[RFC PATCH v2 0/5] mm/swap, memcg: Introduce swap tiers for cgroup based swap control

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [RFC PATCH v2 0/5] mm/swap, memcg: Introduce swap tiers for cgroup based swap control
@ 2026-01-26  6:52 Youngjun Park
  2026-01-26  6:52 ` [RFC PATCH v2 v2 1/5] mm: swap: introduce swap tier infrastructure Youngjun Park
                   ` (7 more replies)
  0 siblings, 8 replies; 23+ messages in thread
From: Youngjun Park @ 2026-01-26  6:52 UTC (permalink / raw)
  To: Andrew Morton, linux-mm
  Cc: Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
	Barry Song, Johannes Weiner, Michal Hocko, Roman Gushchin,
	Shakeel Butt, Muchun Song, gunho.lee, taejoon.song, austin.kim,
	youngjun.park

This is the second version of the RFC for the "Swap Tiers" concept.
Link to v1: https://lore.kernel.org/linux-mm/20251109124947.1101520-1-youngjun.park@lge.com/

This version incorporates feedback received during LPC 2025 and addresses
comments from the previous review. We have also included experimental
results based on usage scenarios intended for our internal platforms.

Motivation & Concept recap
==========================
Current Linux swap allocation is global, limiting the ability to assign
faster devices to specific cgroups. Our initial attempt at per-cgroup
priorities proved over-engineered and caused LRU inversion.

Following Chris Li's suggestion, we pivoted to "Swap Tiers." A tier is
simply a user-named group of swap devices sharing the same priority range.
This abstraction facilitates swap device selection based on speed, allowing
users to configure specific tiers for cgroups.

For more details, please refer to the LPC 2025 presentation
https://lpc.events/event/19/contributions/2141/attachments/1857/3998/LPC2025Finalss.pdf
or v1 patch.

Changes in v2
=============
1. Respect cgroup hierarchy principle (LPC 2025 feedback)
- The logic now strictly follows standard cgroup hierarchy principles.

Previous: Children could select any tier using "+" regardless of the
parent's configuration. "+" tier is referenced. (could not be silently disappeared)

Current: The explicit selection ("+") concept is removed. By
default, all tiers are selected. Users now use "-" to exclude specific
tiers. Excluded tier could disappeared silently.
A child cgroup is always a subset of its parent. Even if a child
re-enables a tier with "+" that was excluded by the parent, the effective
tier list is limited to the parent's allowed subset.

Example:
Global Tiers: SSD, HDD, NET
Parent: SSD, NET (HDD excluded)
Child: HDD, NET (SSD excluded)
-> Effective Child Tier: NET (Intersection of Parent and Child)

2. Simplified swap_tier structure (Chris Li)
- Replaced 'end prio' and priority lists with standard list_head.

3. Reference counting removed
- Removed refcount for swap_tiers. Liveness is now guaranteed by checking
if the swap device is in use.
- Since the default selection is "ALL", holding references for disappearing
tiers is unnecessary. only exclusions ("-") matter.

4. Support mixed operation (/sys/kernel/mm/swap/tiers) (Chris Li)
- Supports mixed use of "+" and "-" operations.

5. Add modify operation
- Introduced an operation to modify the priority of existing tiers.
Format: "tier_name:priority"

6. Restore swap device cluster allocation rule on same swap device priority (Kairui Song, Baoquan He)
- Preserve existing fastpath and slowpath swap allocation logic using percpu swap device cache.
- Therefore, the swap device cluster allocation rule is preserved on same swap device priority

7. Remove compile time selection (Chris Li)
- Removed CONFIG_SWAP_TIER. this is now a base kernel feature.

8. Cgroup tier calculation logic update
- The effective swap tier for a cgroup is now calculated at the time of
configuration (writing to swap.tiers), rather than at the time of swap
allocation.

9. Commit reorganization (Chris Li)
- Commit order reorganized for clarity.

10. Documentation (Chris Li)
- Added documentation for the Swap Tiers concept and usage, as explained in
the RFC.

Apply and Benchmark
===================
1. Real-world Scenario: App Preloading
We applied this patchset to our embedded platform to enable application
preloading for faster launch times. Since the platform uses flash storage
as the default swap device, we aim to minimize swap usage on it to extend
its lifespan.

To achieve this, we utilized an idle device (configured via NBD) as a
separate swap tier specifically for these preloaded applications.

While it is self-evident that restoring from swap (warm launch) is faster
than a cold launch, the data below demonstrates the latency reduction
achieved in this environment.

Streaming App A:
Before (Cold Launch): 13.17s
After (Preloaded): 4.18s (68% reduction)

Streaming App B:
Before (Cold Launch): 5.60s
After (Preloaded): 1.12s (80% reduction)

E-commerce App C:
Before (Cold Launch): 10.25s
After (Preloaded): 2.00s (80% reduction)

We have a plan to solidify this usage and expand the usage.

2. Microbenchmarks
In response to feedback regarding potential regressions in the swap
fastpath (specifically concerning the overhead of global percpu clusters
vs. swap device percpu cache), we addressed this in RFC v2.

By preserving the existing fastpath and slowpath mechanisms via the per-cpu
swap device cache, we ensured that the performance characteristics remain
unchanged. The simple benchmark results below confirm there is no
significant difference.

A. Build kernel test:
Test using 128GB swapfile on Simulated SSD, Qemu VM with 4 CPUs, 4GB RAM,
avg of 5 test runs:

              Before        After
System time:  1584.20s      1590.74s (+0.41%)

Considering the deviation between max/min values, there seems to be no
significant difference.

B. vm-scalability
usemem --init-time -O -y -x -n 32 256M (qemu, 16G memory, global pressure,
simulated SDD as swap), avg of 5 test runs:

                           Before          After
System time:               588.48 s        592.15 s
Sum Throughput:            16.65 MB/s      15.95 MB/s
Single process Throughput: 0.52 MB/s       0.50 MB/s
Avg Free latency:          1098422.97 us   1106388.97 us

The results indicate that the performance remains stable with negligible
variance.

Any feedback is welcome.

Thanks,
Youngjun Park

Youngjun Park (5):
  mm: swap: introduce swap tier infrastructure
  mm: swap: associate swap devices with tiers
  mm: memcontrol: add interface for swap tier selection
  mm, swap: change back to use each swap device's percpu cluster
  mm, swap: introduce percpu swap device cache to avoid fragmentation

 Documentation/admin-guide/cgroup-v2.rst |  27 ++
 Documentation/mm/swap-tier.rst          | 109 ++++++
 MAINTAINERS                             |   2 +
 include/linux/memcontrol.h              |   3 +-
 include/linux/swap.h                    |  17 +-
 mm/Makefile                             |   2 +-
 mm/memcontrol.c                         |  80 +++++
 mm/swap.h                               |   4 +
 mm/swap_state.c                         |  70 ++++
 mm/swap_tier.c                          | 452 ++++++++++++++++++++++++
 mm/swap_tier.h                          |  70 ++++
 mm/swapfile.c                           | 132 +++----
 12 files changed, 900 insertions(+), 68 deletions(-)
 create mode 100644 Documentation/mm/swap-tier.rst
 create mode 100644 mm/swap_tier.c
 create mode 100644 mm/swap_tier.h

base-commit: 5a3704ed2dce0b54a7f038b765bb752b87ee8cc2
-- 
2.34.1

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [RFC PATCH v2 v2 1/5] mm: swap: introduce swap tier infrastructure
  2026-01-26  6:52 [RFC PATCH v2 0/5] mm/swap, memcg: Introduce swap tiers for cgroup based swap control Youngjun Park
@ 2026-01-26  6:52 ` Youngjun Park
  2026-02-12  9:07   ` Chris Li
  2026-01-26  6:52 ` [RFC PATCH v2 v2 2/5] mm: swap: associate swap devices with tiers Youngjun Park
                   ` (6 subsequent siblings)
  7 siblings, 1 reply; 23+ messages in thread
From: Youngjun Park @ 2026-01-26  6:52 UTC (permalink / raw)
  To: Andrew Morton, linux-mm
  Cc: Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
	Barry Song, Johannes Weiner, Michal Hocko, Roman Gushchin,
	Shakeel Butt, Muchun Song, gunho.lee, taejoon.song, austin.kim,
	youngjun.park

This patch introduces the "Swap tier" concept, which serves as an
abstraction layer for managing swap devices based on their performance
characteristics (e.g., NVMe, HDD, Network swap).

Swap tiers are user-named groups representing priority ranges.
These tiers collectively cover the entire priority
space from -1 (`DEF_SWAP_PRIO`) to `SHRT_MAX`.

To configure tiers, a new sysfs interface is exposed at
`/sys/kernel/mm/swap/tiers`. The input parser evaluates commands from
left to right and supports batch input, allowing users to add, remove or
modify multiple tiers in a single write operation.

Tier management enforces continuous priority ranges anchored by start
priorities. Operations trigger range splitting or merging, but overwriting
start priorities is forbidden. Merging expands lower tiers upwards to
preserve configured start priorities, except when removing `DEF_SWAP_PRIO`,
which merges downwards.

Suggested-by: Chris Li <chrisl@kernel.org>
Signed-off-by: Youngjun Park <youngjun.park@lge.com>
---
 MAINTAINERS     |   2 +
 mm/Makefile     |   2 +-
 mm/swap.h       |   4 +
 mm/swap_state.c |  70 +++++++++++
 mm/swap_tier.c  | 304 ++++++++++++++++++++++++++++++++++++++++++++++++
 mm/swap_tier.h  |  38 ++++++
 mm/swapfile.c   |   7 +-
 7 files changed, 423 insertions(+), 4 deletions(-)
 create mode 100644 mm/swap_tier.c
 create mode 100644 mm/swap_tier.h

diff --git a/MAINTAINERS b/MAINTAINERS
index 18d1ebf053db..501bf46adfb4 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -16743,6 +16743,8 @@ F:	mm/swap.c
 F:	mm/swap.h
 F:	mm/swap_table.h
 F:	mm/swap_state.c
+F:	mm/swap_tier.c
+F:	mm/swap_tier.h
 F:	mm/swapfile.c
 
 MEMORY MANAGEMENT - THP (TRANSPARENT HUGE PAGE)
diff --git a/mm/Makefile b/mm/Makefile
index 53ca5d4b1929..3b3de2de7285 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -75,7 +75,7 @@ ifdef CONFIG_MMU
 	obj-$(CONFIG_ADVISE_SYSCALLS)	+= madvise.o
 endif
 
-obj-$(CONFIG_SWAP)	+= page_io.o swap_state.o swapfile.o
+obj-$(CONFIG_SWAP)	+= page_io.o swap_state.o swapfile.o swap_tier.o
 obj-$(CONFIG_ZSWAP)	+= zswap.o
 obj-$(CONFIG_HAS_DMA)	+= dmapool.o
 obj-$(CONFIG_HUGETLBFS)	+= hugetlb.o hugetlb_sysfs.o hugetlb_sysctl.o
diff --git a/mm/swap.h b/mm/swap.h
index bfafa637c458..55f230cbe4e7 100644
--- a/mm/swap.h
+++ b/mm/swap.h
@@ -16,6 +16,10 @@ extern int page_cluster;
 #define swap_entry_order(order)	0
 #endif
 
+#define DEF_SWAP_PRIO  -1
+
+extern spinlock_t swap_lock;
+extern struct plist_head swap_active_head;
 extern struct swap_info_struct *swap_info[];
 
 /*
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 6d0eef7470be..f1a7d9cdc648 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -25,6 +25,7 @@
 #include "internal.h"
 #include "swap_table.h"
 #include "swap.h"
+#include "swap_tier.h"
 
 /*
  * swapper_space is a fiction, retained to simplify the path through
@@ -947,8 +948,77 @@ static ssize_t vma_ra_enabled_store(struct kobject *kobj,
 }
 static struct kobj_attribute vma_ra_enabled_attr = __ATTR_RW(vma_ra_enabled);
 
+static ssize_t tiers_show(struct kobject *kobj,
+				     struct kobj_attribute *attr, char *buf)
+{
+	return swap_tiers_sysfs_show(buf);
+}
+
+static ssize_t tiers_store(struct kobject *kobj,
+			struct kobj_attribute *attr,
+			const char *buf, size_t count)
+{
+	char *p, *token, *name, *tmp;
+	int ret = 0;
+	short prio;
+	DEFINE_SWAP_TIER_SAVE_CTX(ctx);
+
+	tmp = kstrdup(buf, GFP_KERNEL);
+	if (!tmp)
+		return -ENOMEM;
+
+	spin_lock(&swap_lock);
+	spin_lock(&swap_tier_lock);
+
+	p = tmp;
+	swap_tiers_save(ctx);
+
+	while (!ret && (token = strsep(&p, ", \t\n")) != NULL) {
+		if (!*token)
+			continue;
+
+		if (token[0] == '-') {
+			ret = swap_tiers_remove(token + 1);
+		} else {
+
+			name = strsep(&token, ":");
+			if (!token || kstrtos16(token, 10, &prio)) {
+				ret = -EINVAL;
+				goto out;
+			}
+
+			if (name[0] == '+')
+				ret = swap_tiers_add(name + 1, prio);
+			else
+				ret = swap_tiers_modify(name, prio);
+		}
+
+		if (ret)
+			goto restore;
+	}
+
+	if (!swap_tiers_validate()) {
+		ret = -EINVAL;
+		goto restore;
+	}
+
+out:
+	spin_unlock(&swap_tier_lock);
+	spin_unlock(&swap_lock);
+
+	kfree(tmp);
+	return ret ? ret : count;
+
+restore:
+	swap_tiers_restore(ctx);
+	goto out;
+}
+
+static struct kobj_attribute tier_attr = __ATTR_RW(tiers);
+
 static struct attribute *swap_attrs[] = {
 	&vma_ra_enabled_attr.attr,
+	&tier_attr.attr,
 	NULL,
 };
 
diff --git a/mm/swap_tier.c b/mm/swap_tier.c
new file mode 100644
index 000000000000..87882272eec8
--- /dev/null
+++ b/mm/swap_tier.c
@@ -0,0 +1,304 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <linux/swap.h>
+#include <linux/memcontrol.h>
+#include "memcontrol-v1.h"
+#include <linux/sysfs.h>
+#include <linux/plist.h>
+
+#include "swap.h"
+#include "swap_tier.h"
+
+/*
+ * struct swap_tier - structure representing a swap tier.
+ *
+ * @name: name of the swap_tier.
+ * @prio: starting value of priority.
+ * @list: linked list of tiers.
+*/
+static struct swap_tier {
+	char name[MAX_TIERNAME];
+	short prio;
+	struct list_head list;
+} swap_tiers[MAX_SWAPTIER];
+
+DEFINE_SPINLOCK(swap_tier_lock);
+/* active swap priority list, sorted in descending order */
+static LIST_HEAD(swap_tier_active_list);
+/* unused swap_tier object */
+static LIST_HEAD(swap_tier_inactive_list);
+
+#define TIER_IDX(tier)	((tier) - swap_tiers)
+#define TIER_MASK(tier)	(1 << TIER_IDX(tier))
+#define TIER_INVALID_PRIO (DEF_SWAP_PRIO - 1)
+#define TIER_END_PRIO(tier) \
+	(!list_is_first(&(tier)->list, &swap_tier_active_list) ? \
+	list_prev_entry((tier), list)->prio - 1 : SHRT_MAX)
+
+#define for_each_tier(tier, idx) \
+	for (idx = 0, tier = &swap_tiers[0]; idx < MAX_SWAPTIER; \
+		idx++, tier = &swap_tiers[idx])
+
+#define for_each_active_tier(tier) \
+	list_for_each_entry(tier, &swap_tier_active_list, list)
+
+#define for_each_inactive_tier(tier) \
+	list_for_each_entry(tier, &swap_tier_inactive_list, list)
+
+/*
+ * Naming Convention:
+ *   swap_tiers_*() - Public/exported functions
+ *   swap_tier_*()  - Private/internal functions
+ */
+
+static bool swap_tier_is_active(void)
+{
+	return !list_empty(&swap_tier_active_list) ? true : false;
+}
+
+static struct swap_tier *swap_tier_lookup(const char *name)
+{
+	struct swap_tier *tier;
+
+	for_each_active_tier(tier) {
+		if (!strcmp(tier->name, name))
+			return tier;
+	}
+
+	return NULL;
+}
+
+void swap_tiers_init(void)
+{
+	struct swap_tier *tier;
+	int idx;
+
+	BUILD_BUG_ON(BITS_PER_TYPE(int) < MAX_SWAPTIER);
+
+	for_each_tier(tier, idx) {
+		INIT_LIST_HEAD(&tier->list);
+		list_add_tail(&tier->list, &swap_tier_inactive_list);
+	}
+}
+
+ssize_t swap_tiers_sysfs_show(char *buf)
+{
+	struct swap_tier *tier;
+	ssize_t len = 0;
+
+	len += sysfs_emit_at(buf, len, "%-16s %-5s %-11s %-11s\n",
+			 "Name", "Idx", "PrioStart", "PrioEnd");
+
+	spin_lock(&swap_tier_lock);
+	for_each_active_tier(tier) {
+		len += sysfs_emit_at(buf, len, "%-16s %-5ld %-11d %-11d\n",
+				     tier->name,
+				     TIER_IDX(tier),
+				     tier->prio,
+				     TIER_END_PRIO(tier));
+		if (len >= PAGE_SIZE)
+			break;
+	}
+	spin_unlock(&swap_tier_lock);
+
+	return len;
+}
+
+static void swap_tier_insert_by_prio(struct swap_tier *new)
+{
+	struct swap_tier *tier;
+
+	for_each_active_tier(tier) {
+		if (tier->prio > new->prio)
+			continue;
+
+		list_add_tail(&new->list, &tier->list);
+		return;
+	}
+	/* First addition, or becomes the first tier */
+	list_add_tail(&new->list, &swap_tier_active_list);
+}
+
+static void __swap_tier_prepare(struct swap_tier *tier, const char *name,
+	short prio)
+{
+	list_del_init(&tier->list);
+	strscpy(tier->name, name, MAX_TIERNAME);
+	tier->prio = prio;
+}
+
+static struct swap_tier *swap_tier_prepare(const char *name, short prio)
+{
+	struct swap_tier *tier;
+
+	lockdep_assert_held(&swap_tier_lock);
+
+	if (prio < DEF_SWAP_PRIO)
+		return NULL;
+
+	if (list_empty(&swap_tier_inactive_list))
+		return ERR_PTR(-EPERM);
+
+	tier = list_first_entry(&swap_tier_inactive_list,
+		struct swap_tier, list);
+
+	__swap_tier_prepare(tier, name, prio);
+	return tier;
+}
+
+static int swap_tier_check_range(short prio)
+{
+	struct swap_tier *tier;
+
+	lockdep_assert_held(&swap_lock);
+	lockdep_assert_held(&swap_tier_lock);
+
+	for_each_active_tier(tier) {
+		/* No overwrite */
+		if (tier->prio == prio)
+			return -EINVAL;
+	}
+
+	return 0;
+}
+
+int swap_tiers_add(const char *name, int prio)
+{
+	int ret;
+	struct swap_tier *tier;
+
+	lockdep_assert_held(&swap_lock);
+	lockdep_assert_held(&swap_tier_lock);
+
+	/* Duplicate check */
+	if (swap_tier_lookup(name))
+		return -EPERM;
+
+	ret = swap_tier_check_range(prio);
+	if (ret)
+		return ret;
+
+	tier = swap_tier_prepare(name, prio);
+	if (IS_ERR(tier)) {
+		ret = PTR_ERR(tier);
+		return ret;
+	}
+
+
+	swap_tier_insert_by_prio(tier);
+	return ret;
+}
+
+int swap_tiers_remove(const char *name)
+{
+	int ret = 0;
+	struct swap_tier *tier;
+
+	lockdep_assert_held(&swap_lock);
+	lockdep_assert_held(&swap_tier_lock);
+
+	tier = swap_tier_lookup(name);
+	if (!tier)
+		return -EINVAL;
+
+	list_move(&tier->list, &swap_tier_inactive_list);
+
+	/* Removing DEF_SWAP_PRIO merges into the higher tier. */
+	if (swap_tier_is_active() && tier->prio == DEF_SWAP_PRIO)
+		list_prev_entry(tier, list)->prio = DEF_SWAP_PRIO;
+
+	return ret;
+}
+
+int swap_tiers_modify(const char *name, int prio)
+{
+	int ret;
+	struct swap_tier *tier;
+
+	lockdep_assert_held(&swap_lock);
+	lockdep_assert_held(&swap_tier_lock);
+
+	tier = swap_tier_lookup(name);
+	if (!tier)
+		return -EINVAL;
+
+	/* No need to modify */
+	if (tier->prio == prio)
+		return 0;
+
+	ret = swap_tier_check_range(prio);
+	if (ret)
+		return ret;
+
+	list_del_init(&tier->list);
+	tier->prio = prio;
+	swap_tier_insert_by_prio(tier);
+
+	return ret;
+}
+
+/*
+ * XXX: Reverting individual operations becomes complex as the number of
+ * operations grows. Instead, we save the original state beforehand and
+ * fully restore it if any operation fails.
+ */
+void swap_tiers_save(struct swap_tier_save_ctx ctx[])
+{
+	struct swap_tier *tier;
+	int idx;
+
+	lockdep_assert_held(&swap_lock);
+	lockdep_assert_held(&swap_tier_lock);
+
+	for_each_active_tier(tier) {
+		idx = TIER_IDX(tier);
+		strcpy(ctx[idx].name, tier->name);
+		ctx[idx].prio = tier->prio;
+	}
+
+	for_each_inactive_tier(tier) {
+		idx = TIER_IDX(tier);
+		/* Indicator of inactive */
+		ctx[idx].prio = TIER_INVALID_PRIO;
+	}
+}
+
+void swap_tiers_restore(struct swap_tier_save_ctx ctx[])
+{
+	struct swap_tier *tier;
+	int idx;
+
+	lockdep_assert_held(&swap_lock);
+	lockdep_assert_held(&swap_tier_lock);
+
+	/* Invalidate active list */
+	list_splice_tail_init(&swap_tier_active_list,
+			&swap_tier_inactive_list);
+
+	for_each_tier(tier, idx) {
+		if (ctx[idx].prio != TIER_INVALID_PRIO) {
+			/* Preserve idx(mask) */
+			__swap_tier_prepare(tier, ctx[idx].name, ctx[idx].prio);
+			swap_tier_insert_by_prio(tier);
+		}
+	}
+}
+
+bool swap_tiers_validate(void)
+{
+	struct swap_tier *tier;
+
+	/*
+	 * Initial setting might not cover DEF_SWAP_PRIO.
+	 * Swap tier must cover the full range (DEF_SWAP_PRIO to SHRT_MAX).
+	 * Also, modify operation can change only one remaining priority.
+	 */
+	if (swap_tier_is_active()) {
+		tier = list_last_entry(&swap_tier_active_list,
+			struct swap_tier, list);
+
+		if (tier->prio != DEF_SWAP_PRIO)
+			return false;
+	}
+
+	return true;
+}
diff --git a/mm/swap_tier.h b/mm/swap_tier.h
new file mode 100644
index 000000000000..4b1b0602d691
--- /dev/null
+++ b/mm/swap_tier.h
@@ -0,0 +1,38 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _SWAP_TIER_H
+#define _SWAP_TIER_H
+
+#include <linux/types.h>
+#include <linux/spinlock.h>
+
+#define MAX_TIERNAME		16
+
+/* Ensure MAX_SWAPTIER does not exceed MAX_SWAPFILES */
+#if 8 > MAX_SWAPFILES
+#define MAX_SWAPTIER		MAX_SWAPFILES
+#else
+#define MAX_SWAPTIER		8
+#endif
+
+extern spinlock_t swap_tier_lock;
+
+struct swap_tier_save_ctx {
+	char name[MAX_TIERNAME];
+	short prio;
+};
+
+#define DEFINE_SWAP_TIER_SAVE_CTX(_name) \
+	struct swap_tier_save_ctx _name[MAX_SWAPTIER] = {0}
+
+/* Initialization and application */
+void swap_tiers_init(void);
+ssize_t swap_tiers_sysfs_show(char *buf);
+
+int swap_tiers_add(const char *name, int prio);
+int swap_tiers_remove(const char *name);
+int swap_tiers_modify(const char *name, int prio);
+
+void swap_tiers_save(struct swap_tier_save_ctx ctx[]);
+void swap_tiers_restore(struct swap_tier_save_ctx ctx[]);
+bool swap_tiers_validate(void);
+#endif /* _SWAP_TIER_H */
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 7b055f15d705..c27952b41d4f 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -50,6 +50,7 @@
 #include "internal.h"
 #include "swap_table.h"
 #include "swap.h"
+#include "swap_tier.h"
 
 static bool swap_count_continued(struct swap_info_struct *, pgoff_t,
 				 unsigned char);
@@ -65,7 +66,7 @@ static void move_cluster(struct swap_info_struct *si,
 			 struct swap_cluster_info *ci, struct list_head *list,
 			 enum swap_cluster_flags new_flags);
 
-static DEFINE_SPINLOCK(swap_lock);
+DEFINE_SPINLOCK(swap_lock);
 static unsigned int nr_swapfiles;
 atomic_long_t nr_swap_pages;
 /*
@@ -76,7 +77,6 @@ atomic_long_t nr_swap_pages;
 EXPORT_SYMBOL_GPL(nr_swap_pages);
 /* protected with swap_lock. reading in vm_swap_full() doesn't need lock */
 long total_swap_pages;
-#define DEF_SWAP_PRIO  -1
 unsigned long swapfile_maximum_size;
 #ifdef CONFIG_MIGRATION
 bool swap_migration_ad_supported;
@@ -89,7 +89,7 @@ static const char Bad_offset[] = "Bad swap offset entry ";
  * all active swap_info_structs
  * protected with swap_lock, and ordered by priority.
  */
-static PLIST_HEAD(swap_active_head);
+PLIST_HEAD(swap_active_head);
 
 /*
  * all available (active, not full) swap_info_structs
@@ -3977,6 +3977,7 @@ static int __init swapfile_init(void)
 		swap_migration_ad_supported = true;
 #endif	/* CONFIG_MIGRATION */
 
+	swap_tiers_init();
 	return 0;
 }
 subsys_initcall(swapfile_init);
-- 
2.34.1



^ permalink raw reply	[flat|nested] 23+ messages in thread

* [RFC PATCH v2 v2 2/5] mm: swap: associate swap devices with tiers
  2026-01-26  6:52 [RFC PATCH v2 0/5] mm/swap, memcg: Introduce swap tiers for cgroup based swap control Youngjun Park
  2026-01-26  6:52 ` [RFC PATCH v2 v2 1/5] mm: swap: introduce swap tier infrastructure Youngjun Park
@ 2026-01-26  6:52 ` Youngjun Park
  2026-01-26  6:52 ` [RFC PATCH v2 v2 3/5] mm: memcontrol: add interface for swap tier selection Youngjun Park
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 23+ messages in thread
From: Youngjun Park @ 2026-01-26  6:52 UTC (permalink / raw)
  To: Andrew Morton, linux-mm
  Cc: Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
	Barry Song, Johannes Weiner, Michal Hocko, Roman Gushchin,
	Shakeel Butt, Muchun Song, gunho.lee, taejoon.song, austin.kim,
	youngjun.park

This patch connects swap devices to the swap tier infrastructure,
ensuring that devices are correctly assigned to tiers based on their
priority.

A `tier_mask` is added to identify the tier membership of swap devices.
Although tier-based allocation logic is not yet implemented, this
mapping is necessary to track which tier a device belongs to. Upon
activation, the device is assigned to a tier by matching its priority
against the configured tier ranges.

The infrastructure allows dynamic modification of tiers, such as
splitting or merging ranges. These operations are permitted provided
that the tier assignment of already configured swap devices remains
unchanged.

This patch also adds the documentation for the swap tier feature,
covering the core concepts, sysfs interface usage, and configuration
details.

Signed-off-by: Youngjun Park <youngjun.park@lge.com>
---
 Documentation/mm/swap-tier.rst | 109 +++++++++++++++++++++++++++++++++
 include/linux/swap.h           |   1 +
 mm/swap_state.c                |   2 +-
 mm/swap_tier.c                 | 106 ++++++++++++++++++++++++++++----
 mm/swap_tier.h                 |  13 +++-
 mm/swapfile.c                  |   2 +
 6 files changed, 219 insertions(+), 14 deletions(-)
 create mode 100644 Documentation/mm/swap-tier.rst

diff --git a/Documentation/mm/swap-tier.rst b/Documentation/mm/swap-tier.rst
new file mode 100644
index 000000000000..3386161b9b18
--- /dev/null
+++ b/Documentation/mm/swap-tier.rst
@@ -0,0 +1,109 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+:Author: Chris Li <chrisl@kernel.org> Youngjun Park <youngjun.park@lge.com>
+
+==========
+Swap Tier
+==========
+
+Swap tier is a collection of user-named groups classified by priority ranges.
+It acts as a facilitation layer, allowing users to manage swap devices based
+on their speeds.
+
+Users are encouraged to assign swap device priorities according to device
+speed to fully utilize this feature. While the current implementation is
+integrated with cgroups, the concept is designed to be extensible for other
+subsystems in the future.
+
+Use case
+-------
+
+Users can perform selective swapping by choosing a swap tier assigned according
+to speed within a cgroup.
+
+For more information on cgroup v2, please refer to
+``Documentation/admin-guide/cgroup-v2.rst``.
+
+Priority Range
+--------------
+
+The specified tiers must cover the entire priority range from -1
+(DEF_SWAP_PRIO) to SHRT_MAX.
+
+Consistency
+-----------
+
+Tier consistency is guaranteed with a focus on maximizing flexibility. When a
+swap device is activated within a tier range, a reference is held from the
+start of the tier to the priority of that swap device. This ensures that the
+tier of region containing the active swap device does not disappear.
+
+If a request to add a new tier with a priority higher than the current swap
+device is received, the existing tier can be split.
+
+However, specifying a tier in a cgroup does not hold a reference to the tier.
+Consequently, the corresponding tier can disappear at any time.
+
+Configuration Interface
+-----------------------
+
+The swap tiers can be configured via the following interface:
+
+/sys/kernel/mm/swap/tiers
+
+Operations can be performed using the following syntax:
+
+* Add:    ``+"<tiername>":"<start_priority>"``
+* Remove: ``-"<tiername>"``
+* Modify: ``"<tiername>":"<start_priority>"``
+
+Multiple operations can be provided in a single write, separated by spaces (" ")
+or commas (",").
+
+When configuring tiers, the specified value represents the **start priority**
+of that tier. The end priority is automatically determined by the start
+priority of the next higher tier. Consequently, adding or modifying a tier
+automatically adjusts (splits or merges) the ranges of adjacent tiers to
+ensure continuity.
+
+Examples
+--------
+
+**1. Initialization**
+
+A tier starting at -1 is mandatory to cover the entire priority range up to
+SHRT_MAX. In this example, 'HDD' starts at 50, and 'NET' covers the remaining
+lower range starting from -1.
+
+::
+
+    # echo "+HDD:50, +NET:-1" > /sys/kernel/mm/swap/tiers
+    # cat /sys/kernel/mm/swap/tiers
+    Name             Idx   PrioStart   PrioEnd
+    HDD              0     50          32767
+    NET              1     -1          49
+
+**2. Modification and Splitting**
+
+Here, 'HDD' is moved to start at 80, and a new tier 'SSD' is added at 100.
+Notice how the ranges are automatically recalculated:
+* 'SSD' takes the top range. Split HDD Tier's range. (100 to SHRT_MAX).
+* 'HDD' is adjusted to the range between 'NET' and 'SSD' (80 to 99).
+* 'NET' automatically extends to fill the gap below 'HDD' (-1 to 79).
+
+::
+
+    # echo "HDD:80, +SSD:100" > /sys/kernel/mm/swap/tiers
+    # cat /sys/kernel/mm/swap/tiers
+    Name             Idx   PrioStart   PrioEnd
+    SSD              2     100         32767
+    HDD              0     80          99
+    NET              1     -1          79
+
+**3. Removal**
+
+Tiers can be removed using the '-' prefix.
+
+::
+
+    # echo "-SSD,-HDD,-NET" > /sys/kernel/mm/swap/tiers
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 62fc7499b408..1e68c220a0e7 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -262,6 +262,7 @@ struct swap_info_struct {
 	struct percpu_ref users;	/* indicate and keep swap device valid. */
 	unsigned long	flags;		/* SWP_USED etc: see above */
 	signed short	prio;		/* swap priority of this type */
+	int tier_mask;			/* swap tier mask */
 	struct plist_node list;		/* entry in swap_active_head */
 	signed char	type;		/* strange name for an index */
 	unsigned int	max;		/* extent of the swap_map */
diff --git a/mm/swap_state.c b/mm/swap_state.c
index f1a7d9cdc648..d46ca61d2e42 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -997,7 +997,7 @@ static ssize_t tiers_store(struct kobject *kobj,
 			goto restore;
 	}
 
-	if (!swap_tiers_validate()) {
+	if (!swap_tiers_update()) {
 		ret = -EINVAL;
 		goto restore;
 	}
diff --git a/mm/swap_tier.c b/mm/swap_tier.c
index 87882272eec8..d90f6eccb908 100644
--- a/mm/swap_tier.c
+++ b/mm/swap_tier.c
@@ -14,7 +14,7 @@
  * @name: name of the swap_tier.
  * @prio: starting value of priority.
  * @list: linked list of tiers.
-*/
+ */
 static struct swap_tier {
 	char name[MAX_TIERNAME];
 	short prio;
@@ -34,6 +34,8 @@ static LIST_HEAD(swap_tier_inactive_list);
 	(!list_is_first(&(tier)->list, &swap_tier_active_list) ? \
 	list_prev_entry((tier), list)->prio - 1 : SHRT_MAX)
 
+#define MASK_TO_TIER(mask) (&swap_tiers[__ffs((mask))])
+
 #define for_each_tier(tier, idx) \
 	for (idx = 0, tier = &swap_tiers[0]; idx < MAX_SWAPTIER; \
 		idx++, tier = &swap_tiers[idx])
@@ -55,6 +57,26 @@ static bool swap_tier_is_active(void)
 	return !list_empty(&swap_tier_active_list) ? true : false;
 }
 
+static bool swap_tier_prio_in_range(struct swap_tier *tier, short prio)
+{
+	if (tier->prio <= prio && TIER_END_PRIO(tier) >= prio)
+		return true;
+
+	return false;
+}
+
+static bool swap_tier_prio_is_used(struct swap_tier *self, short prio)
+{
+	struct swap_tier *tier;
+
+	for_each_active_tier(tier) {
+		if (tier != self && tier->prio == prio)
+			return true;
+	}
+
+	return false;
+}
+
 static struct swap_tier *swap_tier_lookup(const char *name)
 {
 	struct swap_tier *tier;
@@ -67,12 +89,14 @@ static struct swap_tier *swap_tier_lookup(const char *name)
 	return NULL;
 }
 
+
 void swap_tiers_init(void)
 {
 	struct swap_tier *tier;
 	int idx;
 
 	BUILD_BUG_ON(BITS_PER_TYPE(int) < MAX_SWAPTIER);
+	BUILD_BUG_ON(MAX_SWAPTIER > TIER_DEFAULT_IDX);
 
 	for_each_tier(tier, idx) {
 		INIT_LIST_HEAD(&tier->list);
@@ -145,17 +169,35 @@ static struct swap_tier *swap_tier_prepare(const char *name, short prio)
 	return tier;
 }
 
-static int swap_tier_check_range(short prio)
+static int swap_tier_can_split_range(struct swap_tier *orig_tier,
+	short new_prio)
 {
+	struct swap_info_struct *p;
 	struct swap_tier *tier;
 
 	lockdep_assert_held(&swap_lock);
 	lockdep_assert_held(&swap_tier_lock);
 
-	for_each_active_tier(tier) {
-		/* No overwrite */
-		if (tier->prio == prio)
-			return -EINVAL;
+	plist_for_each_entry(p, &swap_active_head, list) {
+		if (p->tier_mask == TIER_DEFAULT_MASK)
+			continue;
+
+		tier = MASK_TO_TIER(p->tier_mask);
+		if (tier->prio > new_prio)
+			continue;
+		/*
+                 * Prohibit implicit tier reassignment.
+                 * Case 1: Prevent orig_tier devices from dropping out
+                 *         of the new range.
+                 */
+		if (orig_tier == tier && (p->prio < new_prio))
+			return -EBUSY;
+                /*
+                 * Case 2: Prevent other tier devices from entering
+                 *         the new range.
+                 */
+		else if (orig_tier != tier && (p->prio >= new_prio))
+			return -EBUSY;
 	}
 
 	return 0;
@@ -173,7 +215,10 @@ int swap_tiers_add(const char *name, int prio)
 	if (swap_tier_lookup(name))
 		return -EPERM;
 
-	ret = swap_tier_check_range(prio);
+	if (swap_tier_prio_is_used(NULL, prio))
+		return -EBUSY;
+
+	ret = swap_tier_can_split_range(NULL, prio);
 	if (ret)
 		return ret;
 
@@ -183,7 +228,6 @@ int swap_tiers_add(const char *name, int prio)
 		return ret;
 	}
 
-
 	swap_tier_insert_by_prio(tier);
 	return ret;
 }
@@ -200,11 +244,18 @@ int swap_tiers_remove(const char *name)
 	if (!tier)
 		return -EINVAL;
 
+	/* Simulate adding a tier to check for conflicts */
+	ret = swap_tier_can_split_range(NULL, tier->prio);
+	if (ret)
+		return ret;
+
 	list_move(&tier->list, &swap_tier_inactive_list);
 
 	/* Removing DEF_SWAP_PRIO merges into the higher tier. */
-	if (swap_tier_is_active() && tier->prio == DEF_SWAP_PRIO)
-		list_prev_entry(tier, list)->prio = DEF_SWAP_PRIO;
+	if (swap_tier_is_active() && tier->prio == DEF_SWAP_PRIO) {
+		list_last_entry(&swap_tier_active_list, struct swap_tier, list)
+			->prio = DEF_SWAP_PRIO;
+	}
 
 	return ret;
 }
@@ -225,7 +276,10 @@ int swap_tiers_modify(const char *name, int prio)
 	if (tier->prio == prio)
 		return 0;
 
-	ret = swap_tier_check_range(prio);
+	if (swap_tier_prio_is_used(tier, prio))
+		return -EBUSY;
+
+	ret = swap_tier_can_split_range(tier, prio);
 	if (ret)
 		return ret;
 
@@ -283,10 +337,27 @@ void swap_tiers_restore(struct swap_tier_save_ctx ctx[])
 	}
 }
 
-bool swap_tiers_validate(void)
+void swap_tiers_assign_dev(struct swap_info_struct *swp)
 {
 	struct swap_tier *tier;
 
+	lockdep_assert_held(&swap_lock);
+
+	for_each_active_tier(tier) {
+		if (swap_tier_prio_in_range(tier, swp->prio)) {
+			swp->tier_mask = TIER_MASK(tier);
+			return;
+		}
+	}
+
+	swp->tier_mask = TIER_DEFAULT_MASK;
+}
+
+bool swap_tiers_update(void)
+{
+	struct swap_tier *tier;
+	struct swap_info_struct *swp;
+
 	/*
 	 * Initial setting might not cover DEF_SWAP_PRIO.
 	 * Swap tier must cover the full range (DEF_SWAP_PRIO to SHRT_MAX).
@@ -300,5 +371,16 @@ bool swap_tiers_validate(void)
 			return false;
 	}
 
+	/*
+	 * If applied initially, the swap tier_mask may change
+	 * from the default value.
+	 */
+	plist_for_each_entry(swp, &swap_active_head, list) {
+		/* Tier is already configured */
+		if (swp->tier_mask != TIER_DEFAULT_MASK)
+			break;
+		swap_tiers_assign_dev(swp);
+	}
+
 	return true;
 }
diff --git a/mm/swap_tier.h b/mm/swap_tier.h
index 4b1b0602d691..de81d540e3b5 100644
--- a/mm/swap_tier.h
+++ b/mm/swap_tier.h
@@ -14,6 +14,9 @@
 #define MAX_SWAPTIER		8
 #endif
 
+/* Forward declarations */
+struct swap_info_struct;
+
 extern spinlock_t swap_tier_lock;
 
 struct swap_tier_save_ctx {
@@ -24,6 +27,10 @@ struct swap_tier_save_ctx {
 #define DEFINE_SWAP_TIER_SAVE_CTX(_name) \
 	struct swap_tier_save_ctx _name[MAX_SWAPTIER] = {0}
 
+#define TIER_ALL_MASK		(~0)
+#define TIER_DEFAULT_IDX	(31)
+#define TIER_DEFAULT_MASK	(1 << TIER_DEFAULT_IDX)
+
 /* Initialization and application */
 void swap_tiers_init(void);
 ssize_t swap_tiers_sysfs_show(char *buf);
@@ -34,5 +41,9 @@ int swap_tiers_modify(const char *name, int prio);
 
 void swap_tiers_save(struct swap_tier_save_ctx ctx[]);
 void swap_tiers_restore(struct swap_tier_save_ctx ctx[]);
-bool swap_tiers_validate(void);
+bool swap_tiers_update(void);
+
+/* Tier assignment */
+void swap_tiers_assign_dev(struct swap_info_struct *swp);
+
 #endif /* _SWAP_TIER_H */
diff --git a/mm/swapfile.c b/mm/swapfile.c
index c27952b41d4f..4f8ce021c5bd 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -2672,6 +2672,8 @@ static void _enable_swap_info(struct swap_info_struct *si)
 
 	/* Add back to available list */
 	add_to_avail_list(si, true);
+
+	swap_tiers_assign_dev(si);
 }
 
 static void enable_swap_info(struct swap_info_struct *si, int prio,
-- 
2.34.1



^ permalink raw reply	[flat|nested] 23+ messages in thread

* [RFC PATCH v2 v2 3/5] mm: memcontrol: add interface for swap tier selection
  2026-01-26  6:52 [RFC PATCH v2 0/5] mm/swap, memcg: Introduce swap tiers for cgroup based swap control Youngjun Park
  2026-01-26  6:52 ` [RFC PATCH v2 v2 1/5] mm: swap: introduce swap tier infrastructure Youngjun Park
  2026-01-26  6:52 ` [RFC PATCH v2 v2 2/5] mm: swap: associate swap devices with tiers Youngjun Park
@ 2026-01-26  6:52 ` Youngjun Park
  2026-01-26  6:52 ` [RFC PATCH v2 v2 4/5] mm, swap: change back to use each swap device's percpu cluster Youngjun Park
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 23+ messages in thread
From: Youngjun Park @ 2026-01-26  6:52 UTC (permalink / raw)
  To: Andrew Morton, linux-mm
  Cc: Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
	Barry Song, Johannes Weiner, Michal Hocko, Roman Gushchin,
	Shakeel Butt, Muchun Song, gunho.lee, taejoon.song, austin.kim,
	youngjun.park

This patch integrates the swap tier infrastructure with cgroup,
enabling the selection of specific swap devices per cgroup by
configuring allowed swap tiers.

The new `memory.swap.tiers` interface controls allowed swap tiers via a mask.
By default, the mask is set to include all tiers, allowing specific tiers to
be excluded or restored. Note that effective tiers are calculated separately
using a dedicated mask to respect the cgroup hierarchy. Consequently,
configured tiers may differ from effective ones, as they must be a subset
of the parent's.

Note that cgroups do not pin swap tiers. This is similar to the
`cpuset` controller, which does not prevent CPU hotplug. This
approach ensures flexibility by allowing tier configuration changes
regardless of cgroup usage.

Signed-off-by: Youngjun Park <youngjun.park@lge.com>
---
 Documentation/admin-guide/cgroup-v2.rst | 27 +++++++++
 include/linux/memcontrol.h              |  3 +-
 mm/memcontrol.c                         | 80 +++++++++++++++++++++++++
 mm/swap_tier.c                          | 66 ++++++++++++++++++++
 mm/swap_tier.h                          | 21 +++++++
 mm/swapfile.c                           |  5 ++
 6 files changed, 201 insertions(+), 1 deletion(-)

diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
index 7f5b59d95fce..776a908ce1b9 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -1848,6 +1848,33 @@ The following nested keys are defined.
 	Swap usage hard limit.  If a cgroup's swap usage reaches this
 	limit, anonymous memory of the cgroup will not be swapped out.
 
+  memory.swap.tiers
+        A read-write nested-keyed file which exists on non-root
+        cgroups. The default is to enable all tiers.
+
+        This interface allows selecting which swap tiers a cgroup can
+        use for swapping out memory.
+
+        The effective tiers are inherited from the parent. Only tiers
+        effective in the parent can be effective in the child. However,
+        the child can explicitly disable tiers allowed by the parent.
+
+        When read, the file shows two lines:
+          - The first line shows the operation string that was
+            written to this file.
+          - The second line shows the effective operation after
+            merging with parent settings.
+
+        When writing, the format is:
+          (+/-)(TIER_NAME) (+/-)(TIER_NAME) ...
+
+        Valid tier names are those configured in
+        /sys/kernel/mm/swap/tiers.
+
+        Each tier can be prefixed with:
+          +    Enable this tier
+          -    Disable this tier
+
   memory.swap.events
 	A read-only flat-keyed file which exists on non-root cgroups.
 	The following entries are defined.  Unless specified
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index b6c82c8f73e1..542bee1b5f60 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -283,7 +283,8 @@ struct mem_cgroup {
 	/* per-memcg mm_struct list */
 	struct lru_gen_mm_list mm_list;
 #endif
-
+	int tier_mask;
+	int tier_effective_mask;
 #ifdef CONFIG_MEMCG_V1
 	/* Legacy consumer-oriented counters */
 	struct page_counter kmem;		/* v1 only */
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 007413a53b45..c0a0a957a630 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -68,6 +68,7 @@
 #include <net/ip.h>
 #include "slab.h"
 #include "memcontrol-v1.h"
+#include "swap_tier.h"
 
 #include <linux/uaccess.h>
 
@@ -3691,6 +3692,7 @@ static void mem_cgroup_free(struct mem_cgroup *memcg)
 {
 	lru_gen_exit_memcg(memcg);
 	memcg_wb_domain_exit(memcg);
+	swap_tiers_memcg_sync_mask(memcg);
 	__mem_cgroup_free(memcg);
 }
 
@@ -3792,6 +3794,9 @@ mem_cgroup_css_alloc(struct cgroup_subsys_state *parent_css)
 	WRITE_ONCE(memcg->zswap_writeback, true);
 #endif
 	page_counter_set_high(&memcg->swap, PAGE_COUNTER_MAX);
+	memcg->tier_mask = TIER_ALL_MASK;
+	swap_tiers_memcg_inherit_mask(memcg, parent);
+
 	if (parent) {
 		WRITE_ONCE(memcg->swappiness, mem_cgroup_swappiness(parent));
 
@@ -5352,6 +5357,75 @@ static int swap_events_show(struct seq_file *m, void *v)
 	return 0;
 }
 
+static int swap_tier_show(struct seq_file *m, void *v)
+{
+	struct mem_cgroup *memcg = mem_cgroup_from_seq(m);
+
+	swap_tiers_mask_show(m, memcg->tier_mask);
+	swap_tiers_mask_show(m, memcg->tier_effective_mask);
+
+	return 0;
+}
+
+static ssize_t swap_tier_write(struct kernfs_open_file *of,
+				char *buf, size_t nbytes, loff_t off)
+{
+	struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
+	char *pos, *token;
+	int ret = 0;
+
+	pos = strstrip(buf);
+
+	spin_lock(&swap_tier_lock);
+	if (!*pos) {
+		memcg->tier_mask = TIER_ALL_MASK;
+		goto sync;
+	}
+
+	while ((token = strsep(&pos, " \t\n")) != NULL) {
+		int mask;
+
+		if (!*token)
+			continue;
+
+		if (token[0] != '-' && token[0] != '+') {
+			ret = -EINVAL;
+			goto err;
+		}
+
+		mask = swap_tiers_mask_lookup(token+1);
+		if (!mask) {
+			ret = -EINVAL;
+			goto err;
+		}
+
+		/*
+		 * if child already set, cannot add that tiers for hierarch mismatching.
+		 * parent compatible, child must respect parent selected swap device.
+		 */
+		switch (token[0]) {
+		case '-':
+			memcg->tier_mask &= ~mask;
+			break;
+		case '+':
+			memcg->tier_mask |= mask;
+			break;
+		default:
+			ret = -EINVAL;
+			break;
+		}
+
+		if (ret)
+			goto err;
+	}
+
+sync:
+	__swap_tiers_memcg_sync_mask(memcg);
+err:
+	spin_unlock(&swap_tier_lock);
+	return ret ? ret : nbytes;
+}
+
 static struct cftype swap_files[] = {
 	{
 		.name = "swap.current",
@@ -5384,6 +5458,12 @@ static struct cftype swap_files[] = {
 		.file_offset = offsetof(struct mem_cgroup, swap_events_file),
 		.seq_show = swap_events_show,
 	},
+	{
+		.name = "swap.tiers",
+		.flags = CFTYPE_NOT_ON_ROOT,
+		.seq_show = swap_tier_show,
+		.write = swap_tier_write,
+	},
 	{ }	/* terminate */
 };
 
diff --git a/mm/swap_tier.c b/mm/swap_tier.c
index d90f6eccb908..e860c87292e2 100644
--- a/mm/swap_tier.c
+++ b/mm/swap_tier.c
@@ -384,3 +384,69 @@ bool swap_tiers_update(void)
 
 	return true;
 }
+
+void swap_tiers_mask_show(struct seq_file *m, int mask)
+{
+	struct swap_tier *tier;
+
+	spin_lock(&swap_tier_lock);
+	for_each_active_tier(tier) {
+		if (mask & TIER_MASK(tier))
+			seq_printf(m, "%s ", tier->name);
+	}
+	spin_unlock(&swap_tier_lock);
+	seq_puts(m, "\n");
+}
+
+int swap_tiers_mask_lookup(const char *name)
+{
+	struct swap_tier *tier;
+
+	lockdep_assert_held(&swap_tier_lock);
+
+	for_each_active_tier(tier) {
+		if (!strcmp(name, tier->name))
+			return TIER_MASK(tier);
+	}
+
+	return 0;
+}
+
+static void __swap_tier_memcg_inherit_mask(struct mem_cgroup *memcg,
+	struct mem_cgroup *parent)
+{
+	int effective_mask
+		= parent ? parent->tier_effective_mask : TIER_ALL_MASK;
+
+	memcg->tier_effective_mask
+		= effective_mask & memcg->tier_mask;
+}
+
+void swap_tiers_memcg_inherit_mask(struct mem_cgroup *memcg,
+	struct mem_cgroup *parent)
+{
+	spin_lock(&swap_tier_lock);
+	__swap_tier_memcg_inherit_mask(memcg, parent);
+	spin_unlock(&swap_tier_lock);
+}
+
+void __swap_tiers_memcg_sync_mask(struct mem_cgroup *memcg)
+{
+	struct mem_cgroup *child;
+
+	lockdep_assert_held(&swap_tier_lock);
+
+	if (memcg == root_mem_cgroup)
+		return;
+
+	for_each_mem_cgroup_tree(child, memcg)
+		__swap_tier_memcg_inherit_mask(child, parent_mem_cgroup(child));
+}
+
+void swap_tiers_memcg_sync_mask(struct mem_cgroup *memcg)
+{
+	spin_lock(&swap_tier_lock);
+	memcg->tier_mask = TIER_ALL_MASK;
+	__swap_tiers_memcg_sync_mask(memcg);
+	spin_unlock(&swap_tier_lock);
+}
diff --git a/mm/swap_tier.h b/mm/swap_tier.h
index de81d540e3b5..8652a7f993ab 100644
--- a/mm/swap_tier.h
+++ b/mm/swap_tier.h
@@ -46,4 +46,25 @@ bool swap_tiers_update(void);
 /* Tier assignment */
 void swap_tiers_assign_dev(struct swap_info_struct *swp);
 
+/* Memcg related functions */
+void swap_tiers_mask_show(struct seq_file *m, int mask);
+void swap_tiers_memcg_inherit_mask(struct mem_cgroup *memcg,
+	struct mem_cgroup *parent);
+void swap_tiers_memcg_sync_mask(struct mem_cgroup *memcg);
+void __swap_tiers_memcg_sync_mask(struct mem_cgroup *memcg);
+
+/* Mask and tier lookup */
+int swap_tiers_mask_lookup(const char *name);
+
+/**
+ * swap_tiers_mask_test - Check if the tier mask is valid
+ * @tier_mask: The tier mask to check
+ * @mask: The mask to compare against
+ *
+ * Return: true if condition matches, false otherwise
+ */
+static inline bool swap_tiers_mask_test(int tier_mask, int mask)
+{
+	return tier_mask & mask;
+}
 #endif /* _SWAP_TIER_H */
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 4f8ce021c5bd..dd97e850ea2c 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1348,10 +1348,15 @@ static bool swap_alloc_fast(struct folio *folio)
 static void swap_alloc_slow(struct folio *folio)
 {
 	struct swap_info_struct *si, *next;
+	int mask = folio_memcg(folio) ?
+		folio_memcg(folio)->tier_effective_mask : TIER_ALL_MASK;
 
 	spin_lock(&swap_avail_lock);
 start_over:
 	plist_for_each_entry_safe(si, next, &swap_avail_head, avail_list) {
+		if (!swap_tiers_mask_test(si->tier_mask, mask))
+			continue;
+
 		/* Rotate the device and switch to a new cluster */
 		plist_requeue(&si->avail_list, &swap_avail_head);
 		spin_unlock(&swap_avail_lock);
-- 
2.34.1



^ permalink raw reply	[flat|nested] 23+ messages in thread

* [RFC PATCH v2 v2 4/5] mm, swap: change back to use each swap device's percpu cluster
  2026-01-26  6:52 [RFC PATCH v2 0/5] mm/swap, memcg: Introduce swap tiers for cgroup based swap control Youngjun Park
                   ` (2 preceding siblings ...)
  2026-01-26  6:52 ` [RFC PATCH v2 v2 3/5] mm: memcontrol: add interface for swap tier selection Youngjun Park
@ 2026-01-26  6:52 ` Youngjun Park
  2026-02-12  7:37   ` Chris Li
  2026-01-26  6:52 ` [RFC PATCH v2 v2 5/5] mm, swap: introduce percpu swap device cache to avoid fragmentation Youngjun Park
                   ` (3 subsequent siblings)
  7 siblings, 1 reply; 23+ messages in thread
From: Youngjun Park @ 2026-01-26  6:52 UTC (permalink / raw)
  To: Andrew Morton, linux-mm
  Cc: Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
	Barry Song, Johannes Weiner, Michal Hocko, Roman Gushchin,
	Shakeel Butt, Muchun Song, gunho.lee, taejoon.song, austin.kim,
	youngjun.park

This reverts commit 1b7e90020eb7 ("mm, swap: use percpu cluster as
allocation fast path").

Because in the newly introduced swap tiers, the global percpu cluster
will cause two issues:
1) it will cause caching oscillation in the same order of different si
   if two different memcg can only be allowed to access different si and
   both of them are swapping out.
2) It can cause priority inversion on swap devices. Imagine a case where
   there are two memcg, say memcg1 and memcg2. Memcg1 can access si A, B
   and A is higher priority device. While memcg2 can only access si B.
   Then memcg 2 could write the global percpu cluster with si B, then
   memcg1 take si B in fast path even though si A is not exhausted.

Hence in order to support swap tier, revert commit to use
each swap device's percpu cluster.

Suggested-by: Kairui Song <kasong@tencent.com>
Co-developed-by: Baoquan He <bhe@redhat.com>
Signed-off-by: Baoquan He <bhe@redhat.com>
Signed-off-by: Youngjun Park <youngjun.park@lge.com>
---
 include/linux/swap.h |  17 ++++--
 mm/swapfile.c        | 142 ++++++++++++++-----------------------------
 2 files changed, 57 insertions(+), 102 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 1e68c220a0e7..6921e22b14d3 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -247,11 +247,18 @@ enum {
 #define SWAP_NR_ORDERS		1
 #endif
 
-/*
- * We keep using same cluster for rotational device so IO will be sequential.
- * The purpose is to optimize SWAP throughput on these device.
- */
+ /*
+  * We assign a cluster to each CPU, so each CPU can allocate swap entry from
+  * its own cluster and swapout sequentially. The purpose is to optimize swapout
+  * throughput.
+  */
+struct percpu_cluster {
+	local_lock_t lock; /* Protect the percpu_cluster above */
+	unsigned int next[SWAP_NR_ORDERS]; /* Likely next allocation offset */
+};
+
 struct swap_sequential_cluster {
+	spinlock_t lock; /* Serialize usage of global cluster */
 	unsigned int next[SWAP_NR_ORDERS]; /* Likely next allocation offset */
 };
 
@@ -277,8 +284,8 @@ struct swap_info_struct {
 					/* list of cluster that are fragmented or contented */
 	unsigned int pages;		/* total of usable pages of swap */
 	atomic_long_t inuse_pages;	/* number of those currently in use */
+	struct percpu_cluster	__percpu *percpu_cluster; /* per cpu's swap location */
 	struct swap_sequential_cluster *global_cluster; /* Use one global cluster for rotating device */
-	spinlock_t global_cluster_lock;	/* Serialize usage of global cluster */
 	struct rb_root swap_extent_root;/* root of the swap extent rbtree */
 	struct block_device *bdev;	/* swap device or bdev of swap file */
 	struct file *swap_file;		/* seldom referenced */
diff --git a/mm/swapfile.c b/mm/swapfile.c
index dd97e850ea2c..5e3b87799440 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -118,18 +118,6 @@ static atomic_t proc_poll_event = ATOMIC_INIT(0);
 
 atomic_t nr_rotate_swap = ATOMIC_INIT(0);
 
-struct percpu_swap_cluster {
-	struct swap_info_struct *si[SWAP_NR_ORDERS];
-	unsigned long offset[SWAP_NR_ORDERS];
-	local_lock_t lock;
-};
-
-static DEFINE_PER_CPU(struct percpu_swap_cluster, percpu_swap_cluster) = {
-	.si = { NULL },
-	.offset = { SWAP_ENTRY_INVALID },
-	.lock = INIT_LOCAL_LOCK(),
-};
-
 /* May return NULL on invalid type, caller must check for NULL return */
 static struct swap_info_struct *swap_type_to_info(int type)
 {
@@ -477,7 +465,7 @@ swap_cluster_alloc_table(struct swap_info_struct *si,
 	 * Swap allocator uses percpu clusters and holds the local lock.
 	 */
 	lockdep_assert_held(&ci->lock);
-	lockdep_assert_held(&this_cpu_ptr(&percpu_swap_cluster)->lock);
+	lockdep_assert_held(this_cpu_ptr(&si->percpu_cluster->lock));
 
 	/* The cluster must be free and was just isolated from the free list. */
 	VM_WARN_ON_ONCE(ci->flags || !cluster_is_empty(ci));
@@ -495,8 +483,8 @@ swap_cluster_alloc_table(struct swap_info_struct *si,
 	 */
 	spin_unlock(&ci->lock);
 	if (!(si->flags & SWP_SOLIDSTATE))
-		spin_unlock(&si->global_cluster_lock);
-	local_unlock(&percpu_swap_cluster.lock);
+		spin_unlock(&si->global_cluster->lock);
+	local_unlock(&si->percpu_cluster->lock);
 
 	table = swap_table_alloc(__GFP_HIGH | __GFP_NOMEMALLOC | GFP_KERNEL);
 
@@ -508,9 +496,9 @@ swap_cluster_alloc_table(struct swap_info_struct *si,
 	 * could happen with ignoring the percpu cluster is fragmentation,
 	 * which is acceptable since this fallback and race is rare.
 	 */
-	local_lock(&percpu_swap_cluster.lock);
+	local_lock(&si->percpu_cluster->lock);
 	if (!(si->flags & SWP_SOLIDSTATE))
-		spin_lock(&si->global_cluster_lock);
+		spin_lock(&si->global_cluster->lock);
 	spin_lock(&ci->lock);
 
 	/* Nothing except this helper should touch a dangling empty cluster. */
@@ -622,7 +610,7 @@ static bool swap_do_scheduled_discard(struct swap_info_struct *si)
 		ci = list_first_entry(&si->discard_clusters, struct swap_cluster_info, list);
 		/*
 		 * Delete the cluster from list to prepare for discard, but keep
-		 * the CLUSTER_FLAG_DISCARD flag, percpu_swap_cluster could be
+		 * the CLUSTER_FLAG_DISCARD flag, there could be percpu_cluster
 		 * pointing to it, or ran into by relocate_cluster.
 		 */
 		list_del(&ci->list);
@@ -953,12 +941,11 @@ static unsigned int alloc_swap_scan_cluster(struct swap_info_struct *si,
 out:
 	relocate_cluster(si, ci);
 	swap_cluster_unlock(ci);
-	if (si->flags & SWP_SOLIDSTATE) {
-		this_cpu_write(percpu_swap_cluster.offset[order], next);
-		this_cpu_write(percpu_swap_cluster.si[order], si);
-	} else {
+	if (si->flags & SWP_SOLIDSTATE)
+		this_cpu_write(si->percpu_cluster->next[order], next);
+	else
 		si->global_cluster->next[order] = next;
-	}
+
 	return found;
 }
 
@@ -1052,13 +1039,17 @@ static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si,
 	if (order && !(si->flags & SWP_BLKDEV))
 		return 0;
 
-	if (!(si->flags & SWP_SOLIDSTATE)) {
+	if (si->flags & SWP_SOLIDSTATE) {
+		/* Fast path using per CPU cluster */
+		local_lock(&si->percpu_cluster->lock);
+		offset = __this_cpu_read(si->percpu_cluster->next[order]);
+	} else {
 		/* Serialize HDD SWAP allocation for each device. */
-		spin_lock(&si->global_cluster_lock);
+		spin_lock(&si->global_cluster->lock);
 		offset = si->global_cluster->next[order];
-		if (offset == SWAP_ENTRY_INVALID)
-			goto new_cluster;
+	}
 
+	if (offset != SWAP_ENTRY_INVALID) {
 		ci = swap_cluster_lock(si, offset);
 		/* Cluster could have been used by another order */
 		if (cluster_is_usable(ci, order)) {
@@ -1072,7 +1063,6 @@ static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si,
 			goto done;
 	}
 
-new_cluster:
 	/*
 	 * If the device need discard, prefer new cluster over nonfull
 	 * to spread out the writes.
@@ -1129,8 +1119,10 @@ static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si,
 			goto done;
 	}
 done:
-	if (!(si->flags & SWP_SOLIDSTATE))
-		spin_unlock(&si->global_cluster_lock);
+	if (si->flags & SWP_SOLIDSTATE)
+		local_unlock(&si->percpu_cluster->lock);
+	else
+		spin_unlock(&si->global_cluster->lock);
 
 	return found;
 }
@@ -1311,41 +1303,8 @@ static bool get_swap_device_info(struct swap_info_struct *si)
 	return true;
 }
 
-/*
- * Fast path try to get swap entries with specified order from current
- * CPU's swap entry pool (a cluster).
- */
-static bool swap_alloc_fast(struct folio *folio)
-{
-	unsigned int order = folio_order(folio);
-	struct swap_cluster_info *ci;
-	struct swap_info_struct *si;
-	unsigned int offset;
-
-	/*
-	 * Once allocated, swap_info_struct will never be completely freed,
-	 * so checking it's liveness by get_swap_device_info is enough.
-	 */
-	si = this_cpu_read(percpu_swap_cluster.si[order]);
-	offset = this_cpu_read(percpu_swap_cluster.offset[order]);
-	if (!si || !offset || !get_swap_device_info(si))
-		return false;
-
-	ci = swap_cluster_lock(si, offset);
-	if (cluster_is_usable(ci, order)) {
-		if (cluster_is_empty(ci))
-			offset = cluster_offset(si, ci);
-		alloc_swap_scan_cluster(si, ci, folio, offset);
-	} else {
-		swap_cluster_unlock(ci);
-	}
-
-	put_swap_device(si);
-	return folio_test_swapcache(folio);
-}
-
 /* Rotate the device and switch to a new cluster */
-static void swap_alloc_slow(struct folio *folio)
+static void swap_alloc_entry(struct folio *folio)
 {
 	struct swap_info_struct *si, *next;
 	int mask = folio_memcg(folio) ?
@@ -1363,6 +1322,7 @@ static void swap_alloc_slow(struct folio *folio)
 		if (get_swap_device_info(si)) {
 			cluster_alloc_swap_entry(si, folio);
 			put_swap_device(si);
+
 			if (folio_test_swapcache(folio))
 				return;
 			if (folio_test_large(folio))
@@ -1522,11 +1482,7 @@ int folio_alloc_swap(struct folio *folio)
 	}
 
 again:
-	local_lock(&percpu_swap_cluster.lock);
-	if (!swap_alloc_fast(folio))
-		swap_alloc_slow(folio);
-	local_unlock(&percpu_swap_cluster.lock);
-
+	swap_alloc_entry(folio);
 	if (!order && unlikely(!folio_test_swapcache(folio))) {
 		if (swap_sync_discard())
 			goto again;
@@ -1945,9 +1901,7 @@ swp_entry_t swap_alloc_hibernation_slot(int type)
 			 * Grab the local lock to be compliant
 			 * with swap table allocation.
 			 */
-			local_lock(&percpu_swap_cluster.lock);
 			offset = cluster_alloc_swap_entry(si, NULL);
-			local_unlock(&percpu_swap_cluster.lock);
 			if (offset)
 				entry = swp_entry(si->type, offset);
 		}
@@ -2751,28 +2705,6 @@ static void free_cluster_info(struct swap_cluster_info *cluster_info,
 	kvfree(cluster_info);
 }
 
-/*
- * Called after swap device's reference count is dead, so
- * neither scan nor allocation will use it.
- */
-static void flush_percpu_swap_cluster(struct swap_info_struct *si)
-{
-	int cpu, i;
-	struct swap_info_struct **pcp_si;
-
-	for_each_possible_cpu(cpu) {
-		pcp_si = per_cpu_ptr(percpu_swap_cluster.si, cpu);
-		/*
-		 * Invalidate the percpu swap cluster cache, si->users
-		 * is dead, so no new user will point to it, just flush
-		 * any existing user.
-		 */
-		for (i = 0; i < SWAP_NR_ORDERS; i++)
-			cmpxchg(&pcp_si[i], si, NULL);
-	}
-}
-
-
 SYSCALL_DEFINE1(swapoff, const char __user *, specialfile)
 {
 	struct swap_info_struct *p = NULL;
@@ -2856,7 +2788,6 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile)
 
 	flush_work(&p->discard_work);
 	flush_work(&p->reclaim_work);
-	flush_percpu_swap_cluster(p);
 
 	destroy_swap_extents(p);
 	if (p->flags & SWP_CONTINUED)
@@ -2885,6 +2816,8 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile)
 	arch_swap_invalidate_area(p->type);
 	zswap_swapoff(p->type);
 	mutex_unlock(&swapon_mutex);
+	free_percpu(p->percpu_cluster);
+	p->percpu_cluster = NULL;
 	kfree(p->global_cluster);
 	p->global_cluster = NULL;
 	vfree(swap_map);
@@ -3268,7 +3201,7 @@ static struct swap_cluster_info *setup_clusters(struct swap_info_struct *si,
 {
 	unsigned long nr_clusters = DIV_ROUND_UP(maxpages, SWAPFILE_CLUSTER);
 	struct swap_cluster_info *cluster_info;
-	int err = -ENOMEM;
+	int cpu, err = -ENOMEM;
 	unsigned long i;
 
 	cluster_info = kvcalloc(nr_clusters, sizeof(*cluster_info), GFP_KERNEL);
@@ -3278,14 +3211,27 @@ static struct swap_cluster_info *setup_clusters(struct swap_info_struct *si,
 	for (i = 0; i < nr_clusters; i++)
 		spin_lock_init(&cluster_info[i].lock);
 
-	if (!(si->flags & SWP_SOLIDSTATE)) {
+	if (si->flags & SWP_SOLIDSTATE) {
+		si->percpu_cluster = alloc_percpu(struct percpu_cluster);
+		if (!si->percpu_cluster)
+			goto err;
+
+		for_each_possible_cpu(cpu) {
+			struct percpu_cluster *cluster;
+
+			cluster = per_cpu_ptr(si->percpu_cluster, cpu);
+			for (i = 0; i < SWAP_NR_ORDERS; i++)
+				cluster->next[i] = SWAP_ENTRY_INVALID;
+			local_lock_init(&cluster->lock);
+		}
+	} else {
 		si->global_cluster = kmalloc(sizeof(*si->global_cluster),
 				     GFP_KERNEL);
 		if (!si->global_cluster)
 			goto err;
 		for (i = 0; i < SWAP_NR_ORDERS; i++)
 			si->global_cluster->next[i] = SWAP_ENTRY_INVALID;
-		spin_lock_init(&si->global_cluster_lock);
+		spin_lock_init(&si->global_cluster->lock);
 	}
 
 	/*
@@ -3566,6 +3512,8 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
 bad_swap_unlock_inode:
 	inode_unlock(inode);
 bad_swap:
+	free_percpu(si->percpu_cluster);
+	si->percpu_cluster = NULL;
 	kfree(si->global_cluster);
 	si->global_cluster = NULL;
 	inode = NULL;
-- 
2.34.1



^ permalink raw reply	[flat|nested] 23+ messages in thread

* [RFC PATCH v2 v2 5/5] mm, swap: introduce percpu swap device cache to avoid fragmentation
  2026-01-26  6:52 [RFC PATCH v2 0/5] mm/swap, memcg: Introduce swap tiers for cgroup based swap control Youngjun Park
                   ` (3 preceding siblings ...)
  2026-01-26  6:52 ` [RFC PATCH v2 v2 4/5] mm, swap: change back to use each swap device's percpu cluster Youngjun Park
@ 2026-01-26  6:52 ` Youngjun Park
  2026-02-12  6:12 ` [RFC PATCH v2 0/5] mm/swap, memcg: Introduce swap tiers for cgroup based swap control Chris Li
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 23+ messages in thread
From: Youngjun Park @ 2026-01-26  6:52 UTC (permalink / raw)
  To: Andrew Morton, linux-mm
  Cc: Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
	Barry Song, Johannes Weiner, Michal Hocko, Roman Gushchin,
	Shakeel Butt, Muchun Song, gunho.lee, taejoon.song, austin.kim,
	youngjun.park

When using per-device percpu clusters (instead of a global one),
a naive allocation logic triggers swap device rotation on every
allocation. This behavior leads to severe fragmentation and performance
regression.

To address this, this patch introduces a per-cpu cache for the swap
device. The allocation logic is updated to prioritize the per-cpu
cluster within the cached swap device, effectively restoring the
traditional fastpath and slowpath flow. This approach minimizes side
effects on the existing fastpath.

With this change, swap device rotation occurs only when the current
cached device is unable to satisfy the allocation, rather than on
every attempt.

Signed-off-by: Youngjun Park <youngjun.park@lge.com>
---
 include/linux/swap.h |  1 -
 mm/swapfile.c        | 78 +++++++++++++++++++++++++++++++++++++-------
 2 files changed, 66 insertions(+), 13 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 6921e22b14d3..ac634a21683a 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -253,7 +253,6 @@ enum {
   * throughput.
   */
 struct percpu_cluster {
-	local_lock_t lock; /* Protect the percpu_cluster above */
 	unsigned int next[SWAP_NR_ORDERS]; /* Likely next allocation offset */
 };
 
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 5e3b87799440..0dcd451afee5 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -106,6 +106,16 @@ PLIST_HEAD(swap_active_head);
 static PLIST_HEAD(swap_avail_head);
 static DEFINE_SPINLOCK(swap_avail_lock);
 
+struct percpu_swap_device {
+	struct swap_info_struct *si[SWAP_NR_ORDERS];
+	local_lock_t lock;
+};
+
+static DEFINE_PER_CPU(struct percpu_swap_device, percpu_swap_device) = {
+	.si = { NULL },
+	.lock = INIT_LOCAL_LOCK(),
+};
+
 struct swap_info_struct *swap_info[MAX_SWAPFILES];
 
 static struct kmem_cache *swap_table_cachep;
@@ -465,7 +475,7 @@ swap_cluster_alloc_table(struct swap_info_struct *si,
 	 * Swap allocator uses percpu clusters and holds the local lock.
 	 */
 	lockdep_assert_held(&ci->lock);
-	lockdep_assert_held(this_cpu_ptr(&si->percpu_cluster->lock));
+	lockdep_assert_held(this_cpu_ptr(&percpu_swap_device.lock));
 
 	/* The cluster must be free and was just isolated from the free list. */
 	VM_WARN_ON_ONCE(ci->flags || !cluster_is_empty(ci));
@@ -484,7 +494,7 @@ swap_cluster_alloc_table(struct swap_info_struct *si,
 	spin_unlock(&ci->lock);
 	if (!(si->flags & SWP_SOLIDSTATE))
 		spin_unlock(&si->global_cluster->lock);
-	local_unlock(&si->percpu_cluster->lock);
+	local_unlock(&percpu_swap_device.lock);
 
 	table = swap_table_alloc(__GFP_HIGH | __GFP_NOMEMALLOC | GFP_KERNEL);
 
@@ -496,7 +506,7 @@ swap_cluster_alloc_table(struct swap_info_struct *si,
 	 * could happen with ignoring the percpu cluster is fragmentation,
 	 * which is acceptable since this fallback and race is rare.
 	 */
-	local_lock(&si->percpu_cluster->lock);
+	local_lock(&percpu_swap_device.lock);
 	if (!(si->flags & SWP_SOLIDSTATE))
 		spin_lock(&si->global_cluster->lock);
 	spin_lock(&ci->lock);
@@ -941,9 +951,10 @@ static unsigned int alloc_swap_scan_cluster(struct swap_info_struct *si,
 out:
 	relocate_cluster(si, ci);
 	swap_cluster_unlock(ci);
-	if (si->flags & SWP_SOLIDSTATE)
+	if (si->flags & SWP_SOLIDSTATE) {
 		this_cpu_write(si->percpu_cluster->next[order], next);
-	else
+		this_cpu_write(percpu_swap_device.si[order], si);
+	} else
 		si->global_cluster->next[order] = next;
 
 	return found;
@@ -1041,7 +1052,6 @@ static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si,
 
 	if (si->flags & SWP_SOLIDSTATE) {
 		/* Fast path using per CPU cluster */
-		local_lock(&si->percpu_cluster->lock);
 		offset = __this_cpu_read(si->percpu_cluster->next[order]);
 	} else {
 		/* Serialize HDD SWAP allocation for each device. */
@@ -1119,9 +1129,7 @@ static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si,
 			goto done;
 	}
 done:
-	if (si->flags & SWP_SOLIDSTATE)
-		local_unlock(&si->percpu_cluster->lock);
-	else
+	if (!(si->flags & SWP_SOLIDSTATE))
 		spin_unlock(&si->global_cluster->lock);
 
 	return found;
@@ -1303,8 +1311,27 @@ static bool get_swap_device_info(struct swap_info_struct *si)
 	return true;
 }
 
+static bool swap_alloc_fast(struct folio *folio)
+{
+	unsigned int order = folio_order(folio);
+	struct swap_info_struct *si;
+
+	/*
+	 * Once allocated, swap_info_struct will never be completely freed,
+	 * so checking it's liveness by get_swap_device_info is enough.
+	 */
+	si = this_cpu_read(percpu_swap_device.si[order]);
+	if (!si || !get_swap_device_info(si))
+		return false;
+
+	cluster_alloc_swap_entry(si, folio);
+	put_swap_device(si);
+
+	return folio_test_swapcache(folio);
+}
+
 /* Rotate the device and switch to a new cluster */
-static void swap_alloc_entry(struct folio *folio)
+static void swap_alloc_slow(struct folio *folio)
 {
 	struct swap_info_struct *si, *next;
 	int mask = folio_memcg(folio) ?
@@ -1482,7 +1509,11 @@ int folio_alloc_swap(struct folio *folio)
 	}
 
 again:
-	swap_alloc_entry(folio);
+	local_lock(&percpu_swap_device.lock);
+	if (!swap_alloc_fast(folio))
+		swap_alloc_slow(folio);
+	local_unlock(&percpu_swap_device.lock);
+
 	if (!order && unlikely(!folio_test_swapcache(folio))) {
 		if (swap_sync_discard())
 			goto again;
@@ -1901,7 +1932,9 @@ swp_entry_t swap_alloc_hibernation_slot(int type)
 			 * Grab the local lock to be compliant
 			 * with swap table allocation.
 			 */
+			local_lock(&percpu_swap_device.lock);
 			offset = cluster_alloc_swap_entry(si, NULL);
+			local_unlock(&percpu_swap_device.lock);
 			if (offset)
 				entry = swp_entry(si->type, offset);
 		}
@@ -2705,6 +2738,27 @@ static void free_cluster_info(struct swap_cluster_info *cluster_info,
 	kvfree(cluster_info);
 }
 
+/*
+ * Called after swap device's reference count is dead, so
+ * neither scan nor allocation will use it.
+ */
+static void flush_percpu_swap_device(struct swap_info_struct *si)
+{
+	int cpu, i;
+	struct swap_info_struct **pcp_si;
+
+	for_each_possible_cpu(cpu) {
+		pcp_si = per_cpu_ptr(percpu_swap_device.si, cpu);
+		/*
+		 * Invalidate the percpu swap device cache, si->users
+		 * is dead, so no new user will point to it, just flush
+		 * any existing user.
+		 */
+		for (i = 0; i < SWAP_NR_ORDERS; i++)
+			cmpxchg(&pcp_si[i], si, NULL);
+	}
+}
+
 SYSCALL_DEFINE1(swapoff, const char __user *, specialfile)
 {
 	struct swap_info_struct *p = NULL;
@@ -2788,6 +2842,7 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile)
 
 	flush_work(&p->discard_work);
 	flush_work(&p->reclaim_work);
+	flush_percpu_swap_device(p);
 
 	destroy_swap_extents(p);
 	if (p->flags & SWP_CONTINUED)
@@ -3222,7 +3277,6 @@ static struct swap_cluster_info *setup_clusters(struct swap_info_struct *si,
 			cluster = per_cpu_ptr(si->percpu_cluster, cpu);
 			for (i = 0; i < SWAP_NR_ORDERS; i++)
 				cluster->next[i] = SWAP_ENTRY_INVALID;
-			local_lock_init(&cluster->lock);
 		}
 	} else {
 		si->global_cluster = kmalloc(sizeof(*si->global_cluster),
-- 
2.34.1



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH v2 0/5] mm/swap, memcg: Introduce swap tiers for cgroup based swap control
  2026-01-26  6:52 [RFC PATCH v2 0/5] mm/swap, memcg: Introduce swap tiers for cgroup based swap control Youngjun Park
                   ` (4 preceding siblings ...)
  2026-01-26  6:52 ` [RFC PATCH v2 v2 5/5] mm, swap: introduce percpu swap device cache to avoid fragmentation Youngjun Park
@ 2026-02-12  6:12 ` Chris Li
  2026-02-12  9:22   ` Chris Li
  2026-02-13  1:59   ` YoungJun Park
  2026-02-12 17:57 ` Nhat Pham
  2026-02-12 18:33 ` Shakeel Butt
  7 siblings, 2 replies; 23+ messages in thread
From: Chris Li @ 2026-02-12  6:12 UTC (permalink / raw)
  To: Youngjun Park
  Cc: Andrew Morton, linux-mm, Kairui Song, Kemeng Shi, Nhat Pham,
	Baoquan He, Barry Song, Johannes Weiner, Michal Hocko,
	Roman Gushchin, Shakeel Butt, Muchun Song, gunho.lee,
	taejoon.song, austin.kim

Hi Youngjun,

On Sun, Jan 25, 2026 at 10:53 PM Youngjun Park <youngjun.park@lge.com> wrote:
>
> This is the second version of the RFC for the "Swap Tiers" concept.
> Link to v1: https://lore.kernel.org/linux-mm/20251109124947.1101520-1-youngjun.park@lge.com/
>
> This version incorporates feedback received during LPC 2025 and addresses
> comments from the previous review. We have also included experimental
> results based on usage scenarios intended for our internal platforms.

Thanks for the patches series.

Sorry for the late reply. I have been wanting to reply to it but get
super busy at work.

Some high level feedback for the series. Now that you demonstrated the
whole series, let's focus on making small mergiable baby steps. Just
like the swap table has different phases. Make each step minimal, each
step shows some value. Do the MVP, we can always add more features as
a follow up step.

I suggest the first step is getting the tiers bits defined. Add only,
no delete.  Get that reviewed and merged, then the next step is to use
those tiers.

Chris


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH v2 v2 4/5] mm, swap: change back to use each swap device's percpu cluster
  2026-01-26  6:52 ` [RFC PATCH v2 v2 4/5] mm, swap: change back to use each swap device's percpu cluster Youngjun Park
@ 2026-02-12  7:37   ` Chris Li
  0 siblings, 0 replies; 23+ messages in thread
From: Chris Li @ 2026-02-12  7:37 UTC (permalink / raw)
  To: Youngjun Park
  Cc: Andrew Morton, linux-mm, Kairui Song, Kemeng Shi, Nhat Pham,
	Baoquan He, Barry Song, Johannes Weiner, Michal Hocko,
	Roman Gushchin, Shakeel Butt, Muchun Song, gunho.lee,
	taejoon.song, austin.kim

On Sun, Jan 25, 2026 at 11:08 PM Youngjun Park <youngjun.park@lge.com> wrote:
>
> This reverts commit 1b7e90020eb7 ("mm, swap: use percpu cluster as
> allocation fast path").
>
> Because in the newly introduced swap tiers, the global percpu cluster
> will cause two issues:
> 1) it will cause caching oscillation in the same order of different si
>    if two different memcg can only be allowed to access different si and
>    both of them are swapping out.
> 2) It can cause priority inversion on swap devices. Imagine a case where
>    there are two memcg, say memcg1 and memcg2. Memcg1 can access si A, B
>    and A is higher priority device. While memcg2 can only access si B.
>    Then memcg 2 could write the global percpu cluster with si B, then
>    memcg1 take si B in fast path even though si A is not exhausted.

One idea is that, instead of using percpu per swap device.
You can make the global percpu cluster per tier. Because the max tier
number is smaller than the max number of swap devices. That is likely
a win.

Chris

> Hence in order to support swap tier, revert commit to use
> each swap device's percpu cluster.
>
> Suggested-by: Kairui Song <kasong@tencent.com>
> Co-developed-by: Baoquan He <bhe@redhat.com>
> Signed-off-by: Baoquan He <bhe@redhat.com>
> Signed-off-by: Youngjun Park <youngjun.park@lge.com>
> ---
>  include/linux/swap.h |  17 ++++--
>  mm/swapfile.c        | 142 ++++++++++++++-----------------------------
>  2 files changed, 57 insertions(+), 102 deletions(-)
>
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index 1e68c220a0e7..6921e22b14d3 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -247,11 +247,18 @@ enum {
>  #define SWAP_NR_ORDERS         1
>  #endif
>
> -/*
> - * We keep using same cluster for rotational device so IO will be sequential.
> - * The purpose is to optimize SWAP throughput on these device.
> - */
> + /*
> +  * We assign a cluster to each CPU, so each CPU can allocate swap entry from
> +  * its own cluster and swapout sequentially. The purpose is to optimize swapout
> +  * throughput.
> +  */
> +struct percpu_cluster {
> +       local_lock_t lock; /* Protect the percpu_cluster above */
> +       unsigned int next[SWAP_NR_ORDERS]; /* Likely next allocation offset */
> +};
> +
>  struct swap_sequential_cluster {
> +       spinlock_t lock; /* Serialize usage of global cluster */
>         unsigned int next[SWAP_NR_ORDERS]; /* Likely next allocation offset */
>  };
>
> @@ -277,8 +284,8 @@ struct swap_info_struct {
>                                         /* list of cluster that are fragmented or contented */
>         unsigned int pages;             /* total of usable pages of swap */
>         atomic_long_t inuse_pages;      /* number of those currently in use */
> +       struct percpu_cluster   __percpu *percpu_cluster; /* per cpu's swap location */
>         struct swap_sequential_cluster *global_cluster; /* Use one global cluster for rotating device */
> -       spinlock_t global_cluster_lock; /* Serialize usage of global cluster */
>         struct rb_root swap_extent_root;/* root of the swap extent rbtree */
>         struct block_device *bdev;      /* swap device or bdev of swap file */
>         struct file *swap_file;         /* seldom referenced */
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index dd97e850ea2c..5e3b87799440 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -118,18 +118,6 @@ static atomic_t proc_poll_event = ATOMIC_INIT(0);
>
>  atomic_t nr_rotate_swap = ATOMIC_INIT(0);
>
> -struct percpu_swap_cluster {
> -       struct swap_info_struct *si[SWAP_NR_ORDERS];
> -       unsigned long offset[SWAP_NR_ORDERS];
> -       local_lock_t lock;
> -};
> -
> -static DEFINE_PER_CPU(struct percpu_swap_cluster, percpu_swap_cluster) = {
> -       .si = { NULL },
> -       .offset = { SWAP_ENTRY_INVALID },
> -       .lock = INIT_LOCAL_LOCK(),
> -};
> -
>  /* May return NULL on invalid type, caller must check for NULL return */
>  static struct swap_info_struct *swap_type_to_info(int type)
>  {
> @@ -477,7 +465,7 @@ swap_cluster_alloc_table(struct swap_info_struct *si,
>          * Swap allocator uses percpu clusters and holds the local lock.
>          */
>         lockdep_assert_held(&ci->lock);
> -       lockdep_assert_held(&this_cpu_ptr(&percpu_swap_cluster)->lock);
> +       lockdep_assert_held(this_cpu_ptr(&si->percpu_cluster->lock));
>
>         /* The cluster must be free and was just isolated from the free list. */
>         VM_WARN_ON_ONCE(ci->flags || !cluster_is_empty(ci));
> @@ -495,8 +483,8 @@ swap_cluster_alloc_table(struct swap_info_struct *si,
>          */
>         spin_unlock(&ci->lock);
>         if (!(si->flags & SWP_SOLIDSTATE))
> -               spin_unlock(&si->global_cluster_lock);
> -       local_unlock(&percpu_swap_cluster.lock);
> +               spin_unlock(&si->global_cluster->lock);
> +       local_unlock(&si->percpu_cluster->lock);
>
>         table = swap_table_alloc(__GFP_HIGH | __GFP_NOMEMALLOC | GFP_KERNEL);
>
> @@ -508,9 +496,9 @@ swap_cluster_alloc_table(struct swap_info_struct *si,
>          * could happen with ignoring the percpu cluster is fragmentation,
>          * which is acceptable since this fallback and race is rare.
>          */
> -       local_lock(&percpu_swap_cluster.lock);
> +       local_lock(&si->percpu_cluster->lock);
>         if (!(si->flags & SWP_SOLIDSTATE))
> -               spin_lock(&si->global_cluster_lock);
> +               spin_lock(&si->global_cluster->lock);
>         spin_lock(&ci->lock);
>
>         /* Nothing except this helper should touch a dangling empty cluster. */
> @@ -622,7 +610,7 @@ static bool swap_do_scheduled_discard(struct swap_info_struct *si)
>                 ci = list_first_entry(&si->discard_clusters, struct swap_cluster_info, list);
>                 /*
>                  * Delete the cluster from list to prepare for discard, but keep
> -                * the CLUSTER_FLAG_DISCARD flag, percpu_swap_cluster could be
> +                * the CLUSTER_FLAG_DISCARD flag, there could be percpu_cluster
>                  * pointing to it, or ran into by relocate_cluster.
>                  */
>                 list_del(&ci->list);
> @@ -953,12 +941,11 @@ static unsigned int alloc_swap_scan_cluster(struct swap_info_struct *si,
>  out:
>         relocate_cluster(si, ci);
>         swap_cluster_unlock(ci);
> -       if (si->flags & SWP_SOLIDSTATE) {
> -               this_cpu_write(percpu_swap_cluster.offset[order], next);
> -               this_cpu_write(percpu_swap_cluster.si[order], si);
> -       } else {
> +       if (si->flags & SWP_SOLIDSTATE)
> +               this_cpu_write(si->percpu_cluster->next[order], next);
> +       else
>                 si->global_cluster->next[order] = next;
> -       }
> +
>         return found;
>  }
>
> @@ -1052,13 +1039,17 @@ static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si,
>         if (order && !(si->flags & SWP_BLKDEV))
>                 return 0;
>
> -       if (!(si->flags & SWP_SOLIDSTATE)) {
> +       if (si->flags & SWP_SOLIDSTATE) {
> +               /* Fast path using per CPU cluster */
> +               local_lock(&si->percpu_cluster->lock);
> +               offset = __this_cpu_read(si->percpu_cluster->next[order]);
> +       } else {
>                 /* Serialize HDD SWAP allocation for each device. */
> -               spin_lock(&si->global_cluster_lock);
> +               spin_lock(&si->global_cluster->lock);
>                 offset = si->global_cluster->next[order];
> -               if (offset == SWAP_ENTRY_INVALID)
> -                       goto new_cluster;
> +       }
>
> +       if (offset != SWAP_ENTRY_INVALID) {
>                 ci = swap_cluster_lock(si, offset);
>                 /* Cluster could have been used by another order */
>                 if (cluster_is_usable(ci, order)) {
> @@ -1072,7 +1063,6 @@ static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si,
>                         goto done;
>         }
>
> -new_cluster:
>         /*
>          * If the device need discard, prefer new cluster over nonfull
>          * to spread out the writes.
> @@ -1129,8 +1119,10 @@ static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si,
>                         goto done;
>         }
>  done:
> -       if (!(si->flags & SWP_SOLIDSTATE))
> -               spin_unlock(&si->global_cluster_lock);
> +       if (si->flags & SWP_SOLIDSTATE)
> +               local_unlock(&si->percpu_cluster->lock);
> +       else
> +               spin_unlock(&si->global_cluster->lock);
>
>         return found;
>  }
> @@ -1311,41 +1303,8 @@ static bool get_swap_device_info(struct swap_info_struct *si)
>         return true;
>  }
>
> -/*
> - * Fast path try to get swap entries with specified order from current
> - * CPU's swap entry pool (a cluster).
> - */
> -static bool swap_alloc_fast(struct folio *folio)
> -{
> -       unsigned int order = folio_order(folio);
> -       struct swap_cluster_info *ci;
> -       struct swap_info_struct *si;
> -       unsigned int offset;
> -
> -       /*
> -        * Once allocated, swap_info_struct will never be completely freed,
> -        * so checking it's liveness by get_swap_device_info is enough.
> -        */
> -       si = this_cpu_read(percpu_swap_cluster.si[order]);
> -       offset = this_cpu_read(percpu_swap_cluster.offset[order]);
> -       if (!si || !offset || !get_swap_device_info(si))
> -               return false;
> -
> -       ci = swap_cluster_lock(si, offset);
> -       if (cluster_is_usable(ci, order)) {
> -               if (cluster_is_empty(ci))
> -                       offset = cluster_offset(si, ci);
> -               alloc_swap_scan_cluster(si, ci, folio, offset);
> -       } else {
> -               swap_cluster_unlock(ci);
> -       }
> -
> -       put_swap_device(si);
> -       return folio_test_swapcache(folio);
> -}
> -
>  /* Rotate the device and switch to a new cluster */
> -static void swap_alloc_slow(struct folio *folio)
> +static void swap_alloc_entry(struct folio *folio)
>  {
>         struct swap_info_struct *si, *next;
>         int mask = folio_memcg(folio) ?
> @@ -1363,6 +1322,7 @@ static void swap_alloc_slow(struct folio *folio)
>                 if (get_swap_device_info(si)) {
>                         cluster_alloc_swap_entry(si, folio);
>                         put_swap_device(si);
> +
>                         if (folio_test_swapcache(folio))
>                                 return;
>                         if (folio_test_large(folio))
> @@ -1522,11 +1482,7 @@ int folio_alloc_swap(struct folio *folio)
>         }
>
>  again:
> -       local_lock(&percpu_swap_cluster.lock);
> -       if (!swap_alloc_fast(folio))
> -               swap_alloc_slow(folio);
> -       local_unlock(&percpu_swap_cluster.lock);
> -
> +       swap_alloc_entry(folio);
>         if (!order && unlikely(!folio_test_swapcache(folio))) {
>                 if (swap_sync_discard())
>                         goto again;
> @@ -1945,9 +1901,7 @@ swp_entry_t swap_alloc_hibernation_slot(int type)
>                          * Grab the local lock to be compliant
>                          * with swap table allocation.
>                          */
> -                       local_lock(&percpu_swap_cluster.lock);
>                         offset = cluster_alloc_swap_entry(si, NULL);
> -                       local_unlock(&percpu_swap_cluster.lock);
>                         if (offset)
>                                 entry = swp_entry(si->type, offset);
>                 }
> @@ -2751,28 +2705,6 @@ static void free_cluster_info(struct swap_cluster_info *cluster_info,
>         kvfree(cluster_info);
>  }
>
> -/*
> - * Called after swap device's reference count is dead, so
> - * neither scan nor allocation will use it.
> - */
> -static void flush_percpu_swap_cluster(struct swap_info_struct *si)
> -{
> -       int cpu, i;
> -       struct swap_info_struct **pcp_si;
> -
> -       for_each_possible_cpu(cpu) {
> -               pcp_si = per_cpu_ptr(percpu_swap_cluster.si, cpu);
> -               /*
> -                * Invalidate the percpu swap cluster cache, si->users
> -                * is dead, so no new user will point to it, just flush
> -                * any existing user.
> -                */
> -               for (i = 0; i < SWAP_NR_ORDERS; i++)
> -                       cmpxchg(&pcp_si[i], si, NULL);
> -       }
> -}
> -
> -
>  SYSCALL_DEFINE1(swapoff, const char __user *, specialfile)
>  {
>         struct swap_info_struct *p = NULL;
> @@ -2856,7 +2788,6 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile)
>
>         flush_work(&p->discard_work);
>         flush_work(&p->reclaim_work);
> -       flush_percpu_swap_cluster(p);
>
>         destroy_swap_extents(p);
>         if (p->flags & SWP_CONTINUED)
> @@ -2885,6 +2816,8 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile)
>         arch_swap_invalidate_area(p->type);
>         zswap_swapoff(p->type);
>         mutex_unlock(&swapon_mutex);
> +       free_percpu(p->percpu_cluster);
> +       p->percpu_cluster = NULL;
>         kfree(p->global_cluster);
>         p->global_cluster = NULL;
>         vfree(swap_map);
> @@ -3268,7 +3201,7 @@ static struct swap_cluster_info *setup_clusters(struct swap_info_struct *si,
>  {
>         unsigned long nr_clusters = DIV_ROUND_UP(maxpages, SWAPFILE_CLUSTER);
>         struct swap_cluster_info *cluster_info;
> -       int err = -ENOMEM;
> +       int cpu, err = -ENOMEM;
>         unsigned long i;
>
>         cluster_info = kvcalloc(nr_clusters, sizeof(*cluster_info), GFP_KERNEL);
> @@ -3278,14 +3211,27 @@ static struct swap_cluster_info *setup_clusters(struct swap_info_struct *si,
>         for (i = 0; i < nr_clusters; i++)
>                 spin_lock_init(&cluster_info[i].lock);
>
> -       if (!(si->flags & SWP_SOLIDSTATE)) {
> +       if (si->flags & SWP_SOLIDSTATE) {
> +               si->percpu_cluster = alloc_percpu(struct percpu_cluster);
> +               if (!si->percpu_cluster)
> +                       goto err;
> +
> +               for_each_possible_cpu(cpu) {
> +                       struct percpu_cluster *cluster;
> +
> +                       cluster = per_cpu_ptr(si->percpu_cluster, cpu);
> +                       for (i = 0; i < SWAP_NR_ORDERS; i++)
> +                               cluster->next[i] = SWAP_ENTRY_INVALID;
> +                       local_lock_init(&cluster->lock);
> +               }
> +       } else {
>                 si->global_cluster = kmalloc(sizeof(*si->global_cluster),
>                                      GFP_KERNEL);
>                 if (!si->global_cluster)
>                         goto err;
>                 for (i = 0; i < SWAP_NR_ORDERS; i++)
>                         si->global_cluster->next[i] = SWAP_ENTRY_INVALID;
> -               spin_lock_init(&si->global_cluster_lock);
> +               spin_lock_init(&si->global_cluster->lock);
>         }
>
>         /*
> @@ -3566,6 +3512,8 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
>  bad_swap_unlock_inode:
>         inode_unlock(inode);
>  bad_swap:
> +       free_percpu(si->percpu_cluster);
> +       si->percpu_cluster = NULL;
>         kfree(si->global_cluster);
>         si->global_cluster = NULL;
>         inode = NULL;
> --
> 2.34.1
>
>


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH v2 v2 1/5] mm: swap: introduce swap tier infrastructure
  2026-01-26  6:52 ` [RFC PATCH v2 v2 1/5] mm: swap: introduce swap tier infrastructure Youngjun Park
@ 2026-02-12  9:07   ` Chris Li
  2026-02-13  2:18     ` YoungJun Park
  2026-02-13 14:33     ` YoungJun Park
  0 siblings, 2 replies; 23+ messages in thread
From: Chris Li @ 2026-02-12  9:07 UTC (permalink / raw)
  To: Youngjun Park
  Cc: Andrew Morton, linux-mm, Kairui Song, Kemeng Shi, Nhat Pham,
	Baoquan He, Barry Song, Johannes Weiner, Michal Hocko,
	Roman Gushchin, Shakeel Butt, Muchun Song, gunho.lee,
	taejoon.song, austin.kim

Hi Yongjun,

On Sun, Jan 25, 2026 at 10:53 PM Youngjun Park <youngjun.park@lge.com> wrote:
>
> This patch introduces the "Swap tier" concept, which serves as an
> abstraction layer for managing swap devices based on their performance
> characteristics (e.g., NVMe, HDD, Network swap).
>
> Swap tiers are user-named groups representing priority ranges.
> These tiers collectively cover the entire priority
> space from -1 (`DEF_SWAP_PRIO`) to `SHRT_MAX`.
>
> To configure tiers, a new sysfs interface is exposed at
> `/sys/kernel/mm/swap/tiers`. The input parser evaluates commands from
> left to right and supports batch input, allowing users to add, remove or
> modify multiple tiers in a single write operation.
>
> Tier management enforces continuous priority ranges anchored by start
> priorities. Operations trigger range splitting or merging, but overwriting
> start priorities is forbidden. Merging expands lower tiers upwards to
> preserve configured start priorities, except when removing `DEF_SWAP_PRIO`,
> which merges downwards.
>
> Suggested-by: Chris Li <chrisl@kernel.org>
> Signed-off-by: Youngjun Park <youngjun.park@lge.com>
> ---
>  MAINTAINERS     |   2 +
>  mm/Makefile     |   2 +-
>  mm/swap.h       |   4 +
>  mm/swap_state.c |  70 +++++++++++
>  mm/swap_tier.c  | 304 ++++++++++++++++++++++++++++++++++++++++++++++++
>  mm/swap_tier.h  |  38 ++++++
>  mm/swapfile.c   |   7 +-
>  7 files changed, 423 insertions(+), 4 deletions(-)
>  create mode 100644 mm/swap_tier.c
>  create mode 100644 mm/swap_tier.h
>
> diff --git a/MAINTAINERS b/MAINTAINERS
> index 18d1ebf053db..501bf46adfb4 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -16743,6 +16743,8 @@ F:      mm/swap.c
>  F:     mm/swap.h
>  F:     mm/swap_table.h
>  F:     mm/swap_state.c
> +F:     mm/swap_tier.c
> +F:     mm/swap_tier.h
>  F:     mm/swapfile.c
>
>  MEMORY MANAGEMENT - THP (TRANSPARENT HUGE PAGE)
> diff --git a/mm/Makefile b/mm/Makefile
> index 53ca5d4b1929..3b3de2de7285 100644
> --- a/mm/Makefile
> +++ b/mm/Makefile
> @@ -75,7 +75,7 @@ ifdef CONFIG_MMU
>         obj-$(CONFIG_ADVISE_SYSCALLS)   += madvise.o
>  endif
>
> -obj-$(CONFIG_SWAP)     += page_io.o swap_state.o swapfile.o
> +obj-$(CONFIG_SWAP)     += page_io.o swap_state.o swapfile.o swap_tier.o
>  obj-$(CONFIG_ZSWAP)    += zswap.o
>  obj-$(CONFIG_HAS_DMA)  += dmapool.o
>  obj-$(CONFIG_HUGETLBFS)        += hugetlb.o hugetlb_sysfs.o hugetlb_sysctl.o
> diff --git a/mm/swap.h b/mm/swap.h
> index bfafa637c458..55f230cbe4e7 100644
> --- a/mm/swap.h
> +++ b/mm/swap.h
> @@ -16,6 +16,10 @@ extern int page_cluster;
>  #define swap_entry_order(order)        0
>  #endif
>
> +#define DEF_SWAP_PRIO  -1
> +
> +extern spinlock_t swap_lock;
> +extern struct plist_head swap_active_head;
>  extern struct swap_info_struct *swap_info[];
>
>  /*
> diff --git a/mm/swap_state.c b/mm/swap_state.c
> index 6d0eef7470be..f1a7d9cdc648 100644
> --- a/mm/swap_state.c
> +++ b/mm/swap_state.c
> @@ -25,6 +25,7 @@
>  #include "internal.h"
>  #include "swap_table.h"
>  #include "swap.h"
> +#include "swap_tier.h"
>
>  /*
>   * swapper_space is a fiction, retained to simplify the path through
> @@ -947,8 +948,77 @@ static ssize_t vma_ra_enabled_store(struct kobject *kobj,
>  }
>  static struct kobj_attribute vma_ra_enabled_attr = __ATTR_RW(vma_ra_enabled);
>
> +static ssize_t tiers_show(struct kobject *kobj,
> +                                    struct kobj_attribute *attr, char *buf)
> +{
> +       return swap_tiers_sysfs_show(buf);
> +}
> +
> +static ssize_t tiers_store(struct kobject *kobj,
> +                       struct kobj_attribute *attr,
> +                       const char *buf, size_t count)
> +{
> +       char *p, *token, *name, *tmp;
> +       int ret = 0;
> +       short prio;
> +       DEFINE_SWAP_TIER_SAVE_CTX(ctx);
> +
> +       tmp = kstrdup(buf, GFP_KERNEL);
> +       if (!tmp)
> +               return -ENOMEM;
> +
> +       spin_lock(&swap_lock);
> +       spin_lock(&swap_tier_lock);
> +
> +       p = tmp;
> +       swap_tiers_save(ctx);
> +
> +       while (!ret && (token = strsep(&p, ", \t\n")) != NULL) {
> +               if (!*token)
> +                       continue;
> +
> +               if (token[0] == '-') {
> +                       ret = swap_tiers_remove(token + 1);
> +               } else {
> +
> +                       name = strsep(&token, ":");
> +                       if (!token || kstrtos16(token, 10, &prio)) {
> +                               ret = -EINVAL;
> +                               goto out;
> +                       }
> +
> +                       if (name[0] == '+')
> +                               ret = swap_tiers_add(name + 1, prio);
> +                       else
> +                               ret = swap_tiers_modify(name, prio);
> +               }
> +
> +               if (ret)
> +                       goto restore;
> +       }

This function can use some simplification to make the indentation flater.

> +
> +       if (!swap_tiers_validate()) {
> +               ret = -EINVAL;
> +               goto restore;
> +       }
> +
> +out:
> +       spin_unlock(&swap_tier_lock);
> +       spin_unlock(&swap_lock);
> +
> +       kfree(tmp);
> +       return ret ? ret : count;
> +
> +restore:
> +       swap_tiers_restore(ctx);
> +       goto out;
> +}
> +
> +static struct kobj_attribute tier_attr = __ATTR_RW(tiers);
> +
>  static struct attribute *swap_attrs[] = {
>         &vma_ra_enabled_attr.attr,
> +       &tier_attr.attr,
>         NULL,
>  };
>
> diff --git a/mm/swap_tier.c b/mm/swap_tier.c
> new file mode 100644
> index 000000000000..87882272eec8
> --- /dev/null
> +++ b/mm/swap_tier.c
> @@ -0,0 +1,304 @@
> +// SPDX-License-Identifier: GPL-2.0
> +#include <linux/swap.h>
> +#include <linux/memcontrol.h>
> +#include "memcontrol-v1.h"
> +#include <linux/sysfs.h>
> +#include <linux/plist.h>
> +
> +#include "swap.h"
> +#include "swap_tier.h"
> +
> +/*
> + * struct swap_tier - structure representing a swap tier.
> + *
> + * @name: name of the swap_tier.
> + * @prio: starting value of priority.
> + * @list: linked list of tiers.
> +*/
> +static struct swap_tier {
> +       char name[MAX_TIERNAME];
> +       short prio;
> +       struct list_head list;
> +} swap_tiers[MAX_SWAPTIER];

We can have a CONFIG option for the MAX_SWAPTIER. I think the default
should be a small number like 4.

> +
> +DEFINE_SPINLOCK(swap_tier_lock);
> +/* active swap priority list, sorted in descending order */
> +static LIST_HEAD(swap_tier_active_list);
> +/* unused swap_tier object */
> +static LIST_HEAD(swap_tier_inactive_list);
> +
> +#define TIER_IDX(tier) ((tier) - swap_tiers)
> +#define TIER_MASK(tier)        (1 << TIER_IDX(tier))
> +#define TIER_INVALID_PRIO (DEF_SWAP_PRIO - 1)
> +#define TIER_END_PRIO(tier) \
> +       (!list_is_first(&(tier)->list, &swap_tier_active_list) ? \
> +       list_prev_entry((tier), list)->prio - 1 : SHRT_MAX)
> +
> +#define for_each_tier(tier, idx) \
> +       for (idx = 0, tier = &swap_tiers[0]; idx < MAX_SWAPTIER; \
> +               idx++, tier = &swap_tiers[idx])
> +
> +#define for_each_active_tier(tier) \
> +       list_for_each_entry(tier, &swap_tier_active_list, list)
> +
> +#define for_each_inactive_tier(tier) \
> +       list_for_each_entry(tier, &swap_tier_inactive_list, list)
> +
> +/*
> + * Naming Convention:
> + *   swap_tiers_*() - Public/exported functions
> + *   swap_tier_*()  - Private/internal functions
> + */
> +
> +static bool swap_tier_is_active(void)
> +{
> +       return !list_empty(&swap_tier_active_list) ? true : false;
> +}
> +
> +static struct swap_tier *swap_tier_lookup(const char *name)
> +{
> +       struct swap_tier *tier;
> +
> +       for_each_active_tier(tier) {
> +               if (!strcmp(tier->name, name))
> +                       return tier;
> +       }
> +
> +       return NULL;
> +}
> +
> +void swap_tiers_init(void)
> +{
> +       struct swap_tier *tier;
> +       int idx;
> +
> +       BUILD_BUG_ON(BITS_PER_TYPE(int) < MAX_SWAPTIER);
> +
> +       for_each_tier(tier, idx) {
> +               INIT_LIST_HEAD(&tier->list);
> +               list_add_tail(&tier->list, &swap_tier_inactive_list);
> +       }
> +}
> +
> +ssize_t swap_tiers_sysfs_show(char *buf)
> +{
> +       struct swap_tier *tier;
> +       ssize_t len = 0;
> +
> +       len += sysfs_emit_at(buf, len, "%-16s %-5s %-11s %-11s\n",
> +                        "Name", "Idx", "PrioStart", "PrioEnd");
> +
> +       spin_lock(&swap_tier_lock);
> +       for_each_active_tier(tier) {
> +               len += sysfs_emit_at(buf, len, "%-16s %-5ld %-11d %-11d\n",
> +                                    tier->name,
> +                                    TIER_IDX(tier),
> +                                    tier->prio,
> +                                    TIER_END_PRIO(tier));
> +               if (len >= PAGE_SIZE)
> +                       break;
> +       }
> +       spin_unlock(&swap_tier_lock);
> +
> +       return len;
> +}
> +
> +static void swap_tier_insert_by_prio(struct swap_tier *new)
> +{
> +       struct swap_tier *tier;
> +
> +       for_each_active_tier(tier) {
> +               if (tier->prio > new->prio)
> +                       continue;
> +
> +               list_add_tail(&new->list, &tier->list);
> +               return;
> +       }
> +       /* First addition, or becomes the first tier */
> +       list_add_tail(&new->list, &swap_tier_active_list);
> +}
> +
> +static void __swap_tier_prepare(struct swap_tier *tier, const char *name,
> +       short prio)
> +{
> +       list_del_init(&tier->list);
> +       strscpy(tier->name, name, MAX_TIERNAME);
> +       tier->prio = prio;
> +}
> +
> +static struct swap_tier *swap_tier_prepare(const char *name, short prio)
> +{
> +       struct swap_tier *tier;
> +
> +       lockdep_assert_held(&swap_tier_lock);
> +
> +       if (prio < DEF_SWAP_PRIO)
> +               return NULL;
> +
> +       if (list_empty(&swap_tier_inactive_list))
> +               return ERR_PTR(-EPERM);
> +
> +       tier = list_first_entry(&swap_tier_inactive_list,
> +               struct swap_tier, list);
> +
> +       __swap_tier_prepare(tier, name, prio);
> +       return tier;
> +}
> +
> +static int swap_tier_check_range(short prio)
> +{
> +       struct swap_tier *tier;
> +
> +       lockdep_assert_held(&swap_lock);
> +       lockdep_assert_held(&swap_tier_lock);
> +
> +       for_each_active_tier(tier) {
> +               /* No overwrite */
> +               if (tier->prio == prio)
> +                       return -EINVAL;
> +       }
> +
> +       return 0;
> +}
> +
> +int swap_tiers_add(const char *name, int prio)

When we add, modify, remove a tier. The simple case is there is no
swap file under any tiers.
But if the modification causes some swap files to jump to different
tiers. That might be problematic.

> +{
> +       int ret;
> +       struct swap_tier *tier;
> +
> +       lockdep_assert_held(&swap_lock);
> +       lockdep_assert_held(&swap_tier_lock);
> +
> +       /* Duplicate check */
> +       if (swap_tier_lookup(name))
> +               return -EPERM;
> +
> +       ret = swap_tier_check_range(prio);
> +       if (ret)
> +               return ret;
> +
> +       tier = swap_tier_prepare(name, prio);
> +       if (IS_ERR(tier)) {
> +               ret = PTR_ERR(tier);
> +               return ret;
> +       }
> +
> +
> +       swap_tier_insert_by_prio(tier);
> +       return ret;
> +}
> +
> +int swap_tiers_remove(const char *name)
> +{
> +       int ret = 0;
> +       struct swap_tier *tier;
> +
> +       lockdep_assert_held(&swap_lock);
> +       lockdep_assert_held(&swap_tier_lock);
> +
> +       tier = swap_tier_lookup(name);
> +       if (!tier)
> +               return -EINVAL;
> +
> +       list_move(&tier->list, &swap_tier_inactive_list);
> +
> +       /* Removing DEF_SWAP_PRIO merges into the higher tier. */
> +       if (swap_tier_is_active() && tier->prio == DEF_SWAP_PRIO)
> +               list_prev_entry(tier, list)->prio = DEF_SWAP_PRIO;
> +
> +       return ret;
> +}
> +
> +int swap_tiers_modify(const char *name, int prio)
> +{
> +       int ret;
> +       struct swap_tier *tier;
> +
> +       lockdep_assert_held(&swap_lock);
> +       lockdep_assert_held(&swap_tier_lock);
> +
> +       tier = swap_tier_lookup(name);
> +       if (!tier)
> +               return -EINVAL;
> +
> +       /* No need to modify */
> +       if (tier->prio == prio)
> +               return 0;
> +
> +       ret = swap_tier_check_range(prio);
> +       if (ret)
> +               return ret;
> +
> +       list_del_init(&tier->list);
> +       tier->prio = prio;
> +       swap_tier_insert_by_prio(tier);
> +
> +       return ret;
> +}
> +
> +/*
> + * XXX: Reverting individual operations becomes complex as the number of
> + * operations grows. Instead, we save the original state beforehand and
> + * fully restore it if any operation fails.
> + */
> +void swap_tiers_save(struct swap_tier_save_ctx ctx[])

I really hope we don't have to do the save and restore thing. Is there
another design we can simplify this?

> +{
> +       struct swap_tier *tier;
> +       int idx;
> +
> +       lockdep_assert_held(&swap_lock);
> +       lockdep_assert_held(&swap_tier_lock);
> +
> +       for_each_active_tier(tier) {
> +               idx = TIER_IDX(tier);
> +               strcpy(ctx[idx].name, tier->name);
> +               ctx[idx].prio = tier->prio;
> +       }
> +
> +       for_each_inactive_tier(tier) {
> +               idx = TIER_IDX(tier);
> +               /* Indicator of inactive */
> +               ctx[idx].prio = TIER_INVALID_PRIO;
> +       }
> +}
> +
> +void swap_tiers_restore(struct swap_tier_save_ctx ctx[])
> +{
> +       struct swap_tier *tier;
> +       int idx;
> +
> +       lockdep_assert_held(&swap_lock);
> +       lockdep_assert_held(&swap_tier_lock);
> +
> +       /* Invalidate active list */
> +       list_splice_tail_init(&swap_tier_active_list,
> +                       &swap_tier_inactive_list);
> +
> +       for_each_tier(tier, idx) {
> +               if (ctx[idx].prio != TIER_INVALID_PRIO) {
> +                       /* Preserve idx(mask) */
> +                       __swap_tier_prepare(tier, ctx[idx].name, ctx[idx].prio);
> +                       swap_tier_insert_by_prio(tier);
> +               }
> +       }
> +}
> +
> +bool swap_tiers_validate(void)
> +{
> +       struct swap_tier *tier;
> +
> +       /*
> +        * Initial setting might not cover DEF_SWAP_PRIO.
> +        * Swap tier must cover the full range (DEF_SWAP_PRIO to SHRT_MAX).
> +        * Also, modify operation can change only one remaining priority.
> +        */
> +       if (swap_tier_is_active()) {
> +               tier = list_last_entry(&swap_tier_active_list,
> +                       struct swap_tier, list);
> +
> +               if (tier->prio != DEF_SWAP_PRIO)
> +                       return false;
> +       }
> +
> +       return true;
> +}
> diff --git a/mm/swap_tier.h b/mm/swap_tier.h
> new file mode 100644
> index 000000000000..4b1b0602d691
> --- /dev/null
> +++ b/mm/swap_tier.h
> @@ -0,0 +1,38 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef _SWAP_TIER_H
> +#define _SWAP_TIER_H
> +
> +#include <linux/types.h>
> +#include <linux/spinlock.h>
> +
> +#define MAX_TIERNAME           16
> +
> +/* Ensure MAX_SWAPTIER does not exceed MAX_SWAPFILES */
> +#if 8 > MAX_SWAPFILES
> +#define MAX_SWAPTIER           MAX_SWAPFILES
> +#else
> +#define MAX_SWAPTIER           8
> +#endif
> +
> +extern spinlock_t swap_tier_lock;
> +
> +struct swap_tier_save_ctx {
> +       char name[MAX_TIERNAME];
> +       short prio;
> +};
> +
> +#define DEFINE_SWAP_TIER_SAVE_CTX(_name) \
> +       struct swap_tier_save_ctx _name[MAX_SWAPTIER] = {0}
> +
> +/* Initialization and application */
> +void swap_tiers_init(void);
> +ssize_t swap_tiers_sysfs_show(char *buf);
> +
> +int swap_tiers_add(const char *name, int prio);
> +int swap_tiers_remove(const char *name);
> +int swap_tiers_modify(const char *name, int prio);
> +
> +void swap_tiers_save(struct swap_tier_save_ctx ctx[]);
> +void swap_tiers_restore(struct swap_tier_save_ctx ctx[]);
> +bool swap_tiers_validate(void);
> +#endif /* _SWAP_TIER_H */
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index 7b055f15d705..c27952b41d4f 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -50,6 +50,7 @@
>  #include "internal.h"
>  #include "swap_table.h"
>  #include "swap.h"
> +#include "swap_tier.h"
>
>  static bool swap_count_continued(struct swap_info_struct *, pgoff_t,
>                                  unsigned char);
> @@ -65,7 +66,7 @@ static void move_cluster(struct swap_info_struct *si,
>                          struct swap_cluster_info *ci, struct list_head *list,
>                          enum swap_cluster_flags new_flags);
>
> -static DEFINE_SPINLOCK(swap_lock);
> +DEFINE_SPINLOCK(swap_lock);
>  static unsigned int nr_swapfiles;
>  atomic_long_t nr_swap_pages;
>  /*
> @@ -76,7 +77,6 @@ atomic_long_t nr_swap_pages;
>  EXPORT_SYMBOL_GPL(nr_swap_pages);
>  /* protected with swap_lock. reading in vm_swap_full() doesn't need lock */
>  long total_swap_pages;
> -#define DEF_SWAP_PRIO  -1
>  unsigned long swapfile_maximum_size;
>  #ifdef CONFIG_MIGRATION
>  bool swap_migration_ad_supported;
> @@ -89,7 +89,7 @@ static const char Bad_offset[] = "Bad swap offset entry ";
>   * all active swap_info_structs
>   * protected with swap_lock, and ordered by priority.
>   */
> -static PLIST_HEAD(swap_active_head);
> +PLIST_HEAD(swap_active_head);

One idea is to make each tier have swap_active_head. So different swap
entry releases on different tiers don't need to be competing on the
same swap_active_head.

That will require the swapfile don't jump to another tiers.

Chris


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH v2 0/5] mm/swap, memcg: Introduce swap tiers for cgroup based swap control
  2026-02-12  6:12 ` [RFC PATCH v2 0/5] mm/swap, memcg: Introduce swap tiers for cgroup based swap control Chris Li
@ 2026-02-12  9:22   ` Chris Li
  2026-02-13  2:26     ` YoungJun Park
  2026-02-13  1:59   ` YoungJun Park
  1 sibling, 1 reply; 23+ messages in thread
From: Chris Li @ 2026-02-12  9:22 UTC (permalink / raw)
  To: Youngjun Park
  Cc: Andrew Morton, linux-mm, Kairui Song, Kemeng Shi, Nhat Pham,
	Baoquan He, Barry Song, Johannes Weiner, Michal Hocko,
	Roman Gushchin, Shakeel Butt, Muchun Song, gunho.lee,
	taejoon.song, austin.kim

On Wed, Feb 11, 2026 at 10:12 PM Chris Li <chrisl@kernel.org> wrote:
>
> Hi Youngjun,
>
> On Sun, Jan 25, 2026 at 10:53 PM Youngjun Park <youngjun.park@lge.com> wrote:
> >
> > This is the second version of the RFC for the "Swap Tiers" concept.
> > Link to v1: https://lore.kernel.org/linux-mm/20251109124947.1101520-1-youngjun.park@lge.com/
> >
> > This version incorporates feedback received during LPC 2025 and addresses
> > comments from the previous review. We have also included experimental
> > results based on usage scenarios intended for our internal platforms.
>
> Thanks for the patches series.
>
> Sorry for the late reply. I have been wanting to reply to it but get
> super busy at work.
>
> Some high level feedback for the series. Now that you demonstrated the
> whole series, let's focus on making small mergiable baby steps. Just
> like the swap table has different phases. Make each step minimal, each
> step shows some value. Do the MVP, we can always add more features as
> a follow up step.
>
> I suggest the first step is getting the tiers bits defined. Add only,
> no delete.  Get that reviewed and merged, then the next step is to use
> those tiers.

Just take a quick look at the series. I take that suggestion back.
This series is actually not too long. Adding the tiers name alone does
not add any real value. I actually need to look at the whole series
rather than just the tier name alone.

Chris


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH v2 0/5] mm/swap, memcg: Introduce swap tiers for cgroup based swap control
  2026-01-26  6:52 [RFC PATCH v2 0/5] mm/swap, memcg: Introduce swap tiers for cgroup based swap control Youngjun Park
                   ` (5 preceding siblings ...)
  2026-02-12  6:12 ` [RFC PATCH v2 0/5] mm/swap, memcg: Introduce swap tiers for cgroup based swap control Chris Li
@ 2026-02-12 17:57 ` Nhat Pham
  2026-02-12 17:58   ` Nhat Pham
  2026-02-13  2:43   ` YoungJun Park
  2026-02-12 18:33 ` Shakeel Butt
  7 siblings, 2 replies; 23+ messages in thread
From: Nhat Pham @ 2026-02-12 17:57 UTC (permalink / raw)
  To: Youngjun Park
  Cc: Andrew Morton, linux-mm, Chris Li, Kairui Song, Kemeng Shi,
	Baoquan He, Barry Song, Johannes Weiner, Michal Hocko,
	Roman Gushchin, Shakeel Butt, Muchun Song, gunho.lee,
	taejoon.song, austin.kim

On Sun, Jan 25, 2026 at 10:53 PM Youngjun Park <youngjun.park@lge.com> wrote:
>
> This is the second version of the RFC for the "Swap Tiers" concept.
> Link to v1: https://lore.kernel.org/linux-mm/20251109124947.1101520-1-youngjun.park@lge.com/
>
> This version incorporates feedback received during LPC 2025 and addresses
> comments from the previous review. We have also included experimental
> results based on usage scenarios intended for our internal platforms.
>
> Motivation & Concept recap
> ==========================
> Current Linux swap allocation is global, limiting the ability to assign
> faster devices to specific cgroups. Our initial attempt at per-cgroup
> priorities proved over-engineered and caused LRU inversion.
>
> Following Chris Li's suggestion, we pivoted to "Swap Tiers." A tier is
> simply a user-named group of swap devices sharing the same priority range.
> This abstraction facilitates swap device selection based on speed, allowing
> users to configure specific tiers for cgroups.
>
> For more details, please refer to the LPC 2025 presentation
> https://lpc.events/event/19/contributions/2141/attachments/1857/3998/LPC2025Finalss.pdf
> or v1 patch.
>
> Changes in v2
> =============
> 1. Respect cgroup hierarchy principle (LPC 2025 feedback)
> - The logic now strictly follows standard cgroup hierarchy principles.
>
> Previous: Children could select any tier using "+" regardless of the
> parent's configuration. "+" tier is referenced. (could not be silently disappeared)
>
> Current: The explicit selection ("+") concept is removed. By
> default, all tiers are selected. Users now use "-" to exclude specific
> tiers. Excluded tier could disappeared silently.
> A child cgroup is always a subset of its parent. Even if a child
> re-enables a tier with "+" that was excluded by the parent, the effective
> tier list is limited to the parent's allowed subset.

This comment seems a bit clunky to me. The "+" is removed, as noted
above, but then why are we saying "even if a child re-enables a tier
with "+"" here? Am I missing something?

>
> Example:
> Global Tiers: SSD, HDD, NET
> Parent: SSD, NET (HDD excluded)
> Child: HDD, NET (SSD excluded)
> -> Effective Child Tier: NET (Intersection of Parent and Child)

But otherwise, I assume you mean to restrict child's allowed swap
tiers to be a subset of children and its ancestors? That seems more
straightforward to me than the last system :)


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH v2 0/5] mm/swap, memcg: Introduce swap tiers for cgroup based swap control
  2026-02-12 17:57 ` Nhat Pham
@ 2026-02-12 17:58   ` Nhat Pham
  2026-02-13  2:43   ` YoungJun Park
  1 sibling, 0 replies; 23+ messages in thread
From: Nhat Pham @ 2026-02-12 17:58 UTC (permalink / raw)
  To: Youngjun Park
  Cc: Andrew Morton, linux-mm, Chris Li, Kairui Song, Kemeng Shi,
	Baoquan He, Barry Song, Johannes Weiner, Michal Hocko,
	Roman Gushchin, Shakeel Butt, Muchun Song, gunho.lee,
	taejoon.song, austin.kim

On Thu, Feb 12, 2026 at 9:57 AM Nhat Pham <nphamcs@gmail.com> wrote:
>
>
> But otherwise, I assume you mean to restrict child's allowed swap
> tiers to be a subset of children and its ancestors? That seems more

s/children/parent


> straightforward to me than the last system :)


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH v2 0/5] mm/swap, memcg: Introduce swap tiers for cgroup based swap control
  2026-01-26  6:52 [RFC PATCH v2 0/5] mm/swap, memcg: Introduce swap tiers for cgroup based swap control Youngjun Park
                   ` (6 preceding siblings ...)
  2026-02-12 17:57 ` Nhat Pham
@ 2026-02-12 18:33 ` Shakeel Butt
  2026-02-13  3:58   ` YoungJun Park
  7 siblings, 1 reply; 23+ messages in thread
From: Shakeel Butt @ 2026-02-12 18:33 UTC (permalink / raw)
  To: Youngjun Park
  Cc: Andrew Morton, linux-mm, Chris Li, Kairui Song, Kemeng Shi,
	Nhat Pham, Baoquan He, Barry Song, Johannes Weiner, Michal Hocko,
	Roman Gushchin, Muchun Song, gunho.lee, taejoon.song, austin.kim

Hi Youngjun,

On Mon, Jan 26, 2026 at 03:52:37PM +0900, Youngjun Park wrote:
> This is the second version of the RFC for the "Swap Tiers" concept.
> Link to v1: https://lore.kernel.org/linux-mm/20251109124947.1101520-1-youngjun.park@lge.com/
> 
> This version incorporates feedback received during LPC 2025 and addresses
> comments from the previous review. We have also included experimental
> results based on usage scenarios intended for our internal platforms.
> 
> Motivation & Concept recap
> ==========================
> Current Linux swap allocation is global, limiting the ability to assign
> faster devices to specific cgroups. Our initial attempt at per-cgroup
> priorities proved over-engineered and caused LRU inversion.
> 
> Following Chris Li's suggestion, we pivoted to "Swap Tiers." A tier is
> simply a user-named group of swap devices sharing the same priority range.
> This abstraction facilitates swap device selection based on speed, allowing
> users to configure specific tiers for cgroups.
> 
> For more details, please refer to the LPC 2025 presentation
> https://lpc.events/event/19/contributions/2141/attachments/1857/3998/LPC2025Finalss.pdf
> or v1 patch.
> 

One of the LPC feedback you missed is to not add memcg interface for
this functionality and explore BPF way instead.

We are normally very conservative to add new interfaces to cgroup.
However I am not even convinced that memcg interface is the right way to
expose this functionality. Swap is currently global and the idea to
limit or assign specific swap devices to specific cgroups makes sense
but that is the decision for the job orchestator or node controller.
Allowing workloads to pick and choose swap devices do not make sense to
me.

Shakeel



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH v2 0/5] mm/swap, memcg: Introduce swap tiers for cgroup based swap control
  2026-02-12  6:12 ` [RFC PATCH v2 0/5] mm/swap, memcg: Introduce swap tiers for cgroup based swap control Chris Li
  2026-02-12  9:22   ` Chris Li
@ 2026-02-13  1:59   ` YoungJun Park
  1 sibling, 0 replies; 23+ messages in thread
From: YoungJun Park @ 2026-02-13  1:59 UTC (permalink / raw)
  To: Chris Li
  Cc: Andrew Morton, linux-mm, Kairui Song, Kemeng Shi, Nhat Pham,
	Baoquan He, Barry Song, Johannes Weiner, Michal Hocko,
	Roman Gushchin, Shakeel Butt, Muchun Song, gunho.lee,
	taejoon.song, austin.kim

On Wed, Feb 11, 2026 at 10:12:04PM -0800, Chris Li wrote:
> Hi Youngjun,
> 
> On Sun, Jan 25, 2026 at 10:53 PM Youngjun Park <youngjun.park@lge.com> wrote:
> >
> > This is the second version of the RFC for the "Swap Tiers" concept.
> > Link to v1: https://lore.kernel.org/linux-mm/20251109124947.1101520-1-youngjun.park@lge.com/
> >
> > This version incorporates feedback received during LPC 2025 and addresses
> > comments from the previous review. We have also included experimental
> > results based on usage scenarios intended for our internal platforms.
> 
> Thanks for the patches series.
> 
> Sorry for the late reply. I have been wanting to reply to it but get
> super busy at work.
> 
> Some high level feedback for the series. Now that you demonstrated the
> whole series, let's focus on making small mergiable baby steps. Just
> like the swap table has different phases. Make each step minimal, each
> step shows some value. Do the MVP, we can always add more features as
> a follow up step.
> 
> I suggest the first step is getting the tiers bits defined. Add only,
> no delete.  Get that reviewed and merged, then the next step is to use
> those tiers.
> 
> Chris

Hi Chris,

Thank you for the direction.

I agree that breaking the series into smaller, mergeable steps is the
right approach. However, since introducing the definitions alone might
lack immediate usage, I propose a slightly
modified roadmap to ensure Step 1 demonstrates some value.

Here is the plan I have in mind.

1. Swap Tier Definition & Addition
   - Introduce the concept, grouping logic, and the 'add' interface.
   - Value: Enables basic exception handling within the swap device
     itself using tiers.

2. Advanced Control (Delete/Modify)
   - Implement logic to remove or update tiers.
   - Value: Enhances the usability and management of the tiers
     established in Step 1.

3. External Integration (memcg, bpf etc ... )
   - Apply swap tiers for broader swap control.
   - Value: Connects swap tiers to other subsystems like memcg.

Does this roadmap look reasonable to you? I will proceed with preparing
the real patch series based on this structure.

Best regards,
Youngjun


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH v2 v2 1/5] mm: swap: introduce swap tier infrastructure
  2026-02-12  9:07   ` Chris Li
@ 2026-02-13  2:18     ` YoungJun Park
  2026-02-13 14:33     ` YoungJun Park
  1 sibling, 0 replies; 23+ messages in thread
From: YoungJun Park @ 2026-02-13  2:18 UTC (permalink / raw)
  To: Chris Li
  Cc: Andrew Morton, linux-mm, Kairui Song, Kemeng Shi, Nhat Pham,
	Baoquan He, Barry Song, Johannes Weiner, Michal Hocko,
	Roman Gushchin, Shakeel Butt, Muchun Song, gunho.lee,
	taejoon.song, austin.kim

On Thu, Feb 12, 2026 at 01:07:45AM -0800, Chris Li wrote:
> > +
> > +       spin_lock(&swap_lock);
> > +       spin_lock(&swap_tier_lock);
> > +
> > +       p = tmp;
> > +       swap_tiers_save(ctx);
> > +
> > +       while (!ret && (token = strsep(&p, ", \t\n")) != NULL) {
> > +               if (!*token)
> > +                       continue;
> > +
> > +               if (token[0] == '-') {
> > +                       ret = swap_tiers_remove(token + 1);
> > +               } else {
> > +
> > +                       name = strsep(&token, ":");
> > +                       if (!token || kstrtos16(token, 10, &prio)) {
> > +                               ret = -EINVAL;
> > +                               goto out;
> > +                       }
> > +
> > +                       if (name[0] == '+')
> > +                               ret = swap_tiers_add(name + 1, prio);
> > +                       else
> > +                               ret = swap_tiers_modify(name, prio);
> > +               }
> > +
> > +               if (ret)
> > +                       goto restore;
> > +       }
> 
> This function can use some simplification to make the indentation flater.

Agreed. I will refactor this to flatten the indentation.


> > +/*
> > + * struct swap_tier - structure representing a swap tier.
> > + *
> > + * @name: name of the swap_tier.
> > + * @prio: starting value of priority.
> > + * @list: linked list of tiers.
> > +*/
> > +static struct swap_tier {
> > +       char name[MAX_TIERNAME];
> > +       short prio;
> > +       struct list_head list;
> > +} swap_tiers[MAX_SWAPTIER];
> 
> We can have a CONFIG option for the MAX_SWAPTIER. I think the default
> should be a small number like 4.

Sounds good. I will add a CONFIG option for it and ensure it doesn't exceed
MAX_SWAPFILE.


> > +
> > +/*
> > + * XXX: Reverting individual operations becomes complex as the number of
> > + * operations grows. Instead, we save the original state beforehand and
> > + * fully restore it if any operation fails.
> > + */
> > +void swap_tiers_save(struct swap_tier_save_ctx ctx[])
> 
> I really hope we don't have to do the save and restore thing. Is there
> another design we can simplify this?

I have given this a lot of thought.

Since the current interface allows mixing add (+), remove (-), and modify
operations, we must either restore from a saved state or reverse the
successful individual operations upon failure.

I implemented both approaches and concluded that reversing individual
operations is error-prone. Also, it could be slow if there are many
operations.

Another approach could be using a "global clone tier" strategy.
(Because operation globally synchronized)

Therefore, I would like to propose restricting the interface to handle a
single operation at a time. What do you think?

> > @@ -76,7 +77,6 @@ atomic_long_t nr_swap_pages;
> >  EXPORT_SYMBOL_GPL(nr_swap_pages);
> >  /* protected with swap_lock. reading in vm_swap_full() doesn't need lock */
> >  long total_swap_pages;
> > -#define DEF_SWAP_PRIO  -1
> >  unsigned long swapfile_maximum_size;
> >  #ifdef CONFIG_MIGRATION
> >  bool swap_migration_ad_supported;
> > @@ -89,7 +89,7 @@ static const char Bad_offset[] = "Bad swap offset entry ";
> >   * all active swap_info_structs
> >   * protected with swap_lock, and ordered by priority.
> >   */
> > -static PLIST_HEAD(swap_active_head);
> > +PLIST_HEAD(swap_active_head);
> 
> One idea is to make each tier have swap_active_head. So different swap
> entry releases on different tiers don't need to be competing on the
> same swap_active_head.
> 
> That will require the swapfile don't jump to another tiers.
>
I agree. With the tier structure, we can limit contention to objects within
the same tier.

I also think swap_avail_list could be optimized in a similar way in the
future.

Youngjun


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH v2 0/5] mm/swap, memcg: Introduce swap tiers for cgroup based swap control
  2026-02-12  9:22   ` Chris Li
@ 2026-02-13  2:26     ` YoungJun Park
  0 siblings, 0 replies; 23+ messages in thread
From: YoungJun Park @ 2026-02-13  2:26 UTC (permalink / raw)
  To: Chris Li
  Cc: Andrew Morton, linux-mm, Kairui Song, Kemeng Shi, Nhat Pham,
	Baoquan He, Barry Song, Johannes Weiner, Michal Hocko,
	Roman Gushchin, Shakeel Butt, Muchun Song, gunho.lee,
	taejoon.song, austin.kim

On Thu, Feb 12, 2026 at 01:22:04AM -0800, Chris Li wrote:
> On Wed, Feb 11, 2026 at 10:12 PM Chris Li <chrisl@kernel.org> wrote:
> >
> > Hi Youngjun,
> >
> > On Sun, Jan 25, 2026 at 10:53 PM Youngjun Park <youngjun.park@lge.com> wrote:
> > >
> > > This is the second version of the RFC for the "Swap Tiers" concept.
> > > Link to v1: https://lore.kernel.org/linux-mm/20251109124947.1101520-1-youngjun.park@lge.com/
> > >
> > > This version incorporates feedback received during LPC 2025 and addresses
> > > comments from the previous review. We have also included experimental
> > > results based on usage scenarios intended for our internal platforms.
> >
> > Thanks for the patches series.
> >
> > Sorry for the late reply. I have been wanting to reply to it but get
> > super busy at work.
> >
> > Some high level feedback for the series. Now that you demonstrated the
> > whole series, let's focus on making small mergiable baby steps. Just
> > like the swap table has different phases. Make each step minimal, each
> > step shows some value. Do the MVP, we can always add more features as
> > a follow up step.
> >
> > I suggest the first step is getting the tiers bits defined. Add only,
> > no delete.  Get that reviewed and merged, then the next step is to use
> > those tiers.
> 
> Just take a quick look at the series. I take that suggestion back.
> This series is actually not too long. Adding the tiers name alone does
> not add any real value. I actually need to look at the whole series
> rather than just the tier name alone.
> 
> Chris

Oops, I replied to your previous email before seeing this one.

Stripping out the remove/modify parts is also feasible. Do you agree with
that direction?

Youngjun


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH v2 0/5] mm/swap, memcg: Introduce swap tiers for cgroup based swap control
  2026-02-12 17:57 ` Nhat Pham
  2026-02-12 17:58   ` Nhat Pham
@ 2026-02-13  2:43   ` YoungJun Park
  1 sibling, 0 replies; 23+ messages in thread
From: YoungJun Park @ 2026-02-13  2:43 UTC (permalink / raw)
  To: Nhat Pham
  Cc: Andrew Morton, linux-mm, Chris Li, Kairui Song, Kemeng Shi,
	Baoquan He, Barry Song, Johannes Weiner, Michal Hocko,
	Roman Gushchin, Shakeel Butt, Muchun Song, gunho.lee,
	taejoon.song, austin.kim

On Thu, Feb 12, 2026 at 09:57:40AM -0800, Nhat Pham wrote:
> On Sun, Jan 25, 2026 at 10:53 PM Youngjun Park <youngjun.park@lge.com> wrote:
> >
> > This is the second version of the RFC for the "Swap Tiers" concept.
> > Link to v1: https://lore.kernel.org/linux-mm/20251109124947.1101520-1-youngjun.park@lge.com/
> >
> > This version incorporates feedback received during LPC 2025 and addresses
> > comments from the previous review. We have also included experimental
> > results based on usage scenarios intended for our internal platforms.
> >
> > Motivation & Concept recap
> > ==========================
> > Current Linux swap allocation is global, limiting the ability to assign
> > faster devices to specific cgroups. Our initial attempt at per-cgroup
> > priorities proved over-engineered and caused LRU inversion.
> >
> > Following Chris Li's suggestion, we pivoted to "Swap Tiers." A tier is
> > simply a user-named group of swap devices sharing the same priority range.
> > This abstraction facilitates swap device selection based on speed, allowing
> > users to configure specific tiers for cgroups.
> >
> > For more details, please refer to the LPC 2025 presentation
> > https://lpc.events/event/19/contributions/2141/attachments/1857/3998/LPC2025Finalss.pdf
> > or v1 patch.
> >
> > Changes in v2
> > =============
> > 1. Respect cgroup hierarchy principle (LPC 2025 feedback)
> > - The logic now strictly follows standard cgroup hierarchy principles.
> >
> > Previous: Children could select any tier using "+" regardless of the
> > parent's configuration. "+" tier is referenced. (could not be silently disappeared)
> >
> > Current: The explicit selection ("+") concept is removed. By
> > default, all tiers are selected. Users now use "-" to exclude specific
> > tiers. Excluded tier could disappeared silently.
> > A child cgroup is always a subset of its parent. Even if a child
> > re-enables a tier with "+" that was excluded by the parent, the effective
> > tier list is limited to the parent's allowed subset.
> 
> This comment seems a bit clunky to me. The "+" is removed, as noted
> above, but then why are we saying "even if a child re-enables a tier
> with "+"" here? Am I missing something?

To clarify, previously, the default state used all tiers. Using "+"              
switched to "an exclusive mode"  where only that specific tier was used.         
                                                                                 
I am changing this to a subtraction-based model. By default, all tiers           
are selected, and users use "-" to exclude specific ones.                        
(Then not "removed" but "changed" is more proper?)                               
                                                                                 
In this context, I intended "+" to be used to restore a tier that was            
previously excluded by "-".

> >
> > Example:
> > Global Tiers: SSD, HDD, NET
> > Parent: SSD, NET (HDD excluded)
> > Child: HDD, NET (SSD excluded)
> > -> Effective Child Tier: NET (Intersection of Parent and Child)
> 
> But otherwise, I assume you mean to restrict child's allowed swap
> tiers to be a subset of children and its ancestors? 

Exactly.

> That seems more
> straightforward to me than the last system :)

Yes, that's right :)

Thanks 
Youngjun Park.


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH v2 0/5] mm/swap, memcg: Introduce swap tiers for cgroup based swap control
  2026-02-12 18:33 ` Shakeel Butt
@ 2026-02-13  3:58   ` YoungJun Park
  2026-02-21  3:47     ` Shakeel Butt
  0 siblings, 1 reply; 23+ messages in thread
From: YoungJun Park @ 2026-02-13  3:58 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Andrew Morton, linux-mm, Chris Li, Kairui Song, Kemeng Shi,
	Nhat Pham, Baoquan He, Barry Song, Johannes Weiner, Michal Hocko,
	Roman Gushchin, Muchun Song, gunho.lee, taejoon.song, austin.kim

On Thu, Feb 12, 2026 at 10:33:22AM -0800, Shakeel Butt wrote:
> Hi Youngjun,
> 
> On Mon, Jan 26, 2026 at 03:52:37PM +0900, Youngjun Park wrote:
> > This is the second version of the RFC for the "Swap Tiers" concept.
> > Link to v1: https://lore.kernel.org/linux-mm/20251109124947.1101520-1-youngjun.park@lge.com/
> > 
> > This version incorporates feedback received during LPC 2025 and addresses
> > comments from the previous review. We have also included experimental
> > results based on usage scenarios intended for our internal platforms.
> > 
> > Motivation & Concept recap
> > ==========================
> > Current Linux swap allocation is global, limiting the ability to assign
> > faster devices to specific cgroups. Our initial attempt at per-cgroup
> > priorities proved over-engineered and caused LRU inversion.
> > 
> > Following Chris Li's suggestion, we pivoted to "Swap Tiers." A tier is
> > simply a user-named group of swap devices sharing the same priority range.
> > This abstraction facilitates swap device selection based on speed, allowing
> > users to configure specific tiers for cgroups.
> > 
> > For more details, please refer to the LPC 2025 presentation
> > https://lpc.events/event/19/contributions/2141/attachments/1857/3998/LPC2025Finalss.pdf
> > or v1 patch.
> > 
> 
> One of the LPC feedback you missed is to not add memcg interface for
> this functionality and explore BPF way instead.
> 
> We are normally very conservative to add new interfaces to cgroup.
> However I am not even convinced that memcg interface is the right way to
> expose this functionality. Swap is currently global and the idea to
> limit or assign specific swap devices to specific cgroups makes sense
> but that is the decision for the job orchestator or node controller.
> Allowing workloads to pick and choose swap devices do not make sense to
> me.

Apologies for overlooking the feedback regarding the BPF approach. Thank you
for the suggestion.

I agree that using BPF would provide greater flexibility, allowing control not
just at the memcg level, but also per-process or for complex workloads.
(As like orchestrator and node controller)

However, I am concerned that this level of freedom might introduce logical
contradictions, particularly regarding cgroup hierarchy semantics.

For example, BPF might allow a topology that violates hierarchical constraints
(a concern that was also touched upon during LPC)

  - Group A (Parent): Assigned to SSD1
  - Group B (Child of A): Assigned to SSD2

If Group A has a `memory.swap.max` limit, and Group B swaps out to SSD2, it
creates a consistency issue. Group B consumes Group A's swap quota, but it is
utilizing a device (SSD2) that is distinct from the Parent's assignment. This
could lead to situations where the Parent's limit is exhausted by usage on a
device it effectively doesn't "own" or shouldn't be using.

One might suggest restricting BPF to strictly adhere to these hierarchical
constraints. However, doing so would effectively eliminate the primary
advantage of using BPF—its flexibility. If we are to enforce standard cgroup
semantics anyway, a native interface seems more appropriate than a constrained
BPF hook.

Beyond this specific example, I suspect that delegating this logic to BPF
might introduce other unforeseen edge cases regarding hierarchy enforcement.
In my view, the BPF approach seems more like a "next step."

Since you acknowledged that the idea of assigning swap devices to cgroups
"makes sense," I believe implementing this within the standard, strictly
constrained "cgroup land" is preferable. 

A strict cgroup interface ensures
that hierarchy and accounting rules are consistently enforced, avoiding the
potential conflicts that the unrestricted freedom of BPF might create.

Ultimately, I hope this swap tier mechanism can serve as a foundation to be
leveraged by other subsystems, such as BPF and DAMON. I view this proposal as
the necessary first step toward that future.

Youngjun Park

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH v2 v2 1/5] mm: swap: introduce swap tier infrastructure
  2026-02-12  9:07   ` Chris Li
  2026-02-13  2:18     ` YoungJun Park
@ 2026-02-13 14:33     ` YoungJun Park
  1 sibling, 0 replies; 23+ messages in thread
From: YoungJun Park @ 2026-02-13 14:33 UTC (permalink / raw)
  To: Chris Li
  Cc: Andrew Morton, linux-mm, Kairui Song, Kemeng Shi, Nhat Pham,
	Baoquan He, Barry Song, Johannes Weiner, Michal Hocko,
	Roman Gushchin, Shakeel Butt, Muchun Song, gunho.lee,
	taejoon.song, austin.kim

On Thu, Feb 12, 2026 at 01:07:45AM -0800, Chris Li wrote:
> > +}
> > +
> > +int swap_tiers_add(const char *name, int prio)
> 
> When we add, modify, remove a tier. The simple case is there is no
> swap file under any tiers.
> But if the modification causes some swap files to jump to different
> tiers. That might be problematic.

I missed one comment. 

The tier of existing swapfiles is immutable once assigned at swapon.
I removed tier reference.
Instead of reference counting, each operation validates the tier
range at operation time to guarantee this invariant.

- add:    Does not change existing swapfiles' tier. New tier may
          split priority range, but existing assignments stay.
- remove: Rejected with -EBUSY if any swapfile is attached.
- modify: Rejected if the change would cause any swapfile to
          move to a different tier.

So swapfiles never jump between tiers at runtime.

Youngjun Park


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH v2 0/5] mm/swap, memcg: Introduce swap tiers for cgroup based swap control
  2026-02-13  3:58   ` YoungJun Park
@ 2026-02-21  3:47     ` Shakeel Butt
  2026-02-21  6:07       ` Chris Li
  2026-02-21 14:30       ` YoungJun Park
  0 siblings, 2 replies; 23+ messages in thread
From: Shakeel Butt @ 2026-02-21  3:47 UTC (permalink / raw)
  To: YoungJun Park
  Cc: Andrew Morton, linux-mm, Chris Li, Kairui Song, Kemeng Shi,
	Nhat Pham, Baoquan He, Barry Song, Johannes Weiner, Michal Hocko,
	Roman Gushchin, Muchun Song, gunho.lee, taejoon.song, austin.kim

Please don't send a new version of the series before concluding the discussion
on the previous one.

On Fri, Feb 13, 2026 at 12:58:40PM +0900, YoungJun Park wrote:
> > 
> > One of the LPC feedback you missed is to not add memcg interface for
> > this functionality and explore BPF way instead.
> > 
> > We are normally very conservative to add new interfaces to cgroup.
> > However I am not even convinced that memcg interface is the right way to
> > expose this functionality. Swap is currently global and the idea to
> > limit or assign specific swap devices to specific cgroups makes sense
> > but that is the decision for the job orchestator or node controller.
> > Allowing workloads to pick and choose swap devices do not make sense to
> > me.
> 
> Apologies for overlooking the feedback regarding the BPF approach. Thank you
> for the suggestion.

No need for apologies. These things take time and multiple iterations.

> 
> I agree that using BPF would provide greater flexibility, allowing control not
> just at the memcg level, but also per-process or for complex workloads.
> (As like orchestrator and node controller)

Yes it provides the flexibility but that is not the main reason I am pushing for
it. The reason I want you to first try the BPF approach without introducing any
stable interfaces. Show how swap tiers will be used and configured in production
environment and then we can talk if a stable interface is needed. I am still not
convinced that swap tiers need to be controlled hierarchically and the non-root
should be able to control it.

> 
> However, I am concerned that this level of freedom might introduce logical
> contradictions, particularly regarding cgroup hierarchy semantics.
> 
> For example, BPF might allow a topology that violates hierarchical constraints
> (a concern that was also touched upon during LPC)

Yes BPF provides more power but it is controlled by admin and admin can shoot
their foot in multiple ways.

> 
>   - Group A (Parent): Assigned to SSD1
>   - Group B (Child of A): Assigned to SSD2
> 
> If Group A has a `memory.swap.max` limit, and Group B swaps out to SSD2, it
> creates a consistency issue. Group B consumes Group A's swap quota, but it is
> utilizing a device (SSD2) that is distinct from the Parent's assignment. This
> could lead to situations where the Parent's limit is exhausted by usage on a
> device it effectively doesn't "own" or shouldn't be using.
> 
> One might suggest restricting BPF to strictly adhere to these hierarchical
> constraints.

No need to constraint anything.

Taking a step back, can you describe your use-case a bit more and share
requirements?

You have multiple swap devices of different properties and you want to assign
those swap devices to different workloads. Now couple of questions:

1. If more than one device is assign to a workload, do you want to have
   some kind of ordering between them for the worklod or do you want option to
   have round robin kind of policy?

2. What's the reason to use 'tiers' in the name? Is it similar to memory tiers
   and you want promotion/demotion among the tiers?

3. If a workload has multiple swap devices assigned, can you describe the
   scenario where such workloads need to partition/divide given devices to their
   sub-workloads?

Let's start with these questions. Please note that I want us to not just look at
the current use-case but brainstorm more future use-cases and then come up with
the solution which is more future proof.

thanks,
Shakeel

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH v2 0/5] mm/swap, memcg: Introduce swap tiers for cgroup based swap control
  2026-02-21  3:47     ` Shakeel Butt
@ 2026-02-21  6:07       ` Chris Li
  2026-02-21 17:44         ` Shakeel Butt
  2026-02-21 14:30       ` YoungJun Park
  1 sibling, 1 reply; 23+ messages in thread
From: Chris Li @ 2026-02-21  6:07 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: YoungJun Park, Andrew Morton, linux-mm, Kairui Song, Kemeng Shi,
	Nhat Pham, Baoquan He, Barry Song, Johannes Weiner, Michal Hocko,
	Roman Gushchin, Muchun Song, gunho.lee, taejoon.song, austin.kim

On Fri, Feb 20, 2026 at 7:47 PM Shakeel Butt <shakeel.butt@linux.dev> wrote:
>
> Please don't send a new version of the series before concluding the discussion
> on the previous one.

In this case I think it is fine.  You haven't responded to YoungJun's
last response in over a week. He might have mistaken that the
discussion concluded.
Consider it is one of the iterations. It is hard enough to contribute
to the kernel. Relax.
Plus, much of the discussion on the mailing list always has differing
opinions. So, it's hard to determine what is truly concluded.
Different people might have different interitations of the same text.

>
> On Fri, Feb 13, 2026 at 12:58:40PM +0900, YoungJun Park wrote:
> > >
> > > One of the LPC feedback you missed is to not add memcg interface for
> > > this functionality and explore BPF way instead.
> > >
> > > We are normally very conservative to add new interfaces to cgroup.
> > > However I am not even convinced that memcg interface is the right way to
> > > expose this functionality. Swap is currently global and the idea to
> > > limit or assign specific swap devices to specific cgroups makes sense
> > > but that is the decision for the job orchestator or node controller.
> > > Allowing workloads to pick and choose swap devices do not make sense to
> > > me.
> >
> > Apologies for overlooking the feedback regarding the BPF approach. Thank you
> > for the suggestion.
>
> No need for apologies. These things take time and multiple iterations.
>
> >
> > I agree that using BPF would provide greater flexibility, allowing control not
> > just at the memcg level, but also per-process or for complex workloads.
> > (As like orchestrator and node controller)
>
> Yes it provides the flexibility but that is not the main reason I am pushing for
> it. The reason I want you to first try the BPF approach without introducing any
> stable interfaces. Show how swap tiers will be used and configured in production

Is that your biggest concern? Many different ways exist to solve that
problem. e.g. We can put a config option protecting it and mark it as
experimental. This will unblock the development allow experiment. We
can have more people to try it out and give feedback.

> environment and then we can talk if a stable interface is needed. I am still not
> convinced that swap tiers need to be controlled hierarchically and the non-root
> should be able to control it.

Yes, my company uses a different swap device at different cgroup
level. I did ask my coworker to confirm that usage. Control at the non
root level is a real need.

>
> >
> > However, I am concerned that this level of freedom might introduce logical
> > contradictions, particularly regarding cgroup hierarchy semantics.
> >
> > For example, BPF might allow a topology that violates hierarchical constraints
> > (a concern that was also touched upon during LPC)
>
> Yes BPF provides more power but it is controlled by admin and admin can shoot
> their foot in multiple ways.

I think this swap device control is a very basic need. All your
objections to swapping control in the group can equally apply to
zswap.writeback. Unlike zswap.writeback, which only control from the
zswap behavior. This is a more generic version control swap device
other than zswap as well. BTW, I raised that concern about
zswap.writeback was not generic enough as swap control was limited
when zswap was proposed. We did hold back zswap.writeback. The
consensers is interface can be improved as later iterations. So here
we are.

>
> >
> >   - Group A (Parent): Assigned to SSD1
> >   - Group B (Child of A): Assigned to SSD2
> >
> > If Group A has a `memory.swap.max` limit, and Group B swaps out to SSD2, it
> > creates a consistency issue. Group B consumes Group A's swap quota, but it is
> > utilizing a device (SSD2) that is distinct from the Parent's assignment. This
> > could lead to situations where the Parent's limit is exhausted by usage on a
> > device it effectively doesn't "own" or shouldn't be using.
> >
> > One might suggest restricting BPF to strictly adhere to these hierarchical
> > constraints.
>
> No need to constraint anything.
>
> Taking a step back, can you describe your use-case a bit more and share
> requirements?

There is a very long thread on the linux-mm maillist. I'm too lazy to dig it up.

I can share our usage requirement to refresh your memory. We
internally use a cgroup swapfile control interface that has not been
upstreamed. With this we can remove the need of that internal
interface and go upstream instead.
>
> You have multiple swap devices of different properties and you want to assign
> those swap devices to different workloads. Now couple of questions:
>
> 1. If more than one device is assign to a workload, do you want to have
>    some kind of ordering between them for the worklod or do you want option to
>    have round robin kind of policy?

It depends on the number of devices in the tiers. Different tiers
maintain an order. Within the same tier round robin.

>
> 2. What's the reason to use 'tiers' in the name? Is it similar to memory tiers
>    and you want promotion/demotion among the tiers?

I propose the tier name. Guilty. Yes, in was inpired by memory tiers.
It just different class of swap speeds. I am not fixed on the name. We
can also call it swap.device_speed_classes. You can suggest
alternatives.

Promotion / demotion is possible in the future. The current state,
without promotion or demotion, already provides value. Our current
deployment uses only one class of swap device at a time. However I do
know other companies use  more than one class of swap device.

>
> 3. If a workload has multiple swap devices assigned, can you describe the
>    scenario where such workloads need to partition/divide given devices to their
>    sub-workloads?

In our deployment, we always use more than one swap device to reduce
swap device lock contention.
The job config can describe the swap speed it can tolerate. Some jobs
can tolerate slower speeds, while others cannot.

> Let's start with these questions. Please note that I want us to not just look at
> the current use-case but brainstorm more future use-cases and then come up with
> the solution which is more future proof.

Take zswap.writeback as example. We have a solution that worked for
the requirement at that time. Incremental improvement is fine as well.
Usually, incremental progress is better. At least currently there is a
real need to allow different cgroups to select different swap speeds.
There is a risk in being too future-proof: we might design things that
people in the future don't use as we envisioned. I see that happen too
often as well.

So starting from the current need is a solid starting point. It's just
a different design philosophy. Each to their own.

That is the only usage case I know. YoungJun feel free to add yours
usage as well.

Chris

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH v2 0/5] mm/swap, memcg: Introduce swap tiers for cgroup based swap control
  2026-02-21  3:47     ` Shakeel Butt
  2026-02-21  6:07       ` Chris Li
@ 2026-02-21 14:30       ` YoungJun Park
  1 sibling, 0 replies; 23+ messages in thread
From: YoungJun Park @ 2026-02-21 14:30 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Andrew Morton, linux-mm, Chris Li, Kairui Song, Kemeng Shi,
	Nhat Pham, Baoquan He, Barry Song, Johannes Weiner, Michal Hocko,
	Roman Gushchin, Muchun Song, gunho.lee, taejoon.song, austin.kim

On Fri, Feb 20, 2026 at 07:47:22PM -0800, Shakeel Butt wrote:
> Please don't send a new version of the series before concluding the discussion
> on the previous one.

Understood. Let's continue the discussion. :D

Chris has already provided a thorough response, but I would like to
add my perspective as well.

> Yes it provides the flexibility but that is not the main reason I am pushing for
> it. The reason I want you to first try the BPF approach without introducing any
> stable interfaces. Show how swap tiers will be used and configured in production
> environment and then we can talk if a stable interface is needed.

I understand your concern about committing to a stable interface too
early. As Chris suggested, we could reduce this concern by guarding
the interface behind a build-time config option or marking it as
experimental, which I will also touch on further below.

On that note, if BPF were to become the primary control mechanism,
I am not sure a memcg interface would still be needed at all, since
BPF already provides a high degree of freedom. However, that level
of freedom is also what concerns me -- BPF-driven swap device
assignments could subtly conflict with memcg hierarchy semantics in
ways that are hard to predict or debug. A more constrained memcg-based
approach might actually be safer in that regard.

> I am still not convinced that swap tiers need to be controlled
> hierarchically and the non-root should be able to control it.

I think this concern is closely tied to your question #3 below about
concrete use cases for partitioning devices across sub-workloads.
I hope my answer there helps clarify this.

> Yes BPF provides more power but it is controlled by admin and admin can shoot
> their foot in multiple ways.

As I mentioned above, I think guarding the feature behind a build-time
config or runtime constraints could keep the usage well-defined and
predictable, while still being useful.

> Taking a step back, can you describe your use-case a bit more and share
> requirements?

Our use case is simple at now. 
We have two swap devices with different performance
characteristics and want to assign different swap devices to different
workloads (cgroups).

For some background, when I initially proposed this, I suggested allowing
per-cgroup swap device priorities so that it could also accommodate the
broader scenarios you mentioned. However, since even our own use case
does not require reversing swap priorities within a cgroup, we pivoted
to the "swap tier" mechanism that Chris proposed.

> 1. If more than one device is assign to a workload, do you want to have
>    some kind of ordering between them for the worklod or do you want option to
>    have round robin kind of policy?

Both. If devices are in the same tier with the same priority, round robin.
If they are in the same tier with different priorities, or in different
tiers, ordering applies. The current tier structure should be able to
satisfy either preference.

> 2. What's the reason to use 'tiers' in the name? Is it similar to memory tiers
>    and you want promotion/demotion among the tiers?

This was originally Chris's idea. I think he explained the rationale
well in his reply.

> 3. If a workload has multiple swap devices assigned, can you describe the
>    scenario where such workloads need to partition/divide given devices to their
>    sub-workloads?

One possible scenario is reducing lock contention by partitioning swap
devices between parent and child cgroups.

> Let's start with these questions. Please note that I want us to not just look at
> the current use-case but brainstorm more future use-cases and then come up with
> the solution which is more future proof.

We have clear production use cases from both us and Chris, and I also
presented a deployment example in the cover letter.

I think it is hard to design concretely for future use cases at this
point. When those needs become clearer, BPF with its flexibility
would be a better fit then. I see BPF as a natural extension path
rather than a starting point.

For now, guarding the memcg & tier behind a CONFIG option would
let us move forward without committing to a stable interface, and
we can always pivot to BPF later if needed

Thanks,
YoungJun Park

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH v2 0/5] mm/swap, memcg: Introduce swap tiers for cgroup based swap control
  2026-02-21  6:07       ` Chris Li
@ 2026-02-21 17:44         ` Shakeel Butt
  0 siblings, 0 replies; 23+ messages in thread
From: Shakeel Butt @ 2026-02-21 17:44 UTC (permalink / raw)
  To: Chris Li
  Cc: YoungJun Park, Andrew Morton, linux-mm, Kairui Song, Kemeng Shi,
	Nhat Pham, Baoquan He, Barry Song, Johannes Weiner, Michal Hocko,
	Roman Gushchin, Muchun Song, gunho.lee, taejoon.song, austin.kim

On Fri, Feb 20, 2026 at 10:07:44PM -0800, Chris Li wrote:
> >
[...]
> > >
> > > I agree that using BPF would provide greater flexibility, allowing control not
> > > just at the memcg level, but also per-process or for complex workloads.
> > > (As like orchestrator and node controller)
> >
> > Yes it provides the flexibility but that is not the main reason I am pushing for
> > it. The reason I want you to first try the BPF approach without introducing any
> > stable interfaces. Show how swap tiers will be used and configured in production
> 
> Is that your biggest concern?

No, that is secondary because I am not seeing the real use-case of
controlling/partitioning swap devices among sub-workloads. Until that is
figured out, adding a stable API is not good.

> Many different ways exist to solve that
> problem. e.g. We can put a config option protecting it and mark it as
> experimental. This will unblock the development allow experiment. We
> can have more people to try it out and give feedback.
> 
> > environment and then we can talk if a stable interface is needed. I am still not
> > convinced that swap tiers need to be controlled hierarchically and the non-root
> > should be able to control it.
> 
> Yes, my company uses a different swap device at different cgroup
> level. I did ask my coworker to confirm that usage. Control at the non
> root level is a real need.

I am assuming you meant Google and particularly Prodkernel team and not
Android or ChromeOS. Google's prodkernel used to have per-cgroup
swapfiles exposed through memory.swapfiles (if I remember correctly
Suleiman implemented this along with ghost swapfiles). Later this was
deprecated (by Yu Zhao) and global (ghost) swapfiles were being used.
The memory.swapfiles interface instead of supporting real swapfiles
started having select options among default, ghost/zswap and real
(something like that). However such interface was used to just disable
or enable zswap for a workload and never about hierarchically
controlling the swap devices (Google prodkernel only have zswap). Has
something changed?

> 
> >
> > >
> > > However, I am concerned that this level of freedom might introduce logical
> > > contradictions, particularly regarding cgroup hierarchy semantics.
> > >
> > > For example, BPF might allow a topology that violates hierarchical constraints
> > > (a concern that was also touched upon during LPC)
> >
> > Yes BPF provides more power but it is controlled by admin and admin can shoot
> > their foot in multiple ways.
> 
> I think this swap device control is a very basic need.

Please explain that very basic need.

> All your
> objections to swapping control in the group can equally apply to
> zswap.writeback. Unlike zswap.writeback, which only control from the
> zswap behavior. This is a more generic version control swap device
> other than zswap as well. BTW, I raised that concern about
> zswap.writeback was not generic enough as swap control was limited
> when zswap was proposed. We did hold back zswap.writeback. The
> consensers is interface can be improved as later iterations. So here
> we are.

This just motivates me to pushback even harder on adding a new interface
without a clear use-case.

> 
> >
> > >
> > >   - Group A (Parent): Assigned to SSD1
> > >   - Group B (Child of A): Assigned to SSD2
> > >
> > > If Group A has a `memory.swap.max` limit, and Group B swaps out to SSD2, it
> > > creates a consistency issue. Group B consumes Group A's swap quota, but it is
> > > utilizing a device (SSD2) that is distinct from the Parent's assignment. This
> > > could lead to situations where the Parent's limit is exhausted by usage on a
> > > device it effectively doesn't "own" or shouldn't be using.
> > >
> > > One might suggest restricting BPF to strictly adhere to these hierarchical
> > > constraints.
> >
> > No need to constraint anything.
> >
> > Taking a step back, can you describe your use-case a bit more and share
> > requirements?
> 
> There is a very long thread on the linux-mm maillist. I'm too lazy to dig it up.
> 
> I can share our usage requirement to refresh your memory. We
> internally use a cgroup swapfile control interface that has not been
> upstreamed. With this we can remove the need of that internal
> interface and go upstream instead.

I already asked above but let me say it again. What's the actual real
world use-case to control/allow/disallow swap devices hierarchically?

> >
> > You have multiple swap devices of different properties and you want to assign
> > those swap devices to different workloads. Now couple of questions:
> >
> > 1. If more than one device is assign to a workload, do you want to have
> >    some kind of ordering between them for the worklod or do you want option to
> >    have round robin kind of policy?
> 
> It depends on the number of devices in the tiers. Different tiers
> maintain an order. Within the same tier round robin.
> 
> >
> > 2. What's the reason to use 'tiers' in the name? Is it similar to memory tiers
> >    and you want promotion/demotion among the tiers?
> 
> I propose the tier name. Guilty. Yes, in was inpired by memory tiers.
> It just different class of swap speeds. I am not fixed on the name. We
> can also call it swap.device_speed_classes. You can suggest
> alternatives.
> 
> Promotion / demotion is possible in the future. The current state,
> without promotion or demotion, already provides value. Our current
> deployment uses only one class of swap device at a time. However I do
> know other companies use  more than one class of swap device.
> 
> >
> > 3. If a workload has multiple swap devices assigned, can you describe the
> >    scenario where such workloads need to partition/divide given devices to their
> >    sub-workloads?
> 
> In our deployment, we always use more than one swap device to reduce
> swap device lock contention.

Having more than one swap devices to reduce lock contention is unrelated
to hierarchically control swap devices among sub-workloads.



^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2026-02-21 17:44 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2026-01-26  6:52 [RFC PATCH v2 0/5] mm/swap, memcg: Introduce swap tiers for cgroup based swap control Youngjun Park
2026-01-26  6:52 ` [RFC PATCH v2 v2 1/5] mm: swap: introduce swap tier infrastructure Youngjun Park
2026-02-12  9:07   ` Chris Li
2026-02-13  2:18     ` YoungJun Park
2026-02-13 14:33     ` YoungJun Park
2026-01-26  6:52 ` [RFC PATCH v2 v2 2/5] mm: swap: associate swap devices with tiers Youngjun Park
2026-01-26  6:52 ` [RFC PATCH v2 v2 3/5] mm: memcontrol: add interface for swap tier selection Youngjun Park
2026-01-26  6:52 ` [RFC PATCH v2 v2 4/5] mm, swap: change back to use each swap device's percpu cluster Youngjun Park
2026-02-12  7:37   ` Chris Li
2026-01-26  6:52 ` [RFC PATCH v2 v2 5/5] mm, swap: introduce percpu swap device cache to avoid fragmentation Youngjun Park
2026-02-12  6:12 ` [RFC PATCH v2 0/5] mm/swap, memcg: Introduce swap tiers for cgroup based swap control Chris Li
2026-02-12  9:22   ` Chris Li
2026-02-13  2:26     ` YoungJun Park
2026-02-13  1:59   ` YoungJun Park
2026-02-12 17:57 ` Nhat Pham
2026-02-12 17:58   ` Nhat Pham
2026-02-13  2:43   ` YoungJun Park
2026-02-12 18:33 ` Shakeel Butt
2026-02-13  3:58   ` YoungJun Park
2026-02-21  3:47     ` Shakeel Butt
2026-02-21  6:07       ` Chris Li
2026-02-21 17:44         ` Shakeel Butt
2026-02-21 14:30       ` YoungJun Park

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox