* [PATCH v4 0/4] mm/swap, memcg: Introduce swap tiers for cgroup based swap control
@ 2026-02-17 0:09 Youngjun Park
2026-02-17 0:09 ` [PATCH v4 1/4] mm: swap: introduce swap tier infrastructure Youngjun Park
` (3 more replies)
0 siblings, 4 replies; 7+ messages in thread
From: Youngjun Park @ 2026-02-17 0:09 UTC (permalink / raw)
To: Andrew Morton
Cc: Chris Li, linux-mm, Kairui Song, Kemeng Shi, Nhat Pham,
Baoquan He, Barry Song, Johannes Weiner, Michal Hocko,
Roman Gushchin, Shakeel Butt, Muchun Song, Michal Koutný,
gunho.lee, taejoon.song, austin.kim, youngjun.park
This is the fourth version of the "Swap Tiers" concept.
Following Chris Li's suggestion to focus on small, mergeable
steps, this series covers the core tier infrastructure and
memcg-based tier assignment as a minimal usable feature set.
Further extensions are deferred to subsequent series.
Previous versions:
RFC v3: https://lore.kernel.org/linux-mm/20260131125454.3187546-1-youngjun.park@lge.com/
RFC v2: https://lore.kernel.org/linux-mm/20260126065242.1221862-1-youngjun.park@lge.com/
RFC v1: https://lore.kernel.org/linux-mm/20251109124947.1101520-1-youngjun.park@lge.com/
Overview (Recap)
================
Swap Tiers enable grouping swap devices into named tiers based on
performance characteristics (e.g., NVMe, HDD, Network). This allows
faster devices to be dedicated to latency-sensitive workloads while
slower devices serve background tasks. The concept was suggested by
Chris Li.
Changes in v4
=================
- Simplified control flow to flatten indentation (Chris Li)
- Added CONFIG option for MAX_SWAPTIER with a small default of 4
(Chris Li)
- Added memory.swap.tiers.effective read interface, following cpuset
convention of splitting into configuration and effective files
(Michal Koutný)
- cgroup docs refinement. (Michal Koutný)
- Reworked save/restore logic into a clearer "snapshot and rollback"
model for improved readability and simpler control flow (Chris Li)
- Removed tier priority modification operation to reduce complexity;
may be revisited in a future series
- Added tier name validation: only alphanumeric characters and
underscores are allowed
- Fixed several edge case bugs
- Swap allocation logic improvements: integrating percpu global
cluster swap cache onto the swap device will be handled as
part of Kairui Song's ongoing work. Drop that logic on this patch.
- Rebased onto latest mm-new
Deferred and Future work:
- Per-tier swap_active_head to reduce contention across tiers when
releasing swap entries on different tiers (Chris Li). This is an
improvement to the swap_avail_head / swap_active_head (which must be done)
and is not critical for the initial infrastructure.
- Round-robin rotation (Kairui) cleanup will be proposed after
this series lands, as swap tiers can naturally abstract away
round-robin behavior (round-robin is unnecessary when no
equal-priority devices exist. possibly can disable it. and also can make round-robin
priority selectable).
- BPF interfaces (Shakeel Butt). beyond memcg
are potential future extensions once the base infrastructure is
established and real-world use cases are ((including, per-VMA, DAMON, etc.)).
Changes in RFC v3
=================
- Fixed swap_alloc_fast() tier eligibility check
- Fixed tier_mask restoration on error paths
- Fixed priority -1 tier deletion bug
- Fixed !CONFIG_MEMCG build failures
- Improved commit messages
- Fix improper error handling
- Fixed coding style violations
- Fixed tier deletion propagation to cgroups
Changes in RFC v2
=================
- Strict cgroup hierarchy compliance (LPC 2025 feedback)
- Percpu swap device cache to preserve fastpath performance
(Kairui Song, Baoquan He)
- Simplified tier structure (Chris Li)
- Removed explicit "+" selection; default is all tiers, use "-"
to exclude (Chris Li)
- Removed CONFIG_SWAP_TIER; now base kernel feature (Chris Li)
- Effective tier calculation moved to configuration time
(swap.tiers write)
- Mixed operation support for "+" and "-" in
/sys/kernel/mm/swap/tiers (Chris Li)
- Commit reorganization for clarity (Chris Li)
- Added tier priority modification support
- Added documentation for swap tiers concept and usage (Chris Li)
Real-world Results
==================
App preloading on our internal platform using NBD as a separate tier.
Without a separate swap tier:
- Cannot selectively avoid default flash swap, unable to reduce
flash wear and lifespan issues.
- Cannot selectively assign NBD to specific apps that need it.
Result (cold launch vs. preloaded):
- Streaming App A: 13.17s → 4.18s (68% faster)
- Streaming App B: 5.60s → 1.12s (80% faster)
- E-commerce App C: 10.25s → 2.00s (80% faster)
Performance validation against baseline (no tiers configured) shows
negligible overhead (<1%) in kernel build and vm-scalability
benchmarks. Detailed results in RFC v2 cover letter.
Youngjun Park (4):
mm: swap: introduce swap tier infrastructure
mm: swap: associate swap devices with tiers
mm: memcontrol: add interfaces for swap tier selection
mm: swap: filter swap allocation by memcg tier mask
Documentation/admin-guide/cgroup-v2.rst | 27 ++
Documentation/mm/swap-tier.rst | 159 +++++++++
MAINTAINERS | 3 +
include/linux/memcontrol.h | 3 +-
include/linux/swap.h | 1 +
mm/Kconfig | 12 +
mm/Makefile | 2 +-
mm/memcontrol.c | 95 +++++
mm/swap.h | 4 +
mm/swap_state.c | 75 ++++
mm/swap_tier.c | 451 ++++++++++++++++++++++++
mm/swap_tier.h | 74 ++++
mm/swapfile.c | 22 +-
13 files changed, 922 insertions(+), 6 deletions(-)
create mode 100644 Documentation/mm/swap-tier.rst
create mode 100644 mm/swap_tier.c
create mode 100644 mm/swap_tier.h
base-commit: 776250964cbaa49ebe6b8bb2870765cc89cece59
--
2.34.1
^ permalink raw reply [flat|nested] 7+ messages in thread
* [PATCH v4 1/4] mm: swap: introduce swap tier infrastructure
2026-02-17 0:09 [PATCH v4 0/4] mm/swap, memcg: Introduce swap tiers for cgroup based swap control Youngjun Park
@ 2026-02-17 0:09 ` Youngjun Park
2026-02-17 15:27 ` kernel test robot
2026-02-17 0:09 ` [PATCH v4 2/4] mm: swap: associate swap devices with tiers Youngjun Park
` (2 subsequent siblings)
3 siblings, 1 reply; 7+ messages in thread
From: Youngjun Park @ 2026-02-17 0:09 UTC (permalink / raw)
To: Andrew Morton
Cc: Chris Li, linux-mm, Kairui Song, Kemeng Shi, Nhat Pham,
Baoquan He, Barry Song, Johannes Weiner, Michal Hocko,
Roman Gushchin, Shakeel Butt, Muchun Song, Michal Koutný,
gunho.lee, taejoon.song, austin.kim, youngjun.park
This patch introduces the "Swap tier" concept, which serves as an
abstraction layer for managing swap devices based on their performance
characteristics (e.g., NVMe, HDD, Network swap).
Swap tiers are user-named groups representing priority ranges.
Tier names must consist of alphanumeric characters and underscores.
These tiers collectively cover the entire priority space from -1
(`DEF_SWAP_PRIO`) to `SHRT_MAX`.
To configure tiers, a new sysfs interface is exposed at
/sys/kernel/mm/swap/tiers. The input parser evaluates commands from
left to right and supports batch input, allowing users to add or remove
multiple tiers in a single write operation.
Tier management enforces continuous priority ranges anchored by start
priorities. Operations trigger range splitting or merging, but overwriting
start priorities is forbidden. Merging expands lower tiers upwards to
preserve configured start priorities, except when removing `DEF_SWAP_PRIO`,
which merges downwards.
Suggested-by: Chris Li <chrisl@kernel.org>
Signed-off-by: Youngjun Park <youngjun.park@lge.com>
---
MAINTAINERS | 2 +
mm/Kconfig | 12 ++
mm/Makefile | 2 +-
mm/swap.h | 4 +
mm/swap_state.c | 74 +++++++++++++
mm/swap_tier.c | 285 ++++++++++++++++++++++++++++++++++++++++++++++++
mm/swap_tier.h | 20 ++++
mm/swapfile.c | 7 +-
8 files changed, 402 insertions(+), 4 deletions(-)
create mode 100644 mm/swap_tier.c
create mode 100644 mm/swap_tier.h
diff --git a/MAINTAINERS b/MAINTAINERS
index 18d1ebf053db..501bf46adfb4 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -16743,6 +16743,8 @@ F: mm/swap.c
F: mm/swap.h
F: mm/swap_table.h
F: mm/swap_state.c
+F: mm/swap_tier.c
+F: mm/swap_tier.h
F: mm/swapfile.c
MEMORY MANAGEMENT - THP (TRANSPARENT HUGE PAGE)
diff --git a/mm/Kconfig b/mm/Kconfig
index 0b5720186c71..0f76befc4a7e 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -19,6 +19,18 @@ menuconfig SWAP
used to provide more virtual memory than the actual RAM present
in your computer. If unsure say Y.
+config NR_SWAP_TIERS
+ int "Number of swap device tiers"
+ depends on SWAP
+ default 4
+ range 1 32
+ help
+ Sets the number of swap device tiers. Swap devices are
+ grouped into tiers based on their priority, allowing the
+ system to prefer faster devices over slower ones.
+
+ If unsure, say 4.
+
config ZSWAP
bool "Compressed cache for swap pages"
depends on SWAP
diff --git a/mm/Makefile b/mm/Makefile
index 53ca5d4b1929..3b3de2de7285 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -75,7 +75,7 @@ ifdef CONFIG_MMU
obj-$(CONFIG_ADVISE_SYSCALLS) += madvise.o
endif
-obj-$(CONFIG_SWAP) += page_io.o swap_state.o swapfile.o
+obj-$(CONFIG_SWAP) += page_io.o swap_state.o swapfile.o swap_tier.o
obj-$(CONFIG_ZSWAP) += zswap.o
obj-$(CONFIG_HAS_DMA) += dmapool.o
obj-$(CONFIG_HUGETLBFS) += hugetlb.o hugetlb_sysfs.o hugetlb_sysctl.o
diff --git a/mm/swap.h b/mm/swap.h
index bfafa637c458..55f230cbe4e7 100644
--- a/mm/swap.h
+++ b/mm/swap.h
@@ -16,6 +16,10 @@ extern int page_cluster;
#define swap_entry_order(order) 0
#endif
+#define DEF_SWAP_PRIO -1
+
+extern spinlock_t swap_lock;
+extern struct plist_head swap_active_head;
extern struct swap_info_struct *swap_info[];
/*
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 6d0eef7470be..8129d714a44a 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -25,6 +25,7 @@
#include "internal.h"
#include "swap_table.h"
#include "swap.h"
+#include "swap_tier.h"
/*
* swapper_space is a fiction, retained to simplify the path through
@@ -947,8 +948,81 @@ static ssize_t vma_ra_enabled_store(struct kobject *kobj,
}
static struct kobj_attribute vma_ra_enabled_attr = __ATTR_RW(vma_ra_enabled);
+static ssize_t tiers_show(struct kobject *kobj,
+ struct kobj_attribute *attr, char *buf)
+{
+ return swap_tiers_sysfs_show(buf);
+}
+
+static ssize_t tiers_store(struct kobject *kobj,
+ struct kobj_attribute *attr,
+ const char *buf, size_t count)
+{
+ char *p, *token, *name, *tmp;
+ int ret = 0;
+ short prio;
+
+ tmp = kstrdup(buf, GFP_KERNEL);
+ if (!tmp)
+ return -ENOMEM;
+
+ spin_lock(&swap_lock);
+ spin_lock(&swap_tier_lock);
+ swap_tiers_snapshot();
+
+ p = tmp;
+ while ((token = strsep(&p, ", \t\n")) != NULL) {
+ if (!*token)
+ continue;
+
+ switch (token[0]) {
+ case '+':
+ name = token + 1;
+ token = strchr(name, ':');
+ if (!token) {
+ ret = -EINVAL;
+ goto out;
+ }
+ *token++ = '\0';
+ if (kstrtos16(token, 10, &prio)) {
+ ret = -EINVAL;
+ goto out;
+ }
+ ret = swap_tiers_add(name, prio);
+ if (ret)
+ goto restore;
+ break;
+ case '-':
+ ret = swap_tiers_remove(token + 1);
+ if (ret)
+ goto restore;
+ break;
+ default:
+ ret = -EINVAL;
+ goto out;
+ }
+ }
+
+ if (!swap_tiers_validate()) {
+ ret = -EINVAL;
+ goto restore;
+ }
+ goto out;
+
+restore:
+ swap_tiers_snapshot_restore();
+out:
+ spin_unlock(&swap_tier_lock);
+ spin_unlock(&swap_lock);
+ kfree(tmp);
+ return ret ? ret : count;
+}
+
+static struct kobj_attribute tier_attr = __ATTR_RW(tiers);
+
static struct attribute *swap_attrs[] = {
&vma_ra_enabled_attr.attr,
+ &tier_attr.attr,
NULL,
};
diff --git a/mm/swap_tier.c b/mm/swap_tier.c
new file mode 100644
index 000000000000..62b60fa8d3b7
--- /dev/null
+++ b/mm/swap_tier.c
@@ -0,0 +1,285 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <linux/swap.h>
+#include <linux/memcontrol.h>
+#include "memcontrol-v1.h"
+#include <linux/sysfs.h>
+#include <linux/plist.h>
+
+#include "swap.h"
+#include "swap_tier.h"
+
+#define MAX_SWAPTIER CONFIG_NR_SWAP_TIERS
+#define MAX_TIERNAME 16
+
+/*
+ * struct swap_tier - structure representing a swap tier.
+ *
+ * @name: name of the swap_tier.
+ * @prio: starting value of priority.
+ * @list: linked list of tiers.
+ */
+static struct swap_tier {
+ char name[MAX_TIERNAME];
+ short prio;
+ struct list_head list;
+} swap_tiers[MAX_SWAPTIER];
+
+DEFINE_SPINLOCK(swap_tier_lock);
+/* active swap priority list, sorted in descending order */
+static LIST_HEAD(swap_tier_active_list);
+/* unused swap_tier object */
+static LIST_HEAD(swap_tier_inactive_list);
+
+#define TIER_IDX(tier) ((tier) - swap_tiers)
+#define TIER_MASK(tier) (1 << TIER_IDX(tier))
+#define TIER_INACTIVE_PRIO (DEF_SWAP_PRIO - 1)
+#define TIER_IS_ACTIVE(tier) ((tier->prio) != TIER_INACTIVE_PRIO)
+#define TIER_END_PRIO(tier) \
+ (!list_is_first(&(tier)->list, &swap_tier_active_list) ? \
+ list_prev_entry((tier), list)->prio - 1 : SHRT_MAX)
+
+#define for_each_tier(tier, idx) \
+ for (idx = 0, tier = &swap_tiers[0]; idx < MAX_SWAPTIER; \
+ idx++, tier = &swap_tiers[idx])
+
+#define for_each_active_tier(tier) \
+ list_for_each_entry(tier, &swap_tier_active_list, list)
+
+#define for_each_inactive_tier(tier) \
+ list_for_each_entry(tier, &swap_tier_inactive_list, list)
+
+/*
+ * Naming Convention:
+ * swap_tiers_*() - Public/exported functions
+ * swap_tier_*() - Private/internal functions
+ */
+
+static bool swap_tier_is_active(void)
+{
+ return !list_empty(&swap_tier_active_list) ? true : false;
+}
+
+static struct swap_tier *swap_tier_lookup(const char *name)
+{
+ struct swap_tier *tier;
+
+ for_each_active_tier(tier) {
+ if (!strcmp(tier->name, name))
+ return tier;
+ }
+
+ return NULL;
+}
+
+/* Insert new tier into the active list sorted by priority. */
+static void swap_tier_activate(struct swap_tier *new)
+{
+ struct swap_tier *tier;
+
+ for_each_active_tier(tier) {
+ if (tier->prio <= new->prio)
+ break;
+ }
+
+ list_add_tail(&new->list, &tier->list);
+}
+
+static void swap_tier_inactivate(struct swap_tier *tier)
+{
+ list_move(&tier->list, &swap_tier_inactive_list);
+ tier->prio = TIER_INACTIVE_PRIO;
+}
+
+void swap_tiers_init(void)
+{
+ struct swap_tier *tier;
+ int idx;
+
+ BUILD_BUG_ON(BITS_PER_TYPE(int) < MAX_SWAPTIER);
+
+ for_each_tier(tier, idx) {
+ INIT_LIST_HEAD(&tier->list);
+ swap_tier_inactivate(tier);
+ }
+}
+
+ssize_t swap_tiers_sysfs_show(char *buf)
+{
+ struct swap_tier *tier;
+ ssize_t len = 0;
+
+ len += sysfs_emit_at(buf, len, "%-16s %-5s %-11s %-11s\n",
+ "Name", "Idx", "PrioStart", "PrioEnd");
+
+ spin_lock(&swap_tier_lock);
+ for_each_active_tier(tier) {
+ len += sysfs_emit_at(buf, len, "%-16s %-5ld %-11d %-11d\n",
+ tier->name,
+ TIER_IDX(tier),
+ tier->prio,
+ TIER_END_PRIO(tier));
+ }
+ spin_unlock(&swap_tier_lock);
+
+ return len;
+}
+
+static struct swap_tier *swap_tier_prepare(const char *name, short prio)
+{
+ struct swap_tier *tier;
+
+ lockdep_assert_held(&swap_tier_lock);
+
+ if (prio < DEF_SWAP_PRIO)
+ return ERR_PTR(-EINVAL);
+
+ if (list_empty(&swap_tier_inactive_list))
+ return ERR_PTR(-ENOSPC);
+
+ tier = list_first_entry(&swap_tier_inactive_list,
+ struct swap_tier, list);
+
+ list_del_init(&tier->list);
+ strscpy(tier->name, name, MAX_TIERNAME);
+ tier->prio = prio;
+
+ return tier;
+}
+
+static int swap_tier_check_range(short prio)
+{
+ struct swap_tier *tier;
+
+ lockdep_assert_held(&swap_lock);
+ lockdep_assert_held(&swap_tier_lock);
+
+ for_each_active_tier(tier) {
+ /* No overwrite */
+ if (tier->prio == prio)
+ return -EINVAL;
+ }
+
+ return 0;
+}
+
+static bool swap_tier_validate_name(const char *name)
+{
+ if (!name || !*name)
+ return false;
+
+ while (*name) {
+ if (!isalnum(*name) && *name != '_')
+ return false;
+ name++;
+ }
+ return true;
+}
+
+int swap_tiers_add(const char *name, int prio)
+{
+ int ret;
+ struct swap_tier *tier;
+
+ lockdep_assert_held(&swap_lock);
+ lockdep_assert_held(&swap_tier_lock);
+
+ /* Duplicate check */
+ if (swap_tier_lookup(name))
+ return -EEXIST;
+
+ if (!swap_tier_validate_name(name))
+ return -EINVAL;
+
+ ret = swap_tier_check_range(prio);
+ if (ret)
+ return ret;
+
+ tier = swap_tier_prepare(name, prio);
+ if (IS_ERR(tier)) {
+ ret = PTR_ERR(tier);
+ return ret;
+ }
+
+ swap_tier_activate(tier);
+
+ return ret;
+}
+
+int swap_tiers_remove(const char *name)
+{
+ int ret = 0;
+ struct swap_tier *tier;
+
+ lockdep_assert_held(&swap_lock);
+ lockdep_assert_held(&swap_tier_lock);
+
+ tier = swap_tier_lookup(name);
+ if (!tier)
+ return -EINVAL;
+
+ /* Removing DEF_SWAP_PRIO merges into the higher tier. */
+ if (!list_is_singular(&swap_tier_active_list)
+ && tier->prio == DEF_SWAP_PRIO)
+ list_prev_entry(tier, list)->prio = DEF_SWAP_PRIO;
+
+ swap_tier_inactivate(tier);
+
+ return ret;
+}
+
+static struct swap_tier swap_tiers_snap[MAX_SWAPTIER];
+/*
+ * XXX: When multiple operations (adds and removes) are submitted in a
+ * single write, reverting each individually on failure is complex and
+ * error-prone. Instead, snapshot the entire state beforehand and
+ * restore it wholesale if any operation fails.
+ */
+void swap_tiers_snapshot(void)
+{
+ BUILD_BUG_ON(sizeof(swap_tiers_snap) != sizeof(swap_tiers));
+
+ lockdep_assert_held(&swap_lock);
+ lockdep_assert_held(&swap_tier_lock);
+
+ memcpy(swap_tiers_snap, swap_tiers, sizeof(swap_tiers));
+}
+
+void swap_tiers_snapshot_restore(void)
+{
+ struct swap_tier *tier;
+ int idx;
+
+ lockdep_assert_held(&swap_lock);
+ lockdep_assert_held(&swap_tier_lock);
+
+ memcpy(swap_tiers, swap_tiers_snap, sizeof(swap_tiers));
+
+ INIT_LIST_HEAD(&swap_tier_active_list);
+ INIT_LIST_HEAD(&swap_tier_inactive_list);
+
+ for_each_tier(tier, idx) {
+ if (TIER_IS_ACTIVE(tier))
+ swap_tier_activate(tier);
+ else
+ swap_tier_inactivate(tier);
+ }
+}
+
+bool swap_tiers_validate(void)
+{
+ struct swap_tier *tier;
+
+ /*
+ * Initial setting might not cover DEF_SWAP_PRIO.
+ * Swap tier must cover the full range (DEF_SWAP_PRIO to SHRT_MAX).
+ */
+ if (swap_tier_is_active()) {
+ tier = list_last_entry(&swap_tier_active_list,
+ struct swap_tier, list);
+
+ if (tier->prio != DEF_SWAP_PRIO)
+ return false;
+ }
+
+ return true;
+}
diff --git a/mm/swap_tier.h b/mm/swap_tier.h
new file mode 100644
index 000000000000..a1395ec02c24
--- /dev/null
+++ b/mm/swap_tier.h
@@ -0,0 +1,20 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _SWAP_TIER_H
+#define _SWAP_TIER_H
+
+#include <linux/types.h>
+#include <linux/spinlock.h>
+
+extern spinlock_t swap_tier_lock;
+
+/* Initialization and application */
+void swap_tiers_init(void);
+ssize_t swap_tiers_sysfs_show(char *buf);
+
+int swap_tiers_add(const char *name, int prio);
+int swap_tiers_remove(const char *name);
+
+void swap_tiers_snapshot(void);
+void swap_tiers_snapshot_restore(void);
+bool swap_tiers_validate(void);
+#endif /* _SWAP_TIER_H */
diff --git a/mm/swapfile.c b/mm/swapfile.c
index c6863ff7152c..1f93df281ede 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -50,6 +50,7 @@
#include "internal.h"
#include "swap_table.h"
#include "swap.h"
+#include "swap_tier.h"
static bool swap_count_continued(struct swap_info_struct *, pgoff_t,
unsigned char);
@@ -65,7 +66,7 @@ static void move_cluster(struct swap_info_struct *si,
struct swap_cluster_info *ci, struct list_head *list,
enum swap_cluster_flags new_flags);
-static DEFINE_SPINLOCK(swap_lock);
+DEFINE_SPINLOCK(swap_lock);
static unsigned int nr_swapfiles;
atomic_long_t nr_swap_pages;
/*
@@ -76,7 +77,6 @@ atomic_long_t nr_swap_pages;
EXPORT_SYMBOL_GPL(nr_swap_pages);
/* protected with swap_lock. reading in vm_swap_full() doesn't need lock */
long total_swap_pages;
-#define DEF_SWAP_PRIO -1
unsigned long swapfile_maximum_size;
#ifdef CONFIG_MIGRATION
bool swap_migration_ad_supported;
@@ -89,7 +89,7 @@ static const char Bad_offset[] = "Bad swap offset entry ";
* all active swap_info_structs
* protected with swap_lock, and ordered by priority.
*/
-static PLIST_HEAD(swap_active_head);
+PLIST_HEAD(swap_active_head);
/*
* all available (active, not full) swap_info_structs
@@ -3977,6 +3977,7 @@ static int __init swapfile_init(void)
swap_migration_ad_supported = true;
#endif /* CONFIG_MIGRATION */
+ swap_tiers_init();
return 0;
}
subsys_initcall(swapfile_init);
--
2.34.1
^ permalink raw reply [flat|nested] 7+ messages in thread
* [PATCH v4 2/4] mm: swap: associate swap devices with tiers
2026-02-17 0:09 [PATCH v4 0/4] mm/swap, memcg: Introduce swap tiers for cgroup based swap control Youngjun Park
2026-02-17 0:09 ` [PATCH v4 1/4] mm: swap: introduce swap tier infrastructure Youngjun Park
@ 2026-02-17 0:09 ` Youngjun Park
2026-02-17 0:09 ` [PATCH v4 3/4] mm: memcontrol: add interfaces for swap tier selection Youngjun Park
2026-02-17 0:09 ` [PATCH v4 4/4] mm: swap: filter swap allocation by memcg tier mask Youngjun Park
3 siblings, 0 replies; 7+ messages in thread
From: Youngjun Park @ 2026-02-17 0:09 UTC (permalink / raw)
To: Andrew Morton
Cc: Chris Li, linux-mm, Kairui Song, Kemeng Shi, Nhat Pham,
Baoquan He, Barry Song, Johannes Weiner, Michal Hocko,
Roman Gushchin, Shakeel Butt, Muchun Song, Michal Koutný,
gunho.lee, taejoon.song, austin.kim, youngjun.park
This patch connects swap devices to the swap tier infrastructure,
ensuring that devices are correctly assigned to tiers based on their
priority.
A `tier_mask` is added to identify the tier membership of swap devices.
Although tier-based allocation logic is not yet implemented, this
mapping is necessary to track which tier a device belongs to. Upon
activation, the device is assigned to a tier by matching its priority
against the configured tier ranges.
The infrastructure allows dynamic modification of tiers, such as
splitting or merging ranges. These operations are permitted provided
that the tier assignment of already configured swap devices remains
unchanged.
This patch also adds the documentation for the swap tier feature,
covering the core concepts, sysfs interface usage, and configuration
details.
Signed-off-by: Youngjun Park <youngjun.park@lge.com>
---
Documentation/mm/swap-tier.rst | 159 +++++++++++++++++++++++++++++++++
MAINTAINERS | 1 +
include/linux/swap.h | 1 +
mm/swap_state.c | 2 +-
mm/swap_tier.c | 101 ++++++++++++++++++---
mm/swap_tier.h | 12 ++-
mm/swapfile.c | 2 +
7 files changed, 264 insertions(+), 14 deletions(-)
create mode 100644 Documentation/mm/swap-tier.rst
diff --git a/Documentation/mm/swap-tier.rst b/Documentation/mm/swap-tier.rst
new file mode 100644
index 000000000000..7b29b0e4e414
--- /dev/null
+++ b/Documentation/mm/swap-tier.rst
@@ -0,0 +1,159 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+:Author: Chris Li <chrisl@kernel.org> Youngjun Park <youngjun.park@lge.com>
+
+==========
+Swap Tier
+==========
+
+Swap tier is a collection of user-named groups classified by priority ranges.
+It acts as a facilitation layer, allowing users to manage swap devices based
+on their speeds.
+
+Users are encouraged to assign swap device priorities according to device
+speed to fully utilize this feature. While the current implementation is
+integrated with cgroups, the concept is designed to be extensible for other
+subsystems in the future.
+
+Use case
+-------
+
+Users can perform selective swapping by choosing a swap tier assigned according
+to speed within a cgroup.
+
+For more information on cgroup v2, please refer to
+``Documentation/admin-guide/cgroup-v2.rst``.
+
+Priority Range
+--------------
+
+The specified tiers must cover the entire priority range from -1
+(DEF_SWAP_PRIO) to SHRT_MAX.
+
+Consistency
+-----------
+
+Tier consistency is guaranteed with a focus on maximizing flexibility. When a
+swap device is activated within a tier range, the tier covering that device's
+priority is guaranteed not to disappear or change while the device remains
+active. Adding a new tier may split the range of an existing tier, but the
+active device's tier assignment remains unchanged.
+
+However, specifying a tier in a cgroup does not guarantee the tier's existence.
+Consequently, the corresponding tier can disappear at any time.
+
+Configuration Interface
+-----------------------
+
+The swap tiers can be configured via the following interface:
+
+/sys/kernel/mm/swap/tiers
+
+Operations can be performed using the following syntax:
+
+* Add: ``+"<tiername>":"<start_priority>"``
+* Remove: ``-"<tiername>"``
+
+Tier names must consist of alphanumeric characters and underscores. Multiple
+operations can be provided in a single write, separated by commas (",") or
+whitespace (spaces, tabs, newlines).
+
+When configuring tiers, the specified value represents the **start priority**
+of that tier. The end priority is automatically determined by the start
+priority of the next higher tier. Consequently, adding a tier
+automatically adjusts the ranges of adjacent tiers to ensure continuity.
+
+Examples
+--------
+
+**1. Initialization**
+
+A tier starting at -1 is mandatory to cover the entire priority range up to
+SHRT_MAX. In this example, 'HDD' starts at 50, and 'NET' covers the remaining
+lower range starting from -1.
+
+::
+
+ # echo "+HDD:50, +NET:-1" > /sys/kernel/mm/swap/tiers
+ # cat /sys/kernel/mm/swap/tiers
+ Name Idx PrioStart PrioEnd
+ HDD 0 50 32767
+ NET 1 -1 49
+
+**2. Adding a New Tier (split)**
+
+A new tier 'SSD' is added at priority 100, splitting the existing 'HDD' tier.
+The ranges are automatically recalculated:
+
+* 'SSD' takes the top range (100 to SHRT_MAX).
+* 'HDD' is adjusted to the range between 'NET' and 'SSD' (50 to 99).
+* 'NET' remains unchanged (-1 to 49).
+
+::
+
+ # echo "+SSD:100" > /sys/kernel/mm/swap/tiers
+ # cat /sys/kernel/mm/swap/tiers
+ Name Idx PrioStart PrioEnd
+ SSD 2 100 32767
+ HDD 0 50 99
+ NET 1 -1 49
+
+**3. Removal (merge)**
+
+Tiers can be removed using the '-' prefix.
+::
+
+ # echo "-SSD" > /sys/kernel/mm/swap/tiers
+
+When a tier is removed, its priority range is merged into the adjacent
+tier. The merge direction is always upward (the tier below expands),
+except when the lowest tier is removed — in that case the tier above
+shifts its starting priority down to -1 to maintain full range coverage.
+
+::
+
+ Initial state:
+ Name Idx PrioStart PrioEnd
+ SSD 2 100 32767
+ HDD 1 50 99
+ NET 0 -1 49
+
+ # echo "-SSD" > /sys/kernel/mm/swap/tiers
+
+ Name Idx PrioStart PrioEnd
+ HDD 1 50 32767 <- merged with SSD's range
+ NET 0 -1 49
+
+ # echo "-NET" > /sys/kernel/mm/swap/tiers
+
+ Name Idx PrioStart PrioEnd
+ HDD 1 -1 32767 <- shifted down to -1
+
+**4. Interaction with Active Swap Devices**
+
+If a swap device is active (swapon), the tier covering that device's
+priority cannot be removed. Splitting the active tier's range is only
+allowed above the device's priority.
+
+Assume a swap device is active at priority 60 (inside 'HDD' tier).
+
+::
+
+ # swapon -p 60 /dev/zram0
+
+ Name Idx PrioStart PrioEnd
+ HDD 0 50 32767
+ NET 1 -1 49
+
+ # echo "-HDD" > /sys/kernel/mm/swap/tiers
+ -bash: echo: write error: Device or resource busy
+
+ # echo "+SSD:60" > /sys/kernel/mm/swap/tiers
+ -bash: echo: write error: Device or resource busy
+
+ # echo "+SSD:100" > /sys/kernel/mm/swap/tiers
+
+ Name Idx PrioStart PrioEnd
+ SSD 2 100 32767
+ HDD 0 50 99 <- device (prio 60) stays here
+ NET 1 -1 49
diff --git a/MAINTAINERS b/MAINTAINERS
index 501bf46adfb4..fa05a39d9ad1 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -16735,6 +16735,7 @@ R: Barry Song <baohua@kernel.org>
L: linux-mm@kvack.org
S: Maintained
F: Documentation/mm/swap-table.rst
+F: Documentation/mm/swap-tier.rst
F: include/linux/swap.h
F: include/linux/swapfile.h
F: include/linux/swapops.h
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 62fc7499b408..1e68c220a0e7 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -262,6 +262,7 @@ struct swap_info_struct {
struct percpu_ref users; /* indicate and keep swap device valid. */
unsigned long flags; /* SWP_USED etc: see above */
signed short prio; /* swap priority of this type */
+ int tier_mask; /* swap tier mask */
struct plist_node list; /* entry in swap_active_head */
signed char type; /* strange name for an index */
unsigned int max; /* extent of the swap_map */
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 8129d714a44a..513d74dc1709 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -1003,7 +1003,7 @@ static ssize_t tiers_store(struct kobject *kobj,
}
}
- if (!swap_tiers_validate()) {
+ if (!swap_tiers_update()) {
ret = -EINVAL;
goto restore;
}
diff --git a/mm/swap_tier.c b/mm/swap_tier.c
index 62b60fa8d3b7..91aac55d3a8b 100644
--- a/mm/swap_tier.c
+++ b/mm/swap_tier.c
@@ -38,6 +38,8 @@ static LIST_HEAD(swap_tier_inactive_list);
(!list_is_first(&(tier)->list, &swap_tier_active_list) ? \
list_prev_entry((tier), list)->prio - 1 : SHRT_MAX)
+#define MASK_TO_TIER(mask) (&swap_tiers[__ffs((mask))])
+
#define for_each_tier(tier, idx) \
for (idx = 0, tier = &swap_tiers[0]; idx < MAX_SWAPTIER; \
idx++, tier = &swap_tiers[idx])
@@ -59,6 +61,26 @@ static bool swap_tier_is_active(void)
return !list_empty(&swap_tier_active_list) ? true : false;
}
+static bool swap_tier_prio_in_range(struct swap_tier *tier, short prio)
+{
+ if (tier->prio <= prio && TIER_END_PRIO(tier) >= prio)
+ return true;
+
+ return false;
+}
+
+static bool swap_tier_prio_is_used(short prio)
+{
+ struct swap_tier *tier;
+
+ for_each_active_tier(tier) {
+ if (tier->prio == prio)
+ return true;
+ }
+
+ return false;
+}
+
static struct swap_tier *swap_tier_lookup(const char *name)
{
struct swap_tier *tier;
@@ -96,6 +118,7 @@ void swap_tiers_init(void)
int idx;
BUILD_BUG_ON(BITS_PER_TYPE(int) < MAX_SWAPTIER);
+ BUILD_BUG_ON(MAX_SWAPTIER > TIER_DEFAULT_IDX);
for_each_tier(tier, idx) {
INIT_LIST_HEAD(&tier->list);
@@ -146,17 +169,29 @@ static struct swap_tier *swap_tier_prepare(const char *name, short prio)
return tier;
}
-static int swap_tier_check_range(short prio)
+static int swap_tier_can_split_range(short new_prio)
{
+ struct swap_info_struct *p;
struct swap_tier *tier;
lockdep_assert_held(&swap_lock);
lockdep_assert_held(&swap_tier_lock);
- for_each_active_tier(tier) {
- /* No overwrite */
- if (tier->prio == prio)
- return -EINVAL;
+ plist_for_each_entry(p, &swap_active_head, list) {
+ if (p->tier_mask == TIER_DEFAULT_MASK)
+ continue;
+
+ tier = MASK_TO_TIER(p->tier_mask);
+ if (!swap_tier_prio_in_range(tier, new_prio))
+ continue;
+
+ /*
+ * Device sits in a tier that spans new_prio;
+ * splitting here would reassign it to a
+ * different tier.
+ */
+ if (p->prio >= new_prio)
+ return -EBUSY;
}
return 0;
@@ -190,7 +225,11 @@ int swap_tiers_add(const char *name, int prio)
if (!swap_tier_validate_name(name))
return -EINVAL;
- ret = swap_tier_check_range(prio);
+ /* No overwrite */
+ if (swap_tier_prio_is_used(prio))
+ return -EBUSY;
+
+ ret = swap_tier_can_split_range(prio);
if (ret)
return ret;
@@ -217,6 +256,11 @@ int swap_tiers_remove(const char *name)
if (!tier)
return -EINVAL;
+ /* Simulate adding a tier to check for conflicts */
+ ret = swap_tier_can_split_range(tier->prio);
+ if (ret)
+ return ret;
+
/* Removing DEF_SWAP_PRIO merges into the higher tier. */
if (!list_is_singular(&swap_tier_active_list)
&& tier->prio == DEF_SWAP_PRIO)
@@ -227,13 +271,15 @@ int swap_tiers_remove(const char *name)
return ret;
}
-static struct swap_tier swap_tiers_snap[MAX_SWAPTIER];
/*
- * XXX: When multiple operations (adds and removes) are submitted in a
- * single write, reverting each individually on failure is complex and
- * error-prone. Instead, snapshot the entire state beforehand and
- * restore it wholesale if any operation fails.
+ * XXX: Static global snapshot buffer for batch operations. Small
+ * and used once per write, so a static global is not bad.
+ * When multiple adds/removes are submitted in a single write,
+ * reverting each individually on failure is error-prone. Instead,
+ * snapshot beforehand and restore wholesale if any operation fails.
*/
+static struct swap_tier swap_tiers_snap[MAX_SWAPTIER];
+
void swap_tiers_snapshot(void)
{
BUILD_BUG_ON(sizeof(swap_tiers_snap) != sizeof(swap_tiers));
@@ -265,9 +311,29 @@ void swap_tiers_snapshot_restore(void)
}
}
-bool swap_tiers_validate(void)
+void swap_tiers_assign_dev(struct swap_info_struct *swp)
+{
+ struct swap_tier *tier;
+
+ lockdep_assert_held(&swap_lock);
+
+ for_each_active_tier(tier) {
+ if (swap_tier_prio_in_range(tier, swp->prio)) {
+ swp->tier_mask = TIER_MASK(tier);
+ return;
+ }
+ }
+
+ swp->tier_mask = TIER_DEFAULT_MASK;
+}
+
+bool swap_tiers_update(void)
{
struct swap_tier *tier;
+ struct swap_info_struct *swp;
+
+ lockdep_assert_held(&swap_lock);
+ lockdep_assert_held(&swap_tier_lock);
/*
* Initial setting might not cover DEF_SWAP_PRIO.
@@ -281,5 +347,16 @@ bool swap_tiers_validate(void)
return false;
}
+ /*
+ * If applied initially, the swap tier_mask may change
+ * from the default value.
+ */
+ plist_for_each_entry(swp, &swap_active_head, list) {
+ /* Tier is already configured */
+ if (swp->tier_mask != TIER_DEFAULT_MASK)
+ break;
+ swap_tiers_assign_dev(swp);
+ }
+
return true;
}
diff --git a/mm/swap_tier.h b/mm/swap_tier.h
index a1395ec02c24..6f281e95ed81 100644
--- a/mm/swap_tier.h
+++ b/mm/swap_tier.h
@@ -5,8 +5,15 @@
#include <linux/types.h>
#include <linux/spinlock.h>
+/* Forward declarations */
+struct swap_info_struct;
+
extern spinlock_t swap_tier_lock;
+#define TIER_ALL_MASK (~0)
+#define TIER_DEFAULT_IDX (31)
+#define TIER_DEFAULT_MASK (1 << TIER_DEFAULT_IDX)
+
/* Initialization and application */
void swap_tiers_init(void);
ssize_t swap_tiers_sysfs_show(char *buf);
@@ -16,5 +23,8 @@ int swap_tiers_remove(const char *name);
void swap_tiers_snapshot(void);
void swap_tiers_snapshot_restore(void);
-bool swap_tiers_validate(void);
+bool swap_tiers_update(void);
+
+/* Tier assignment */
+void swap_tiers_assign_dev(struct swap_info_struct *swp);
#endif /* _SWAP_TIER_H */
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 1f93df281ede..2f956b6a5edc 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -2672,6 +2672,8 @@ static void _enable_swap_info(struct swap_info_struct *si)
/* Add back to available list */
add_to_avail_list(si, true);
+
+ swap_tiers_assign_dev(si);
}
static void enable_swap_info(struct swap_info_struct *si, int prio,
--
2.34.1
^ permalink raw reply [flat|nested] 7+ messages in thread
* [PATCH v4 3/4] mm: memcontrol: add interfaces for swap tier selection
2026-02-17 0:09 [PATCH v4 0/4] mm/swap, memcg: Introduce swap tiers for cgroup based swap control Youngjun Park
2026-02-17 0:09 ` [PATCH v4 1/4] mm: swap: introduce swap tier infrastructure Youngjun Park
2026-02-17 0:09 ` [PATCH v4 2/4] mm: swap: associate swap devices with tiers Youngjun Park
@ 2026-02-17 0:09 ` Youngjun Park
2026-02-17 12:18 ` kernel test robot
2026-02-17 0:09 ` [PATCH v4 4/4] mm: swap: filter swap allocation by memcg tier mask Youngjun Park
3 siblings, 1 reply; 7+ messages in thread
From: Youngjun Park @ 2026-02-17 0:09 UTC (permalink / raw)
To: Andrew Morton
Cc: Chris Li, linux-mm, Kairui Song, Kemeng Shi, Nhat Pham,
Baoquan He, Barry Song, Johannes Weiner, Michal Hocko,
Roman Gushchin, Shakeel Butt, Muchun Song, Michal Koutný,
gunho.lee, taejoon.song, austin.kim, youngjun.park
Integrate swap tier infrastructure with cgroup to allow selecting specific
swap devices per cgroup.
Introduce `memory.swap.tiers` for configuring allowed tiers, and
`memory.swap.tiers.effective` for exposing the effective tiers.
The effective tiers are the intersection of the configured tiers and
the parent's effective tiers.
Note that cgroups do not pin swap tiers, similar to `cpuset` and CPU
hotplug, allowing configuration changes regardless of usage.
Signed-off-by: Youngjun Park <youngjun.park@lge.com>
---
Documentation/admin-guide/cgroup-v2.rst | 27 +++++++
include/linux/memcontrol.h | 3 +-
mm/memcontrol.c | 95 +++++++++++++++++++++++++
mm/swap_state.c | 5 +-
mm/swap_tier.c | 93 +++++++++++++++++++++++-
mm/swap_tier.h | 56 +++++++++++++--
6 files changed, 268 insertions(+), 11 deletions(-)
diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
index 7f5b59d95fce..fbe96ef3517c 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -1848,6 +1848,33 @@ The following nested keys are defined.
Swap usage hard limit. If a cgroup's swap usage reaches this
limit, anonymous memory of the cgroup will not be swapped out.
+ memory.swap.tiers
+ A read-write file which exists on non-root cgroups.
+ Format is similar to cgroup.subtree_control.
+
+ Controls which swap tiers this cgroup is allowed to swap
+ out to. All tiers are enabled by default.
+
+ (-|+)TIER [(-|+)TIER ...]
+
+ "-" disables a tier, "+" re-enables it.
+ Entries are whitespace-delimited.
+
+ Changes here are combined with parent restrictions to
+ compute memory.swap.tiers.effective.
+
+ If a tier is removed from /sys/kernel/mm/swap/tiers,
+ any prior disable for that tier is invalidated.
+
+ memory.swap.tiers.effective
+ A read-only file which exists on non-root cgroups.
+
+ Shows the tiers this cgroup can actually swap out to.
+ This is the intersection of the parent's effective tiers
+ and this cgroup's own memory.swap.tiers configuration.
+ A child cannot enable a tier that is disabled in its
+ parent.
+
memory.swap.events
A read-only flat-keyed file which exists on non-root cgroups.
The following entries are defined. Unless specified
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index b6c82c8f73e1..542bee1b5f60 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -283,7 +283,8 @@ struct mem_cgroup {
/* per-memcg mm_struct list */
struct lru_gen_mm_list mm_list;
#endif
-
+ int tier_mask;
+ int tier_effective_mask;
#ifdef CONFIG_MEMCG_V1
/* Legacy consumer-oriented counters */
struct page_counter kmem; /* v1 only */
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 007413a53b45..fa6e2b2355fb 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -68,6 +68,7 @@
#include <net/ip.h>
#include "slab.h"
#include "memcontrol-v1.h"
+#include "swap_tier.h"
#include <linux/uaccess.h>
@@ -3792,6 +3793,9 @@ mem_cgroup_css_alloc(struct cgroup_subsys_state *parent_css)
WRITE_ONCE(memcg->zswap_writeback, true);
#endif
page_counter_set_high(&memcg->swap, PAGE_COUNTER_MAX);
+ memcg->tier_mask = TIER_ALL_MASK;
+ swap_tiers_memcg_inherit_mask(memcg, parent);
+
if (parent) {
WRITE_ONCE(memcg->swappiness, mem_cgroup_swappiness(parent));
@@ -5352,6 +5356,86 @@ static int swap_events_show(struct seq_file *m, void *v)
return 0;
}
+static int swap_tier_show(struct seq_file *m, void *v)
+{
+ struct mem_cgroup *memcg = mem_cgroup_from_seq(m);
+
+ swap_tiers_mask_show(m, memcg->tier_mask);
+ return 0;
+}
+
+static ssize_t swap_tier_write(struct kernfs_open_file *of,
+ char *buf, size_t nbytes, loff_t off)
+{
+ struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
+ char *pos, *token;
+ int ret = 0;
+ int original_mask;
+
+ pos = strstrip(buf);
+
+ spin_lock(&swap_tier_lock);
+ if (!*pos) {
+ memcg->tier_mask = TIER_ALL_MASK;
+ goto sync;
+ }
+
+ original_mask = memcg->tier_mask;
+
+ while ((token = strsep(&pos, " \t\n")) != NULL) {
+ int mask;
+
+ if (!*token)
+ continue;
+
+ if (token[0] != '-' && token[0] != '+') {
+ ret = -EINVAL;
+ goto err;
+ }
+
+ mask = swap_tiers_mask_lookup(token+1);
+ if (!mask) {
+ ret = -EINVAL;
+ goto err;
+ }
+
+ /*
+ * if child already set, cannot add that tiers for hierarch mismatching.
+ * parent compatible, child must respect parent selected swap device.
+ */
+ switch (token[0]) {
+ case '-':
+ memcg->tier_mask &= ~mask;
+ break;
+ case '+':
+ memcg->tier_mask |= mask;
+ break;
+ default:
+ ret = -EINVAL;
+ break;
+ }
+
+ if (ret)
+ goto err;
+ }
+
+sync:
+ swap_tiers_memcg_sync_mask(memcg);
+err:
+ if (ret)
+ memcg->tier_mask = original_mask;
+ spin_unlock(&swap_tier_lock);
+ return ret ? ret : nbytes;
+}
+
+static int swap_tier_effective_show(struct seq_file *m, void *v)
+{
+ struct mem_cgroup *memcg = mem_cgroup_from_seq(m);
+
+ swap_tiers_mask_show(m, memcg->tier_effective_mask);
+ return 0;
+}
+
static struct cftype swap_files[] = {
{
.name = "swap.current",
@@ -5384,6 +5468,17 @@ static struct cftype swap_files[] = {
.file_offset = offsetof(struct mem_cgroup, swap_events_file),
.seq_show = swap_events_show,
},
+ {
+ .name = "swap.tiers",
+ .flags = CFTYPE_NOT_ON_ROOT,
+ .seq_show = swap_tier_show,
+ .write = swap_tier_write,
+ },
+ {
+ .name = "swap.tiers.effective",
+ .flags = CFTYPE_NOT_ON_ROOT,
+ .seq_show = swap_tier_effective_show,
+ },
{ } /* terminate */
};
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 513d74dc1709..b61ac73d4963 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -961,6 +961,7 @@ static ssize_t tiers_store(struct kobject *kobj,
char *p, *token, *name, *tmp;
int ret = 0;
short prio;
+ int mask = 0;
tmp = kstrdup(buf, GFP_KERNEL);
if (!tmp)
@@ -993,7 +994,7 @@ static ssize_t tiers_store(struct kobject *kobj,
goto restore;
break;
case '-':
- ret = swap_tiers_remove(token + 1);
+ ret = swap_tiers_remove(token + 1, &mask);
if (ret)
goto restore;
break;
@@ -1003,7 +1004,7 @@ static ssize_t tiers_store(struct kobject *kobj,
}
}
- if (!swap_tiers_update()) {
+ if (!swap_tiers_update(mask)) {
ret = -EINVAL;
goto restore;
}
diff --git a/mm/swap_tier.c b/mm/swap_tier.c
index 91aac55d3a8b..64365569b970 100644
--- a/mm/swap_tier.c
+++ b/mm/swap_tier.c
@@ -244,7 +244,7 @@ int swap_tiers_add(const char *name, int prio)
return ret;
}
-int swap_tiers_remove(const char *name)
+int swap_tiers_remove(const char *name, int *mask)
{
int ret = 0;
struct swap_tier *tier;
@@ -267,6 +267,7 @@ int swap_tiers_remove(const char *name)
list_prev_entry(tier, list)->prio = DEF_SWAP_PRIO;
swap_tier_inactivate(tier);
+ *mask |= TIER_MASK(tier);
return ret;
}
@@ -327,7 +328,24 @@ void swap_tiers_assign_dev(struct swap_info_struct *swp)
swp->tier_mask = TIER_DEFAULT_MASK;
}
-bool swap_tiers_update(void)
+/*
+ * When a tier is removed, set its bit in every memcg's tier_mask and
+ * tier_effective_mask. This prevents stale tier indices from being
+ * silently filtered out if the same index is reused later.
+ */
+static void swap_tier_memcg_propagate(int mask)
+{
+ struct mem_cgroup *child;
+
+ rcu_read_lock();
+ for_each_mem_cgroup_tree(child, root_mem_cgroup) {
+ child->tier_mask |= mask;
+ child->tier_effective_mask |= mask;
+ }
+ rcu_read_unlock();
+}
+
+bool swap_tiers_update(int mask)
{
struct swap_tier *tier;
struct swap_info_struct *swp;
@@ -357,6 +375,77 @@ bool swap_tiers_update(void)
break;
swap_tiers_assign_dev(swp);
}
+ /*
+ * XXX: Unused tiers default to ON, disabled after next tier added.
+ * Use removed tier mask to clear settings for removed/re-added tiers.
+ * (Could hold tier refs, but better to keep cgroup config independent)
+ */
+ if (mask)
+ swap_tier_memcg_propagate(mask);
return true;
}
+
+void swap_tiers_mask_show(struct seq_file *m, int mask)
+{
+ struct swap_tier *tier;
+
+ spin_lock(&swap_tier_lock);
+ for_each_active_tier(tier) {
+ if (mask & TIER_MASK(tier))
+ seq_printf(m, "%s ", tier->name);
+ }
+ spin_unlock(&swap_tier_lock);
+ seq_puts(m, "\n");
+}
+
+int swap_tiers_mask_lookup(const char *name)
+{
+ struct swap_tier *tier;
+
+ lockdep_assert_held(&swap_tier_lock);
+
+ for_each_active_tier(tier) {
+ if (!strcmp(name, tier->name))
+ return TIER_MASK(tier);
+ }
+
+ return 0;
+}
+
+static void __swap_tier_memcg_inherit_mask(struct mem_cgroup *memcg,
+ struct mem_cgroup *parent)
+{
+ int effective_mask
+ = parent ? parent->tier_effective_mask : TIER_ALL_MASK;
+
+ memcg->tier_effective_mask
+ = effective_mask & memcg->tier_mask;
+}
+
+/* Computes the initial effective mask from the parent's effective mask. */
+void swap_tiers_memcg_inherit_mask(struct mem_cgroup *memcg,
+ struct mem_cgroup *parent)
+{
+ spin_lock(&swap_tier_lock);
+ rcu_read_lock();
+ __swap_tier_memcg_inherit_mask(memcg, parent);
+ rcu_read_unlock();
+ spin_unlock(&swap_tier_lock);
+}
+
+/*
+ * Called when a memcg's tier_mask is modified. Walks the subtree
+ * and recomputes each descendant's effective mask against its parent.
+ */
+void swap_tiers_memcg_sync_mask(struct mem_cgroup *memcg)
+{
+ struct mem_cgroup *child;
+
+ lockdep_assert_held(&swap_tier_lock);
+
+ rcu_read_lock();
+ for_each_mem_cgroup_tree(child, memcg)
+ __swap_tier_memcg_inherit_mask(child, parent_mem_cgroup(child));
+ rcu_read_unlock();
+}
diff --git a/mm/swap_tier.h b/mm/swap_tier.h
index 6f281e95ed81..329c6a4f375f 100644
--- a/mm/swap_tier.h
+++ b/mm/swap_tier.h
@@ -10,21 +10,65 @@ struct swap_info_struct;
extern spinlock_t swap_tier_lock;
-#define TIER_ALL_MASK (~0)
-#define TIER_DEFAULT_IDX (31)
-#define TIER_DEFAULT_MASK (1 << TIER_DEFAULT_IDX)
-
/* Initialization and application */
void swap_tiers_init(void);
ssize_t swap_tiers_sysfs_show(char *buf);
int swap_tiers_add(const char *name, int prio);
-int swap_tiers_remove(const char *name);
+int swap_tiers_remove(const char *name, int *mask);
void swap_tiers_snapshot(void);
void swap_tiers_snapshot_restore(void);
-bool swap_tiers_update(void);
+bool swap_tiers_update(int mask);
/* Tier assignment */
void swap_tiers_assign_dev(struct swap_info_struct *swp);
+
+#ifdef CONFIG_SWAP
+/* Memcg related functions */
+void swap_tiers_mask_show(struct seq_file *m, int mask);
+void swap_tiers_memcg_inherit_mask(struct mem_cgroup *memcg,
+ struct mem_cgroup *parent);
+void swap_tiers_memcg_sync_mask(struct mem_cgroup *memcg);
+#else
+static inline void swap_tiers_mask_show(struct seq_file *m, int mask) {}
+static inline void swap_tiers_memcg_inherit_mask(struct mem_cgroup *memcg,
+ struct mem_cgroup *parent) {}
+static inline void swap_tiers_memcg_sync_mask(struct mem_cgroup *memcg) {}
+static inline void __swap_tiers_memcg_sync_mask(struct mem_cgroup *memcg) {}
+#endif
+
+/* Mask and tier lookup */
+int swap_tiers_mask_lookup(const char *name);
+
+/**
+ * swap_tiers_mask_test - Check if the tier mask is valid
+ * @tier_mask: The tier mask to check
+ * @mask: The mask to compare against
+ *
+ * Return: true if condition matches, false otherwise
+ */
+static inline bool swap_tiers_mask_test(int tier_mask, int mask)
+{
+ return tier_mask & mask;
+}
+
+#define TIER_ALL_MASK (~0)
+#define TIER_DEFAULT_IDX (31)
+#define TIER_DEFAULT_MASK (1 << TIER_DEFAULT_IDX)
+
+#ifdef CONFIG_MEMCG
+static inline int folio_tier_effective_mask(struct folio *folio)
+{
+ struct mem_cgroup *memcg = folio_memcg(folio);
+
+ return memcg ? memcg->tier_effective_mask : TIER_ALL_MASK;
+}
+#else
+static inline int folio_tier_effective_mask(struct folio *folio)
+{
+ return TIER_ALL_MASK;
+}
+#endif
+
#endif /* _SWAP_TIER_H */
--
2.34.1
^ permalink raw reply [flat|nested] 7+ messages in thread
* [PATCH v4 4/4] mm: swap: filter swap allocation by memcg tier mask
2026-02-17 0:09 [PATCH v4 0/4] mm/swap, memcg: Introduce swap tiers for cgroup based swap control Youngjun Park
` (2 preceding siblings ...)
2026-02-17 0:09 ` [PATCH v4 3/4] mm: memcontrol: add interfaces for swap tier selection Youngjun Park
@ 2026-02-17 0:09 ` Youngjun Park
3 siblings, 0 replies; 7+ messages in thread
From: Youngjun Park @ 2026-02-17 0:09 UTC (permalink / raw)
To: Andrew Morton
Cc: Chris Li, linux-mm, Kairui Song, Kemeng Shi, Nhat Pham,
Baoquan He, Barry Song, Johannes Weiner, Michal Hocko,
Roman Gushchin, Shakeel Butt, Muchun Song, Michal Koutný,
gunho.lee, taejoon.song, austin.kim, youngjun.park
Apply memcg tier effective mask during swap slot allocation to
enforce per-cgroup swap tier restrictions.
In the fast path, check the percpu cached swap_info's tier_mask
against the folio's effective mask. If it does not match, fall
through to the slow path. In the slow path, skip swap devices
whose tier_mask is not covered by the folio's effective mask.
This works correctly when there is only one non-rotational
device in the system and no devices share the same priority.
However, there are known limitations:
- When multiple non-rotational devices exist, percpu swap
caches from different memcg contexts may reference
mismatched tiers, causing unnecessary fast path misses.
- When multiple non-rotational devices are assigned to
different tiers and same-priority devices exist among
them, cluster-based rotation may not work correctly.
These edge cases do not affect the primary use case of
directing swap traffic per cgroup. Further optimization is
planned for future work.
Signed-off-by: Youngjun Park <youngjun.park@lge.com>
---
mm/swapfile.c | 13 ++++++++++++-
1 file changed, 12 insertions(+), 1 deletion(-)
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 2f956b6a5edc..aff5e8407691 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1321,15 +1321,22 @@ static bool swap_alloc_fast(struct folio *folio)
struct swap_cluster_info *ci;
struct swap_info_struct *si;
unsigned int offset;
+ int mask = folio_tier_effective_mask(folio);
/*
* Once allocated, swap_info_struct will never be completely freed,
* so checking it's liveness by get_swap_device_info is enough.
*/
si = this_cpu_read(percpu_swap_cluster.si[order]);
+ if (!si || !swap_tiers_mask_test(si->tier_mask, mask) ||
+ !get_swap_device_info(si))
+ return false;
+
offset = this_cpu_read(percpu_swap_cluster.offset[order]);
- if (!si || !offset || !get_swap_device_info(si))
+ if (!offset) {
+ put_swap_device(si);
return false;
+ }
ci = swap_cluster_lock(si, offset);
if (cluster_is_usable(ci, order)) {
@@ -1348,10 +1355,14 @@ static bool swap_alloc_fast(struct folio *folio)
static void swap_alloc_slow(struct folio *folio)
{
struct swap_info_struct *si, *next;
+ int mask = folio_tier_effective_mask(folio);
spin_lock(&swap_avail_lock);
start_over:
plist_for_each_entry_safe(si, next, &swap_avail_head, avail_list) {
+ if (!swap_tiers_mask_test(si->tier_mask, mask))
+ continue;
+
/* Rotate the device and switch to a new cluster */
plist_requeue(&si->avail_list, &swap_avail_head);
spin_unlock(&swap_avail_lock);
--
2.34.1
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH v4 3/4] mm: memcontrol: add interfaces for swap tier selection
2026-02-17 0:09 ` [PATCH v4 3/4] mm: memcontrol: add interfaces for swap tier selection Youngjun Park
@ 2026-02-17 12:18 ` kernel test robot
0 siblings, 0 replies; 7+ messages in thread
From: kernel test robot @ 2026-02-17 12:18 UTC (permalink / raw)
To: Youngjun Park, Andrew Morton
Cc: oe-kbuild-all, Linux Memory Management List, Chris Li,
Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He, Barry Song,
Johannes Weiner, Michal Hocko, Roman Gushchin, Shakeel Butt,
Muchun Song, Michal Koutný,
gunho.lee, taejoon.song, austin.kim, youngjun.park
Hi Youngjun,
kernel test robot noticed the following build errors:
[auto build test ERROR on 776250964cbaa49ebe6b8bb2870765cc89cece59]
url: https://github.com/intel-lab-lkp/linux/commits/Youngjun-Park/mm-swap-introduce-swap-tier-infrastructure/20260217-121406
base: 776250964cbaa49ebe6b8bb2870765cc89cece59
patch link: https://lore.kernel.org/r/20260217000950.4015880-4-youngjun.park%40lge.com
patch subject: [PATCH v4 3/4] mm: memcontrol: add interfaces for swap tier selection
config: powerpc64-randconfig-001-20260217 (https://download.01.org/0day-ci/archive/20260217/202602172046.jWVum2TN-lkp@intel.com/config)
compiler: powerpc64-linux-gcc (GCC) 8.5.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20260217/202602172046.jWVum2TN-lkp@intel.com/reproduce)
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202602172046.jWVum2TN-lkp@intel.com/
All errors (new ones prefixed by >>):
mm/swap_tier.c: In function 'swap_tier_memcg_propagate':
>> mm/swap_tier.c:342:8: error: dereferencing pointer to incomplete type 'struct mem_cgroup'
child->tier_mask |= mask;
^~
vim +342 mm/swap_tier.c
330
331 /*
332 * When a tier is removed, set its bit in every memcg's tier_mask and
333 * tier_effective_mask. This prevents stale tier indices from being
334 * silently filtered out if the same index is reused later.
335 */
336 static void swap_tier_memcg_propagate(int mask)
337 {
338 struct mem_cgroup *child;
339
340 rcu_read_lock();
341 for_each_mem_cgroup_tree(child, root_mem_cgroup) {
> 342 child->tier_mask |= mask;
343 child->tier_effective_mask |= mask;
344 }
345 rcu_read_unlock();
346 }
347
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH v4 1/4] mm: swap: introduce swap tier infrastructure
2026-02-17 0:09 ` [PATCH v4 1/4] mm: swap: introduce swap tier infrastructure Youngjun Park
@ 2026-02-17 15:27 ` kernel test robot
0 siblings, 0 replies; 7+ messages in thread
From: kernel test robot @ 2026-02-17 15:27 UTC (permalink / raw)
To: Youngjun Park, Andrew Morton
Cc: llvm, oe-kbuild-all, Linux Memory Management List, Chris Li,
Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He, Barry Song,
Johannes Weiner, Michal Hocko, Roman Gushchin, Shakeel Butt,
Muchun Song, Michal Koutný,
gunho.lee, taejoon.song, austin.kim, youngjun.park
Hi Youngjun,
kernel test robot noticed the following build warnings:
[auto build test WARNING on 776250964cbaa49ebe6b8bb2870765cc89cece59]
url: https://github.com/intel-lab-lkp/linux/commits/Youngjun-Park/mm-swap-introduce-swap-tier-infrastructure/20260217-121406
base: 776250964cbaa49ebe6b8bb2870765cc89cece59
patch link: https://lore.kernel.org/r/20260217000950.4015880-2-youngjun.park%40lge.com
patch subject: [PATCH v4 1/4] mm: swap: introduce swap tier infrastructure
config: hexagon-randconfig-002-20260217 (https://download.01.org/0day-ci/archive/20260217/202602172319.SHQ7btgd-lkp@intel.com/config)
compiler: clang version 23.0.0git (https://github.com/llvm/llvm-project e86750b29fa0ff207cd43213d66dabe565417638)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20260217/202602172319.SHQ7btgd-lkp@intel.com/reproduce)
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202602172319.SHQ7btgd-lkp@intel.com/
All warnings (new ones prefixed by >>):
>> mm/swap_tier.c:118:10: warning: format specifies type 'long' but the argument has type '__ptrdiff_t' (aka 'int') [-Wformat]
116 | len += sysfs_emit_at(buf, len, "%-16s %-5ld %-11d %-11d\n",
| ~~~~~
| %-5td
117 | tier->name,
118 | TIER_IDX(tier),
| ^~~~~~~~~~~~~~
mm/swap_tier.c:33:24: note: expanded from macro 'TIER_IDX'
33 | #define TIER_IDX(tier) ((tier) - swap_tiers)
| ^~~~~~~~~~~~~~~~~~~~~
1 warning generated.
vim +118 mm/swap_tier.c
105
106 ssize_t swap_tiers_sysfs_show(char *buf)
107 {
108 struct swap_tier *tier;
109 ssize_t len = 0;
110
111 len += sysfs_emit_at(buf, len, "%-16s %-5s %-11s %-11s\n",
112 "Name", "Idx", "PrioStart", "PrioEnd");
113
114 spin_lock(&swap_tier_lock);
115 for_each_active_tier(tier) {
116 len += sysfs_emit_at(buf, len, "%-16s %-5ld %-11d %-11d\n",
117 tier->name,
> 118 TIER_IDX(tier),
119 tier->prio,
120 TIER_END_PRIO(tier));
121 }
122 spin_unlock(&swap_tier_lock);
123
124 return len;
125 }
126
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2026-02-17 15:28 UTC | newest]
Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2026-02-17 0:09 [PATCH v4 0/4] mm/swap, memcg: Introduce swap tiers for cgroup based swap control Youngjun Park
2026-02-17 0:09 ` [PATCH v4 1/4] mm: swap: introduce swap tier infrastructure Youngjun Park
2026-02-17 15:27 ` kernel test robot
2026-02-17 0:09 ` [PATCH v4 2/4] mm: swap: associate swap devices with tiers Youngjun Park
2026-02-17 0:09 ` [PATCH v4 3/4] mm: memcontrol: add interfaces for swap tier selection Youngjun Park
2026-02-17 12:18 ` kernel test robot
2026-02-17 0:09 ` [PATCH v4 4/4] mm: swap: filter swap allocation by memcg tier mask Youngjun Park
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox