From: Youngjun Park <youngjun.park@lge.com>
To: Andrew Morton <akpm@linux-foundation.org>
Cc: "Chris Li" <chrisl@kernel.org>,
linux-mm@kvack.org, "Kairui Song" <kasong@tencent.com>,
"Kemeng Shi" <shikemeng@huaweicloud.com>,
"Nhat Pham" <nphamcs@gmail.com>, "Baoquan He" <bhe@redhat.com>,
"Barry Song" <baohua@kernel.org>,
"Johannes Weiner" <hannes@cmpxchg.org>,
"Michal Hocko" <mhocko@kernel.org>,
"Roman Gushchin" <roman.gushchin@linux.dev>,
"Shakeel Butt" <shakeel.butt@linux.dev>,
"Muchun Song" <muchun.song@linux.dev>,
"Michal Koutný" <mkoutny@suse.com>,
gunho.lee@lge.com, taejoon.song@lge.com, austin.kim@lge.com,
youngjun.park@lge.com
Subject: [PATCH v4 0/4] mm/swap, memcg: Introduce swap tiers for cgroup based swap control
Date: Tue, 17 Feb 2026 09:09:46 +0900 [thread overview]
Message-ID: <20260217000950.4015880-1-youngjun.park@lge.com> (raw)
This is the fourth version of the "Swap Tiers" concept.
Following Chris Li's suggestion to focus on small, mergeable
steps, this series covers the core tier infrastructure and
memcg-based tier assignment as a minimal usable feature set.
Further extensions are deferred to subsequent series.
Previous versions:
RFC v3: https://lore.kernel.org/linux-mm/20260131125454.3187546-1-youngjun.park@lge.com/
RFC v2: https://lore.kernel.org/linux-mm/20260126065242.1221862-1-youngjun.park@lge.com/
RFC v1: https://lore.kernel.org/linux-mm/20251109124947.1101520-1-youngjun.park@lge.com/
Overview (Recap)
================
Swap Tiers enable grouping swap devices into named tiers based on
performance characteristics (e.g., NVMe, HDD, Network). This allows
faster devices to be dedicated to latency-sensitive workloads while
slower devices serve background tasks. The concept was suggested by
Chris Li.
Changes in v4
=================
- Simplified control flow to flatten indentation (Chris Li)
- Added CONFIG option for MAX_SWAPTIER with a small default of 4
(Chris Li)
- Added memory.swap.tiers.effective read interface, following cpuset
convention of splitting into configuration and effective files
(Michal Koutný)
- cgroup docs refinement. (Michal Koutný)
- Reworked save/restore logic into a clearer "snapshot and rollback"
model for improved readability and simpler control flow (Chris Li)
- Removed tier priority modification operation to reduce complexity;
may be revisited in a future series
- Added tier name validation: only alphanumeric characters and
underscores are allowed
- Fixed several edge case bugs
- Swap allocation logic improvements: integrating percpu global
cluster swap cache onto the swap device will be handled as
part of Kairui Song's ongoing work. Drop that logic on this patch.
- Rebased onto latest mm-new
Deferred and Future work:
- Per-tier swap_active_head to reduce contention across tiers when
releasing swap entries on different tiers (Chris Li). This is an
improvement to the swap_avail_head / swap_active_head (which must be done)
and is not critical for the initial infrastructure.
- Round-robin rotation (Kairui) cleanup will be proposed after
this series lands, as swap tiers can naturally abstract away
round-robin behavior (round-robin is unnecessary when no
equal-priority devices exist. possibly can disable it. and also can make round-robin
priority selectable).
- BPF interfaces (Shakeel Butt). beyond memcg
are potential future extensions once the base infrastructure is
established and real-world use cases are ((including, per-VMA, DAMON, etc.)).
Changes in RFC v3
=================
- Fixed swap_alloc_fast() tier eligibility check
- Fixed tier_mask restoration on error paths
- Fixed priority -1 tier deletion bug
- Fixed !CONFIG_MEMCG build failures
- Improved commit messages
- Fix improper error handling
- Fixed coding style violations
- Fixed tier deletion propagation to cgroups
Changes in RFC v2
=================
- Strict cgroup hierarchy compliance (LPC 2025 feedback)
- Percpu swap device cache to preserve fastpath performance
(Kairui Song, Baoquan He)
- Simplified tier structure (Chris Li)
- Removed explicit "+" selection; default is all tiers, use "-"
to exclude (Chris Li)
- Removed CONFIG_SWAP_TIER; now base kernel feature (Chris Li)
- Effective tier calculation moved to configuration time
(swap.tiers write)
- Mixed operation support for "+" and "-" in
/sys/kernel/mm/swap/tiers (Chris Li)
- Commit reorganization for clarity (Chris Li)
- Added tier priority modification support
- Added documentation for swap tiers concept and usage (Chris Li)
Real-world Results
==================
App preloading on our internal platform using NBD as a separate tier.
Without a separate swap tier:
- Cannot selectively avoid default flash swap, unable to reduce
flash wear and lifespan issues.
- Cannot selectively assign NBD to specific apps that need it.
Result (cold launch vs. preloaded):
- Streaming App A: 13.17s → 4.18s (68% faster)
- Streaming App B: 5.60s → 1.12s (80% faster)
- E-commerce App C: 10.25s → 2.00s (80% faster)
Performance validation against baseline (no tiers configured) shows
negligible overhead (<1%) in kernel build and vm-scalability
benchmarks. Detailed results in RFC v2 cover letter.
Youngjun Park (4):
mm: swap: introduce swap tier infrastructure
mm: swap: associate swap devices with tiers
mm: memcontrol: add interfaces for swap tier selection
mm: swap: filter swap allocation by memcg tier mask
Documentation/admin-guide/cgroup-v2.rst | 27 ++
Documentation/mm/swap-tier.rst | 159 +++++++++
MAINTAINERS | 3 +
include/linux/memcontrol.h | 3 +-
include/linux/swap.h | 1 +
mm/Kconfig | 12 +
mm/Makefile | 2 +-
mm/memcontrol.c | 95 +++++
mm/swap.h | 4 +
mm/swap_state.c | 75 ++++
mm/swap_tier.c | 451 ++++++++++++++++++++++++
mm/swap_tier.h | 74 ++++
mm/swapfile.c | 22 +-
13 files changed, 922 insertions(+), 6 deletions(-)
create mode 100644 Documentation/mm/swap-tier.rst
create mode 100644 mm/swap_tier.c
create mode 100644 mm/swap_tier.h
base-commit: 776250964cbaa49ebe6b8bb2870765cc89cece59
--
2.34.1
next reply other threads:[~2026-02-17 0:10 UTC|newest]
Thread overview: 7+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-02-17 0:09 Youngjun Park [this message]
2026-02-17 0:09 ` [PATCH v4 1/4] mm: swap: introduce swap tier infrastructure Youngjun Park
2026-02-17 15:27 ` kernel test robot
2026-02-17 0:09 ` [PATCH v4 2/4] mm: swap: associate swap devices with tiers Youngjun Park
2026-02-17 0:09 ` [PATCH v4 3/4] mm: memcontrol: add interfaces for swap tier selection Youngjun Park
2026-02-17 12:18 ` kernel test robot
2026-02-17 0:09 ` [PATCH v4 4/4] mm: swap: filter swap allocation by memcg tier mask Youngjun Park
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260217000950.4015880-1-youngjun.park@lge.com \
--to=youngjun.park@lge.com \
--cc=akpm@linux-foundation.org \
--cc=austin.kim@lge.com \
--cc=baohua@kernel.org \
--cc=bhe@redhat.com \
--cc=chrisl@kernel.org \
--cc=gunho.lee@lge.com \
--cc=hannes@cmpxchg.org \
--cc=kasong@tencent.com \
--cc=linux-mm@kvack.org \
--cc=mhocko@kernel.org \
--cc=mkoutny@suse.com \
--cc=muchun.song@linux.dev \
--cc=nphamcs@gmail.com \
--cc=roman.gushchin@linux.dev \
--cc=shakeel.butt@linux.dev \
--cc=shikemeng@huaweicloud.com \
--cc=taejoon.song@lge.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox