From: Youngjun Park <youngjun.park@lge.com>
To: akpm@linux-foundation.org
Cc: chrisl@kernel.org, kasong@tencent.com, hannes@cmpxchg.org,
mhocko@kernel.org, roman.gushchin@linux.dev,
shakeel.butt@linux.dev, muchun.song@linux.dev,
shikemeng@huaweicloud.com, nphamcs@gmail.com, bhe@redhat.com,
baohua@kernel.org, cgroups@vger.kernel.org, linux-mm@kvack.org,
linux-kernel@vger.kernel.org, gunho.lee@lge.com,
youngjun.park@lge.com, taejoon.song@lge.com
Subject: [RFC PATCH v3 0/5] mm/swap, memcg: Introduce swap tiers for cgroup based swap control
Date: Sat, 31 Jan 2026 21:54:49 +0900 [thread overview]
Message-ID: <20260131125454.3187546-1-youngjun.park@lge.com> (raw)
This is the third version of the RFC for the "Swap Tiers" concept,
incorporating LPC 2025 feedback and subsequent bug fixes.
Previous approach: https://lore.kernel.org/linux-mm/20250716202006.3640584-1-youngjun.park@lge.com/
RFC v2: https://lore.kernel.org/linux-mm/20260126065242.1221862-1-youngjun.park@lge.com/
RFC v1: https://lore.kernel.org/linux-mm/20251109124947.1101520-1-youngjun.park@lge.com/
v3 addresses bug fixes found during testing and adds clarifications to
improve patch reviewability.
Overview (Recap)
================
Swap Tiers enable cgroup-based swap device assignment by grouping swap
devices into named tiers. This allows faster devices (e.g., SSD) to be
dedicated to latency-sensitive workloads while slower devices (e.g., HDD,
network) serve background tasks. The concept was suggested by Chris Li.
Key Changes after LPC 2025(RFC v1)
==================================
The most significant change in v2 was adopting strict cgroup hierarchy
semantics based on LPC 2025 feedback.
v1 allowed children to explicitly select tiers ("+tier") regardless of
parent configuration, violating standard cgroup principles.
v2 enforces proper hierarchy: child configurations are always subsets of
parent. Default is all tiers enabled; use "-tier" to exclude.
Example:
Global: SSD, HDD, NET
Parent: -HDD → uses SSD, NET
Child: -SSD → uses NET (intersection)
If SSD deleted: Child uses NET (exclusions reset)
If NEW added: All cgroups use it by default
This ensures children cannot access resources denied by ancestors,
matching standard cgroup behavior.
For detailed rationale, see v2 RFC and LPC presentation.
Changes in RFC v3
=================
- Fixed swap_alloc_fast() tier eligibility check
- Fixed tier_mask restoration on error paths
- Fixed priority -1 tier deletion bug
- Fixed !CONFIG_MEMCG build failures
- Improved commit messages
- Fix improper error handling
- Fixed coding style violations
- Fixed tier deletion propagation to cgroups
Changes in RFC v2
=================
- Strict cgroup hierarchy compliance (LPC 2025 feedback)
- Percpu swap device cache to preserve fastpath performance (Kairui Song, Baoquan He)
- Simplified tier structure (Chris Li)
- Removed explicit "+" selection; default is all tiers, use "-" to exclude (Chris Li)
- Removed CONFIG_SWAP_TIER; now base kernel feature (Chris Li)
- Effective tier calculation moved to configuration time (swap.tiers write)
- Mixed operation support for "+" and "-" in /sys/kernel/mm/swap/tiers (Chris Li)
- Commit reorganization for clarity (Chris Li)
- Added tier priority modification support
- Added documentation for swap tiers concept and usage (Chris Li)
Real-world Results
==================
App preloading on our internal platform using NBD as separate tier.
(Our first real-world use case. We plan to refine and expand this usage.)
Without separate swap tier,
- Cannot selectively avoid default flash swap, unable to reduce flash wear and lifespan issues.
- Can't selectively assign NBD to specific apps that need it.
Result (cold launch vs. preloaded):
- Streaming App A: 13.17s → 4.18s (68% faster)
- Streaming App B: 5.60s → 1.12s (80% faster)
- E-commerce App C: 10.25s → 2.00s (80% faster)
Performance validation against baseline (no tiers configured) shows
negligible overhead (<1%) in kernel build and vm-scalability benchmarks.
Detailed results in v2 cover letter.
Any feedback welcome.
Youngjun Park
Youngjun Park (5):
mm: swap: introduce swap tier infrastructure
mm: swap: associate swap devices with tiers
mm: memcontrol: add interface for swap tier selection
mm, swap: change back to use each swap device's percpu cluster
mm, swap: introduce percpu swap device cache to avoid fragmentation
Documentation/admin-guide/cgroup-v2.rst | 27 ++
Documentation/mm/swap-tier.rst | 109 ++++++
MAINTAINERS | 2 +
include/linux/memcontrol.h | 3 +-
include/linux/swap.h | 17 +-
mm/Makefile | 2 +-
mm/memcontrol.c | 85 +++++
mm/swap.h | 4 +
mm/swap_state.c | 72 ++++
mm/swap_tier.c | 469 ++++++++++++++++++++++++
mm/swap_tier.h | 84 +++++
mm/swapfile.c | 133 +++----
12 files changed, 938 insertions(+), 69 deletions(-)
create mode 100644 Documentation/mm/swap-tier.rst
create mode 100644 mm/swap_tier.c
create mode 100644 mm/swap_tier.h
base-commit: 5a3704ed2dce0b54a7f038b765bb752b87ee8cc2
--
2.34.1
next reply other threads:[~2026-01-31 12:56 UTC|newest]
Thread overview: 8+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-01-31 12:54 Youngjun Park [this message]
2026-01-31 12:54 ` [RFC PATCH v3 1/5] mm: swap: introduce swap tier infrastructure Youngjun Park
2026-01-31 12:54 ` [RFC PATCH v3 2/5] mm: swap: associate swap devices with tiers Youngjun Park
2026-01-31 12:54 ` [RFC PATCH v3 3/5] mm: memcontrol: add interface for swap tier selection Youngjun Park
2026-02-03 10:54 ` Michal Koutný
2026-02-04 1:11 ` YoungJun Park
2026-01-31 12:54 ` [RFC PATCH v3 4/5] mm, swap: change back to use each swap device's percpu cluster Youngjun Park
2026-01-31 12:54 ` [RFC PATCH v3 5/5] mm, swap: introduce percpu swap device cache to avoid fragmentation Youngjun Park
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260131125454.3187546-1-youngjun.park@lge.com \
--to=youngjun.park@lge.com \
--cc=akpm@linux-foundation.org \
--cc=baohua@kernel.org \
--cc=bhe@redhat.com \
--cc=cgroups@vger.kernel.org \
--cc=chrisl@kernel.org \
--cc=gunho.lee@lge.com \
--cc=hannes@cmpxchg.org \
--cc=kasong@tencent.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=mhocko@kernel.org \
--cc=muchun.song@linux.dev \
--cc=nphamcs@gmail.com \
--cc=roman.gushchin@linux.dev \
--cc=shakeel.butt@linux.dev \
--cc=shikemeng@huaweicloud.com \
--cc=taejoon.song@lge.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox