linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Youngjun Park <youngjun.park@lge.com>
To: Andrew Morton <akpm@linux-foundation.org>, linux-mm@kvack.org
Cc: Chris Li <chrisl@kernel.org>, Kairui Song <kasong@tencent.com>,
	Kemeng Shi <shikemeng@huaweicloud.com>,
	Nhat Pham <nphamcs@gmail.com>, Baoquan He <bhe@redhat.com>,
	Barry Song <baohua@kernel.org>,
	Johannes Weiner <hannes@cmpxchg.org>,
	Michal Hocko <mhocko@kernel.org>,
	Roman Gushchin <roman.gushchin@linux.dev>,
	Shakeel Butt <shakeel.butt@linux.dev>,
	Muchun Song <muchun.song@linux.dev>,
	gunho.lee@lge.com, taejoon.song@lge.com, austin.kim@lge.com,
	youngjun.park@lge.com
Subject: [RFC PATCH v2 0/5] mm/swap, memcg: Introduce swap tiers for cgroup based swap control
Date: Mon, 26 Jan 2026 15:52:37 +0900	[thread overview]
Message-ID: <20260126065242.1221862-1-youngjun.park@lge.com> (raw)

This is the second version of the RFC for the "Swap Tiers" concept.
Link to v1: https://lore.kernel.org/linux-mm/20251109124947.1101520-1-youngjun.park@lge.com/

This version incorporates feedback received during LPC 2025 and addresses
comments from the previous review. We have also included experimental
results based on usage scenarios intended for our internal platforms.

Motivation & Concept recap
==========================
Current Linux swap allocation is global, limiting the ability to assign
faster devices to specific cgroups. Our initial attempt at per-cgroup
priorities proved over-engineered and caused LRU inversion.

Following Chris Li's suggestion, we pivoted to "Swap Tiers." A tier is
simply a user-named group of swap devices sharing the same priority range.
This abstraction facilitates swap device selection based on speed, allowing
users to configure specific tiers for cgroups.

For more details, please refer to the LPC 2025 presentation
https://lpc.events/event/19/contributions/2141/attachments/1857/3998/LPC2025Finalss.pdf
or v1 patch.

Changes in v2
=============
1. Respect cgroup hierarchy principle (LPC 2025 feedback)
- The logic now strictly follows standard cgroup hierarchy principles.

Previous: Children could select any tier using "+" regardless of the
parent's configuration. "+" tier is referenced. (could not be silently disappeared)

Current: The explicit selection ("+") concept is removed. By
default, all tiers are selected. Users now use "-" to exclude specific
tiers. Excluded tier could disappeared silently.
A child cgroup is always a subset of its parent. Even if a child
re-enables a tier with "+" that was excluded by the parent, the effective
tier list is limited to the parent's allowed subset.

Example:
Global Tiers: SSD, HDD, NET
Parent: SSD, NET (HDD excluded)
Child: HDD, NET (SSD excluded)
-> Effective Child Tier: NET (Intersection of Parent and Child)

2. Simplified swap_tier structure (Chris Li)
- Replaced 'end prio' and priority lists with standard list_head.

3. Reference counting removed
- Removed refcount for swap_tiers. Liveness is now guaranteed by checking
if the swap device is in use.
- Since the default selection is "ALL", holding references for disappearing
tiers is unnecessary. only exclusions ("-") matter.

4. Support mixed operation (/sys/kernel/mm/swap/tiers) (Chris Li)
- Supports mixed use of "+" and "-" operations.

5. Add modify operation
- Introduced an operation to modify the priority of existing tiers.
Format: "tier_name:priority"

6. Restore swap device cluster allocation rule on same swap device priority (Kairui Song, Baoquan He)
- Preserve existing fastpath and slowpath swap allocation logic using percpu swap device cache.
- Therefore, the swap device cluster allocation rule is preserved on same swap device priority

7. Remove compile time selection (Chris Li)
- Removed CONFIG_SWAP_TIER. this is now a base kernel feature.

8. Cgroup tier calculation logic update
- The effective swap tier for a cgroup is now calculated at the time of
configuration (writing to swap.tiers), rather than at the time of swap
allocation.

9. Commit reorganization (Chris Li)
- Commit order reorganized for clarity.

10. Documentation (Chris Li)
- Added documentation for the Swap Tiers concept and usage, as explained in
the RFC.

Apply and Benchmark
===================
1. Real-world Scenario: App Preloading
We applied this patchset to our embedded platform to enable application
preloading for faster launch times. Since the platform uses flash storage
as the default swap device, we aim to minimize swap usage on it to extend
its lifespan.

To achieve this, we utilized an idle device (configured via NBD) as a
separate swap tier specifically for these preloaded applications.

While it is self-evident that restoring from swap (warm launch) is faster
than a cold launch, the data below demonstrates the latency reduction
achieved in this environment.

Streaming App A:
Before (Cold Launch): 13.17s
After (Preloaded): 4.18s (68% reduction)

Streaming App B:
Before (Cold Launch): 5.60s
After (Preloaded): 1.12s (80% reduction)

E-commerce App C:
Before (Cold Launch): 10.25s
After (Preloaded): 2.00s (80% reduction)

We have a plan to solidify this usage and expand the usage.

2. Microbenchmarks
In response to feedback regarding potential regressions in the swap
fastpath (specifically concerning the overhead of global percpu clusters
vs. swap device percpu cache), we addressed this in RFC v2.

By preserving the existing fastpath and slowpath mechanisms via the per-cpu
swap device cache, we ensured that the performance characteristics remain
unchanged. The simple benchmark results below confirm there is no
significant difference.

A. Build kernel test:
Test using 128GB swapfile on Simulated SSD, Qemu VM with 4 CPUs, 4GB RAM,
avg of 5 test runs:

              Before        After
System time:  1584.20s      1590.74s (+0.41%)

Considering the deviation between max/min values, there seems to be no
significant difference.

B. vm-scalability
usemem --init-time -O -y -x -n 32 256M (qemu, 16G memory, global pressure,
simulated SDD as swap), avg of 5 test runs:

                           Before          After
System time:               588.48 s        592.15 s
Sum Throughput:            16.65 MB/s      15.95 MB/s
Single process Throughput: 0.52 MB/s       0.50 MB/s
Avg Free latency:          1098422.97 us   1106388.97 us

The results indicate that the performance remains stable with negligible
variance.

Any feedback is welcome.

Thanks,
Youngjun Park

Youngjun Park (5):
  mm: swap: introduce swap tier infrastructure
  mm: swap: associate swap devices with tiers
  mm: memcontrol: add interface for swap tier selection
  mm, swap: change back to use each swap device's percpu cluster
  mm, swap: introduce percpu swap device cache to avoid fragmentation

 Documentation/admin-guide/cgroup-v2.rst |  27 ++
 Documentation/mm/swap-tier.rst          | 109 ++++++
 MAINTAINERS                             |   2 +
 include/linux/memcontrol.h              |   3 +-
 include/linux/swap.h                    |  17 +-
 mm/Makefile                             |   2 +-
 mm/memcontrol.c                         |  80 +++++
 mm/swap.h                               |   4 +
 mm/swap_state.c                         |  70 ++++
 mm/swap_tier.c                          | 452 ++++++++++++++++++++++++
 mm/swap_tier.h                          |  70 ++++
 mm/swapfile.c                           | 132 +++----
 12 files changed, 900 insertions(+), 68 deletions(-)
 create mode 100644 Documentation/mm/swap-tier.rst
 create mode 100644 mm/swap_tier.c
 create mode 100644 mm/swap_tier.h

base-commit: 5a3704ed2dce0b54a7f038b765bb752b87ee8cc2
-- 
2.34.1


             reply	other threads:[~2026-01-26  6:53 UTC|newest]

Thread overview: 24+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-01-26  6:52 Youngjun Park [this message]
2026-01-26  6:52 ` [RFC PATCH v2 v2 1/5] mm: swap: introduce swap tier infrastructure Youngjun Park
2026-02-12  9:07   ` Chris Li
2026-02-13  2:18     ` YoungJun Park
2026-02-13 14:33     ` YoungJun Park
2026-01-26  6:52 ` [RFC PATCH v2 v2 2/5] mm: swap: associate swap devices with tiers Youngjun Park
2026-01-26  6:52 ` [RFC PATCH v2 v2 3/5] mm: memcontrol: add interface for swap tier selection Youngjun Park
2026-01-26  6:52 ` [RFC PATCH v2 v2 4/5] mm, swap: change back to use each swap device's percpu cluster Youngjun Park
2026-02-12  7:37   ` Chris Li
2026-01-26  6:52 ` [RFC PATCH v2 v2 5/5] mm, swap: introduce percpu swap device cache to avoid fragmentation Youngjun Park
2026-02-12  6:12 ` [RFC PATCH v2 0/5] mm/swap, memcg: Introduce swap tiers for cgroup based swap control Chris Li
2026-02-12  9:22   ` Chris Li
2026-02-13  2:26     ` YoungJun Park
2026-02-13  1:59   ` YoungJun Park
2026-02-12 17:57 ` Nhat Pham
2026-02-12 17:58   ` Nhat Pham
2026-02-13  2:43   ` YoungJun Park
2026-02-12 18:33 ` Shakeel Butt
2026-02-13  3:58   ` YoungJun Park
2026-02-21  3:47     ` Shakeel Butt
2026-02-21  6:07       ` Chris Li
2026-02-21 17:44         ` Shakeel Butt
2026-02-22  1:16           ` YoungJun Park
2026-02-21 14:30       ` YoungJun Park

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260126065242.1221862-1-youngjun.park@lge.com \
    --to=youngjun.park@lge.com \
    --cc=akpm@linux-foundation.org \
    --cc=austin.kim@lge.com \
    --cc=baohua@kernel.org \
    --cc=bhe@redhat.com \
    --cc=chrisl@kernel.org \
    --cc=gunho.lee@lge.com \
    --cc=hannes@cmpxchg.org \
    --cc=kasong@tencent.com \
    --cc=linux-mm@kvack.org \
    --cc=mhocko@kernel.org \
    --cc=muchun.song@linux.dev \
    --cc=nphamcs@gmail.com \
    --cc=roman.gushchin@linux.dev \
    --cc=shakeel.butt@linux.dev \
    --cc=shikemeng@huaweicloud.com \
    --cc=taejoon.song@lge.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox