linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: youngjun.park@lge.com
To: linux-mm@kvack.org
Cc: akpm@linux-foundation.org, hannes@cmpxchg.org, mhocko@kernel.org,
	roman.gushchin@linux.dev, shakeel.butt@linux.dev,
	cgroups@vger.kernel.org, linux-kernel@vger.kernel.org,
	shikemeng@huaweicloud.com, kasong@tencent.com, nphamcs@gmail.com,
	bhe@redhat.com, baohua@kernel.org, chrisl@kernel.org,
	muchun.song@linux.dev, iamjoonsoo.kim@lge.com,
	taejoon.song@lge.com, gunho.lee@lge.com,
	Youngjun Park <youngjun.park@lge.com>
Subject: [RFC PATCH 0/2] mm/swap, memcg: Support per-cgroup swap device prioritization
Date: Thu, 12 Jun 2025 19:37:42 +0900	[thread overview]
Message-ID: <20250612103743.3385842-1-youngjun.park@lge.com> (raw)

From: Youngjun Park <youngjun.park@lge.com>

Introduction
============
I am a kernel developer working on platforms deployed on commercial consumer devices.
Due to real-world product requirements, needed to modify the Linux kernel to support
a new swap management mechanism. The proposed mechanism allows assigning different swap
priorities to swap devices per cgroup.
I believe this mechanism can be generally useful for similar constrained-device scenarios
and would like to propose it for upstream inclusion and solicit feedback from the community.

Motivation
==========
Core requirement was to improve application responsiveness and loading time, especially
for latency critical applications, without increasing RAM or storage hardware resources.
Device constraints:
  - Linux-based embedded platform
  - Limited system RAM
  - Small local swap
  - No option to expand RAM or local swap
To mitigate this, we explored utilizing idle RAM and storage from nearby devices as remote
swap space. To maximize its effectiveness, we needed the ability to control which swap devices
were used by different cgroups:
  - Assign faster local swap devices to latency critical apps
  - Assign remote swap devices to background apps
However, current Linux kernel swap infrastructure does not support per-cgroup swap device
assignment.
To solve this, I propose a mechanism to allow each cgroup to specify its own swap device
priorities.

Evaluated Alternatives
======================
1. **Per-cgroup dedicated swap devices**
   - Previously proposed upstream [1]
   - Challenges in managing global vs per-cgroup swap state
   - Difficult to integrate with existing memory.limit / swap.max semantics
2. **Multi-backend swap device with cgroup-aware routing**
   - Considered sort of layering violation (block device cgroup awareness)
   - Swap devices are commonly meant to be physical block devices.
   - Similar idea mentioned in [2]
3. **Per-cgroup swap device enable/disable with swap usage contorl**
   - Expand swap.max with zswap.writeback usage
   - Discussed in context of zswap writeback [3]
   - Cannot express arbitrary priority orderings 
    (e.g. swap priority A-B-C on cgroup C-A-B impossible)
   - Less flexible than per-device priority approach
4. **Per-namespace swap priority configuration**
   - In short, make swap namespace for swap device priority
   - Overly complex for our use case
   - Cgroups are the natural scope for this mechanism

Based on these findings, we chose to prototype per-cgroup swap priority configuration
as the most natural, least invasive extension of the existing kernel mechanisms.

Design and Semantics
====================
- Each swap device gets a unique ID at `swapon` time
- Each cgroup has a `memory.swap.priority` interface:
  - Show unique ID by memory.swap.priority interface
  - Format: `unique_id:priority,unique_id:priority,...`
  - All currently-active swap devices must be listed
  - Priorities follow existing swap infrastructure semantics
- The interface is writeable and updatable at runtime
- A priority configuration can be reset via `echo "" > memory.swap.priority`
- Swap on/off events propagate to all cgroups with priority configurations

Example Usage
-------------
# swap device on
$ swapon
NAME      TYPE      SIZE USED PRIO
/dev/sdb  partition 300M  0B   10
/dev/sdc  partition 300M  0B    5

# assign custom priorities in a cgroup
$ echo "1:5,2:10" > memory.swap.priority
$ cat memory.swap.priority
Active
/dev/sdb  unique:1  prio:5
/dev/sdc  unique:2  prio:10

# adding new swap device later
$ swapon /dev/sdd --priority -1
$ cat memory.swap.priority
Active
/dev/sdb  unique:1  prio:5
/dev/sdc  unique:2  prio:10
/dev/sdd  unique:3  prio:-2 

# reset cgroup priority
$ echo "" > memory.swap.priority
$ cat memory.swap.priority
Inactive
/dev/sdb  unique:1  prio:10
/dev/sdc  unique:2  prio:5
/dev/sdd  unique:3  prio:-2

Implementation Notes
====================
The items mentioned below are to be considered during the next patch work.

- Workaround using per swap cpu cluster as before 
- Priority propgation of child cgroup
- And other TODO, XXX
- Refactoring for reviewability and maintainability, comprehensive testing
  and performance evaluation

Future Work
===========
These are items that would benefit from further consideration 
and potential implementation.

- Support for per-process or anything else swap prioritization
- Optional usage limits per swap device (e.g., ratio, max bytes)
- Generalizing the interface beyond cgroups

References
==========
[1] https://lkml.iu.edu/hypermail/linux/kernel/1404.0/02530.html
[2] https://lore.kernel.org/linux-mm/CAMgjq7DGMS5A4t6nOQmwyLy5Px96aoejBkiwFHgy9uMk-F8Y-w@mail.gmail.com
[3] https://lore.kernel.org/lkml/CAF8kJuN-4UE0skVHvjUzpGefavkLULMonjgkXUZSBVJrcGFXCA@mail.gmail.com

All comments and feedback are greatly appreciated.
Patch will follow.

Sincerely,
Youngjun Park

youngjun.park (2):
  mm/swap, memcg: basic structure and logic for per cgroup swap priority
    control
  mm: swap: apply per cgroup swap priority mechansim on swap layer

 include/linux/memcontrol.h |   3 +
 include/linux/swap.h       |  11 ++
 mm/Kconfig                 |   7 +
 mm/memcontrol.c            |  55 ++++++
 mm/swap.h                  |  18 ++
 mm/swap_cgroup_priority.c  | 335 +++++++++++++++++++++++++++++++++++++
 mm/swapfile.c              | 129 ++++++++++----
 7 files changed, 523 insertions(+), 35 deletions(-)
 create mode 100644 mm/swap_cgroup_priority.c

base-commit: 19272b37aa4f83ca52bdf9c16d5d81bdd1354494
-- 
2.34.1



             reply	other threads:[~2025-06-12 10:38 UTC|newest]

Thread overview: 25+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-06-12 10:37 youngjun.park [this message]
2025-06-12 10:37 ` [RFC PATCH 1/2] mm/swap, memcg: basic structure and logic for per cgroup swap priority control youngjun.park
2025-06-17 12:23   ` Michal Koutný
2025-06-18  0:32     ` YoungJun Park
2025-06-18  9:11       ` Michal Koutný
2025-06-18 12:07         ` YoungJun Park
2025-06-30 17:39           ` Michal Koutný
2025-07-01 13:08             ` YoungJun Park
2025-07-07  9:59               ` Michal Koutný
2025-07-07 14:45                 ` YoungJun Park
2025-07-07 14:57                   ` YoungJun Park
2025-06-12 10:37 ` [RFC PATCH 2/2] mm: swap: apply per cgroup swap priority mechansim on swap layer youngjun.park
2025-06-12 11:14   ` Kairui Song
2025-06-12 11:16     ` Kairui Song
2025-06-12 17:28     ` Nhat Pham
2025-06-12 18:20       ` Kairui Song
2025-06-12 20:08         ` Nhat Pham
2025-06-13  7:11           ` YoungJun Park
2025-06-13  7:36             ` Kairui Song
2025-06-13  7:38               ` Kairui Song
2025-06-13 10:45                 ` YoungJun Park
2025-06-13  6:49     ` YoungJun Park
2025-06-12 12:24 ` [RFC PATCH 0/2] mm/swap, memcg: Support per-cgroup swap device prioritization Kairui Song
2025-06-12 21:32   ` Nhat Pham
2025-06-13  6:56   ` YoungJun Park

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20250612103743.3385842-1-youngjun.park@lge.com \
    --to=youngjun.park@lge.com \
    --cc=akpm@linux-foundation.org \
    --cc=baohua@kernel.org \
    --cc=bhe@redhat.com \
    --cc=cgroups@vger.kernel.org \
    --cc=chrisl@kernel.org \
    --cc=gunho.lee@lge.com \
    --cc=hannes@cmpxchg.org \
    --cc=iamjoonsoo.kim@lge.com \
    --cc=kasong@tencent.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mhocko@kernel.org \
    --cc=muchun.song@linux.dev \
    --cc=nphamcs@gmail.com \
    --cc=roman.gushchin@linux.dev \
    --cc=shakeel.butt@linux.dev \
    --cc=shikemeng@huaweicloud.com \
    --cc=taejoon.song@lge.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox