From: YoungJun Park <youngjun.park@lge.com>
To: Kairui Song <ryncsn@gmail.com>
Cc: linux-mm@kvack.org, akpm@linux-foundation.org,
hannes@cmpxchg.org, mhocko@kernel.org, roman.gushchin@linux.dev,
shakeel.butt@linux.dev, cgroups@vger.kernel.org,
linux-kernel@vger.kernel.org, shikemeng@huaweicloud.com,
nphamcs@gmail.com, bhe@redhat.com, baohua@kernel.org,
chrisl@kernel.org, muchun.song@linux.dev, iamjoonsoo.kim@lge.com,
taejoon.song@lge.com, gunho.lee@lge.com
Subject: Re: [RFC PATCH 0/2] mm/swap, memcg: Support per-cgroup swap device prioritization
Date: Fri, 13 Jun 2025 15:56:12 +0900 [thread overview]
Message-ID: <aEvLjEInMQC7hEyh@yjaykim-PowerEdge-T330> (raw)
In-Reply-To: <CAMgjq7BA_2-5iCvS-vp9ZEoG=1DwHWYuVZOuH8DWH9wzdoC00g@mail.gmail.com>
On Thu, Jun 12, 2025 at 08:24:08PM +0800, Kairui Song wrote:
> On Thu, Jun 12, 2025 at 6:38 PM <youngjun.park@lge.com> wrote:
> >
> > From: Youngjun Park <youngjun.park@lge.com>
> >
> > Introduction
> > ============
> > I am a kernel developer working on platforms deployed on commercial consumer devices.
> > Due to real-world product requirements, needed to modify the Linux kernel to support
> > a new swap management mechanism. The proposed mechanism allows assigning different swap
> > priorities to swap devices per cgroup.
> > I believe this mechanism can be generally useful for similar constrained-device scenarios
> > and would like to propose it for upstream inclusion and solicit feedback from the community.
> >
> > Motivation
> > ==========
> > Core requirement was to improve application responsiveness and loading time, especially
> > for latency critical applications, without increasing RAM or storage hardware resources.
> > Device constraints:
> > - Linux-based embedded platform
> > - Limited system RAM
> > - Small local swap
> > - No option to expand RAM or local swap
> > To mitigate this, we explored utilizing idle RAM and storage from nearby devices as remote
> > swap space. To maximize its effectiveness, we needed the ability to control which swap devices
> > were used by different cgroups:
> > - Assign faster local swap devices to latency critical apps
> > - Assign remote swap devices to background apps
> > However, current Linux kernel swap infrastructure does not support per-cgroup swap device
> > assignment.
> > To solve this, I propose a mechanism to allow each cgroup to specify its own swap device
> > priorities.
> >
> > Evaluated Alternatives
> > ======================
> > 1. **Per-cgroup dedicated swap devices**
> > - Previously proposed upstream [1]
> > - Challenges in managing global vs per-cgroup swap state
> > - Difficult to integrate with existing memory.limit / swap.max semantics
> > 2. **Multi-backend swap device with cgroup-aware routing**
> > - Considered sort of layering violation (block device cgroup awareness)
> > - Swap devices are commonly meant to be physical block devices.
> > - Similar idea mentioned in [2]
> > 3. **Per-cgroup swap device enable/disable with swap usage contorl**
> > - Expand swap.max with zswap.writeback usage
> > - Discussed in context of zswap writeback [3]
> > - Cannot express arbitrary priority orderings
> > (e.g. swap priority A-B-C on cgroup C-A-B impossible)
> > - Less flexible than per-device priority approach
> > 4. **Per-namespace swap priority configuration**
> > - In short, make swap namespace for swap device priority
> > - Overly complex for our use case
> > - Cgroups are the natural scope for this mechanism
> >
> > Based on these findings, we chose to prototype per-cgroup swap priority configuration
> > as the most natural, least invasive extension of the existing kernel mechanisms.
> >
> > Design and Semantics
> > ====================
> > - Each swap device gets a unique ID at `swapon` time
> > - Each cgroup has a `memory.swap.priority` interface:
> > - Show unique ID by memory.swap.priority interface
> > - Format: `unique_id:priority,unique_id:priority,...`
> > - All currently-active swap devices must be listed
> > - Priorities follow existing swap infrastructure semantics
> > - The interface is writeable and updatable at runtime
> > - A priority configuration can be reset via `echo "" > memory.swap.priority`
> > - Swap on/off events propagate to all cgroups with priority configurations
> >
> > Example Usage
> > -------------
> > # swap device on
> > $ swapon
> > NAME TYPE SIZE USED PRIO
> > /dev/sdb partition 300M 0B 10
> > /dev/sdc partition 300M 0B 5
> >
> > # assign custom priorities in a cgroup
> > $ echo "1:5,2:10" > memory.swap.priority
> > $ cat memory.swap.priority
> > Active
> > /dev/sdb unique:1 prio:5
> > /dev/sdc unique:2 prio:10
> >
> > # adding new swap device later
> > $ swapon /dev/sdd --priority -1
> > $ cat memory.swap.priority
> > Active
> > /dev/sdb unique:1 prio:5
> > /dev/sdc unique:2 prio:10
> > /dev/sdd unique:3 prio:-2
> >
> > # reset cgroup priority
> > $ echo "" > memory.swap.priority
> > $ cat memory.swap.priority
> > Inactive
> > /dev/sdb unique:1 prio:10
> > /dev/sdc unique:2 prio:5
> > /dev/sdd unique:3 prio:-2
> >
> > Implementation Notes
> > ====================
> > The items mentioned below are to be considered during the next patch work.
> >
> > - Workaround using per swap cpu cluster as before
> > - Priority propgation of child cgroup
> > - And other TODO, XXX
> > - Refactoring for reviewability and maintainability, comprehensive testing
> > and performance evaluation
>
> Hi Youngjun,
>
> Interesting idea. For your current approach, I think all we need is
> per-cgroup swap meta info structures (and infrastures for maintaining
> and manipulating them).
>
> So we have a global version and a cgroup version of "plist, next
> cluster list, and maybe something else", right? And then
> once the allocator is folio aware it can just prefer the cgroup ones
> (as I mentioned in another reply) reusing all the same other
> routines. Changes are minimal, the cgroup swap meta infos
> and control plane are separately maintained.
>
> It seems aligned quite well with what I wanted to do, and can be done
> in a clean and easy to maintain way.
>
> Meanwhile with virtual swap, things could be even more flexible, not
> only changing the priority at swapout time, it will also provide
> capabilities to migrate and balance devices adaptively, and solve long
> term issues like mTHP fragmentation and min-order swapout etc..
>
> Maybe they can be combined, like maybe cgroup can be limited to use
> the virtual device or physical ones depending on priority. Seems all
> solvable. Just some ideas here.
I had been thinking about the work related to vswap and alignment,
so I'm glad to hear that they can harmonize.
> Vswap can cover the priority part too. I think we might want to avoid
> duplicated interfaces.
>
> So I'm just imagining things now, will it be good if we have something
> like (following your design):
>
> $ cat memcg1/memory.swap.priority
> Active
> /dev/vswap:(zram/zswap? with compression params?) unique:0 prio:5
>
> $ cat memcg2/memory.swap.priority
> Active
> /dev/vswap:/dev/nvme1 unique:1 prio:5
> /dev/vswap:/dev/nvme2 unique:2 prio:10
> /dev/vswap:/dev/vda unique:3 prio:15
> /dev/sda unique:4 prio:20
>
> $ cat memcg3/memory.swap.priority
> Active
> /dev/vda unique:3 prio:5
> /dev/sda unique:4 prio:15
>
> Meaning memcg1 (high priority) is allowed to use compressed memory
> only through vswap, and memcg2 (mid priority) uses disks through vswap
> and fallback to HDD. memcg3 (low prio) is only allowed to use slow
> devices.
>
> Global fallback just uses everything the system has. It might be over
> complex though?
Just looking at the example usage which you mention,
it seems flexible and good.
I will think more about this in relation to it.
prev parent reply other threads:[~2025-06-13 6:56 UTC|newest]
Thread overview: 25+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-06-12 10:37 youngjun.park
2025-06-12 10:37 ` [RFC PATCH 1/2] mm/swap, memcg: basic structure and logic for per cgroup swap priority control youngjun.park
2025-06-17 12:23 ` Michal Koutný
2025-06-18 0:32 ` YoungJun Park
2025-06-18 9:11 ` Michal Koutný
2025-06-18 12:07 ` YoungJun Park
2025-06-30 17:39 ` Michal Koutný
2025-07-01 13:08 ` YoungJun Park
2025-07-07 9:59 ` Michal Koutný
2025-07-07 14:45 ` YoungJun Park
2025-07-07 14:57 ` YoungJun Park
2025-06-12 10:37 ` [RFC PATCH 2/2] mm: swap: apply per cgroup swap priority mechansim on swap layer youngjun.park
2025-06-12 11:14 ` Kairui Song
2025-06-12 11:16 ` Kairui Song
2025-06-12 17:28 ` Nhat Pham
2025-06-12 18:20 ` Kairui Song
2025-06-12 20:08 ` Nhat Pham
2025-06-13 7:11 ` YoungJun Park
2025-06-13 7:36 ` Kairui Song
2025-06-13 7:38 ` Kairui Song
2025-06-13 10:45 ` YoungJun Park
2025-06-13 6:49 ` YoungJun Park
2025-06-12 12:24 ` [RFC PATCH 0/2] mm/swap, memcg: Support per-cgroup swap device prioritization Kairui Song
2025-06-12 21:32 ` Nhat Pham
2025-06-13 6:56 ` YoungJun Park [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=aEvLjEInMQC7hEyh@yjaykim-PowerEdge-T330 \
--to=youngjun.park@lge.com \
--cc=akpm@linux-foundation.org \
--cc=baohua@kernel.org \
--cc=bhe@redhat.com \
--cc=cgroups@vger.kernel.org \
--cc=chrisl@kernel.org \
--cc=gunho.lee@lge.com \
--cc=hannes@cmpxchg.org \
--cc=iamjoonsoo.kim@lge.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=mhocko@kernel.org \
--cc=muchun.song@linux.dev \
--cc=nphamcs@gmail.com \
--cc=roman.gushchin@linux.dev \
--cc=ryncsn@gmail.com \
--cc=shakeel.butt@linux.dev \
--cc=shikemeng@huaweicloud.com \
--cc=taejoon.song@lge.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox