linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: YoungJun Park <youngjun.park@lge.com>
To: Shakeel Butt <shakeel.butt@linux.dev>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	linux-mm@kvack.org, Chris Li <chrisl@kernel.org>,
	Kairui Song <kasong@tencent.com>,
	Kemeng Shi <shikemeng@huaweicloud.com>,
	Nhat Pham <nphamcs@gmail.com>, Baoquan He <bhe@redhat.com>,
	Barry Song <baohua@kernel.org>,
	Johannes Weiner <hannes@cmpxchg.org>,
	Michal Hocko <mhocko@kernel.org>,
	Roman Gushchin <roman.gushchin@linux.dev>,
	Muchun Song <muchun.song@linux.dev>,
	gunho.lee@lge.com, taejoon.song@lge.com, austin.kim@lge.com
Subject: Re: [RFC PATCH v2 0/5] mm/swap, memcg: Introduce swap tiers for cgroup based swap control
Date: Fri, 13 Feb 2026 12:58:40 +0900	[thread overview]
Message-ID: <aY6hcPNxiolf5jj6@yjaykim-PowerEdge-T330> (raw)
In-Reply-To: <aY4bQFvpPRWgnOTM@linux.dev>

On Thu, Feb 12, 2026 at 10:33:22AM -0800, Shakeel Butt wrote:
> Hi Youngjun,
> 
> On Mon, Jan 26, 2026 at 03:52:37PM +0900, Youngjun Park wrote:
> > This is the second version of the RFC for the "Swap Tiers" concept.
> > Link to v1: https://lore.kernel.org/linux-mm/20251109124947.1101520-1-youngjun.park@lge.com/
> > 
> > This version incorporates feedback received during LPC 2025 and addresses
> > comments from the previous review. We have also included experimental
> > results based on usage scenarios intended for our internal platforms.
> > 
> > Motivation & Concept recap
> > ==========================
> > Current Linux swap allocation is global, limiting the ability to assign
> > faster devices to specific cgroups. Our initial attempt at per-cgroup
> > priorities proved over-engineered and caused LRU inversion.
> > 
> > Following Chris Li's suggestion, we pivoted to "Swap Tiers." A tier is
> > simply a user-named group of swap devices sharing the same priority range.
> > This abstraction facilitates swap device selection based on speed, allowing
> > users to configure specific tiers for cgroups.
> > 
> > For more details, please refer to the LPC 2025 presentation
> > https://lpc.events/event/19/contributions/2141/attachments/1857/3998/LPC2025Finalss.pdf
> > or v1 patch.
> > 
> 
> One of the LPC feedback you missed is to not add memcg interface for
> this functionality and explore BPF way instead.
> 
> We are normally very conservative to add new interfaces to cgroup.
> However I am not even convinced that memcg interface is the right way to
> expose this functionality. Swap is currently global and the idea to
> limit or assign specific swap devices to specific cgroups makes sense
> but that is the decision for the job orchestator or node controller.
> Allowing workloads to pick and choose swap devices do not make sense to
> me.

Apologies for overlooking the feedback regarding the BPF approach. Thank you
for the suggestion.

I agree that using BPF would provide greater flexibility, allowing control not
just at the memcg level, but also per-process or for complex workloads.
(As like orchestrator and node controller)

However, I am concerned that this level of freedom might introduce logical
contradictions, particularly regarding cgroup hierarchy semantics.

For example, BPF might allow a topology that violates hierarchical constraints
(a concern that was also touched upon during LPC)

  - Group A (Parent): Assigned to SSD1
  - Group B (Child of A): Assigned to SSD2

If Group A has a `memory.swap.max` limit, and Group B swaps out to SSD2, it
creates a consistency issue. Group B consumes Group A's swap quota, but it is
utilizing a device (SSD2) that is distinct from the Parent's assignment. This
could lead to situations where the Parent's limit is exhausted by usage on a
device it effectively doesn't "own" or shouldn't be using.

One might suggest restricting BPF to strictly adhere to these hierarchical
constraints. However, doing so would effectively eliminate the primary
advantage of using BPF—its flexibility. If we are to enforce standard cgroup
semantics anyway, a native interface seems more appropriate than a constrained
BPF hook.

Beyond this specific example, I suspect that delegating this logic to BPF
might introduce other unforeseen edge cases regarding hierarchy enforcement.
In my view, the BPF approach seems more like a "next step."

Since you acknowledged that the idea of assigning swap devices to cgroups
"makes sense," I believe implementing this within the standard, strictly
constrained "cgroup land" is preferable. 

A strict cgroup interface ensures
that hierarchy and accounting rules are consistently enforced, avoiding the
potential conflicts that the unrestricted freedom of BPF might create.

Ultimately, I hope this swap tier mechanism can serve as a foundation to be
leveraged by other subsystems, such as BPF and DAMON. I view this proposal as
the necessary first step toward that future.

Youngjun Park


  reply	other threads:[~2026-02-13  3:58 UTC|newest]

Thread overview: 24+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-01-26  6:52 Youngjun Park
2026-01-26  6:52 ` [RFC PATCH v2 v2 1/5] mm: swap: introduce swap tier infrastructure Youngjun Park
2026-02-12  9:07   ` Chris Li
2026-02-13  2:18     ` YoungJun Park
2026-02-13 14:33     ` YoungJun Park
2026-01-26  6:52 ` [RFC PATCH v2 v2 2/5] mm: swap: associate swap devices with tiers Youngjun Park
2026-01-26  6:52 ` [RFC PATCH v2 v2 3/5] mm: memcontrol: add interface for swap tier selection Youngjun Park
2026-01-26  6:52 ` [RFC PATCH v2 v2 4/5] mm, swap: change back to use each swap device's percpu cluster Youngjun Park
2026-02-12  7:37   ` Chris Li
2026-01-26  6:52 ` [RFC PATCH v2 v2 5/5] mm, swap: introduce percpu swap device cache to avoid fragmentation Youngjun Park
2026-02-12  6:12 ` [RFC PATCH v2 0/5] mm/swap, memcg: Introduce swap tiers for cgroup based swap control Chris Li
2026-02-12  9:22   ` Chris Li
2026-02-13  2:26     ` YoungJun Park
2026-02-13  1:59   ` YoungJun Park
2026-02-12 17:57 ` Nhat Pham
2026-02-12 17:58   ` Nhat Pham
2026-02-13  2:43   ` YoungJun Park
2026-02-12 18:33 ` Shakeel Butt
2026-02-13  3:58   ` YoungJun Park [this message]
2026-02-21  3:47     ` Shakeel Butt
2026-02-21  6:07       ` Chris Li
2026-02-21 17:44         ` Shakeel Butt
2026-02-22  1:16           ` YoungJun Park
2026-02-21 14:30       ` YoungJun Park

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aY6hcPNxiolf5jj6@yjaykim-PowerEdge-T330 \
    --to=youngjun.park@lge.com \
    --cc=akpm@linux-foundation.org \
    --cc=austin.kim@lge.com \
    --cc=baohua@kernel.org \
    --cc=bhe@redhat.com \
    --cc=chrisl@kernel.org \
    --cc=gunho.lee@lge.com \
    --cc=hannes@cmpxchg.org \
    --cc=kasong@tencent.com \
    --cc=linux-mm@kvack.org \
    --cc=mhocko@kernel.org \
    --cc=muchun.song@linux.dev \
    --cc=nphamcs@gmail.com \
    --cc=roman.gushchin@linux.dev \
    --cc=shakeel.butt@linux.dev \
    --cc=shikemeng@huaweicloud.com \
    --cc=taejoon.song@lge.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox