linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Chris Li <chrisl@kernel.org>
To: Shakeel Butt <shakeel.butt@linux.dev>
Cc: YoungJun Park <youngjun.park@lge.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	linux-mm@kvack.org,  Kairui Song <kasong@tencent.com>,
	Kemeng Shi <shikemeng@huaweicloud.com>,
	 Nhat Pham <nphamcs@gmail.com>, Baoquan He <bhe@redhat.com>,
	Barry Song <baohua@kernel.org>,
	 Johannes Weiner <hannes@cmpxchg.org>,
	Michal Hocko <mhocko@kernel.org>,
	 Roman Gushchin <roman.gushchin@linux.dev>,
	Muchun Song <muchun.song@linux.dev>,
	gunho.lee@lge.com,  taejoon.song@lge.com, austin.kim@lge.com
Subject: Re: [RFC PATCH v2 0/5] mm/swap, memcg: Introduce swap tiers for cgroup based swap control
Date: Fri, 20 Feb 2026 22:07:44 -0800	[thread overview]
Message-ID: <CACePvbU=4f4gT5kHUBq0wD7COHN+quE5g4bPQqJYgJNx_9vuhg@mail.gmail.com> (raw)
In-Reply-To: <aZjxP2sTavBRGC1l@linux.dev>

On Fri, Feb 20, 2026 at 7:47 PM Shakeel Butt <shakeel.butt@linux.dev> wrote:
>
> Please don't send a new version of the series before concluding the discussion
> on the previous one.

In this case I think it is fine.  You haven't responded to YoungJun's
last response in over a week. He might have mistaken that the
discussion concluded.
Consider it is one of the iterations. It is hard enough to contribute
to the kernel. Relax.
Plus, much of the discussion on the mailing list always has differing
opinions. So, it's hard to determine what is truly concluded.
Different people might have different interitations of the same text.

>
> On Fri, Feb 13, 2026 at 12:58:40PM +0900, YoungJun Park wrote:
> > >
> > > One of the LPC feedback you missed is to not add memcg interface for
> > > this functionality and explore BPF way instead.
> > >
> > > We are normally very conservative to add new interfaces to cgroup.
> > > However I am not even convinced that memcg interface is the right way to
> > > expose this functionality. Swap is currently global and the idea to
> > > limit or assign specific swap devices to specific cgroups makes sense
> > > but that is the decision for the job orchestator or node controller.
> > > Allowing workloads to pick and choose swap devices do not make sense to
> > > me.
> >
> > Apologies for overlooking the feedback regarding the BPF approach. Thank you
> > for the suggestion.
>
> No need for apologies. These things take time and multiple iterations.
>
> >
> > I agree that using BPF would provide greater flexibility, allowing control not
> > just at the memcg level, but also per-process or for complex workloads.
> > (As like orchestrator and node controller)
>
> Yes it provides the flexibility but that is not the main reason I am pushing for
> it. The reason I want you to first try the BPF approach without introducing any
> stable interfaces. Show how swap tiers will be used and configured in production

Is that your biggest concern? Many different ways exist to solve that
problem. e.g. We can put a config option protecting it and mark it as
experimental. This will unblock the development allow experiment. We
can have more people to try it out and give feedback.

> environment and then we can talk if a stable interface is needed. I am still not
> convinced that swap tiers need to be controlled hierarchically and the non-root
> should be able to control it.

Yes, my company uses a different swap device at different cgroup
level. I did ask my coworker to confirm that usage. Control at the non
root level is a real need.

>
> >
> > However, I am concerned that this level of freedom might introduce logical
> > contradictions, particularly regarding cgroup hierarchy semantics.
> >
> > For example, BPF might allow a topology that violates hierarchical constraints
> > (a concern that was also touched upon during LPC)
>
> Yes BPF provides more power but it is controlled by admin and admin can shoot
> their foot in multiple ways.

I think this swap device control is a very basic need. All your
objections to swapping control in the group can equally apply to
zswap.writeback. Unlike zswap.writeback, which only control from the
zswap behavior. This is a more generic version control swap device
other than zswap as well. BTW, I raised that concern about
zswap.writeback was not generic enough as swap control was limited
when zswap was proposed. We did hold back zswap.writeback. The
consensers is interface can be improved as later iterations. So here
we are.

>
> >
> >   - Group A (Parent): Assigned to SSD1
> >   - Group B (Child of A): Assigned to SSD2
> >
> > If Group A has a `memory.swap.max` limit, and Group B swaps out to SSD2, it
> > creates a consistency issue. Group B consumes Group A's swap quota, but it is
> > utilizing a device (SSD2) that is distinct from the Parent's assignment. This
> > could lead to situations where the Parent's limit is exhausted by usage on a
> > device it effectively doesn't "own" or shouldn't be using.
> >
> > One might suggest restricting BPF to strictly adhere to these hierarchical
> > constraints.
>
> No need to constraint anything.
>
> Taking a step back, can you describe your use-case a bit more and share
> requirements?

There is a very long thread on the linux-mm maillist. I'm too lazy to dig it up.

I can share our usage requirement to refresh your memory. We
internally use a cgroup swapfile control interface that has not been
upstreamed. With this we can remove the need of that internal
interface and go upstream instead.
>
> You have multiple swap devices of different properties and you want to assign
> those swap devices to different workloads. Now couple of questions:
>
> 1. If more than one device is assign to a workload, do you want to have
>    some kind of ordering between them for the worklod or do you want option to
>    have round robin kind of policy?

It depends on the number of devices in the tiers. Different tiers
maintain an order. Within the same tier round robin.

>
> 2. What's the reason to use 'tiers' in the name? Is it similar to memory tiers
>    and you want promotion/demotion among the tiers?

I propose the tier name. Guilty. Yes, in was inpired by memory tiers.
It just different class of swap speeds. I am not fixed on the name. We
can also call it swap.device_speed_classes. You can suggest
alternatives.

Promotion / demotion is possible in the future. The current state,
without promotion or demotion, already provides value. Our current
deployment uses only one class of swap device at a time. However I do
know other companies use  more than one class of swap device.

>
> 3. If a workload has multiple swap devices assigned, can you describe the
>    scenario where such workloads need to partition/divide given devices to their
>    sub-workloads?

In our deployment, we always use more than one swap device to reduce
swap device lock contention.
The job config can describe the swap speed it can tolerate. Some jobs
can tolerate slower speeds, while others cannot.

> Let's start with these questions. Please note that I want us to not just look at
> the current use-case but brainstorm more future use-cases and then come up with
> the solution which is more future proof.

Take zswap.writeback as example. We have a solution that worked for
the requirement at that time. Incremental improvement is fine as well.
Usually, incremental progress is better. At least currently there is a
real need to allow different cgroups to select different swap speeds.
There is a risk in being too future-proof: we might design things that
people in the future don't use as we envisioned. I see that happen too
often as well.

So starting from the current need is a solid starting point. It's just
a different design philosophy. Each to their own.

That is the only usage case I know. YoungJun feel free to add yours
usage as well.

Chris


  reply	other threads:[~2026-02-21  6:08 UTC|newest]

Thread overview: 24+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-01-26  6:52 Youngjun Park
2026-01-26  6:52 ` [RFC PATCH v2 v2 1/5] mm: swap: introduce swap tier infrastructure Youngjun Park
2026-02-12  9:07   ` Chris Li
2026-02-13  2:18     ` YoungJun Park
2026-02-13 14:33     ` YoungJun Park
2026-01-26  6:52 ` [RFC PATCH v2 v2 2/5] mm: swap: associate swap devices with tiers Youngjun Park
2026-01-26  6:52 ` [RFC PATCH v2 v2 3/5] mm: memcontrol: add interface for swap tier selection Youngjun Park
2026-01-26  6:52 ` [RFC PATCH v2 v2 4/5] mm, swap: change back to use each swap device's percpu cluster Youngjun Park
2026-02-12  7:37   ` Chris Li
2026-01-26  6:52 ` [RFC PATCH v2 v2 5/5] mm, swap: introduce percpu swap device cache to avoid fragmentation Youngjun Park
2026-02-12  6:12 ` [RFC PATCH v2 0/5] mm/swap, memcg: Introduce swap tiers for cgroup based swap control Chris Li
2026-02-12  9:22   ` Chris Li
2026-02-13  2:26     ` YoungJun Park
2026-02-13  1:59   ` YoungJun Park
2026-02-12 17:57 ` Nhat Pham
2026-02-12 17:58   ` Nhat Pham
2026-02-13  2:43   ` YoungJun Park
2026-02-12 18:33 ` Shakeel Butt
2026-02-13  3:58   ` YoungJun Park
2026-02-21  3:47     ` Shakeel Butt
2026-02-21  6:07       ` Chris Li [this message]
2026-02-21 17:44         ` Shakeel Butt
2026-02-22  1:16           ` YoungJun Park
2026-02-21 14:30       ` YoungJun Park

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CACePvbU=4f4gT5kHUBq0wD7COHN+quE5g4bPQqJYgJNx_9vuhg@mail.gmail.com' \
    --to=chrisl@kernel.org \
    --cc=akpm@linux-foundation.org \
    --cc=austin.kim@lge.com \
    --cc=baohua@kernel.org \
    --cc=bhe@redhat.com \
    --cc=gunho.lee@lge.com \
    --cc=hannes@cmpxchg.org \
    --cc=kasong@tencent.com \
    --cc=linux-mm@kvack.org \
    --cc=mhocko@kernel.org \
    --cc=muchun.song@linux.dev \
    --cc=nphamcs@gmail.com \
    --cc=roman.gushchin@linux.dev \
    --cc=shakeel.butt@linux.dev \
    --cc=shikemeng@huaweicloud.com \
    --cc=taejoon.song@lge.com \
    --cc=youngjun.park@lge.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox