linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Chris Li <chrisl@kernel.org>
To: YoungJun Park <youngjun.park@lge.com>
Cc: Shakeel Butt <shakeel.butt@linux.dev>,
	Andrew Morton <akpm@linux-foundation.org>,
	 linux-mm@kvack.org, Kairui Song <kasong@tencent.com>,
	 Kemeng Shi <shikemeng@huaweicloud.com>,
	Nhat Pham <nphamcs@gmail.com>,  Baoquan He <bhe@redhat.com>,
	Barry Song <baohua@kernel.org>,
	Johannes Weiner <hannes@cmpxchg.org>,
	 Michal Hocko <mhocko@kernel.org>,
	Roman Gushchin <roman.gushchin@linux.dev>,
	 Muchun Song <muchun.song@linux.dev>,
	gunho.lee@lge.com, taejoon.song@lge.com,  austin.kim@lge.com,
	hyungjun.cho@lge.com
Subject: Re: [RFC PATCH v2 0/5] mm/swap, memcg: Introduce swap tiers for cgroup based swap control
Date: Sat, 14 Mar 2026 10:32:56 -0700	[thread overview]
Message-ID: <CACePvbXhk2GFTay3OrPoqFU=hRt9N5fgx=FrWFQ6nj4Nyn7b8A@mail.gmail.com> (raw)
In-Reply-To: <aa9+eK/VEealbo8i@yjaykim-PowerEdge-T330>

Hi YoungJun,

On Mon, Mar 9, 2026 at 7:14 PM YoungJun Park <youngjun.park@lge.com> wrote:
>
> On Mon, Mar 02, 2026 at 01:27:31PM -0800, Shakeel Butt wrote:
> >
> > Hi YoungJun,
> >
> > Sorry for the late response.
> >
> > On Sun, Feb 22, 2026 at 10:16:04AM +0900, YoungJun Park wrote:
> > [...]
> >
> > Let me summarize our discussion first:
> >
> > You have a use-case where they have systems running multiple workloads and
> > have multiple swap devices. Those swap devices have different performance
> > capabilities and they want to restrict/assign swap devices to the workloads. For
> > example assigning a low latency SSD swap device to latency sensitive workload
> > and slow disk swap to latency tolerant workload. (please correct me if
> > I misunderstood something).
> >
> > The use-case seems reasonable to me but I have concerns related to adding an
> > interface to memory cgroups. Mainly I am not clear how hierarchical semantics on
> > such interface would look like. In addition, I think it would be too rigid and
> > will be very hard to evolve for future features. To me enabling this
> > functionality through BPF would give much more flexibility and will be more
> > future proof.
> >
> > >
> > > After reading the reply and re-think more of it.
> > >
> > > I have a few questions regarding the BPF-first approach you
> > > suggested, if you don't mind. Some of them I am re-asking
> > > because I feel they have not been clearly addressed yet.
> > >
> > > - We are in an embedded environment where enabling additional
> > >   kernel compile options is costly. BPF is disabled by
> > >   default in some of our production configurations. From a
> > >   trade-off perspective, does it make sense to enable BPF
> > >   just for swap device control?
> >
> > To me, it is reasonable to enable BPF for environment running multiple
> > workloads and having multiple swap devices.
> >
> > >
> > > - You suggest starting with BPF and discussing a stable
> > >   interface later. I am genuinely curious, are there actual
> > >   precedents where a BPF prototype graduated into a stable
> > >   kernel interface?
> >
> > After giving some thought, I think once we have BPF working, adding another
> > interface for the same feature would not be an option. So, we have decide
> > upfront which route to take.
> >
> > >
> > > - You raised that stable interfaces are hard to remove. Would
> > >   gating it behind a CONFIG option or marking it experimental
> > >   be an acceptable compromise?
> >
> > I think hiding behind CONFIG options do not really protect against the usage and
> > the rule of no API breakage usually apply.
> >
> > >
> > > - You already acknowledged the use-case for assigning
> > >   different swap devices to different workloads. Your
> > >   objection is specifically about hierarchical parent-child
> > >   partitioning. If the interface enforced uniform policy
> > >   within a subtree, would that be acceptable?
> >
> > Let's start with that or maybe comeup with concrete examples on how that would
> > look like.
> >
> > Beside, give a bit more thought on potential future features e.g. demotion and
> > reason about how you would incorporate those features.
> Hello Shakeel, Chris Li,
>
> Just sending a gentle ping on my previous reply. :D

Sorry for the late reply, busy days.

>
> To quickly summarize the main points:
> (I might wrongly undestand your intentaion, then correct me please :) )
>
> * Regarding Shakeel's BPF approach, stable interface movement would be difficult,
>   so we need to choose a direction. I prefer adding it to memcg for immediate
>   usage, and if it proves highly effective, we can consider transitioning
>   entirely to BPF later.

I am very concerned about locking down the kernel user interface just
because things might change in the future. If we need to use BPF to
get the stable user space API, I am fine with that. Completely
blocking the new cgroup interface because of a worry about future
change is not justifiable IMHO. There ought to be some intermediate
staging we can do e.g. debugfs interface to test and play with the new
API. We should focus on designing the interface as well as possible
right now.

> * Shakeel seemed somewhat positive about matching all child tiers from the
>   parent if tiers are applied to a specific cgroup use case, and I would like
>   to start the discussion from here. Chris, I would appreciate your thoughts
>   on whether you agree with this direction of unifying all swap tiers within
>   the hierarchy as a first step.

Does that mean all children will only use the parent cgroup setting?
Wouldn't that be more restrictive and counteract the goal of making
the API more future-proof?
For the record, the current Google deployment can uses a different
swap device for the child cgroup, in the current delpyment.
The typical setup is that the top-level cgroup is a job running on a
VM. Then there is a second cgroup level for the VMM guest memory
allocation; swap device selection occurs at this second level.
There is also zswap vs SSD, the SSD is something new starting the
deployment. So it's not just about enabling zswap or not. We also need
to select the swap device.
If you get the current cgroup, it will need to walk the parent cgroup
chain to find the toplevel cgroup any way. I just think having the
hierarchy makes more sense.

> Here are some additional thoughts I had after my last reply:
> (Thanks for the insight and discussion. Hyungjun Cho)
>
> * Cgroup distribution:
>   A direct use case where cgroup A distributes a portion to A' is hard to
>   imagine, but the following scenario is possible:
>
>   swap: +SSD +HDD +NET
>   cgroup hierarchy:
>   /
>   A : +HDD +NET
>   A'(app 1) +HDD, A''(app 2) +NET
>
>   Cgroup A has two interdependent apps, and +SSD is excluded for more critical
>   services. App1 (A') avoids reclaim with a large hot working set using fast
>   +HDD, while App2 (A'') has a cold working set using slow/large +NET.

The app interface is a huge departure from the cgroup. The cgroup is a
well defined interface.

>
> * Promotion / Demotion:
>   Unlike memory tiers, swap tiers are directly assigned by the user, providing
>   flexibility beyond just speed. Since swap priority is already a user choice,
>   this design makes perfect sense.

We need to find customers willing to use this promotion/demotion. I
hesitate to build something while hoping to find someone to use it
later.
It would be good to identify someone who can immediately use and test
this promotion/demotion feature.
We should focus the discussion on achieving a more flexible swap
device selection approach and reach a conclusion on the API discussion
before discussing promotion/demotion. If we can't even have a usable
swap tier interface, there is nothing to promote.

>   With this arbitrary assignment, we can support higher-to-slower tier
>   allocation, similar to current memory tiers, if user properly bind the tier.
>   (more flexible as I think)
>
>   Within the same tier (meaning we define it as equal speed(tier)), we could apply round-robin or other
>   distribution policies via an additional tier layer interface. The current
>   equal-priority round-robin policy could also be elevated to the tier layer.


Chris


  reply	other threads:[~2026-03-14 17:33 UTC|newest]

Thread overview: 34+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-01-26  6:52 Youngjun Park
2026-01-26  6:52 ` [RFC PATCH v2 v2 1/5] mm: swap: introduce swap tier infrastructure Youngjun Park
2026-02-12  9:07   ` Chris Li
2026-02-13  2:18     ` YoungJun Park
2026-02-13 14:33     ` YoungJun Park
2026-01-26  6:52 ` [RFC PATCH v2 v2 2/5] mm: swap: associate swap devices with tiers Youngjun Park
2026-01-26  6:52 ` [RFC PATCH v2 v2 3/5] mm: memcontrol: add interface for swap tier selection Youngjun Park
2026-01-26  6:52 ` [RFC PATCH v2 v2 4/5] mm, swap: change back to use each swap device's percpu cluster Youngjun Park
2026-02-12  7:37   ` Chris Li
2026-01-26  6:52 ` [RFC PATCH v2 v2 5/5] mm, swap: introduce percpu swap device cache to avoid fragmentation Youngjun Park
2026-02-12  6:12 ` [RFC PATCH v2 0/5] mm/swap, memcg: Introduce swap tiers for cgroup based swap control Chris Li
2026-02-12  9:22   ` Chris Li
2026-02-13  2:26     ` YoungJun Park
2026-02-13  1:59   ` YoungJun Park
2026-02-12 17:57 ` Nhat Pham
2026-02-12 17:58   ` Nhat Pham
2026-02-13  2:43   ` YoungJun Park
2026-02-12 18:33 ` Shakeel Butt
2026-02-13  3:58   ` YoungJun Park
2026-02-21  3:47     ` Shakeel Butt
2026-02-21  6:07       ` Chris Li
2026-02-21 17:44         ` Shakeel Butt
2026-02-22  1:16           ` YoungJun Park
2026-03-02 21:27             ` Shakeel Butt
2026-03-04  7:27               ` YoungJun Park
2026-03-18  3:54                 ` Shakeel Butt
2026-03-18  4:57                   ` YoungJun Park
2026-03-10  2:14               ` YoungJun Park
2026-03-14 17:32                 ` Chris Li [this message]
2026-03-18  2:46                   ` YoungJun Park
2026-02-21 14:30       ` YoungJun Park
2026-02-23  5:56         ` Shakeel Butt
2026-02-27  2:43           ` YoungJun Park
2026-03-02 14:50           ` YoungJun Park

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CACePvbXhk2GFTay3OrPoqFU=hRt9N5fgx=FrWFQ6nj4Nyn7b8A@mail.gmail.com' \
    --to=chrisl@kernel.org \
    --cc=akpm@linux-foundation.org \
    --cc=austin.kim@lge.com \
    --cc=baohua@kernel.org \
    --cc=bhe@redhat.com \
    --cc=gunho.lee@lge.com \
    --cc=hannes@cmpxchg.org \
    --cc=hyungjun.cho@lge.com \
    --cc=kasong@tencent.com \
    --cc=linux-mm@kvack.org \
    --cc=mhocko@kernel.org \
    --cc=muchun.song@linux.dev \
    --cc=nphamcs@gmail.com \
    --cc=roman.gushchin@linux.dev \
    --cc=shakeel.butt@linux.dev \
    --cc=shikemeng@huaweicloud.com \
    --cc=taejoon.song@lge.com \
    --cc=youngjun.park@lge.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox