linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Shakeel Butt <shakeel.butt@linux.dev>
To: Chris Li <chrisl@kernel.org>
Cc: YoungJun Park <youngjun.park@lge.com>,
	 Andrew Morton <akpm@linux-foundation.org>,
	linux-mm@kvack.org, Kairui Song <kasong@tencent.com>,
	 Kemeng Shi <shikemeng@huaweicloud.com>,
	Nhat Pham <nphamcs@gmail.com>, Baoquan He <bhe@redhat.com>,
	 Barry Song <baohua@kernel.org>,
	Johannes Weiner <hannes@cmpxchg.org>,
	 Michal Hocko <mhocko@kernel.org>,
	Roman Gushchin <roman.gushchin@linux.dev>,
	 Muchun Song <muchun.song@linux.dev>,
	gunho.lee@lge.com, taejoon.song@lge.com, austin.kim@lge.com
Subject: Re: [RFC PATCH v2 0/5] mm/swap, memcg: Introduce swap tiers for cgroup based swap control
Date: Sat, 21 Feb 2026 09:44:01 -0800	[thread overview]
Message-ID: <20260221163043.GA35350@shakeel.butt@linux.dev> (raw)
In-Reply-To: <CACePvbU=4f4gT5kHUBq0wD7COHN+quE5g4bPQqJYgJNx_9vuhg@mail.gmail.com>

On Fri, Feb 20, 2026 at 10:07:44PM -0800, Chris Li wrote:
> >
[...]
> > >
> > > I agree that using BPF would provide greater flexibility, allowing control not
> > > just at the memcg level, but also per-process or for complex workloads.
> > > (As like orchestrator and node controller)
> >
> > Yes it provides the flexibility but that is not the main reason I am pushing for
> > it. The reason I want you to first try the BPF approach without introducing any
> > stable interfaces. Show how swap tiers will be used and configured in production
> 
> Is that your biggest concern?

No, that is secondary because I am not seeing the real use-case of
controlling/partitioning swap devices among sub-workloads. Until that is
figured out, adding a stable API is not good.

> Many different ways exist to solve that
> problem. e.g. We can put a config option protecting it and mark it as
> experimental. This will unblock the development allow experiment. We
> can have more people to try it out and give feedback.
> 
> > environment and then we can talk if a stable interface is needed. I am still not
> > convinced that swap tiers need to be controlled hierarchically and the non-root
> > should be able to control it.
> 
> Yes, my company uses a different swap device at different cgroup
> level. I did ask my coworker to confirm that usage. Control at the non
> root level is a real need.

I am assuming you meant Google and particularly Prodkernel team and not
Android or ChromeOS. Google's prodkernel used to have per-cgroup
swapfiles exposed through memory.swapfiles (if I remember correctly
Suleiman implemented this along with ghost swapfiles). Later this was
deprecated (by Yu Zhao) and global (ghost) swapfiles were being used.
The memory.swapfiles interface instead of supporting real swapfiles
started having select options among default, ghost/zswap and real
(something like that). However such interface was used to just disable
or enable zswap for a workload and never about hierarchically
controlling the swap devices (Google prodkernel only have zswap). Has
something changed?

> 
> >
> > >
> > > However, I am concerned that this level of freedom might introduce logical
> > > contradictions, particularly regarding cgroup hierarchy semantics.
> > >
> > > For example, BPF might allow a topology that violates hierarchical constraints
> > > (a concern that was also touched upon during LPC)
> >
> > Yes BPF provides more power but it is controlled by admin and admin can shoot
> > their foot in multiple ways.
> 
> I think this swap device control is a very basic need.

Please explain that very basic need.

> All your
> objections to swapping control in the group can equally apply to
> zswap.writeback. Unlike zswap.writeback, which only control from the
> zswap behavior. This is a more generic version control swap device
> other than zswap as well. BTW, I raised that concern about
> zswap.writeback was not generic enough as swap control was limited
> when zswap was proposed. We did hold back zswap.writeback. The
> consensers is interface can be improved as later iterations. So here
> we are.

This just motivates me to pushback even harder on adding a new interface
without a clear use-case.

> 
> >
> > >
> > >   - Group A (Parent): Assigned to SSD1
> > >   - Group B (Child of A): Assigned to SSD2
> > >
> > > If Group A has a `memory.swap.max` limit, and Group B swaps out to SSD2, it
> > > creates a consistency issue. Group B consumes Group A's swap quota, but it is
> > > utilizing a device (SSD2) that is distinct from the Parent's assignment. This
> > > could lead to situations where the Parent's limit is exhausted by usage on a
> > > device it effectively doesn't "own" or shouldn't be using.
> > >
> > > One might suggest restricting BPF to strictly adhere to these hierarchical
> > > constraints.
> >
> > No need to constraint anything.
> >
> > Taking a step back, can you describe your use-case a bit more and share
> > requirements?
> 
> There is a very long thread on the linux-mm maillist. I'm too lazy to dig it up.
> 
> I can share our usage requirement to refresh your memory. We
> internally use a cgroup swapfile control interface that has not been
> upstreamed. With this we can remove the need of that internal
> interface and go upstream instead.

I already asked above but let me say it again. What's the actual real
world use-case to control/allow/disallow swap devices hierarchically?

> >
> > You have multiple swap devices of different properties and you want to assign
> > those swap devices to different workloads. Now couple of questions:
> >
> > 1. If more than one device is assign to a workload, do you want to have
> >    some kind of ordering between them for the worklod or do you want option to
> >    have round robin kind of policy?
> 
> It depends on the number of devices in the tiers. Different tiers
> maintain an order. Within the same tier round robin.
> 
> >
> > 2. What's the reason to use 'tiers' in the name? Is it similar to memory tiers
> >    and you want promotion/demotion among the tiers?
> 
> I propose the tier name. Guilty. Yes, in was inpired by memory tiers.
> It just different class of swap speeds. I am not fixed on the name. We
> can also call it swap.device_speed_classes. You can suggest
> alternatives.
> 
> Promotion / demotion is possible in the future. The current state,
> without promotion or demotion, already provides value. Our current
> deployment uses only one class of swap device at a time. However I do
> know other companies use  more than one class of swap device.
> 
> >
> > 3. If a workload has multiple swap devices assigned, can you describe the
> >    scenario where such workloads need to partition/divide given devices to their
> >    sub-workloads?
> 
> In our deployment, we always use more than one swap device to reduce
> swap device lock contention.

Having more than one swap devices to reduce lock contention is unrelated
to hierarchically control swap devices among sub-workloads.



  reply	other threads:[~2026-02-21 17:44 UTC|newest]

Thread overview: 24+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-01-26  6:52 Youngjun Park
2026-01-26  6:52 ` [RFC PATCH v2 v2 1/5] mm: swap: introduce swap tier infrastructure Youngjun Park
2026-02-12  9:07   ` Chris Li
2026-02-13  2:18     ` YoungJun Park
2026-02-13 14:33     ` YoungJun Park
2026-01-26  6:52 ` [RFC PATCH v2 v2 2/5] mm: swap: associate swap devices with tiers Youngjun Park
2026-01-26  6:52 ` [RFC PATCH v2 v2 3/5] mm: memcontrol: add interface for swap tier selection Youngjun Park
2026-01-26  6:52 ` [RFC PATCH v2 v2 4/5] mm, swap: change back to use each swap device's percpu cluster Youngjun Park
2026-02-12  7:37   ` Chris Li
2026-01-26  6:52 ` [RFC PATCH v2 v2 5/5] mm, swap: introduce percpu swap device cache to avoid fragmentation Youngjun Park
2026-02-12  6:12 ` [RFC PATCH v2 0/5] mm/swap, memcg: Introduce swap tiers for cgroup based swap control Chris Li
2026-02-12  9:22   ` Chris Li
2026-02-13  2:26     ` YoungJun Park
2026-02-13  1:59   ` YoungJun Park
2026-02-12 17:57 ` Nhat Pham
2026-02-12 17:58   ` Nhat Pham
2026-02-13  2:43   ` YoungJun Park
2026-02-12 18:33 ` Shakeel Butt
2026-02-13  3:58   ` YoungJun Park
2026-02-21  3:47     ` Shakeel Butt
2026-02-21  6:07       ` Chris Li
2026-02-21 17:44         ` Shakeel Butt [this message]
2026-02-22  1:16           ` YoungJun Park
2026-02-21 14:30       ` YoungJun Park

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260221163043.GA35350@shakeel.butt@linux.dev \
    --to=shakeel.butt@linux.dev \
    --cc=akpm@linux-foundation.org \
    --cc=austin.kim@lge.com \
    --cc=baohua@kernel.org \
    --cc=bhe@redhat.com \
    --cc=chrisl@kernel.org \
    --cc=gunho.lee@lge.com \
    --cc=hannes@cmpxchg.org \
    --cc=kasong@tencent.com \
    --cc=linux-mm@kvack.org \
    --cc=mhocko@kernel.org \
    --cc=muchun.song@linux.dev \
    --cc=nphamcs@gmail.com \
    --cc=roman.gushchin@linux.dev \
    --cc=shikemeng@huaweicloud.com \
    --cc=taejoon.song@lge.com \
    --cc=youngjun.park@lge.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox