From: YoungJun Park <youngjun.park@lge.com>
To: Chris Li <chrisl@kernel.org>
Cc: "Michal Koutný" <mkoutny@suse.com>,
akpm@linux-foundation.org, hannes@cmpxchg.org, mhocko@kernel.org,
roman.gushchin@linux.dev, shakeel.butt@linux.dev,
muchun.song@linux.dev, shikemeng@huaweicloud.com,
kasong@tencent.com, nphamcs@gmail.com, bhe@redhat.com,
baohua@kernel.org, cgroups@vger.kernel.org, linux-mm@kvack.org,
linux-kernel@vger.kernel.org, gunho.lee@lge.com,
iamjoonsoo.kim@lge.com, taejoon.song@lge.com,
"Matthew Wilcox" <willy@infradead.org>,
"David Hildenbrand" <david@redhat.com>,
"Kairui Song" <ryncsn@gmail.com>
Subject: Re: [PATCH 1/4] mm/swap, memcg: Introduce infrastructure for cgroup-based swap priority
Date: Fri, 22 Aug 2025 14:45:18 +0900 [thread overview]
Message-ID: <aKgD7nZy7U+rHt9X@yjaykim-PowerEdge-T330> (raw)
In-Reply-To: <CAF8kJuM4f2W6w29VcHY5mgXVMYmTF4yORKaFky6bCjS1xRek9Q@mail.gmail.com>
I still believe that the priority based approach has more flexibility,
and can cover more usage scenarios. That opinion has not changed.
However, from this discussion I came to clearly understand and agree on
three points:
1. The swap.tier idea can be implemented in a much simpler way, and
2. It can cover the most important use cases I initially needed, as well
as common performance scenarios, without causing LRU inversion.
3. The really really needed usage scenario of arbitrary ordering does not exist.
the usage scenario I suggest is imaginary.(just has possibility)
I have also considered the situation where I might need to revisit my
original idea in the future. I believe this would still be manageable
within the swap.tier framework. For example:
* If after swap.tier is merged, an arbitrate ordering use case arises
(which you do not consider concrete), it could be solved by allowing
cgroups to remap the tier order individually.
* If reviewers later decide to go back to the priority based direction,
I think it will still be possible. By then, much of the work would
already be done in patch v2, so switching back would not be
impossible.
And also, since I highly respect you for long-time contributions and
deep thinking in the swap layer, I decided to move the idea forward
based on swap.tier.
For now, I would like to share the first major direction change I am
considering, and get feedback on how to proceed. If you think this path
is promising, please advise whether I should continue as patch v2, or
send a new RFC series or new patch series.
-----------------------------------------------------------------------
1. Interface
-----------------------------------------------------------------------
In the initial thread you replied with the following examples:
> Here are a few examples:
> e.g. consider the following cgroup hierarchy a/b/c/d, a as the first
> level cgroup.
> a/swap.tiers: "- +compress_ram"
> it means who shall not be named is set to opt out, optin in
> compress_ram only, no ssd, no hard.
> Who shall not be named, if specified, has to be the first one listed
> in the "swap.tiers".
>
> a/b/swap.tiers: "+ssd"
> For b cgroup, who shall not be named is not specified, the tier is
> appended to the parent "a/swap.tiers". The effective "a/b/swap.tiers"
> become "- +compress_ram +ssd"
> a/b can use both zswap and ssd.
>
> Every time the who shall not be named is changed, it can drop the
> parent swap.tiers chain, starting from scratch.
>
> a/b/c/swap.tiers: "-"
>
> For c, it turns off all swap. The effective "a/b/c/swap.tiers" become
> "- +compress_ram +ssd -" which simplify as "-", because the second "-"
> overwrites all previous optin/optout results.
> In other words, if the current cgroup does not specify the who shall
> not be named, it will walk the parent chain until it does. The global
> "/" for non cgroup is on.
>
> a/b/c/d/swap.tiers: "- +hdd"
> For d, only hdd swap, nothing else.
>
> More example:
> "- +ssd +hdd -ssd" will simplify to: "- +hdd", which means hdd only.
> "+ -hdd": No hdd for you! Use everything else.
>
> Let me know what you think about the above "swap.tiers"(name TBD)
> proposal.
My opinion is that instead of mapping priority into named concepts, it
may be simpler to represent it as plain integers.
(The integers are assigned in sequential order, as explained in the following reply.)
This would make the interface almost identical to the cpuset style suggested by Koutný.
For example:
echo 1-8,9-10 > a/swap.tier # parent allows tier range 1–8 and 9-10
echo 1-4,9 > a/b/swap.tier # child uses tier 1-4 and 9 within parent's range
echo 20 > a/b/swap.tier # invalid: parent only allowed 1-8 and 9-10
named concepts can be dealt with by some userland based software solution.
kernel just gives simple integer mapping concept.
userland software can abstract it as a "named" tier to user.
Regarding the mapping of names to ranges, as you also mentioned:
> There is a simple mapping of global swap tier names into priority
> range
> The name itself is customizable.
> e.g. 100+ is the "compress_ram" tier. 50-99 is the "SSD" tier,
> 0-55 is the "hdd" tier.
> The detailed mechanization and API is TBD.
> The end result is a simple tier name lookup will get the priority
> range.
> By default all swap tiers are available for global usage without
> cgroup. That matches the current global swap on behavior.
One idea would be to provide a /proc/swaptier interface:
echo "100 40" > /proc/swaptier
This would mean:
* >=100 : tier 1
* 40–99 : tier 2
* <40 : tier 3
How do you feel about this approach?
-----------------------------------------------------------------------
2. NUMA autobind
-----------------------------------------------------------------------
If NUMA autobind is in use, perhaps it is best to simply disallow
swaptier settings. I expect workloads depending on autobind would rely
on it globally, rather than per-cgroup. Therefore, when a negative
priority is present, tier grouping could reject the configuration.
-----------------------------------------------------------------------
3. Implementation
-----------------------------------------------------------------------
My initial thought is to implement a simple bitmask check. That is, in
the slow swap path, check whether the cgroup has selected the given
tier. This is simple, but I worry it might lose the optimization of the
current priority list, where devices are dynamically tracked as they
become available or unavailable.
So perhaps a better design is to make swap tier an object, and have
each cgroup traverse only the priority list of the tiers it selected. I
would like feedback on whether this design makes sense.
-----------------------------------------------------------------------
Finally, I want to thank all reviewers for the constructive feedback.
Even if we move to the swap.tier approach, the reviews from Kairui, Nhat
Pham and Koutný are still valid and will remain relevant.
Kairui, Nhat Pham
* Regarding per-cgroup per-cluster feedback: this would likely need to
be adapted to tier-based design.
* Regarding passing percpu info along the allocation path: since tier is
selected per-cgroup, this may still be needed, depending on
implementation.
Koutný
* Regarding NUMA autobind complexity: as explained above, I intend to
design the mechanism so that autobind does not affect it. Parent-child
semantics will remain essentially identical to cpuset. If the proposed
interface is accepted, its usage would be like cpuset, which should be
less controversial.
---
Thank you again for the suggestions. I will continue to review while
waiting for your feedback.
Best Regards,
Youngjun Park
next prev parent reply other threads:[~2025-08-22 5:45 UTC|newest]
Thread overview: 48+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-07-16 20:20 [PATCH 0/4] mm/swap, memcg: Support per-cgroup swap device priorities Youngjun Park
2025-07-16 20:20 ` [PATCH 1/4] mm/swap, memcg: Introduce infrastructure for cgroup-based swap priority Youngjun Park
2025-07-17 11:20 ` kernel test robot
2025-07-22 14:09 ` YoungJun Park
2025-07-18 17:08 ` kernel test robot
2025-07-22 14:11 ` YoungJun Park
2025-07-21 15:13 ` kernel test robot
2025-07-22 14:14 ` YoungJun Park
2025-07-22 8:41 ` Michal Koutný
2025-07-22 14:05 ` YoungJun Park
2025-07-22 18:41 ` YoungJun Park
2025-08-14 14:03 ` Michal Koutný
2025-08-15 15:10 ` Chris Li
2025-08-16 17:21 ` YoungJun Park
2025-08-16 19:15 ` Chris Li
2025-08-19 10:12 ` YoungJun Park
2025-08-20 0:52 ` Chris Li
2025-08-20 14:39 ` YoungJun Park
2025-08-21 20:39 ` Chris Li
2025-08-22 5:45 ` YoungJun Park [this message]
2025-08-22 16:48 ` Chris Li
2025-08-24 14:19 ` YoungJun Park
[not found] ` <aKsAES4cXWbDG1xn@yjaykim-PowerEdge-T330>
2025-08-26 8:19 ` Chris Li
2025-08-26 12:57 ` YoungJun Park
2025-08-26 14:30 ` Chris Li
2025-08-30 4:05 ` YoungJun Park
2025-08-30 7:13 ` Chris Li
2025-08-31 13:53 ` YoungJun Park
2025-08-31 16:45 ` Chris Li
2025-09-01 16:03 ` YoungJun Park
2025-09-01 16:06 ` YoungJun Park
2025-09-01 22:40 ` Chris Li
2025-09-03 9:32 ` Chris Li
2025-09-03 10:18 ` YoungJun Park
2025-09-05 6:30 ` YoungJun Park
2025-09-05 23:45 ` Chris Li
2025-09-06 12:56 ` Chris Li
2025-09-07 17:51 ` YoungJun Park
2025-09-10 0:26 ` Chris Li
2025-09-07 17:39 ` YoungJun Park
2025-09-10 0:14 ` Chris Li
2025-09-12 15:39 ` YoungJun Park
2025-08-16 16:41 ` YoungJun Park
2025-07-16 20:20 ` [PATCH 2/4] mm: swap: Apply per-cgroup swap priority mechanism to swap layer Youngjun Park
2025-07-16 20:20 ` [PATCH 3/4] mm: memcg: Add swap cgroup priority inheritance mechanism Youngjun Park
2025-07-16 20:20 ` [PATCH 4/4] mm: swap: Per-cgroup per-CPU swap device cache with shared clusters Youngjun Park
2025-07-22 17:44 ` Kairui Song
2025-07-22 18:30 ` YoungJun Park
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=aKgD7nZy7U+rHt9X@yjaykim-PowerEdge-T330 \
--to=youngjun.park@lge.com \
--cc=akpm@linux-foundation.org \
--cc=baohua@kernel.org \
--cc=bhe@redhat.com \
--cc=cgroups@vger.kernel.org \
--cc=chrisl@kernel.org \
--cc=david@redhat.com \
--cc=gunho.lee@lge.com \
--cc=hannes@cmpxchg.org \
--cc=iamjoonsoo.kim@lge.com \
--cc=kasong@tencent.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=mhocko@kernel.org \
--cc=mkoutny@suse.com \
--cc=muchun.song@linux.dev \
--cc=nphamcs@gmail.com \
--cc=roman.gushchin@linux.dev \
--cc=ryncsn@gmail.com \
--cc=shakeel.butt@linux.dev \
--cc=shikemeng@huaweicloud.com \
--cc=taejoon.song@lge.com \
--cc=willy@infradead.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox