Re: [RFC] Add per-socket weight support for multi-socket systems in weighted interleave

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Rakie Kim <rakie.kim@sk.com>
To: Gregory Price <gourry@gourry.net>
Cc: joshua.hahnjy@gmail.com, akpm@linux-foundation.org,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	linux-cxl@vger.kernel.org, dan.j.williams@intel.com,
	ying.huang@linux.alibaba.com, kernel_team@skhynix.com,
	honggyu.kim@sk.com, yunjeong.mun@sk.com,
	Rakie Kim <rakie.kim@sk.com>
Subject: Re: [RFC] Add per-socket weight support for multi-socket systems in weighted interleave
Date: Mon, 12 May 2025 17:22:50 +0900	[thread overview]
Message-ID: <20250512082257.263-1-rakie.kim@sk.com> (raw)
In-Reply-To: <aB2Xh4jEqpSTuvsi@gourry-fedora-PF4VCD3F>

On Fri, 9 May 2025 01:49:59 -0400 Gregory Price <gourry@gourry.net> wrote:
> On Fri, May 09, 2025 at 11:30:26AM +0900, Rakie Kim wrote:
> > 
> > Scenario 1: Adapt weighting based on the task's execution node
> > A task prefers only the DRAM and locally attached CXL memory of the
> > socket on which it is running, in order to avoid cross-socket access and
> > optimize bandwidth.
> > - A task running on CPU0 (node0) would prefer DRAM0 (w=3) and CXL0 (w=1)
> > - A task running on CPU1 (node1) would prefer DRAM1 (w=3) and CXL1 (w=1)
> ... snip ...
> > 
> > However, Scenario 1 does not depend on such information. Rather, it is
> > a locality-preserving optimization where we isolate memory access to
> > each socket's DRAM and CXL nodes. I believe this use case is implementable
> > today and worth considering independently from interconnect performance
> > awareness.
> > 
> 
> There's nothing to implement - all the controls exist:
> 
> 1) --cpunodebind=0
> 2) --weighted-interleave=0,2
> 3) cpuset.mems
> 4) cpuset.cpus

Thank you again for your thoughtful response and the detailed suggestions.

As you pointed out, it is indeed possible to construct node-local memory
allocation behaviors using the existing interfaces such as --cpunodebind,
--weighted-interleave, cpuset.mems, and cpuset.cpus. I appreciate you
highlighting that path.

However, what I am proposing in Scenario 1 (Adapt weighting based on the
task's execution node) is slightly different in intent.

The idea is to allow tasks to dynamically prefer the DRAM and CXL nodes
attached to the socket on which they are executing without requiring a
fixed execution node or manual nodemask configuration. For instance, if
a task is running on node0, it would prefer node0 and node2; if running
on node1, it would prefer node1 and node3.

This differs from the current model, which relies on statically binding
both the CPU and memory nodes. My proposal aims to express this behavior
as a policy-level abstraction that dynamically adapts based on execution
locality.

So rather than being a combination of manual configuration and execution
constraints, the intent is to incorporate locality-awareness into the
memory policy itself.

> 
> You might consider maybe something like "--local-tier" (akin to
> --localalloc) that sets an explicitly fallback set based on the local
> node.  You'd end up doing something like
> 
> current_nid = memtier_next_local_node(socket_nid, current_nid)
> 
> Where this interface returns the preferred fallback ordering but doesn't
> allow cross-socket fallback.
> 
> That might be useful, i suppose, in letting a user do:
> 
> --cpunodebind=0 --weighted-interleave --local-tier
> 
> without having to know anything about the local memory tier structure.

That said, I believe your suggestion for a "--local-tier" option is a
very good one. It could provide a concise, user-friendly way to activate
such locality-aware fallback behavior, even if the underlying mechanism
requires some policy extension.

In this regard, I fully agree that such an interface could greatly help
users express their intent without requiring them to understand the
details of the memory tier topology.

> 
> > > At the same time we were discussing this, we were also discussing how to
> > > do external task-mempolicy modifications - which seemed significantly
> > > more useful, but ultimately more complex and without sufficient
> > > interested parties / users.
> > 
> > I'd like to learn more about that thread. If you happen to have a pointer
> > to that discussion, it would be really helpful.
> > 
> 
> https://lore.kernel.org/all/20231122211200.31620-1-gregory.price@memverge.com/
> https://lore.kernel.org/all/ZV5zGROLefrsEcHJ@r13-u19.micron.com/
> https://lore.kernel.org/linux-mm/ZWYsth2CtC4Ilvoz@memverge.com/
> https://lore.kernel.org/linux-mm/20221010094842.4123037-1-hezhongkun.hzk@bytedance.com/
> There are locking issues with these that aren't easy to fix.
> 
> I think the bytedance method uses a task_work queueing to defer a
> mempolicy update to the task itself the next time it makes a kernel/user
> transition.  That's probably the best overall approach i've seen.
> 
> https://lore.kernel.org/linux-mm/ZWezcQk+BYEq%2FWiI@memverge.com/
> More notes gathered prior to implementing weighted interleave.

Thank you for sharing the earlier links to related discussions and
patches. They were very helpful, and I will review them carefully to
gather more ideas and refine my thoughts further.

I look forward to any further feedback you may have on this topic.

Best regards,
Rakie

> 
> ~Gregory
>

next prev parent reply	other threads:[~2025-05-12  8:23 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-05-07  9:35 rakie.kim
2025-05-07 16:38 ` Gregory Price
2025-05-08  6:30   ` Rakie Kim
2025-05-08 15:12     ` Gregory Price
2025-05-09  2:30       ` Rakie Kim
2025-05-09  5:49         ` Gregory Price
2025-05-12  8:22           ` Rakie Kim [this message]
2025-05-09 11:31       ` Jonathan Cameron
2025-05-09 16:29         ` Gregory Price
2025-05-12  8:23           ` Rakie Kim
2025-05-12  8:23         ` Rakie Kim

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20250512082257.263-1-rakie.kim@sk.com \
    --to=rakie.kim@sk.com \
    --cc=akpm@linux-foundation.org \
    --cc=dan.j.williams@intel.com \
    --cc=gourry@gourry.net \
    --cc=honggyu.kim@sk.com \
    --cc=joshua.hahnjy@gmail.com \
    --cc=kernel_team@skhynix.com \
    --cc=linux-cxl@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=ying.huang@linux.alibaba.com \
    --cc=yunjeong.mun@sk.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox