From: Rakie Kim <rakie.kim@sk.com>
To: Gregory Price <gourry@gourry.net>
Cc: joshua.hahnjy@gmail.com, akpm@linux-foundation.org,
linux-mm@kvack.org, linux-kernel@vger.kernel.org,
linux-cxl@vger.kernel.org, dan.j.williams@intel.com,
ying.huang@linux.alibaba.com, kernel_team@skhynix.com,
honggyu.kim@sk.com, yunjeong.mun@sk.com,
Rakie Kim <rakie.kim@sk.com>
Subject: Re: [RFC] Add per-socket weight support for multi-socket systems in weighted interleave
Date: Fri, 9 May 2025 11:30:26 +0900 [thread overview]
Message-ID: <20250509023032.235-1-rakie.kim@sk.com> (raw)
In-Reply-To: <aBzJ42b8zIThYo1X@gourry-fedora-PF4VCD3F>
On Thu, 8 May 2025 11:12:35 -0400 Gregory Price <gourry@gourry.net> wrote:
> On Thu, May 08, 2025 at 03:30:36PM +0900, Rakie Kim wrote:
> > On Wed, 7 May 2025 12:38:18 -0400 Gregory Price <gourry@gourry.net> wrote:
> >
> > The proposed design is completely optional and isolated: it retains the
> > existing flat weight model as-is and activates the source-aware behavior only
> > when 'multi' mode is enabled. The complexity is scoped entirely to users who
> > opt into this mode.
> >
>
> I get what you're going for, just expressing my experience around this
> issue specifically.
Thank you very much for your response. Your prior experience and insights
have been extremely helpful in refining how I think about this problem.
>
> The lack of enthusiasm for solving the cross-socket case, and thus
> reduction from a 2D array to a 1D array, was because reasoning about
> interleave w/ cross-socket interconnects is not really feasible with
> the NUMA abstraction. Cross-socket interconnects are "Invisible" but
> have real performance implications. Unless we have a way to:
>
> 1) Represent the topology, AND
> 2) A way to get performance about that topology
>
> It's not useful. So NUMA is an incomplete (if not wrong) tool for this.
Your comment gave me an opportunity to reconsider the purpose of the
feature I originally proposed. In fact, I had two different scenarios
in mind when outlining this direction.
Scenario 1: Adapt weighting based on the task's execution node
A task prefers only the DRAM and locally attached CXL memory of the
socket on which it is running, in order to avoid cross-socket access and
optimize bandwidth.
- A task running on CPU0 (node0) would prefer DRAM0 (w=3) and CXL0 (w=1)
- A task running on CPU1 (node1) would prefer DRAM1 (w=3) and CXL1 (w=1)
Scenario 2: Reflect relative memory access performance
The system adjusts weights based on expected bandwidth differences for
remote accesses. This relies on having access to interconnect performance
data, which NUMA currently does not expose.
As you rightly pointed out, Scenario 2 depends on being able to measure
or model the cost of cross-socket access, which is not available in the
current abstraction. I now realize that this case is less actionable and
needs further research before being pursued.
However, Scenario 1 does not depend on such information. Rather, it is
a locality-preserving optimization where we isolate memory access to
each socket's DRAM and CXL nodes. I believe this use case is implementable
today and worth considering independently from interconnect performance
awareness.
>
> Additionally - reacting to task migration is not a real issue. If
> you're deploying an allocation strategy, you probably don't want your
> task migrating away from the place where you just spent a bunch of time
> allocating based on some existing strategy. So the solution is: don't
> migrate, and if you do - don't use cross-socket interleave.
That's a fair point. I also agree that handling migration is not critical
at this stage, and I'm not actively focusing on that aspect in this
proposal.
>
> Maybe if we solve the first half of this we can take a look at the task
> migration piece again, but I wouldn't try to solve for migration.
>
> At the same time we were discussing this, we were also discussing how to
> do external task-mempolicy modifications - which seemed significantly
> more useful, but ultimately more complex and without sufficient
> interested parties / users.
I'd like to learn more about that thread. If you happen to have a pointer
to that discussion, it would be really helpful.
>
> ~Gregory
>
Thanks again for sharing your insights. I will follow up with a refined
proposal based on the localized socket-based routing model (Scenario 1)
and will give further consideration to the parts dependent on topology
performance measurement for now.
Rakie
next prev parent reply other threads:[~2025-05-09 2:30 UTC|newest]
Thread overview: 11+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-05-07 9:35 rakie.kim
2025-05-07 16:38 ` Gregory Price
2025-05-08 6:30 ` Rakie Kim
2025-05-08 15:12 ` Gregory Price
2025-05-09 2:30 ` Rakie Kim [this message]
2025-05-09 5:49 ` Gregory Price
2025-05-12 8:22 ` Rakie Kim
2025-05-09 11:31 ` Jonathan Cameron
2025-05-09 16:29 ` Gregory Price
2025-05-12 8:23 ` Rakie Kim
2025-05-12 8:23 ` Rakie Kim
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20250509023032.235-1-rakie.kim@sk.com \
--to=rakie.kim@sk.com \
--cc=akpm@linux-foundation.org \
--cc=dan.j.williams@intel.com \
--cc=gourry@gourry.net \
--cc=honggyu.kim@sk.com \
--cc=joshua.hahnjy@gmail.com \
--cc=kernel_team@skhynix.com \
--cc=linux-cxl@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=ying.huang@linux.alibaba.com \
--cc=yunjeong.mun@sk.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox