From: Gregory Price <gourry@gourry.net>
To: Hyeonggon Yoo <hyeonggon.yoo@sk.com>
Cc: Joshua Hahn <joshua.hahnjy@gmail.com>,
"gourry@gourry.net" <gourry@gourry.net>,
kernel_team@skhynix.com, 42.hyeyoo@gmail.com,
"rafael@kernel.org" <rafael@kernel.org>,
"lenb@kernel.org" <lenb@kernel.org>,
"gregkh@linuxfoundation.org" <gregkh@linuxfoundation.org>,
"akpm@linux-foundation.org" <akpm@linux-foundation.org>,
Honggyu Kim <honggyu.kim@sk.com>,
"ying.huang@linux.alibaba.com" <ying.huang@linux.alibaba.com>,
Rakie Kim <rakie.kim@sk.com>,
"dan.j.williams@intel.com" <dan.j.williams@intel.com>,
"Jonathan.Cameron@huawei.com" <Jonathan.Cameron@huawei.com>,
"dave.jiang@intel.com" <dave.jiang@intel.com>,
"horen.chuang@linux.dev" <horen.chuang@linux.dev>,
"hannes@cmpxchg.org" <hannes@cmpxchg.org>,
"linux-mm@kvack.org" <linux-mm@kvack.org>,
"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
"linux-acpi@vger.kernel.org" <linux-acpi@vger.kernel.org>,
"kernel-team@meta.com" <kernel-team@meta.com>
Subject: Re: [External Mail] [RFC PATCH v2] Weighted interleave auto-tuning
Date: Fri, 20 Dec 2024 09:13:50 -0500 [thread overview]
Message-ID: <Z2V7ngtLXH99LqLe@gourry-fedora-PF4VCD3F> (raw)
In-Reply-To: <3682b9cf-213c-497d-ab81-f70e1a785716@sk.com>
On Fri, Dec 20, 2024 at 05:25:28PM +0900, Hyeonggon Yoo wrote:
> On 2024-12-20 4:18 AM, Joshua Hahn wrote:
... snip ...
>
> By the way, this might be out of scope, but let me ask for my own
> learning.
>
> We have a server with 2 sockets, each attached with local DRAM and CXL
> memory (and thus 4 NUMA nodes). When accessing remote socket's memory
> (either CXL or not), the bandwidth is limited by the interconnect's
> bandwidth.
>
> On this server, ideally weighted interleaving should be configured
> within a socket (e.g. local NUMA node + local CXL node) because
> weighted interleaving does not consider the bandwidth when accessed
> from a remote socket.
>
> So, the question is: On systems with multiple sockets (and CXL mem
> attached to each socket), do you always assume the admin must bind to
> a specific socket for optimal performance or is there any plan to
> mitigate this problem without binding tasks to a socket?
>
There was a long discussion about this when initially implementing the
weighted interleave mechanism.
The answer is basically that interleave/weighted-interleave is
suboptimal for this scenario for a few reasons.
1) The "effective bandwidth" of a given node is relative *to the task*
Imagine:
A----B
| |
C D
Task 1 on A has a different effective bandwidth from A->D than
Task 2 running on B. There's no good way for us to capture this
information in global weights because...
2) We initially explored implementing a matrix of weights (cpu-relative)
This had little support - so it was simplied to a single array.
3) We also explored task-local weights to allow capturing this info.
This required new syscalls, and likewise had little support.
4) It's unclear how we can actually acquire cross-connect bandwidth
information anyway, and it's further unclear how this would be used
in an automated fashion to do "something reasonable" for the user.
5) The actual use cases for weighted-interleave on multi-socket systems
was questionable due to the above - so we more or less discarded the
idea as untennable at best (or at least in need of much more thought)
So in short, yes, if the admin wants to be good use of (weighted)
interleave, they should bind to one socket and its attached CXL memory
only - otherwise the hidden chokepoint of the cross-socket interconnect
may bite them.
For now the best we can do is create global-relative weights, which
mathematically reduce according to bandwidth within a nodemask if the
task binds itself to a single socket.
~Gregory
next prev parent reply other threads:[~2024-12-20 14:13 UTC|newest]
Thread overview: 6+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-12-19 19:18 Joshua Hahn
2024-12-20 8:25 ` [External Mail] " Hyeonggon Yoo
2024-12-20 14:13 ` Gregory Price [this message]
2024-12-22 7:21 ` Huang, Ying
2024-12-22 17:03 ` Gregory Price
2024-12-24 23:48 ` Huang, Ying
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=Z2V7ngtLXH99LqLe@gourry-fedora-PF4VCD3F \
--to=gourry@gourry.net \
--cc=42.hyeyoo@gmail.com \
--cc=Jonathan.Cameron@huawei.com \
--cc=akpm@linux-foundation.org \
--cc=dan.j.williams@intel.com \
--cc=dave.jiang@intel.com \
--cc=gregkh@linuxfoundation.org \
--cc=hannes@cmpxchg.org \
--cc=honggyu.kim@sk.com \
--cc=horen.chuang@linux.dev \
--cc=hyeonggon.yoo@sk.com \
--cc=joshua.hahnjy@gmail.com \
--cc=kernel-team@meta.com \
--cc=kernel_team@skhynix.com \
--cc=lenb@kernel.org \
--cc=linux-acpi@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=rafael@kernel.org \
--cc=rakie.kim@sk.com \
--cc=ying.huang@linux.alibaba.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox