linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Gregory Price <gourry@gourry.net>
To: Hyeonggon Yoo <hyeonggon.yoo@sk.com>
Cc: Joshua Hahn <joshua.hahnjy@gmail.com>,
	"gourry@gourry.net" <gourry@gourry.net>,
	kernel_team@skhynix.com, 42.hyeyoo@gmail.com,
	"rafael@kernel.org" <rafael@kernel.org>,
	"lenb@kernel.org" <lenb@kernel.org>,
	"gregkh@linuxfoundation.org" <gregkh@linuxfoundation.org>,
	"akpm@linux-foundation.org" <akpm@linux-foundation.org>,
	Honggyu Kim <honggyu.kim@sk.com>,
	"ying.huang@linux.alibaba.com" <ying.huang@linux.alibaba.com>,
	Rakie Kim <rakie.kim@sk.com>,
	"dan.j.williams@intel.com" <dan.j.williams@intel.com>,
	"Jonathan.Cameron@huawei.com" <Jonathan.Cameron@huawei.com>,
	"dave.jiang@intel.com" <dave.jiang@intel.com>,
	"horen.chuang@linux.dev" <horen.chuang@linux.dev>,
	"hannes@cmpxchg.org" <hannes@cmpxchg.org>,
	"linux-mm@kvack.org" <linux-mm@kvack.org>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	"linux-acpi@vger.kernel.org" <linux-acpi@vger.kernel.org>,
	"kernel-team@meta.com" <kernel-team@meta.com>
Subject: Re: [External Mail] [RFC PATCH v2] Weighted interleave auto-tuning
Date: Fri, 20 Dec 2024 09:13:50 -0500	[thread overview]
Message-ID: <Z2V7ngtLXH99LqLe@gourry-fedora-PF4VCD3F> (raw)
In-Reply-To: <3682b9cf-213c-497d-ab81-f70e1a785716@sk.com>

On Fri, Dec 20, 2024 at 05:25:28PM +0900, Hyeonggon Yoo wrote:
> On 2024-12-20 4:18 AM, Joshua Hahn wrote:
... snip ...
> 
> By the way, this might be out of scope, but let me ask for my own
> learning.
> 
> We have a server with 2 sockets, each attached with local DRAM and CXL
> memory (and thus 4 NUMA nodes). When accessing remote socket's memory
> (either CXL or not), the bandwidth is limited by the interconnect's
> bandwidth.
> 
> On this server, ideally weighted interleaving should be configured
> within a socket (e.g. local NUMA node + local CXL node) because
> weighted interleaving does not consider the bandwidth when accessed
> from a remote socket.
> 
> So, the question is: On systems with multiple sockets (and CXL mem
> attached to each socket), do you always assume the admin must bind to
> a specific socket for optimal performance or is there any plan to
> mitigate this problem without binding tasks to a socket?
>

There was a long discussion about this when initially implementing the
weighted interleave mechanism.

The answer is basically that interleave/weighted-interleave is
suboptimal for this scenario for a few reasons.

1) The "effective bandwidth" of a given node is relative *to the task*

   Imagine:
          A----B
          |    |
          C    D

   Task 1 on A has a different effective bandwidth from A->D than
   Task 2 running on B.  There's no good way for us to capture this
   information in global weights because...

2) We initially explored implementing a matrix of weights (cpu-relative)
   This had little support - so it was simplied to a single array.

3) We also explored task-local weights to allow capturing this info. 
   This required new syscalls, and likewise had little support.

4) It's unclear how we can actually acquire cross-connect bandwidth
   information anyway, and it's further unclear how this would be used
   in an automated fashion to do "something reasonable" for the user.

5) The actual use cases for weighted-interleave on multi-socket systems
   was questionable due to the above - so we more or less discarded the
   idea as untennable at best (or at least in need of much more thought)

So in short, yes, if the admin wants to be good use of (weighted)
interleave, they should bind to one socket and its attached CXL memory
only - otherwise the hidden chokepoint of the cross-socket interconnect
may bite them.

For now the best we can do is create global-relative weights, which
mathematically reduce according to bandwidth within a nodemask if the
task binds itself to a single socket.

~Gregory


  reply	other threads:[~2024-12-20 14:13 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-12-19 19:18 Joshua Hahn
2024-12-20  8:25 ` [External Mail] " Hyeonggon Yoo
2024-12-20 14:13   ` Gregory Price [this message]
2024-12-22  7:21   ` Huang, Ying
2024-12-22 17:03     ` Gregory Price
2024-12-24 23:48       ` Huang, Ying

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Z2V7ngtLXH99LqLe@gourry-fedora-PF4VCD3F \
    --to=gourry@gourry.net \
    --cc=42.hyeyoo@gmail.com \
    --cc=Jonathan.Cameron@huawei.com \
    --cc=akpm@linux-foundation.org \
    --cc=dan.j.williams@intel.com \
    --cc=dave.jiang@intel.com \
    --cc=gregkh@linuxfoundation.org \
    --cc=hannes@cmpxchg.org \
    --cc=honggyu.kim@sk.com \
    --cc=horen.chuang@linux.dev \
    --cc=hyeonggon.yoo@sk.com \
    --cc=joshua.hahnjy@gmail.com \
    --cc=kernel-team@meta.com \
    --cc=kernel_team@skhynix.com \
    --cc=lenb@kernel.org \
    --cc=linux-acpi@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=rafael@kernel.org \
    --cc=rakie.kim@sk.com \
    --cc=ying.huang@linux.alibaba.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox