linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Gregory Price <gourry@gourry.net>
To: Yunjeong Mun <yunjeong.mun@sk.com>
Cc: kernel_team@skhynix.com, Joshua Hahn <joshua.hahnjy@gmail.com>,
	harry.yoo@oracle.com, ying.huang@linux.alibaba.com,
	gregkh@linuxfoundation.org, rakie.kim@sk.com,
	akpm@linux-foundation.org, rafael@kernel.org, lenb@kernel.org,
	dan.j.williams@intel.com, Jonathan.Cameron@huawei.com,
	dave.jiang@intel.com, horen.chuang@linux.dev, hannes@cmpxchg.org,
	linux-kernel@vger.kernel.org, linux-acpi@vger.kernel.org,
	linux-mm@kvack.org, kernel-team@meta.com,
	Honggyu Kim <honggyu.kim@sk.com>
Subject: Re: [PATCH 2/2 v6] mm/mempolicy: Don't create weight sysfs for memoryless nodes
Date: Tue, 11 Mar 2025 00:42:49 -0400	[thread overview]
Message-ID: <Z8-_SXm0JGjXTegL@gourry-fedora-PF4VCD3F> (raw)
In-Reply-To: <20250311040252.425-1-yunjeong.mun@sk.com>

On Tue, Mar 11, 2025 at 01:02:07PM +0900, Yunjeong Mun wrote:

forenote - Hi Andrew, please hold off on the auto-configuration patch
for now, the sk group has identified a hotplug issue we need to work out
and we'll likely need to merge these two patch set together.  I really
appreciate your patience with this feature.

> Hi Gregory,
>
> In my understanding, the reason we are seeing 12 NUMA node is because
> it loops through node_states[N_POSSIBLE] and its value is 4095 (twelves ones)
> in the code [1]  below:
> 
... snip ...

Appreciated, so yes this confirms what i thought was going on.  There's
4 host bridges, 2 devices on each host bridge, and an extra CFMWS per
socket that is intended to interleave across the host bridges.

As you mention below, the code in acpi/numa/srat.c will create 1 NUMA
node per SRAT Memory Affinity Entry - and then also 1 NUMA node per
CFMWS that doesn't have a matching SRAT entry (with a known corner case
for a missing SRAT which doesn't apply here).

So essentialy what the system is doing is marking that it's absolutely
possible to create 1 region per device and also 1 region that
interleaves across host each pair of host bridges (I presume this is a
dual socket system?).

So, tl;dr: All these nodes are valid and this configuration is correct.

Weighted interleave presently works fine as intended, but with the
inclusion of the auto-configuration, there will be issues for your
system configuration. This means we probably need to consider
merging these as a group.

During boot, the following will occur

1) drivers/acpi/numa/srat.c marks 12 nodes as possible
   0-1) Socket nodes
   2-3) Cross-host-bridge interleave nodes
   4-11) single region nodes

2) drivers/cxl/* will probe the various devices and create
   a root decoder for each CXL Fixed Memory Window
   decoder0.0 - decoder11.0  (or maybe decoder0.0 - decoder0.11)

3) during probe auto-configuration of wieghted interleave occurs as a
   result of this code being called with hmat or cdat data:

void node_set_perf_attrs() {
...
	/* When setting CPU access coordinates, update mempolicy */
	if (access == ACCESS_COORDINATE_CPU) {
		if (mempolicy_set_node_perf(nid, coord)) {
			pr_info("failed to set mempolicy attrs for node %d\n",
				nid);
		}
	}
...
}

under the current system, since we calculate with N_POSSIBLE, all nodes
will be assigned weights (assuming HMAT or CDAT data is available for
all of them).

We actually have a few issues here

1) If all nodes are included in the weighting reduction, we're actually
   over-representing a particular set of hardware.  The interleave node
   and the individual device nodes would actually over-represent the
   bandwidth available (comparative to the CPU nodes).

2) As stated on this patch line, just switching to N_MEMORY causes
   issues with hotplug - where the bandwidth can be reported, but if
   memory hasn't been added yet then we'll end up with wrong weights
   because it wasn't included in the calculation.

3) However, not exposing the nodes because N_MEMORY isn't set yet
   a) prevents pre-configuration before memory is onlined, and
   b) hides the implications of hotplugging memory into a node from the
      user (adding memory causes a re-weight and may affect an
      interleave-all configuration).

but - i think it's reasonable that anyone using weighted-interleave is
*probably* not going to have nodes come and go.  It just seems like a
corner case that isn't reasonable to spend time supporting.

So coming back around to the hotplug patch line, I do think it's
reasonable hide nodes marked !N_MEMORY, but consider two issues:

1) In auto mode, we need to re-weight on hotplug to only include
   onlined nodes.  This is because the reduction may be sensitive
   to the available bandwidth changes.

   This behavior needs to be clearly documented.

2) We need to clearly define what the weight of a node will be when
   in manual mode and a node goes (memory -> no memory -> memory)
   a) does it retain it's old, manually set weight?
   b) does it revert to 1?

Sorry for the long email, just working through all the implications.

I think the proposed hotplug patch is a requirement for the
auto-configuration patch set.

~Gregory


  reply	other threads:[~2025-03-11  4:42 UTC|newest]

Thread overview: 34+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <20250228001631.1102-1-yunjeong.mun@sk.com>
2025-02-26 21:35 ` [PATCH 1/2 v6] mm/mempolicy: Weighted Interleave Auto-tuning Joshua Hahn
2025-02-26 21:35   ` [PATCH 2/2 v6] mm/mempolicy: Don't create weight sysfs for memoryless nodes Joshua Hahn
2025-02-27  2:32     ` Honggyu Kim
2025-02-27  3:20       ` Honggyu Kim
2025-03-03 21:56         ` Joshua Hahn
2025-03-04 12:53           ` Honggyu Kim
2025-03-03 16:19       ` Gregory Price
2025-03-04 13:03         ` Honggyu Kim
2025-03-04 16:16           ` Gregory Price
2025-03-04 16:29       ` Gregory Price
2025-03-06 12:39         ` Honggyu Kim
2025-03-06 17:32           ` Gregory Price
2025-03-07 11:46             ` Honggyu Kim
2025-03-07 17:51               ` Gregory Price
2025-03-10 12:26                 ` Honggyu Kim
2025-03-10 14:22                   ` Gregory Price
2025-03-11  2:07                     ` Yunjeong Mun
2025-03-11  2:42                       ` Gregory Price
2025-03-11  4:02                         ` Yunjeong Mun
2025-03-11  4:42                           ` Gregory Price [this message]
2025-03-11  9:51                             ` Yunjeong Mun
2025-03-11 15:52                               ` Gregory Price
2025-03-18  8:02                             ` Yunjeong Mun
2025-03-18 11:02                               ` Honggyu Kim
2025-03-18 15:13                                 ` Gregory Price
2025-03-19  9:56                                   ` Yunjeong Mun
2025-03-19 14:54                                     ` Gregory Price
2025-02-28  0:16   ` [PATCH 1/2 v6] mm/mempolicy: Weighted Interleave Auto-tuning yunjeong.mun
2025-02-28  6:39   ` Yunjeong Mun
2025-02-28 16:24     ` Joshua Hahn
2025-03-04 21:56     ` Joshua Hahn
2025-03-04 22:22       ` Joshua Hahn
2025-03-05  9:49         ` Yunjeong Mun
2025-03-05 16:28           ` Joshua Hahn

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Z8-_SXm0JGjXTegL@gourry-fedora-PF4VCD3F \
    --to=gourry@gourry.net \
    --cc=Jonathan.Cameron@huawei.com \
    --cc=akpm@linux-foundation.org \
    --cc=dan.j.williams@intel.com \
    --cc=dave.jiang@intel.com \
    --cc=gregkh@linuxfoundation.org \
    --cc=hannes@cmpxchg.org \
    --cc=harry.yoo@oracle.com \
    --cc=honggyu.kim@sk.com \
    --cc=horen.chuang@linux.dev \
    --cc=joshua.hahnjy@gmail.com \
    --cc=kernel-team@meta.com \
    --cc=kernel_team@skhynix.com \
    --cc=lenb@kernel.org \
    --cc=linux-acpi@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=rafael@kernel.org \
    --cc=rakie.kim@sk.com \
    --cc=ying.huang@linux.alibaba.com \
    --cc=yunjeong.mun@sk.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox