From: Gregory Price <gourry@gourry.net>
To: Yunjeong Mun <yunjeong.mun@sk.com>
Cc: kernel_team@skhynix.com, Joshua Hahn <joshua.hahnjy@gmail.com>,
harry.yoo@oracle.com, ying.huang@linux.alibaba.com,
gregkh@linuxfoundation.org, rakie.kim@sk.com,
akpm@linux-foundation.org, rafael@kernel.org, lenb@kernel.org,
dan.j.williams@intel.com, Jonathan.Cameron@huawei.com,
dave.jiang@intel.com, horen.chuang@linux.dev, hannes@cmpxchg.org,
linux-kernel@vger.kernel.org, linux-acpi@vger.kernel.org,
linux-mm@kvack.org, kernel-team@meta.com,
Honggyu Kim <honggyu.kim@sk.com>
Subject: Re: [PATCH 2/2 v6] mm/mempolicy: Don't create weight sysfs for memoryless nodes
Date: Tue, 11 Mar 2025 00:42:49 -0400 [thread overview]
Message-ID: <Z8-_SXm0JGjXTegL@gourry-fedora-PF4VCD3F> (raw)
In-Reply-To: <20250311040252.425-1-yunjeong.mun@sk.com>
On Tue, Mar 11, 2025 at 01:02:07PM +0900, Yunjeong Mun wrote:
forenote - Hi Andrew, please hold off on the auto-configuration patch
for now, the sk group has identified a hotplug issue we need to work out
and we'll likely need to merge these two patch set together. I really
appreciate your patience with this feature.
> Hi Gregory,
>
> In my understanding, the reason we are seeing 12 NUMA node is because
> it loops through node_states[N_POSSIBLE] and its value is 4095 (twelves ones)
> in the code [1] below:
>
... snip ...
Appreciated, so yes this confirms what i thought was going on. There's
4 host bridges, 2 devices on each host bridge, and an extra CFMWS per
socket that is intended to interleave across the host bridges.
As you mention below, the code in acpi/numa/srat.c will create 1 NUMA
node per SRAT Memory Affinity Entry - and then also 1 NUMA node per
CFMWS that doesn't have a matching SRAT entry (with a known corner case
for a missing SRAT which doesn't apply here).
So essentialy what the system is doing is marking that it's absolutely
possible to create 1 region per device and also 1 region that
interleaves across host each pair of host bridges (I presume this is a
dual socket system?).
So, tl;dr: All these nodes are valid and this configuration is correct.
Weighted interleave presently works fine as intended, but with the
inclusion of the auto-configuration, there will be issues for your
system configuration. This means we probably need to consider
merging these as a group.
During boot, the following will occur
1) drivers/acpi/numa/srat.c marks 12 nodes as possible
0-1) Socket nodes
2-3) Cross-host-bridge interleave nodes
4-11) single region nodes
2) drivers/cxl/* will probe the various devices and create
a root decoder for each CXL Fixed Memory Window
decoder0.0 - decoder11.0 (or maybe decoder0.0 - decoder0.11)
3) during probe auto-configuration of wieghted interleave occurs as a
result of this code being called with hmat or cdat data:
void node_set_perf_attrs() {
...
/* When setting CPU access coordinates, update mempolicy */
if (access == ACCESS_COORDINATE_CPU) {
if (mempolicy_set_node_perf(nid, coord)) {
pr_info("failed to set mempolicy attrs for node %d\n",
nid);
}
}
...
}
under the current system, since we calculate with N_POSSIBLE, all nodes
will be assigned weights (assuming HMAT or CDAT data is available for
all of them).
We actually have a few issues here
1) If all nodes are included in the weighting reduction, we're actually
over-representing a particular set of hardware. The interleave node
and the individual device nodes would actually over-represent the
bandwidth available (comparative to the CPU nodes).
2) As stated on this patch line, just switching to N_MEMORY causes
issues with hotplug - where the bandwidth can be reported, but if
memory hasn't been added yet then we'll end up with wrong weights
because it wasn't included in the calculation.
3) However, not exposing the nodes because N_MEMORY isn't set yet
a) prevents pre-configuration before memory is onlined, and
b) hides the implications of hotplugging memory into a node from the
user (adding memory causes a re-weight and may affect an
interleave-all configuration).
but - i think it's reasonable that anyone using weighted-interleave is
*probably* not going to have nodes come and go. It just seems like a
corner case that isn't reasonable to spend time supporting.
So coming back around to the hotplug patch line, I do think it's
reasonable hide nodes marked !N_MEMORY, but consider two issues:
1) In auto mode, we need to re-weight on hotplug to only include
onlined nodes. This is because the reduction may be sensitive
to the available bandwidth changes.
This behavior needs to be clearly documented.
2) We need to clearly define what the weight of a node will be when
in manual mode and a node goes (memory -> no memory -> memory)
a) does it retain it's old, manually set weight?
b) does it revert to 1?
Sorry for the long email, just working through all the implications.
I think the proposed hotplug patch is a requirement for the
auto-configuration patch set.
~Gregory
next prev parent reply other threads:[~2025-03-11 4:42 UTC|newest]
Thread overview: 34+ messages / expand[flat|nested] mbox.gz Atom feed top
[not found] <20250228001631.1102-1-yunjeong.mun@sk.com>
2025-02-26 21:35 ` [PATCH 1/2 v6] mm/mempolicy: Weighted Interleave Auto-tuning Joshua Hahn
2025-02-26 21:35 ` [PATCH 2/2 v6] mm/mempolicy: Don't create weight sysfs for memoryless nodes Joshua Hahn
2025-02-27 2:32 ` Honggyu Kim
2025-02-27 3:20 ` Honggyu Kim
2025-03-03 21:56 ` Joshua Hahn
2025-03-04 12:53 ` Honggyu Kim
2025-03-03 16:19 ` Gregory Price
2025-03-04 13:03 ` Honggyu Kim
2025-03-04 16:16 ` Gregory Price
2025-03-04 16:29 ` Gregory Price
2025-03-06 12:39 ` Honggyu Kim
2025-03-06 17:32 ` Gregory Price
2025-03-07 11:46 ` Honggyu Kim
2025-03-07 17:51 ` Gregory Price
2025-03-10 12:26 ` Honggyu Kim
2025-03-10 14:22 ` Gregory Price
2025-03-11 2:07 ` Yunjeong Mun
2025-03-11 2:42 ` Gregory Price
2025-03-11 4:02 ` Yunjeong Mun
2025-03-11 4:42 ` Gregory Price [this message]
2025-03-11 9:51 ` Yunjeong Mun
2025-03-11 15:52 ` Gregory Price
2025-03-18 8:02 ` Yunjeong Mun
2025-03-18 11:02 ` Honggyu Kim
2025-03-18 15:13 ` Gregory Price
2025-03-19 9:56 ` Yunjeong Mun
2025-03-19 14:54 ` Gregory Price
2025-02-28 0:16 ` [PATCH 1/2 v6] mm/mempolicy: Weighted Interleave Auto-tuning yunjeong.mun
2025-02-28 6:39 ` Yunjeong Mun
2025-02-28 16:24 ` Joshua Hahn
2025-03-04 21:56 ` Joshua Hahn
2025-03-04 22:22 ` Joshua Hahn
2025-03-05 9:49 ` Yunjeong Mun
2025-03-05 16:28 ` Joshua Hahn
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=Z8-_SXm0JGjXTegL@gourry-fedora-PF4VCD3F \
--to=gourry@gourry.net \
--cc=Jonathan.Cameron@huawei.com \
--cc=akpm@linux-foundation.org \
--cc=dan.j.williams@intel.com \
--cc=dave.jiang@intel.com \
--cc=gregkh@linuxfoundation.org \
--cc=hannes@cmpxchg.org \
--cc=harry.yoo@oracle.com \
--cc=honggyu.kim@sk.com \
--cc=horen.chuang@linux.dev \
--cc=joshua.hahnjy@gmail.com \
--cc=kernel-team@meta.com \
--cc=kernel_team@skhynix.com \
--cc=lenb@kernel.org \
--cc=linux-acpi@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=rafael@kernel.org \
--cc=rakie.kim@sk.com \
--cc=ying.huang@linux.alibaba.com \
--cc=yunjeong.mun@sk.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox