From: Yunjeong Mun <yunjeong.mun@sk.com>
To: Gregory Price <gourry@gourry.net>
Cc: kernel_team@skhynix.com, Joshua Hahn <joshua.hahnjy@gmail.com>,
harry.yoo@oracle.com, ying.huang@linux.alibaba.com,
gregkh@linuxfoundation.org, rakie.kim@sk.com,
akpm@linux-foundation.org, rafael@kernel.org, lenb@kernel.org,
dan.j.williams@intel.com, Jonathan.Cameron@huawei.com,
dave.jiang@intel.com, horen.chuang@linux.dev, hannes@cmpxchg.org,
linux-kernel@vger.kernel.org, linux-acpi@vger.kernel.org,
linux-mm@kvack.org, kernel-team@meta.com,
Honggyu Kim <honggyu.kim@sk.com>
Subject: Re: [PATCH 2/2 v6] mm/mempolicy: Don't create weight sysfs for memoryless nodes
Date: Tue, 18 Mar 2025 17:02:38 +0900 [thread overview]
Message-ID: <20250318080246.1058-1-yunjeong.mun@sk.com> (raw)
In-Reply-To: <Z8-_SXm0JGjXTegL@gourry-fedora-PF4VCD3F>
Hi Gregory, I have one more question below.
On Tue, 11 Mar 2025 00:42:49 -0400 Gregory Price <gourry@gourry.net> wrote:
> On Tue, Mar 11, 2025 at 01:02:07PM +0900, Yunjeong Mun wrote:
>
> forenote - Hi Andrew, please hold off on the auto-configuration patch
> for now, the sk group has identified a hotplug issue we need to work out
> and we'll likely need to merge these two patch set together. I really
> appreciate your patience with this feature.
>
> > Hi Gregory,
> >
> > In my understanding, the reason we are seeing 12 NUMA node is because
> > it loops through node_states[N_POSSIBLE] and its value is 4095 (twelves ones)
> > in the code [1] below:
> >
> ... snip ...
>
> Appreciated, so yes this confirms what i thought was going on. There's
> 4 host bridges, 2 devices on each host bridge, and an extra CFMWS per
> socket that is intended to interleave across the host bridges.
>
Thanks for confirm. Honggyu represented it as a tree sturcture:
rootport/
├── socket0
│ ├── cross-host-bridge0 -> SRAT && CEDT (interleave on) --> NODE 2
│ │ ├── host-bridge0 -> CEDT
│ │ │ ├── cxl0 -> CEDT
│ │ │ └── cxl1-> CEDT
│ │ └── host-bridge1 -> CEDT
│ │ ├── cxl2 -> CEDT
│ │ └── cxl3 -> CEDT
│ └── dram0 -> SRAT ---------------------------------------> NODE 0
└── socket1
├── cross-host-bridge1 -> SRAT && CEDT (interleave on)---> NODE 3
│ ├── host-bridge2 -> CEDT
│ │ ├── cxl4 -> CEDT
│ │ └── cxl5 -> CEDT
│ └── host-bridge3 -> CEDT
│ ├── cxl6 -> CEDT
│ └── cxl7 -> CEDT
└── dram1 -> SRAT ---------------------------------------> NODE 1
> As you mention below, the code in acpi/numa/srat.c will create 1 NUMA
> node per SRAT Memory Affinity Entry - and then also 1 NUMA node per
> CFMWS that doesn't have a matching SRAT entry (with a known corner case
> for a missing SRAT which doesn't apply here).
>
> So essentialy what the system is doing is marking that it's absolutely
> possible to create 1 region per device and also 1 region that
> interleaves across host each pair of host bridges (I presume this is a
> dual socket system?).
>
> So, tl;dr: All these nodes are valid and this configuration is correct.
I am wondering if all 12 nodes specifed as 'possible' is indeed correct.
The definiton of 'possible' is:
- 'Nodes that could be possibly become online at some point'.
IMHO, it seems like there should only be 4 nodes specified as 'possible'.
>
> Weighted interleave presently works fine as intended, but with the
> inclusion of the auto-configuration, there will be issues for your
> system configuration. This means we probably need to consider
> merging these as a group.
>
> During boot, the following will occur
>
> 1) drivers/acpi/numa/srat.c marks 12 nodes as possible
> 0-1) Socket nodes
> 2-3) Cross-host-bridge interleave nodes
> 4-11) single region nodes
>
> 2) drivers/cxl/* will probe the various devices and create
> a root decoder for each CXL Fixed Memory Window
> decoder0.0 - decoder11.0 (or maybe decoder0.0 - decoder0.11)
>
> 3) during probe auto-configuration of wieghted interleave occurs as a
> result of this code being called with hmat or cdat data:
>
> void node_set_perf_attrs() {
> ...
> /* When setting CPU access coordinates, update mempolicy */
> if (access == ACCESS_COORDINATE_CPU) {
> if (mempolicy_set_node_perf(nid, coord)) {
> pr_info("failed to set mempolicy attrs for node %d\n",
> nid);
> }
> }
> ...
> }
>
> under the current system, since we calculate with N_POSSIBLE, all nodes
> will be assigned weights (assuming HMAT or CDAT data is available for
> all of them).
>
> We actually have a few issues here
>
> 1) If all nodes are included in the weighting reduction, we're actually
> over-representing a particular set of hardware. The interleave node
> and the individual device nodes would actually over-represent the
> bandwidth available (comparative to the CPU nodes).
>
> 2) As stated on this patch line, just switching to N_MEMORY causes
> issues with hotplug - where the bandwidth can be reported, but if
> memory hasn't been added yet then we'll end up with wrong weights
> because it wasn't included in the calculation.
>
> 3) However, not exposing the nodes because N_MEMORY isn't set yet
> a) prevents pre-configuration before memory is onlined, and
> b) hides the implications of hotplugging memory into a node from the
> user (adding memory causes a re-weight and may affect an
> interleave-all configuration).
>
> but - i think it's reasonable that anyone using weighted-interleave is
> *probably* not going to have nodes come and go. It just seems like a
> corner case that isn't reasonable to spend time supporting.
>
> So coming back around to the hotplug patch line, I do think it's
> reasonable hide nodes marked !N_MEMORY, but consider two issues:
>
> 1) In auto mode, we need to re-weight on hotplug to only include
> onlined nodes. This is because the reduction may be sensitive
> to the available bandwidth changes.
>
> This behavior needs to be clearly documented.
>
> 2) We need to clearly define what the weight of a node will be when
> in manual mode and a node goes (memory -> no memory -> memory)
> a) does it retain it's old, manually set weight?
> b) does it revert to 1?
>
> Sorry for the long email, just working through all the implications.
>
> I think the proposed hotplug patch is a requirement for the
> auto-configuration patch set.
>
> ~Gregory
>
Best regards,
Yunjeong
next prev parent reply other threads:[~2025-03-18 8:03 UTC|newest]
Thread overview: 34+ messages / expand[flat|nested] mbox.gz Atom feed top
[not found] <20250228001631.1102-1-yunjeong.mun@sk.com>
2025-02-26 21:35 ` [PATCH 1/2 v6] mm/mempolicy: Weighted Interleave Auto-tuning Joshua Hahn
2025-02-26 21:35 ` [PATCH 2/2 v6] mm/mempolicy: Don't create weight sysfs for memoryless nodes Joshua Hahn
2025-02-27 2:32 ` Honggyu Kim
2025-02-27 3:20 ` Honggyu Kim
2025-03-03 21:56 ` Joshua Hahn
2025-03-04 12:53 ` Honggyu Kim
2025-03-03 16:19 ` Gregory Price
2025-03-04 13:03 ` Honggyu Kim
2025-03-04 16:16 ` Gregory Price
2025-03-04 16:29 ` Gregory Price
2025-03-06 12:39 ` Honggyu Kim
2025-03-06 17:32 ` Gregory Price
2025-03-07 11:46 ` Honggyu Kim
2025-03-07 17:51 ` Gregory Price
2025-03-10 12:26 ` Honggyu Kim
2025-03-10 14:22 ` Gregory Price
2025-03-11 2:07 ` Yunjeong Mun
2025-03-11 2:42 ` Gregory Price
2025-03-11 4:02 ` Yunjeong Mun
2025-03-11 4:42 ` Gregory Price
2025-03-11 9:51 ` Yunjeong Mun
2025-03-11 15:52 ` Gregory Price
2025-03-18 8:02 ` Yunjeong Mun [this message]
2025-03-18 11:02 ` Honggyu Kim
2025-03-18 15:13 ` Gregory Price
2025-03-19 9:56 ` Yunjeong Mun
2025-03-19 14:54 ` Gregory Price
2025-02-28 0:16 ` [PATCH 1/2 v6] mm/mempolicy: Weighted Interleave Auto-tuning yunjeong.mun
2025-02-28 6:39 ` Yunjeong Mun
2025-02-28 16:24 ` Joshua Hahn
2025-03-04 21:56 ` Joshua Hahn
2025-03-04 22:22 ` Joshua Hahn
2025-03-05 9:49 ` Yunjeong Mun
2025-03-05 16:28 ` Joshua Hahn
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20250318080246.1058-1-yunjeong.mun@sk.com \
--to=yunjeong.mun@sk.com \
--cc=Jonathan.Cameron@huawei.com \
--cc=akpm@linux-foundation.org \
--cc=dan.j.williams@intel.com \
--cc=dave.jiang@intel.com \
--cc=gourry@gourry.net \
--cc=gregkh@linuxfoundation.org \
--cc=hannes@cmpxchg.org \
--cc=harry.yoo@oracle.com \
--cc=honggyu.kim@sk.com \
--cc=horen.chuang@linux.dev \
--cc=joshua.hahnjy@gmail.com \
--cc=kernel-team@meta.com \
--cc=kernel_team@skhynix.com \
--cc=lenb@kernel.org \
--cc=linux-acpi@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=rafael@kernel.org \
--cc=rakie.kim@sk.com \
--cc=ying.huang@linux.alibaba.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox