From: Honggyu Kim <honggyu.kim@sk.com>
To: Gregory Price <gourry@gourry.net>
Cc: kernel_team@skhynix.com, Joshua Hahn <joshua.hahnjy@gmail.com>,
harry.yoo@oracle.com, ying.huang@linux.alibaba.com,
gregkh@linuxfoundation.org, rakie.kim@sk.com,
akpm@linux-foundation.org, rafael@kernel.org, lenb@kernel.org,
dan.j.williams@intel.com, Jonathan.Cameron@huawei.com,
dave.jiang@intel.com, horen.chuang@linux.dev, hannes@cmpxchg.org,
linux-kernel@vger.kernel.org, linux-acpi@vger.kernel.org,
linux-mm@kvack.org, kernel-team@meta.com, yunjeong.mun@sk.com
Subject: Re: [PATCH 2/2 v6] mm/mempolicy: Don't create weight sysfs for memoryless nodes
Date: Thu, 6 Mar 2025 21:39:26 +0900 [thread overview]
Message-ID: <f64819e2-8dc6-4907-b8bf-faec66eecd0e@sk.com> (raw)
In-Reply-To: <Z8cqe3BCdobsV4-2@gourry-fedora-PF4VCD3F>
Hi Gregory,
On 3/5/2025 1:29 AM, Gregory Price wrote:
> On Thu, Feb 27, 2025 at 11:32:26AM +0900, Honggyu Kim wrote:
>> Actually, we're aware of this issue and currently trying to fix this.
>> In our system, we've attached 4ch of CXL memory for each socket as
>> follows.
>>
>> node0 node1
>> +-------+ UPI +-------+
>> | CPU 0 |-+-----+-| CPU 1 |
>> +-------+ +-------+
>> | DRAM0 | | DRAM1 |
>> +---+---+ +---+---+
>> | |
>> +---+---+ +---+---+
>> | CXL 0 | | CXL 4 |
>> +---+---+ +---+---+
>> | CXL 1 | | CXL 5 |
>> +---+---+ +---+---+
>> | CXL 2 | | CXL 6 |
>> +---+---+ +---+---+
>> | CXL 3 | | CXL 7 |
>> +---+---+ +---+---+
>> node2 node3
>>
>> The 4ch of CXL memory are detected as a single NUMA node in each socket,
>> but it shows as follows with the current N_POSSIBLE loop.
>>
>> $ ls /sys/kernel/mm/mempolicy/weighted_interleave/
>> node0 node1 node2 node3 node4 node5
>> node6 node7 node8 node9 node10 node11
>
> This is insufficient information for me to assess the correctness of the
> configuration. Can you please show the contents of your CEDT/CFMWS and
> SRAT/Memory Affinity structures?
>
> mkdir acpi_data && cd acpi_data
> acpidump -b
> iasl -d *
> cat cedt.dsl <- find all CFMWS entries
> cat srat.dsl <- find all Memory Affinity entries
I'm not able to provide all the details as srat.dsl has too much info.
$ wc -l srat.dsl
25229 srat.dsl
Instead, I can show you that there are 4 diffferent proximity domains
with "Enabled : 1" with the following filtered output from srat.dsl.
$ grep -E "Proximity Domain :|Enabled : " srat.dsl | cut -c 31- | sed
'N;s/\n//' | sort | uniq
Enabled : 0 Enabled : 0
Proximity Domain : 00000000 Enabled : 0
Proximity Domain : 00000000 Enabled : 1
Proximity Domain : 00000001 Enabled : 1
Proximity Domain : 00000006 Enabled : 1
Proximity Domain : 00000007 Enabled : 1
We don't actually have to use those complicated commands to check this
as dmesg clearly prints the SRAT and node numbers as follows.
[ 0.009915] ACPI: SRAT: Node 0 PXM 0 [mem 0x00000000-0x7fffffff]
[ 0.009917] ACPI: SRAT: Node 0 PXM 0 [mem 0x100000000-0x207fffffff]
[ 0.009919] ACPI: SRAT: Node 1 PXM 1 [mem
0x60f80000000-0x64f7fffffff]
[ 0.009924] ACPI: SRAT: Node 2 PXM 6 [mem
0x2080000000-0x807fffffff] hotplug
[ 0.009925] ACPI: SRAT: Node 3 PXM 7 [mem
0x64f80000000-0x6cf7fffffff] hotplug
The memoryless nodes are printed as follows after those ACPI, SRAT,
Node N PXM M messages.
[ 0.010927] Initmem setup node 0 [mem
0x0000000000001000-0x000000207effffff]
[ 0.010930] Initmem setup node 1 [mem
0x0000060f80000000-0x0000064f7fffffff]
[ 0.010992] Initmem setup node 2 as memoryless
[ 0.011055] Initmem setup node 3 as memoryless
[ 0.011115] Initmem setup node 4 as memoryless
[ 0.011177] Initmem setup node 5 as memoryless
[ 0.011238] Initmem setup node 6 as memoryless
[ 0.011299] Initmem setup node 7 as memoryless
[ 0.011361] Initmem setup node 8 as memoryless
[ 0.011422] Initmem setup node 9 as memoryless
[ 0.011484] Initmem setup node 10 as memoryless
[ 0.011544] Initmem setup node 11 as memoryless
This is related why the 12 nodes at sysfs knobs are provided with the
current N_POSSIBLE loop.
>
> Basically I need to know:
> 1) Is each CXL device on a dedicated Host Bridge?
> 2) Is inter-host-bridge interleaving configured?
> 3) Is intra-host-bridge interleaving configured?
> 4) Do SRAT entries exist for all nodes?
Are there some simple commands that I can get those info?
> 5) Why are there 12 nodes but only 10 sources? Are there additional
> devices left out of your diagram? Are there 2 CFMWS but and 8 Memory
> Affinity records - resulting in 10 nodes? This is strange.
My blind guess is that there could be a logic node that combines 4ch of
CXL memory so there are 5 nodes per each socket. Adding 2 nodes for
local CPU/DRAM makes 12 nodes in total.
>
> By default, Linux creates a node for each proximity domain ("PXM")
> detected in the SRAT Memory Affinity tables. If SRAT entries for a
> memory region described in a CFMWS is absent, it will also create an
> node for that CFMWS.
>
> Your reported configuration and results lead me to believe you have
> a combination of CFMWS/SRAT configurations that are unexpected.
>
> ~Gregory
Not sure about this part but our approach with hotplug_memory_notifier()
resolves this problem. Rakie will submit an initial working patchset
soonish.
Thanks,
Honggyu
next prev parent reply other threads:[~2025-03-06 12:39 UTC|newest]
Thread overview: 34+ messages / expand[flat|nested] mbox.gz Atom feed top
[not found] <20250228001631.1102-1-yunjeong.mun@sk.com>
2025-02-26 21:35 ` [PATCH 1/2 v6] mm/mempolicy: Weighted Interleave Auto-tuning Joshua Hahn
2025-02-26 21:35 ` [PATCH 2/2 v6] mm/mempolicy: Don't create weight sysfs for memoryless nodes Joshua Hahn
2025-02-27 2:32 ` Honggyu Kim
2025-02-27 3:20 ` Honggyu Kim
2025-03-03 21:56 ` Joshua Hahn
2025-03-04 12:53 ` Honggyu Kim
2025-03-03 16:19 ` Gregory Price
2025-03-04 13:03 ` Honggyu Kim
2025-03-04 16:16 ` Gregory Price
2025-03-04 16:29 ` Gregory Price
2025-03-06 12:39 ` Honggyu Kim [this message]
2025-03-06 17:32 ` Gregory Price
2025-03-07 11:46 ` Honggyu Kim
2025-03-07 17:51 ` Gregory Price
2025-03-10 12:26 ` Honggyu Kim
2025-03-10 14:22 ` Gregory Price
2025-03-11 2:07 ` Yunjeong Mun
2025-03-11 2:42 ` Gregory Price
2025-03-11 4:02 ` Yunjeong Mun
2025-03-11 4:42 ` Gregory Price
2025-03-11 9:51 ` Yunjeong Mun
2025-03-11 15:52 ` Gregory Price
2025-03-18 8:02 ` Yunjeong Mun
2025-03-18 11:02 ` Honggyu Kim
2025-03-18 15:13 ` Gregory Price
2025-03-19 9:56 ` Yunjeong Mun
2025-03-19 14:54 ` Gregory Price
2025-02-28 0:16 ` [PATCH 1/2 v6] mm/mempolicy: Weighted Interleave Auto-tuning yunjeong.mun
2025-02-28 6:39 ` Yunjeong Mun
2025-02-28 16:24 ` Joshua Hahn
2025-03-04 21:56 ` Joshua Hahn
2025-03-04 22:22 ` Joshua Hahn
2025-03-05 9:49 ` Yunjeong Mun
2025-03-05 16:28 ` Joshua Hahn
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=f64819e2-8dc6-4907-b8bf-faec66eecd0e@sk.com \
--to=honggyu.kim@sk.com \
--cc=Jonathan.Cameron@huawei.com \
--cc=akpm@linux-foundation.org \
--cc=dan.j.williams@intel.com \
--cc=dave.jiang@intel.com \
--cc=gourry@gourry.net \
--cc=gregkh@linuxfoundation.org \
--cc=hannes@cmpxchg.org \
--cc=harry.yoo@oracle.com \
--cc=horen.chuang@linux.dev \
--cc=joshua.hahnjy@gmail.com \
--cc=kernel-team@meta.com \
--cc=kernel_team@skhynix.com \
--cc=lenb@kernel.org \
--cc=linux-acpi@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=rafael@kernel.org \
--cc=rakie.kim@sk.com \
--cc=ying.huang@linux.alibaba.com \
--cc=yunjeong.mun@sk.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox