linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Honggyu Kim <honggyu.kim@sk.com>
To: Gregory Price <gourry@gourry.net>
Cc: kernel_team@skhynix.com, Joshua Hahn <joshua.hahnjy@gmail.com>,
	harry.yoo@oracle.com, ying.huang@linux.alibaba.com,
	gregkh@linuxfoundation.org, rakie.kim@sk.com,
	akpm@linux-foundation.org, rafael@kernel.org, lenb@kernel.org,
	dan.j.williams@intel.com, Jonathan.Cameron@huawei.com,
	dave.jiang@intel.com, horen.chuang@linux.dev, hannes@cmpxchg.org,
	linux-kernel@vger.kernel.org, linux-acpi@vger.kernel.org,
	linux-mm@kvack.org, kernel-team@meta.com, yunjeong.mun@sk.com
Subject: Re: [PATCH 2/2 v6] mm/mempolicy: Don't create weight sysfs for memoryless nodes
Date: Fri, 7 Mar 2025 20:46:46 +0900	[thread overview]
Message-ID: <9c0d8aa8-cac7-4679-aece-af88e8129345@sk.com> (raw)
In-Reply-To: <Z8ncOp2H54WE4C5s@gourry-fedora-PF4VCD3F>



On 3/7/2025 2:32 AM, Gregory Price wrote:
> On Thu, Mar 06, 2025 at 09:39:26PM +0900, Honggyu Kim wrote:
>>
>> The memoryless nodes are printed as follows after those ACPI, SRAT,
>> Node N PXM M messages.
>>
>>    [    0.010927] Initmem setup node 0 [mem
>> 0x0000000000001000-0x000000207effffff]
>>    [    0.010930] Initmem setup node 1 [mem
>> 0x0000060f80000000-0x0000064f7fffffff]
>>    [    0.010992] Initmem setup node 2 as memoryless
>>    [    0.011055] Initmem setup node 3 as memoryless
>>    [    0.011115] Initmem setup node 4 as memoryless
>>    [    0.011177] Initmem setup node 5 as memoryless
>>    [    0.011238] Initmem setup node 6 as memoryless
>>    [    0.011299] Initmem setup node 7 as memoryless
>>    [    0.011361] Initmem setup node 8 as memoryless
>>    [    0.011422] Initmem setup node 9 as memoryless
>>    [    0.011484] Initmem setup node 10 as memoryless
>>    [    0.011544] Initmem setup node 11 as memoryless
>>
>> This is related why the 12 nodes at sysfs knobs are provided with the
>> current N_POSSIBLE loop.
>>
> 
> This isn't actually why, this is another symptom.  This gets printed
> because someone is marking nodes 4-11 as possible and setup_nr_node_ids
> reports 12 total nodes
> 
> void __init setup_nr_node_ids(void)
> {
>          unsigned int highest;
> 
>          highest = find_last_bit(node_possible_map.bits, MAX_NUMNODES);
>          nr_node_ids = highest + 1;
> }
> 
> Given your configuration data so far, we may have a bug somewhere (or
> i'm missing a configuration piece).

Maybe there could be some misunderstanding on this issue.
This isn't a problem of NUMA detection for CXL memory but just a problem
of number of "node" knobs only for weighted interleave.

The number of nodes in 'numactl -H' shows the correct nodes even without
our fix.

   $ numactl -H
   available: 4 nodes (0-3)
   node 0 cpus: 0 1 2 3 ...
   node 0 size: 128504 MB
   node 0 free: 118563 MB
   node 1 cpus: 144 145 146 147 ...
   node 1 size: 257961 MB
   node 1 free: 242628 MB
   node 2 cpus:
   node 2 size: 393216 MB
   node 2 free: 393216 MB
   node 3 cpus:
   node 3 size: 524288 MB
   node 3 free: 524288 MB
   node distances:
   node     0    1    2    3
      0:   10   21   14   24
      1:   21   10   24   14
      2:   14   24   10   26
      3:   24   14   26   10

You can see more info below.

   $ cd /sys/devices/system/node

   $ ls -d node*
   node0  node1  node2  node3

   $ cat possible
   0-11

   $ cat online
   0-3

   $ cat has_memory
   0-3

   $ cat has_normal_memory
   0-1

   $ cat has_cpu
   0-1

>>> Basically I need to know:
>>> 1) Is each CXL device on a dedicated Host Bridge?
>>> 2) Is inter-host-bridge interleaving configured?
>>> 3) Is intra-host-bridge interleaving configured?
>>> 4) Do SRAT entries exist for all nodes?
>>
>> Are there some simple commands that I can get those info?
>>
> 
> The content of the CEDT would be sufficient - that will show us the
> number of CXL host bridges.

Which command do we need for this info specifically?  My output doesn't
provide some useful info for that.

   $ acpidump -b
   $ iasl -d *
   $ cat cedt.dsl
       ...
   **** Unknown ACPI table signature [CEDT]

> 
>>> 5) Why are there 12 nodes but only 10 sources? Are there additional
>>>      devices left out of your diagram? Are there 2 CFMWS but and 8 Memory
>>>      Affinity records - resulting in 10 nodes? This is strange.
>>
>> My blind guess is that there could be a logic node that combines 4ch of
>> CXL memory so there are 5 nodes per each socket.  Adding 2 nodes for
>> local CPU/DRAM makes 12 nodes in total.
>>
> 
> The issue is that nodes have associated memory regions.  If there are
> multiple nodes with overlapping memory regions, that seems problematic.
> 
> If there are "possible nodes" without memory and no real use case
> (because the memory is associated with the aggregate node) then those
> nodes probably shouldn't be reported as possible.
> 
> the tl;dr here is we should figure out what is marking those nodes as
> possible.
> 
>> Not sure about this part but our approach with hotplug_memory_notifier()
>> resolves this problem.  Rakie will submit an initial working patchset
>> soonish.
> 
> This may just be a bandaid on the issue.  We should get our node
> configuration correct from the get-go.

Not sure about it.  This must be fixed ASAP because current kernel is
broken on this issue and the fix should go into hotfix tree first.

If you can think this is just a bandaid, but leaving it bleeding as is
not the right approach.

Our fix was posted a few hours ago.  Please have a look, then think
about the apprach again.
https://lore.kernel.org/linux-mm/20250307063534.540-1-rakie.kim@sk.com

Thanks,
Honggyu

> 
> ~Gregory



  reply	other threads:[~2025-03-07 11:46 UTC|newest]

Thread overview: 34+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <20250228001631.1102-1-yunjeong.mun@sk.com>
2025-02-26 21:35 ` [PATCH 1/2 v6] mm/mempolicy: Weighted Interleave Auto-tuning Joshua Hahn
2025-02-26 21:35   ` [PATCH 2/2 v6] mm/mempolicy: Don't create weight sysfs for memoryless nodes Joshua Hahn
2025-02-27  2:32     ` Honggyu Kim
2025-02-27  3:20       ` Honggyu Kim
2025-03-03 21:56         ` Joshua Hahn
2025-03-04 12:53           ` Honggyu Kim
2025-03-03 16:19       ` Gregory Price
2025-03-04 13:03         ` Honggyu Kim
2025-03-04 16:16           ` Gregory Price
2025-03-04 16:29       ` Gregory Price
2025-03-06 12:39         ` Honggyu Kim
2025-03-06 17:32           ` Gregory Price
2025-03-07 11:46             ` Honggyu Kim [this message]
2025-03-07 17:51               ` Gregory Price
2025-03-10 12:26                 ` Honggyu Kim
2025-03-10 14:22                   ` Gregory Price
2025-03-11  2:07                     ` Yunjeong Mun
2025-03-11  2:42                       ` Gregory Price
2025-03-11  4:02                         ` Yunjeong Mun
2025-03-11  4:42                           ` Gregory Price
2025-03-11  9:51                             ` Yunjeong Mun
2025-03-11 15:52                               ` Gregory Price
2025-03-18  8:02                             ` Yunjeong Mun
2025-03-18 11:02                               ` Honggyu Kim
2025-03-18 15:13                                 ` Gregory Price
2025-03-19  9:56                                   ` Yunjeong Mun
2025-03-19 14:54                                     ` Gregory Price
2025-02-28  0:16   ` [PATCH 1/2 v6] mm/mempolicy: Weighted Interleave Auto-tuning yunjeong.mun
2025-02-28  6:39   ` Yunjeong Mun
2025-02-28 16:24     ` Joshua Hahn
2025-03-04 21:56     ` Joshua Hahn
2025-03-04 22:22       ` Joshua Hahn
2025-03-05  9:49         ` Yunjeong Mun
2025-03-05 16:28           ` Joshua Hahn

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=9c0d8aa8-cac7-4679-aece-af88e8129345@sk.com \
    --to=honggyu.kim@sk.com \
    --cc=Jonathan.Cameron@huawei.com \
    --cc=akpm@linux-foundation.org \
    --cc=dan.j.williams@intel.com \
    --cc=dave.jiang@intel.com \
    --cc=gourry@gourry.net \
    --cc=gregkh@linuxfoundation.org \
    --cc=hannes@cmpxchg.org \
    --cc=harry.yoo@oracle.com \
    --cc=horen.chuang@linux.dev \
    --cc=joshua.hahnjy@gmail.com \
    --cc=kernel-team@meta.com \
    --cc=kernel_team@skhynix.com \
    --cc=lenb@kernel.org \
    --cc=linux-acpi@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=rafael@kernel.org \
    --cc=rakie.kim@sk.com \
    --cc=ying.huang@linux.alibaba.com \
    --cc=yunjeong.mun@sk.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox