[PATCH v2 0/1] Identify the accurate NUMA ID of CFMW

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v2 0/1] Identify the accurate NUMA ID of CFMW
@ 2026-01-06  3:10 Cui Chao
  2026-01-06  3:10 ` [PATCH v2 1/1] mm: numa_memblks: " Cui Chao
  0 siblings, 1 reply; 25+ messages in thread
From: Cui Chao @ 2026-01-06  3:10 UTC (permalink / raw)
  To: Andrew Morton, Jonathan Cameron, Mike Rapoport
  Cc: Wang Yinfeng, linux-cxl, linux-kernel, linux-mm

Changes in v2:
- Added an example memory layout to changelog.
- Added linux-cxl@vger.kernel.org to CC list.
- Assigned the result of meminfo_to_nid(&numa_reserved_meminfo, start)
  to a local variable.

Cui Chao (1):
  mm: numa_memblks: Identify the accurate NUMA ID of CFMW

 mm/numa_memblks.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

-- 
2.33.0



^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH v2 1/1] mm: numa_memblks: Identify the accurate NUMA ID of CFMW
  2026-01-06  3:10 [PATCH v2 0/1] Identify the accurate NUMA ID of CFMW Cui Chao
@ 2026-01-06  3:10 ` Cui Chao
  2026-01-08 16:19   ` Jonathan Cameron
                     ` (2 more replies)
  0 siblings, 3 replies; 25+ messages in thread
From: Cui Chao @ 2026-01-06  3:10 UTC (permalink / raw)
  To: Andrew Morton, Jonathan Cameron, Mike Rapoport
  Cc: Wang Yinfeng, linux-cxl, linux-kernel, linux-mm

In some physical memory layout designs, the address space of CFMW
resides between multiple segments of system memory belonging to
the same NUMA node. In numa_cleanup_meminfo, these multiple segments
of system memory are merged into a larger numa_memblk. When
identifying which NUMA node the CFMW belongs to, it may be incorrectly
assigned to the NUMA node of the merged system memory.

Example memory layout:

Physical address space:
    0x00000000 - 0x1FFFFFFF  System RAM (node0)
    0x20000000 - 0x2FFFFFFF  CXL CFMW (node2)
    0x40000000 - 0x5FFFFFFF  System RAM (node0)
    0x60000000 - 0x7FFFFFFF  System RAM (node1)

After numa_cleanup_meminfo, the two node0 segments are merged into one:
    0x00000000 - 0x5FFFFFFF  System RAM (node0) // CFMW is inside the range
    0x60000000 - 0x7FFFFFFF  System RAM (node1)

So the CFMW (0x20000000-0x2FFFFFFF) will be incorrectly assigned to node0.

To address this scenario, accurately identifying the correct NUMA node
can be achieved by checking whether the region belongs to both
numa_meminfo and numa_reserved_meminfo.

Signed-off-by: Cui Chao <cuichao1753@phytium.com.cn>
---
 mm/numa_memblks.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/mm/numa_memblks.c b/mm/numa_memblks.c
index 5b009a9cd8b4..e91908ed8661 100644
--- a/mm/numa_memblks.c
+++ b/mm/numa_memblks.c
@@ -568,15 +568,16 @@ static int meminfo_to_nid(struct numa_meminfo *mi, u64 start)
 int phys_to_target_node(u64 start)
 {
 	int nid = meminfo_to_nid(&numa_meminfo, start);
+	int reserved_nid = meminfo_to_nid(&numa_reserved_meminfo, start);
 
 	/*
 	 * Prefer online nodes, but if reserved memory might be
 	 * hot-added continue the search with reserved ranges.
 	 */
-	if (nid != NUMA_NO_NODE)
+	if (nid != NUMA_NO_NODE && reserved_nid == NUMA_NO_NODE)
 		return nid;
 
-	return meminfo_to_nid(&numa_reserved_meminfo, start);
+	return reserved_nid;
 }
 EXPORT_SYMBOL_GPL(phys_to_target_node);
 
-- 
2.33.0



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2 1/1] mm: numa_memblks: Identify the accurate NUMA ID of CFMW
  2026-01-06  3:10 ` [PATCH v2 1/1] mm: numa_memblks: " Cui Chao
@ 2026-01-08 16:19   ` Jonathan Cameron
  2026-01-08 17:48   ` Andrew Morton
  2026-01-09  9:35   ` Pratyush Brahma
  2 siblings, 0 replies; 25+ messages in thread
From: Jonathan Cameron @ 2026-01-08 16:19 UTC (permalink / raw)
  To: Cui Chao
  Cc: Andrew Morton, Mike Rapoport, Wang Yinfeng, linux-cxl,
	linux-kernel, linux-mm

On Tue, 6 Jan 2026 11:10:42 +0800
Cui Chao <cuichao1753@phytium.com.cn> wrote:

> In some physical memory layout designs, the address space of CFMW
> resides between multiple segments of system memory belonging to
> the same NUMA node. In numa_cleanup_meminfo, these multiple segments
> of system memory are merged into a larger numa_memblk. When
> identifying which NUMA node the CFMW belongs to, it may be incorrectly
> assigned to the NUMA node of the merged system memory.
> 
> Example memory layout:
> 
> Physical address space:
>     0x00000000 - 0x1FFFFFFF  System RAM (node0)
>     0x20000000 - 0x2FFFFFFF  CXL CFMW (node2)
>     0x40000000 - 0x5FFFFFFF  System RAM (node0)
>     0x60000000 - 0x7FFFFFFF  System RAM (node1)
> 
> After numa_cleanup_meminfo, the two node0 segments are merged into one:
>     0x00000000 - 0x5FFFFFFF  System RAM (node0) // CFMW is inside the range
>     0x60000000 - 0x7FFFFFFF  System RAM (node1)
> 
> So the CFMW (0x20000000-0x2FFFFFFF) will be incorrectly assigned to node0.
> 
> To address this scenario, accurately identifying the correct NUMA node
> can be achieved by checking whether the region belongs to both
> numa_meminfo and numa_reserved_meminfo.
> 
> Signed-off-by: Cui Chao <cuichao1753@phytium.com.cn>

Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>

> ---
>  mm/numa_memblks.c | 5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/numa_memblks.c b/mm/numa_memblks.c
> index 5b009a9cd8b4..e91908ed8661 100644
> --- a/mm/numa_memblks.c
> +++ b/mm/numa_memblks.c
> @@ -568,15 +568,16 @@ static int meminfo_to_nid(struct numa_meminfo *mi, u64 start)
>  int phys_to_target_node(u64 start)
>  {
>  	int nid = meminfo_to_nid(&numa_meminfo, start);
> +	int reserved_nid = meminfo_to_nid(&numa_reserved_meminfo, start);
>  
>  	/*
>  	 * Prefer online nodes, but if reserved memory might be
>  	 * hot-added continue the search with reserved ranges.
>  	 */
> -	if (nid != NUMA_NO_NODE)
> +	if (nid != NUMA_NO_NODE && reserved_nid == NUMA_NO_NODE)
>  		return nid;
>  
> -	return meminfo_to_nid(&numa_reserved_meminfo, start);
> +	return reserved_nid;
>  }
>  EXPORT_SYMBOL_GPL(phys_to_target_node);
>  



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2 1/1] mm: numa_memblks: Identify the accurate NUMA ID of CFMW
  2026-01-06  3:10 ` [PATCH v2 1/1] mm: numa_memblks: " Cui Chao
  2026-01-08 16:19   ` Jonathan Cameron
@ 2026-01-08 17:48   ` Andrew Morton
  2026-01-15  9:43     ` Cui Chao
  2026-01-09  9:35   ` Pratyush Brahma
  2 siblings, 1 reply; 25+ messages in thread
From: Andrew Morton @ 2026-01-08 17:48 UTC (permalink / raw)
  To: Cui Chao
  Cc: Jonathan Cameron, Mike Rapoport, Wang Yinfeng, linux-cxl,
	linux-kernel, linux-mm

On Tue,  6 Jan 2026 11:10:42 +0800 Cui Chao <cuichao1753@phytium.com.cn> wrote:

> In some physical memory layout designs, the address space of CFMW
> resides between multiple segments of system memory belonging to
> the same NUMA node. In numa_cleanup_meminfo, these multiple segments
> of system memory are merged into a larger numa_memblk. When
> identifying which NUMA node the CFMW belongs to, it may be incorrectly
> assigned to the NUMA node of the merged system memory.
> 
> Example memory layout:
> 
> Physical address space:
>     0x00000000 - 0x1FFFFFFF  System RAM (node0)
>     0x20000000 - 0x2FFFFFFF  CXL CFMW (node2)
>     0x40000000 - 0x5FFFFFFF  System RAM (node0)
>     0x60000000 - 0x7FFFFFFF  System RAM (node1)
> 
> After numa_cleanup_meminfo, the two node0 segments are merged into one:
>     0x00000000 - 0x5FFFFFFF  System RAM (node0) // CFMW is inside the range
>     0x60000000 - 0x7FFFFFFF  System RAM (node1)
> 
> So the CFMW (0x20000000-0x2FFFFFFF) will be incorrectly assigned to node0.
> 
> To address this scenario, accurately identifying the correct NUMA node
> can be achieved by checking whether the region belongs to both
> numa_meminfo and numa_reserved_meminfo.

Thanks.

Can you please help us understand the userspace-visible runtime effects
of this incorrect assignment?


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2 1/1] mm: numa_memblks: Identify the accurate NUMA ID of CFMW
  2026-01-06  3:10 ` [PATCH v2 1/1] mm: numa_memblks: " Cui Chao
  2026-01-08 16:19   ` Jonathan Cameron
  2026-01-08 17:48   ` Andrew Morton
@ 2026-01-09  9:35   ` Pratyush Brahma
  2026-01-15 10:06     ` Cui Chao
  2 siblings, 1 reply; 25+ messages in thread
From: Pratyush Brahma @ 2026-01-09  9:35 UTC (permalink / raw)
  To: Cui Chao
  Cc: Wang Yinfeng, linux-cxl, linux-kernel, linux-mm, Andrew Morton,
	Jonathan Cameron, Mike Rapoport

On Fri, Jan 9, 2026 at 12:44 PM Cui Chao <cuichao1753@phytium.com.cn> wrote:
>
> In some physical memory layout designs, the address space of CFMW
> resides between multiple segments of system memory belonging to
> the same NUMA node. In numa_cleanup_meminfo, these multiple segments
> of system memory are merged into a larger numa_memblk. When
> identifying which NUMA node the CFMW belongs to, it may be incorrectly
> assigned to the NUMA node of the merged system memory.
>
> Example memory layout:
>
> Physical address space:
>     0x00000000 - 0x1FFFFFFF  System RAM (node0)
>     0x20000000 - 0x2FFFFFFF  CXL CFMW (node2)
>     0x40000000 - 0x5FFFFFFF  System RAM (node0)
>     0x60000000 - 0x7FFFFFFF  System RAM (node1)
>
> After numa_cleanup_meminfo, the two node0 segments are merged into one:
>     0x00000000 - 0x5FFFFFFF  System RAM (node0) // CFMW is inside the range
>     0x60000000 - 0x7FFFFFFF  System RAM (node1)
>
> So the CFMW (0x20000000-0x2FFFFFFF) will be incorrectly assigned to node0.
>
> To address this scenario, accurately identifying the correct NUMA node
> can be achieved by checking whether the region belongs to both
> numa_meminfo and numa_reserved_meminfo.
>
> Signed-off-by: Cui Chao <cuichao1753@phytium.com.cn>
> Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
> ---
>  mm/numa_memblks.c | 5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)
>
> diff --git a/mm/numa_memblks.c b/mm/numa_memblks.c
> index 5b009a9cd8b4..e91908ed8661 100644
> --- a/mm/numa_memblks.c
> +++ b/mm/numa_memblks.c
> @@ -568,15 +568,16 @@ static int meminfo_to_nid(struct numa_meminfo *mi, u64 start)
>  int phys_to_target_node(u64 start)
>  {
>         int nid = meminfo_to_nid(&numa_meminfo, start);
> +       int reserved_nid = meminfo_to_nid(&numa_reserved_meminfo, start);
>
>         /*
>          * Prefer online nodes, but if reserved memory might be
>          * hot-added continue the search with reserved ranges.
It would be good to change this comment as well. With the new logic
you’re not just "continuing the search", you’re explicitly preferring
reserved on overlap.
Probably something like "Prefer numa_meminfo unless the address is
also described by reserved ranges, in which case use the reserved
nid."
>          */
> -       if (nid != NUMA_NO_NODE)
> +       if (nid != NUMA_NO_NODE && reserved_nid == NUMA_NO_NODE)
>                 return nid;
>
> -       return meminfo_to_nid(&numa_reserved_meminfo, start);
> +       return reserved_nid;
>  }
>  EXPORT_SYMBOL_GPL(phys_to_target_node);
>
> --
> 2.33.0
>


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2 1/1] mm: numa_memblks: Identify the accurate NUMA ID of CFMW
  2026-01-08 17:48   ` Andrew Morton
@ 2026-01-15  9:43     ` Cui Chao
  2026-01-15 18:18       ` Andrew Morton
  0 siblings, 1 reply; 25+ messages in thread
From: Cui Chao @ 2026-01-15  9:43 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jonathan Cameron, Mike Rapoport, Wang Yinfeng, linux-cxl,
	linux-kernel, linux-mm

When a CXL RAM region is created in userspace, the memory capacity of 
the newly created region is not added to the CFMW-dedicated NUMA node. 
Instead, it is accumulated into an existing NUMA node (e.g., NUMA0 
containing RAM). This makes it impossible to clearly distinguish between 
the two types of memory, which may affect memory-tiering applications.

On 1/9/2026 1:48 AM, Andrew Morton wrote:
> On Tue,  6 Jan 2026 11:10:42 +0800 Cui Chao <cuichao1753@phytium.com.cn> wrote:
>
>> In some physical memory layout designs, the address space of CFMW
>> resides between multiple segments of system memory belonging to
>> the same NUMA node. In numa_cleanup_meminfo, these multiple segments
>> of system memory are merged into a larger numa_memblk. When
>> identifying which NUMA node the CFMW belongs to, it may be incorrectly
>> assigned to the NUMA node of the merged system memory.
>>
>> Example memory layout:
>>
>> Physical address space:
>>      0x00000000 - 0x1FFFFFFF  System RAM (node0)
>>      0x20000000 - 0x2FFFFFFF  CXL CFMW (node2)
>>      0x40000000 - 0x5FFFFFFF  System RAM (node0)
>>      0x60000000 - 0x7FFFFFFF  System RAM (node1)
>>
>> After numa_cleanup_meminfo, the two node0 segments are merged into one:
>>      0x00000000 - 0x5FFFFFFF  System RAM (node0) // CFMW is inside the range
>>      0x60000000 - 0x7FFFFFFF  System RAM (node1)
>>
>> So the CFMW (0x20000000-0x2FFFFFFF) will be incorrectly assigned to node0.
>>
>> To address this scenario, accurately identifying the correct NUMA node
>> can be achieved by checking whether the region belongs to both
>> numa_meminfo and numa_reserved_meminfo.
> Thanks.
>
> Can you please help us understand the userspace-visible runtime effects
> of this incorrect assignment?

-- 
Best regards,
Cui Chao.



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2 1/1] mm: numa_memblks: Identify the accurate NUMA ID of CFMW
  2026-01-09  9:35   ` Pratyush Brahma
@ 2026-01-15 10:06     ` Cui Chao
  0 siblings, 0 replies; 25+ messages in thread
From: Cui Chao @ 2026-01-15 10:06 UTC (permalink / raw)
  To: Pratyush Brahma
  Cc: Wang Yinfeng, linux-cxl, linux-kernel, linux-mm, Andrew Morton,
	Jonathan Cameron, Mike Rapoport


On 1/9/2026 5:35 PM, Pratyush Brahma wrote:
> On Fri, Jan 9, 2026 at 12:44 PM Cui Chao <cuichao1753@phytium.com.cn> wrote:
>> In some physical memory layout designs, the address space of CFMW
>> resides between multiple segments of system memory belonging to
>> the same NUMA node. In numa_cleanup_meminfo, these multiple segments
>> of system memory are merged into a larger numa_memblk. When
>> identifying which NUMA node the CFMW belongs to, it may be incorrectly
>> assigned to the NUMA node of the merged system memory.
>>
>> Example memory layout:
>>
>> Physical address space:
>>      0x00000000 - 0x1FFFFFFF  System RAM (node0)
>>      0x20000000 - 0x2FFFFFFF  CXL CFMW (node2)
>>      0x40000000 - 0x5FFFFFFF  System RAM (node0)
>>      0x60000000 - 0x7FFFFFFF  System RAM (node1)
>>
>> After numa_cleanup_meminfo, the two node0 segments are merged into one:
>>      0x00000000 - 0x5FFFFFFF  System RAM (node0) // CFMW is inside the range
>>      0x60000000 - 0x7FFFFFFF  System RAM (node1)
>>
>> So the CFMW (0x20000000-0x2FFFFFFF) will be incorrectly assigned to node0.
>>
>> To address this scenario, accurately identifying the correct NUMA node
>> can be achieved by checking whether the region belongs to both
>> numa_meminfo and numa_reserved_meminfo.
>>
>> Signed-off-by: Cui Chao <cuichao1753@phytium.com.cn>
>> Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
>> ---
>>   mm/numa_memblks.c | 5 +++--
>>   1 file changed, 3 insertions(+), 2 deletions(-)
>>
>> diff --git a/mm/numa_memblks.c b/mm/numa_memblks.c
>> index 5b009a9cd8b4..e91908ed8661 100644
>> --- a/mm/numa_memblks.c
>> +++ b/mm/numa_memblks.c
>> @@ -568,15 +568,16 @@ static int meminfo_to_nid(struct numa_meminfo *mi, u64 start)
>>   int phys_to_target_node(u64 start)
>>   {
>>          int nid = meminfo_to_nid(&numa_meminfo, start);
>> +       int reserved_nid = meminfo_to_nid(&numa_reserved_meminfo, start);
>>
>>          /*
>>           * Prefer online nodes, but if reserved memory might be
>>           * hot-added continue the search with reserved ranges.
> It would be good to change this comment as well. With the new logic
> you’re not just "continuing the search", you’re explicitly preferring
> reserved on overlap.
> Probably something like "Prefer numa_meminfo unless the address is
> also described by reserved ranges, in which case use the reserved
> nid."

Thanks.

I will revise the next version according to your suggestion.

>>           */
>> -       if (nid != NUMA_NO_NODE)
>> +       if (nid != NUMA_NO_NODE && reserved_nid == NUMA_NO_NODE)
>>                  return nid;
>>
>> -       return meminfo_to_nid(&numa_reserved_meminfo, start);
>> +       return reserved_nid;
>>   }
>>   EXPORT_SYMBOL_GPL(phys_to_target_node);
>>
>> --
>> 2.33.0
>>
-- 
Best regards,
Cui Chao.



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2 1/1] mm: numa_memblks: Identify the accurate NUMA ID of CFMW
  2026-01-15  9:43     ` Cui Chao
@ 2026-01-15 18:18       ` Andrew Morton
  2026-01-15 19:50         ` dan.j.williams
  0 siblings, 1 reply; 25+ messages in thread
From: Andrew Morton @ 2026-01-15 18:18 UTC (permalink / raw)
  To: Cui Chao
  Cc: Jonathan Cameron, Mike Rapoport, Wang Yinfeng, linux-cxl,
	linux-kernel, linux-mm

On Thu, 15 Jan 2026 17:43:02 +0800 Cui Chao <cuichao1753@phytium.com.cn> wrote:

> When a CXL RAM region is created in userspace, the memory capacity of 
> the newly created region is not added to the CFMW-dedicated NUMA node. 
> Instead, it is accumulated into an existing NUMA node (e.g., NUMA0 
> containing RAM). This makes it impossible to clearly distinguish between 
> the two types of memory, which may affect memory-tiering applications.
> 

OK, thanks, I added this to the changelog.  Please retain it when
sending v3.

What I'm actually looking for here are answers to the questions

  Should we backport this into -stable kernels and if so, why?
  And if not, why not?

So a very complete description of the runtime effects really helps
myself and others to decide which kernels to patch.  And it helps
people to understand *why* we made that decision.

And sorry, but "may affect memory-tiering applications" isn't very
complete!

So please, tell us how much our users are hurting from this and please
make a recommendation on the backporting decision.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2 1/1] mm: numa_memblks: Identify the accurate NUMA ID of CFMW
  2026-01-15 18:18       ` Andrew Morton
@ 2026-01-15 19:50         ` dan.j.williams
  2026-01-22  8:03           ` Cui Chao
  0 siblings, 1 reply; 25+ messages in thread
From: dan.j.williams @ 2026-01-15 19:50 UTC (permalink / raw)
  To: Andrew Morton, Cui Chao
  Cc: Jonathan Cameron, Mike Rapoport, Wang Yinfeng, linux-cxl,
	linux-kernel, linux-mm

Andrew Morton wrote:
> On Thu, 15 Jan 2026 17:43:02 +0800 Cui Chao <cuichao1753@phytium.com.cn> wrote:
> 
> > When a CXL RAM region is created in userspace, the memory capacity of 
> > the newly created region is not added to the CFMW-dedicated NUMA node. 
> > Instead, it is accumulated into an existing NUMA node (e.g., NUMA0 
> > containing RAM). This makes it impossible to clearly distinguish between 
> > the two types of memory, which may affect memory-tiering applications.
> > 
> 
> OK, thanks, I added this to the changelog.  Please retain it when
> sending v3.
> 
> What I'm actually looking for here are answers to the questions
> 
>   Should we backport this into -stable kernels and if so, why?
>   And if not, why not?
> 
> So a very complete description of the runtime effects really helps
> myself and others to decide which kernels to patch.  And it helps
> people to understand *why* we made that decision.
> 
> And sorry, but "may affect memory-tiering applications" isn't very
> complete!
> 
> So please, tell us how much our users are hurting from this and please
> make a recommendation on the backporting decision.
> 

To add on here, Cui, please describe which shipping hardware platforms
in the wild create physical address maps like this. For example, if this
is something that only occurs in QEMU configurations or similar, then
the urgency is low and it is debatable if Linux should even worry about
fixing it.

I know that x86 platforms typically do not do this. It is also
within the realm of possibility for platform firmware to fix. So in
addition to platform impact please also clarify why folks can not just
ask for a firmware update to get this fixed without updating their
kernel.


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2 1/1] mm: numa_memblks: Identify the accurate NUMA ID of CFMW
  2026-01-15 19:50         ` dan.j.williams
@ 2026-01-22  8:03           ` Cui Chao
  2026-01-22 21:28             ` Andrew Morton
  2026-01-23 16:46             ` Gregory Price
  0 siblings, 2 replies; 25+ messages in thread
From: Cui Chao @ 2026-01-22  8:03 UTC (permalink / raw)
  To: dan.j.williams, Andrew Morton
  Cc: Jonathan Cameron, Mike Rapoport, Wang Yinfeng, linux-cxl,
	linux-kernel, linux-mm


On 1/16/2026 3:50 AM, dan.j.williams@intel.com wrote:
> Andrew Morton wrote:
>> On Thu, 15 Jan 2026 17:43:02 +0800 Cui Chao <cuichao1753@phytium.com.cn> wrote:
>>
>>> When a CXL RAM region is created in userspace, the memory capacity of
>>> the newly created region is not added to the CFMW-dedicated NUMA node.
>>> Instead, it is accumulated into an existing NUMA node (e.g., NUMA0
>>> containing RAM). This makes it impossible to clearly distinguish between
>>> the two types of memory, which may affect memory-tiering applications.
>>>
>> OK, thanks, I added this to the changelog.  Please retain it when
>> sending v3.
>>
>> What I'm actually looking for here are answers to the questions
>>
>>    Should we backport this into -stable kernels and if so, why?
>>    And if not, why not?
>>
>> So a very complete description of the runtime effects really helps
>> myself and others to decide which kernels to patch.  And it helps
>> people to understand *why* we made that decision.
>>
>> And sorry, but "may affect memory-tiering applications" isn't very
>> complete!
>>
>> So please, tell us how much our users are hurting from this and please
>> make a recommendation on the backporting decision.
>>
> To add on here, Cui, please describe which shipping hardware platforms
> in the wild create physical address maps like this. For example, if this
> is something that only occurs in QEMU configurations or similar, then
> the urgency is low and it is debatable if Linux should even worry about
> fixing it.
>
> I know that x86 platforms typically do not do this. It is also
> within the realm of possibility for platform firmware to fix. So in
> addition to platform impact please also clarify why folks can not just
> ask for a firmware update to get this fixed without updating their
> kernel.

Andrew, Dan, thank you for your review.

1.Issue Impact and Backport Recommendation:

This patch fixes an issue on hardware platforms (not QEMU emulation) 
where, during the dynamic creation of a CXL RAM region, the memory 
capacity is not assigned to the correct CFMW-dedicated NUMA node. This 
issue leads to:

  *

    Failure of the memory tiering mechanism: The system is designed to
    treat System RAM as fast memory and CXL memory as slow memory. For
    performance optimization, hot pages may be migrated to fast memory
    while cold pages are migrated to slow memory. The system uses NUMA
    IDs as an index to identify different tiers of memory. If the NUMA
    ID for CXL memory is calculated incorrectly and its capacity is
    aggregated into the NUMA node containing System RAM (i.e., the node
    for fast memory), the CXL memory cannot be correctly identified. It
    may be misjudged as fast memory, thereby affecting performance
    optimization strategies.

  *

    Inability to distinguish between System RAM and CXL memory even for
    simple manual binding: Tools like |numactl|and other NUMA policy
    utilities cannot differentiate between System RAM and CXL memory,
    making it impossible to perform reasonable memory binding.

  *

    Inaccurate system reporting: Tools like |numactl -H|would display
    memory capacities that do not match the actual physical hardware
    layout, impacting operations and monitoring.

This issue affects all users utilizing the CXL RAM functionality who 
rely on memory tiering or NUMA-aware scheduling. Such configurations are 
becoming increasingly common in data centers, cloud computing, and 
high-performance computing scenarios.

Therefore, I recommend backporting this patch to all stable kernel 
series that support dynamic CXL region creation.

2.Why a Kernel Update is Recommended Over a Firmware Update:

In the scenario of dynamic CXL region creation, the association between 
the memory's HPA range and its corresponding NUMA node is established 
when the kernel driver performs the commit operation. This is a runtime, 
OS-managed operation where the platform firmware cannot intervene to 
provide a fix.

Considering factors like hardware platform architecture, memory 
resources, and others, such a physical address layout can indeed occur. 
This patch does not introduce risk; it simply correctly handles the NUMA 
node assignment for CXL RAM regions within such a physical address layout.

Thus, I believe a kernel fix is necessary.

-- 
Best regards,
Cui Chao.



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2 1/1] mm: numa_memblks: Identify the accurate NUMA ID of CFMW
  2026-01-22  8:03           ` Cui Chao
@ 2026-01-22 21:28             ` Andrew Morton
  2026-01-23  8:59               ` Cui Chao
  2026-01-23 16:46             ` Gregory Price
  1 sibling, 1 reply; 25+ messages in thread
From: Andrew Morton @ 2026-01-22 21:28 UTC (permalink / raw)
  To: Cui Chao
  Cc: dan.j.williams, Jonathan Cameron, Mike Rapoport, Wang Yinfeng,
	linux-cxl, linux-kernel, linux-mm

On Thu, 22 Jan 2026 16:03:49 +0800 Cui Chao <cuichao1753@phytium.com.cn> wrote:

> 
> >> So please, tell us how much our users are hurting from this and please
> >> make a recommendation on the backporting decision.
> >>
> > To add on here, Cui, please describe which shipping hardware platforms
> > in the wild create physical address maps like this. For example, if this
> > is something that only occurs in QEMU configurations or similar, then
> > the urgency is low and it is debatable if Linux should even worry about
> > fixing it.
> >
> > I know that x86 platforms typically do not do this. It is also
> > within the realm of possibility for platform firmware to fix. So in
> > addition to platform impact please also clarify why folks can not just
> > ask for a firmware update to get this fixed without updating their
> > kernel.
> 
> Andrew, Dan, thank you for your review.
> 
> 1.Issue Impact and Backport Recommendation:
>
> ...
>
> Thus, I believe a kernel fix is necessary.

Thanks, I posted all that into the changelog.

> Therefore, I recommend backporting this patch to all stable kernel 
> series that support dynamic CXL region creation.

It's helpful if we can tell -stable maintainers which kernel versions
"support dynamic CXL region creation".  We communicate that by
providing a Fixes: tag in the changelog.  Are you able to help identify
a suitable commit for this?



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2 1/1] mm: numa_memblks: Identify the accurate NUMA ID of CFMW
  2026-01-22 21:28             ` Andrew Morton
@ 2026-01-23  8:59               ` Cui Chao
  0 siblings, 0 replies; 25+ messages in thread
From: Cui Chao @ 2026-01-23  8:59 UTC (permalink / raw)
  To: Andrew Morton
  Cc: dan.j.williams, Jonathan Cameron, Mike Rapoport, Wang Yinfeng,
	linux-cxl, linux-kernel, linux-mm


On 1/23/2026 5:28 AM, Andrew Morton wrote:
> On Thu, 22 Jan 2026 16:03:49 +0800 Cui Chao <cuichao1753@phytium.com.cn> wrote:
>
>>>> So please, tell us how much our users are hurting from this and please
>>>> make a recommendation on the backporting decision.
>>>>
>>> To add on here, Cui, please describe which shipping hardware platforms
>>> in the wild create physical address maps like this. For example, if this
>>> is something that only occurs in QEMU configurations or similar, then
>>> the urgency is low and it is debatable if Linux should even worry about
>>> fixing it.
>>>
>>> I know that x86 platforms typically do not do this. It is also
>>> within the realm of possibility for platform firmware to fix. So in
>>> addition to platform impact please also clarify why folks can not just
>>> ask for a firmware update to get this fixed without updating their
>>> kernel.
>> Andrew, Dan, thank you for your review.
>>
>> 1.Issue Impact and Backport Recommendation:
>>
>> ...
>>
>> Thus, I believe a kernel fix is necessary.
> Thanks, I posted all that into the changelog.
>
>> Therefore, I recommend backporting this patch to all stable kernel
>> series that support dynamic CXL region creation.
> It's helpful if we can tell -stable maintainers which kernel versions
> "support dynamic CXL region creation".  We communicate that by
> providing a Fixes: tag in the changelog.  Are you able to help identify
> a suitable commit for this?
Although the modified code is not in this commit, it is possible that 
since the beginning of support for dynamically creating CXL memory 
regions, this physical address layout and the issue of incorrect NUMA ID 
assignment needed to be considered and resolved.

Fixes: 779dd20cfb56 ("cxl/region: Add region creation support")

-- 
Best regards,
Cui Chao.



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2 1/1] mm: numa_memblks: Identify the accurate NUMA ID of CFMW
  2026-01-22  8:03           ` Cui Chao
  2026-01-22 21:28             ` Andrew Morton
@ 2026-01-23 16:46             ` Gregory Price
  2026-01-26  9:06               ` Cui Chao
  1 sibling, 1 reply; 25+ messages in thread
From: Gregory Price @ 2026-01-23 16:46 UTC (permalink / raw)
  To: Cui Chao
  Cc: dan.j.williams, Andrew Morton, Jonathan Cameron, Mike Rapoport,
	Wang Yinfeng, linux-cxl, linux-kernel, linux-mm

On Thu, Jan 22, 2026 at 04:03:49PM +0800, Cui Chao wrote:
> 2.Why a Kernel Update is Recommended Over a Firmware Update:
> 
> In the scenario of dynamic CXL region creation, the association between the
> memory's HPA range and its corresponding NUMA node is established when the
> kernel driver performs the commit operation. This is a runtime, OS-managed
> operation where the platform firmware cannot intervene to provide a fix.
>

This is not accurate

The memory-to-node association for CXL memory is built by acpi logic:

   linux/drivers/acpi/numa/srat.c

Specifically:

   acpi_parse_memory_affinity()  /* if SRAT entry exists */
      -> numa_add_memblk(node, start, end)

   acpi_parse_cfmws()  /* if no SRAT entry exists */
      -> numa_add_reserved_memblk(node, start, end)


This patch implies the latter is occurring - as it queries the reserved
block associations - meaning your platform is not shipping SRAT tables
for CXL memory regions.

We have only seen this in QEMU - and this is correctable in firmware.

But if this is shipped hardware, letting us know the platform lets us
know whether we should backport it.

---

All that said, this does look harmless, and seems reasonable - but the
changelog should reflect what the hardware is doing above.

~Gregory


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2 1/1] mm: numa_memblks: Identify the accurate NUMA ID of CFMW
  2026-01-23 16:46             ` Gregory Price
@ 2026-01-26  9:06               ` Cui Chao
  2026-02-05 22:58                 ` Andrew Morton
  0 siblings, 1 reply; 25+ messages in thread
From: Cui Chao @ 2026-01-26  9:06 UTC (permalink / raw)
  To: Gregory Price
  Cc: dan.j.williams, Andrew Morton, Jonathan Cameron, Mike Rapoport,
	Wang Yinfeng, linux-cxl, linux-kernel, linux-mm


On 1/24/2026 12:46 AM, Gregory Price wrote:
> On Thu, Jan 22, 2026 at 04:03:49PM +0800, Cui Chao wrote:
>> 2.Why a Kernel Update is Recommended Over a Firmware Update:
>>
>> In the scenario of dynamic CXL region creation, the association between the
>> memory's HPA range and its corresponding NUMA node is established when the
>> kernel driver performs the commit operation. This is a runtime, OS-managed
>> operation where the platform firmware cannot intervene to provide a fix.
>>
> This is not accurate
>
> The memory-to-node association for CXL memory is built by acpi logic:
>
>     linux/drivers/acpi/numa/srat.c
>
> Specifically:
>
>     acpi_parse_memory_affinity()  /* if SRAT entry exists */
>        -> numa_add_memblk(node, start, end)
>
>     acpi_parse_cfmws()  /* if no SRAT entry exists */
>        -> numa_add_reserved_memblk(node, start, end)
>
>
> This patch implies the latter is occurring - as it queries the reserved
> block associations - meaning your platform is not shipping SRAT tables
> for CXL memory regions.
Sorry，my previous statement was ambiguous. What I intended to convey is 
that the moment when CXL memory is actually assigned to a dedicated NUMA 
node and becomes ready for use by applications is precisely during the 
creation of the region.
> We have only seen this in QEMU - and this is correctable in firmware.
>
> But if this is shipped hardware, letting us know the platform lets us
> know whether we should backport it.
>
> ---
>
> All that said, this does look harmless, and seems reasonable - but the
> changelog should reflect what the hardware is doing above.
This issue was discovered on the QEMU platform. I need to apologize for 
my earlier imprecise statement (claiming it was hardware instead of 
QEMU). My core point at the time was to emphasize that this is a problem 
in the general code path when facing this scenario, not a QEMU-specific 
emulation issue, and therefore it could theoretically affect real 
hardware as well. I apologize for any confusion this may have caused.
>
> ~Gregory

-- 
Best regards,
Cui Chao.



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2 1/1] mm: numa_memblks: Identify the accurate NUMA ID of CFMW
  2026-01-26  9:06               ` Cui Chao
@ 2026-02-05 22:58                 ` Andrew Morton
  2026-02-05 23:10                   ` Gregory Price
  0 siblings, 1 reply; 25+ messages in thread
From: Andrew Morton @ 2026-02-05 22:58 UTC (permalink / raw)
  To: Cui Chao
  Cc: Gregory Price, dan.j.williams, Jonathan Cameron, Mike Rapoport,
	Wang Yinfeng, linux-cxl, linux-kernel, linux-mm

On Mon, 26 Jan 2026 17:06:52 +0800 Cui Chao <cuichao1753@phytium.com.cn> wrote:

> > All that said, this does look harmless, and seems reasonable - but the
> > changelog should reflect what the hardware is doing above.
> This issue was discovered on the QEMU platform. I need to apologize for 
> my earlier imprecise statement (claiming it was hardware instead of 
> QEMU). My core point at the time was to emphasize that this is a problem 
> in the general code path when facing this scenario, not a QEMU-specific 
> emulation issue, and therefore it could theoretically affect real 
> hardware as well. I apologize for any confusion this may have caused.

This patch doesn't sounds very urgent.  Perhaps we should do a v3 with
updated changelog and handle that in the next -rc cycle?


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2 1/1] mm: numa_memblks: Identify the accurate NUMA ID of CFMW
  2026-02-05 22:58                 ` Andrew Morton
@ 2026-02-05 23:10                   ` Gregory Price
  2026-02-06 11:03                     ` Jonathan Cameron
  0 siblings, 1 reply; 25+ messages in thread
From: Gregory Price @ 2026-02-05 23:10 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Cui Chao, dan.j.williams, Jonathan Cameron, Mike Rapoport,
	Wang Yinfeng, linux-cxl, linux-kernel, linux-mm

On Thu, Feb 05, 2026 at 02:58:42PM -0800, Andrew Morton wrote:
> On Mon, 26 Jan 2026 17:06:52 +0800 Cui Chao <cuichao1753@phytium.com.cn> wrote:
> 
> > > All that said, this does look harmless, and seems reasonable - but the
> > > changelog should reflect what the hardware is doing above.
> > This issue was discovered on the QEMU platform. I need to apologize for 
> > my earlier imprecise statement (claiming it was hardware instead of 
> > QEMU). My core point at the time was to emphasize that this is a problem 
> > in the general code path when facing this scenario, not a QEMU-specific 
> > emulation issue, and therefore it could theoretically affect real 
> > hardware as well. I apologize for any confusion this may have caused.
> 
> This patch doesn't sounds very urgent.  Perhaps we should do a v3 with
> updated changelog and handle that in the next -rc cycle?

Mostly QEMU just needs to add SRAT entries associated with the
CEDT/CFMWS it adds.

A system providing a CEDT/CFMWS entry without an SRAT entry is arguably
bad BIOS.

But yeah, this is not urgent.

~Gregory


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2 1/1] mm: numa_memblks: Identify the accurate NUMA ID of CFMW
  2026-02-05 23:10                   ` Gregory Price
@ 2026-02-06 11:03                     ` Jonathan Cameron
  2026-02-06 13:31                       ` Gregory Price
  0 siblings, 1 reply; 25+ messages in thread
From: Jonathan Cameron @ 2026-02-06 11:03 UTC (permalink / raw)
  To: Gregory Price
  Cc: Andrew Morton, Cui Chao, dan.j.williams, Mike Rapoport,
	Wang Yinfeng, linux-cxl, linux-kernel, linux-mm, qemu-devel

On Thu, 5 Feb 2026 18:10:55 -0500
Gregory Price <gourry@gourry.net> wrote:

> On Thu, Feb 05, 2026 at 02:58:42PM -0800, Andrew Morton wrote:
> > On Mon, 26 Jan 2026 17:06:52 +0800 Cui Chao <cuichao1753@phytium.com.cn> wrote:
> >   
> > > > All that said, this does look harmless, and seems reasonable - but the
> > > > changelog should reflect what the hardware is doing above.  
> > > This issue was discovered on the QEMU platform. I need to apologize for 
> > > my earlier imprecise statement (claiming it was hardware instead of 
> > > QEMU). My core point at the time was to emphasize that this is a problem 
> > > in the general code path when facing this scenario, not a QEMU-specific 
> > > emulation issue, and therefore it could theoretically affect real 
> > > hardware as well. I apologize for any confusion this may have caused.  
> > 
> > This patch doesn't sounds very urgent.  Perhaps we should do a v3 with
> > updated changelog and handle that in the next -rc cycle?  
> 
> Mostly QEMU just needs to add SRAT entries associated with the
> CEDT/CFMWS it adds.

HI Gregory,

I got a bit carried away - but the following basically says: No QEMU should not
add SRAT Memory Affinity Structures. As to Andrew's question: I'm fine with this
fix taking a little longer.

I disagree. There is nothing in the specification to say it should do that and
we have very intentionally not done so in QEMU - this is far from the first
time this has come up!. We won't be doing so any time soon unless someone
convinces me with clear spec references and tight reasoning for why it is the
right thing to do.

The only time providing SRAT Memory Affinity Structures for CEDT CXL Fixed
Memory Window  Structures (CFMWSs) is definitely the right thing to do is
if the BIOS has also programmed the full set of decoders etc. That is something
we could do in QEMU as an option. Only if we do that would it be valid
to provide SRAT Memory structures for the CXL memory. I'd suggest that's
probably a job for EDK2 rather than QEMU but that's an implementation detail
and there is a dance between EDK2 and QEMU for creating some of the tables
anyway. This configuration reflects the pre hotplug / early CXL deployment
situation. Now we have proper support in Linux we have moved beyond that.
We do need to solve the dynamic NUMA node cases though and I'm hoping your
current work will make that a bit easier.

Note that I give the same advice to our firmware folk I talk to. This stuff
is policy - it belongs in OS control, not in a bunch of config menus in the
BIOS or output of some unknown heuristic. BIOS authors are not clairvoyant.
They have no way to know (in a non trivial topology) what makes sense for a
given use case or what devices are going to be hotplugged later.
I'd increasingly expect shipping BIOSes to have a "hands off" option in which
they make not attempt to guess about what is beyond the host bridges.

One argument I have heard for why a BIOS could know an appropriate CFMWS to
SRAT memory structure mapping is the CFWMS / QTG (Quality of Service)
mapping implying a consistency of performance expectations in a given CFMWS.
However that's very specific to particular designs. For others PA space is
expensive thus they use one large CFMWS for everything and QoS handing in
the uarch relies on information derived from the host bridge decoders. Often
no one cares about cross host bridge interleave for same reason full system
DRAM interleave is a niche thing. PA space is too expensive to provide the
extra CFMWS to support it.

If we are looking at forwards looking systems, that are built to work with full
gamut of CXL then all that should be in SRAT for CXL topology is the Generic
Port Structures (provide a handle for perf data in HMAT to the edge
of the 'known world' - the host bridge / root ports). Nothing else.

If we do have BIOSes that are guessing what to put in SRAT and associated
HMAT etc then there is a fairly strong argument that a good OS
implementation should at most take such structures as a hint not a
rule (though obviously we don't do that today).

> 
> A system providing a CEDT/CFMWS entry without an SRAT entry is arguably
> bad BIOS.

I'd argue that if you aren't programming the decoder topology (and probably
locking everything down) and are providing SRAT then you are providing
a guess at best and that's a bad BIOS - not the other way around.

Note we've supported this from the first in Linux so it's not like there
is anything missing, just a corner case to tidy up.

> 
> But yeah, this is not urgent.

I'd like to see it fixed, but given we don't know of a system where
this applies today it doesn't need to be super rushed!

Jonathan

> 
> ~Gregory

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2 1/1] mm: numa_memblks: Identify the accurate NUMA ID of CFMW
  2026-02-06 11:03                     ` Jonathan Cameron
@ 2026-02-06 13:31                       ` Gregory Price
  2026-02-06 15:09                         ` Jonathan Cameron
  0 siblings, 1 reply; 25+ messages in thread
From: Gregory Price @ 2026-02-06 13:31 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: Andrew Morton, Cui Chao, dan.j.williams, Mike Rapoport,
	Wang Yinfeng, linux-cxl, linux-kernel, linux-mm, qemu-devel

On Fri, Feb 06, 2026 at 11:03:05AM +0000, Jonathan Cameron wrote:
> On Thu, 5 Feb 2026 18:10:55 -0500
> Gregory Price <gourry@gourry.net> wrote:
> 
> I disagree. There is nothing in the specification to say it should do that and
> we have very intentionally not done so in QEMU - this is far from the first
> time this has come up!. We won't be doing so any time soon unless someone
> convinces me with clear spec references and tight reasoning for why it is the
> right thing to do.
> 

Interestingly I've had this exact conversation - in reverse - with other
platform folks, who think CFMWS w/o SRAT is broken.  It was a zealous
enough opinion that I may have over-indexed on it (plus i've read the
numa mapping code and making this more dynamic seems difficult).

> This configuration reflects the pre hotplug / early CXL deployment
> situation. Now we have proper support in Linux we have moved beyond that.
> We do need to solve the dynamic NUMA node cases though and I'm hoping your
> current work will make that a bit easier.

If we want flexibility to ship HPAs around to different nodes at
runtime, that might cause issues. The page-to-nid / pa-to-nid mapping
code is somewhat expected to be immutable after __init, so there could
be nasty assumptions sprinkled all over the kernel.

That will take some time.
---

Andrew if Jonathan is good with it then with changelog updates this can
go in, otherwise I don't think this warrants a backport or anything.

~Gregory

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2 1/1] mm: numa_memblks: Identify the accurate NUMA ID of CFMW
  2026-02-06 13:31                       ` Gregory Price
@ 2026-02-06 15:09                         ` Jonathan Cameron
  2026-02-06 15:53                           ` Gregory Price
  2026-02-06 15:57                           ` Andrew Morton
  0 siblings, 2 replies; 25+ messages in thread
From: Jonathan Cameron @ 2026-02-06 15:09 UTC (permalink / raw)
  To: Gregory Price
  Cc: Andrew Morton, Cui Chao, dan.j.williams, Mike Rapoport,
	Wang Yinfeng, linux-cxl, linux-kernel, linux-mm, qemu-devel,
	David Hildenbrand (Arm)

On Fri, 6 Feb 2026 08:31:09 -0500
Gregory Price <gourry@gourry.net> wrote:

> On Fri, Feb 06, 2026 at 11:03:05AM +0000, Jonathan Cameron wrote:
> > On Thu, 5 Feb 2026 18:10:55 -0500
> > Gregory Price <gourry@gourry.net> wrote:
> > 
> > I disagree. There is nothing in the specification to say it should do that and
> > we have very intentionally not done so in QEMU - this is far from the first
> > time this has come up!. We won't be doing so any time soon unless someone
> > convinces me with clear spec references and tight reasoning for why it is the
> > right thing to do.
> >   
> 
> Interestingly I've had this exact conversation - in reverse - with other
> platform folks, who think CFMWS w/o SRAT is broken.  It was a zealous
> enough opinion that I may have over-indexed on it (plus i've read the
> numa mapping code and making this more dynamic seems difficult).

I'd be curious to why they thought it was broken.  What info did they
think SRAT conveyed in this case?  Or was this, today's OS doesn't get
this right therefore it's broken?  It's only relative recently that
everything (perf reporting etc) has been in place without the SRAT
entries and associated SLIT + HMAT, so maybe that was their usecase.

Note I've run into a bunch of cases over the years of the 'correct'
description for a system in ACPI not working in Linux. Often
no one fixes that, they just lie in ACPI instead. :(

> 
> > This configuration reflects the pre hotplug / early CXL deployment
> > situation. Now we have proper support in Linux we have moved beyond that.
> > We do need to solve the dynamic NUMA node cases though and I'm hoping your
> > current work will make that a bit easier.  
> 
> If we want flexibility to ship HPAs around to different nodes at
> runtime, that might cause issues. The page-to-nid / pa-to-nid mapping
> code is somewhat expected to be immutable after __init, so there could
> be nasty assumptions sprinkled all over the kernel.

Why?  The numa-memblk stuff makes that assumption but the keeping that around
after initial boot is mostly just about 'where' to put the memory if we have
no other way of knowing. The node assignment for tradition memory
hotplug doesn't even look at numa-memblk - it uses the node provided
in the ACPI blob for the DIMM that is arriving.

The following is from the school of 'what if I' + 'what could possibly go
wrong?' - and a vague recollection of how this works in practice.

To do this with QEMU just spin up with something like (hand typed from
wrong machine so beware silly errors):

Now a fun corner is that a node isn't created unless there is something
in it - the whole SRAT is the source of truth for what nodes exist
- so we need 'something' in it - a cpu will do, or a GI, probably a GP.
Otherwise memory ends up in node0.  However, fallback lists etc happen
as normal when first mem in a node is added.

qemu-system-aarch64 -M virt,gic-verson=3 -m 4g,maxmem=8g,slots=4 -cpu max \
-smp 4 ... \
-monitor telnet:127.0.0.1:1234 \
-object memory-backend-ram,size=4G,id=mem0 \
-numa node,nodeid=0,cpus=0,memdev=mem0 \
-numa node,nodeid=1,cpus=1 \
-numa node,nodeid=2,cpus=2 \
-numa node,nodeid=3,cpus=3 

.. hmat stuff if you like.

Then from the monitor via telnet

object_add memory-backend-ram,id=mema,size=1G
object_add memory-backend-ram,id=memb,size=1G
object_add memory-backend-ram,id=memc,size=1G
object_add memory-backend-ram,id=memd,size=1G
device_add pc-dimm,id=dimm1,memdev=mema,node=0
device_add pc-dimm,id=dimm2,memdev=memb,node=1
device_add pc-dimm,id=dimm3,memdev=memc,node=2
device_add pc-dimm,id=dimm4,memdev=memd,node=3

and you'll get 1G added to the first node and 1G added to the 2nd one

SRAT has:

[078h 0120 001h]               Subtable Type : 01 [Memory Affinity]
[079h 0121 001h]                      Length : 28

[07Ah 0122 004h]            Proximity Domain : 00000000
[07Eh 0126 002h]                   Reserved1 : 0000
[080h 0128 008h]                Base Address : 0000000040000000
[088h 0136 008h]              Address Length : 0000000100000000
[090h 0144 004h]                   Reserved2 : 00000000
[094h 0148 004h]       Flags (decoded below) : 00000001
                                     Enabled : 1
                               Hot Pluggable : 0
                                Non-Volatile : 0
[098h 0152 008h]                   Reserved3 : 0000000000000000

[0A0h 0160 001h]               Subtable Type : 01 [Memory Affinity]
[0A1h 0161 001h]                      Length : 28

[0A2h 0162 004h]            Proximity Domain : 00000003
[0A6h 0166 002h]                   Reserved1 : 0000
[0A8h 0168 008h]                Base Address : 0000000140000000
[0B0h 0176 008h]              Address Length : 0000000200000000
[0B8h 0184 004h]                   Reserved2 : 00000000
[0BCh 0188 004h]       Flags (decoded below) : 00000003
                                     Enabled : 1
                               Hot Pluggable : 1
                                Non-Volatile : 0
[0C0h 0192 008h]                   Reserved3 : 0000000000000000

So it thought the extra space in SRAT was in PXM 3...
A fun question for another day is why that is twice as big as it should be
given the presence of 4G at boot on domain 0 (which is taken into account
when you try to hotplug anything!)

Resulting kernel log:

memblock_add_node: [0x0000000140000000-0x000000017fffffff] nid=0 flags=0 add_memory_resource+0x110/0x5a0
memblock_add_node: [0x0000000180000000-0x00000001bfffffff] nid=1 flags=0 add_memory_resource+0x110/0x5a0
memblock_add_node: [0x00000001c0000000-0x00000001ffffffff] nid=2 flags=0 add_memory_resource+0x110/0x5a0
memblock_add_node: [0x0000000200000000-0x000000023fffffff] nid=3 flags=0 add_memory_resource+0x110/0x5a0

I ran some brief stress tests - basic stuff all looks fine.

FWIW before numa-memblk got fixed up on Arm CXL memory always showed up on
node 0 even if we had CFWMS entries and had instantiated nodes for them.
I don't recall anything functionally breaking (other than obvious
performance issue of it appearing to have much better performance than
it actually did).

It will take more work to make this stuff as dynamic as we'd like but at
least from dumb testing it looks like there is nothing fundamental.
(I'm too lazy to spin a test up on x86 just to check if it's different ;)

For now I 'suspect' we could hack things to provide lots of waiting numa nodes
and merrily assign HPA into them as we like whatever SRAT provides
in the way of 'hints' :) 



> 
> That will take some time.
> ---
> 
> Andrew if Jonathan is good with it then with changelog updates this can
> go in, otherwise I don't think this warrants a backport or anything.

Wait and see if anyone hits it on a real machine (or even non creative QEMU
setup!)  So for now no need to backport.

Jonathan

> 
> ~Gregory



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2 1/1] mm: numa_memblks: Identify the accurate NUMA ID of CFMW
  2026-02-06 15:09                         ` Jonathan Cameron
@ 2026-02-06 15:53                           ` Gregory Price
  2026-02-06 16:26                             ` Jonathan Cameron
  2026-02-06 15:57                           ` Andrew Morton
  1 sibling, 1 reply; 25+ messages in thread
From: Gregory Price @ 2026-02-06 15:53 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: Andrew Morton, Cui Chao, dan.j.williams, Mike Rapoport,
	Wang Yinfeng, linux-cxl, linux-kernel, linux-mm, qemu-devel,
	David Hildenbrand (Arm)

On Fri, Feb 06, 2026 at 03:09:41PM +0000, Jonathan Cameron wrote:
> On Fri, 6 Feb 2026 08:31:09 -0500
> Gregory Price <gourry@gourry.net> wrote:
> 
> Now a fun corner is that a node isn't created unless there is something
> in it - the whole SRAT is the source of truth for what nodes exist
> - so we need 'something' in it - a cpu will do, or a GI, probably a GP.
> Otherwise memory ends up in node0.  However, fallback lists etc happen
> as normal when first mem in a node is added.
> 
...
> For now I 'suspect' we could hack things to provide lots of waiting numa nodes
> and merrily assign HPA into them as we like whatever SRAT provides
> in the way of 'hints' :) 
> 

look at ACPI MSCT - "Maximum Proximity Domain Information Structure" ;]

I don't remember reading anything in the ACPI spec that says something
has to be ON any of these PXMs for it to be accounted for in the MSCT.

Platforms can just say "Reserve that many Nodes".

(Linux does not read this value, and on my existing systems, this number
always reflects the number of actually present PXMs)

---

We probably want to ignore that and just add this:

CONFIG_ACPI_NUMA_NODES_PER_CFMWS
    int
    range 1 4
    help
        This option determines the number of NUMA nodes that will be
	added for each CEDT CFMWS entry.

	By default ACPI reserves 1 per unique PXM entry in the SRAT,
	or 1 for a CXL Fixed Memory Window without SRAT mappings.

	This will reserve up to N nodes per CEDT entry, even if that
	CEDT has one or more SRAT entries.

then in the acpi/numa/srat.c code that parses srat/cedt, just track
the number of nodes over a CEDT range.

for each srat:
   account_unique_pxm(pxm, srat_range)

for each cedt:
   nr_nodes = unique_pxms(cedt_range)
   while (nr_nodes < CONFIG_ACPI_NUMA_NODES_PER_CFMWS)
      node = acpi_map_pxm_to_node(*fake_pxm++);
      if (node == NUMA_NO_NODE):
      	err("Unable to reserve additional nodes for CXL windows")
	break;
      node_set(node, numa_nodes_parsed);
      nr_nodes++

This should fall out cleanly.

The additional nodes won't be associated with anything, but could be
used for hotplug - I imagine.

~Gregory


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2 1/1] mm: numa_memblks: Identify the accurate NUMA ID of CFMW
  2026-02-06 15:09                         ` Jonathan Cameron
  2026-02-06 15:53                           ` Gregory Price
@ 2026-02-06 15:57                           ` Andrew Morton
  2026-02-06 16:23                             ` Jonathan Cameron
  1 sibling, 1 reply; 25+ messages in thread
From: Andrew Morton @ 2026-02-06 15:57 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: Gregory Price, Cui Chao, dan.j.williams, Mike Rapoport,
	Wang Yinfeng, linux-cxl, linux-kernel, linux-mm, qemu-devel,
	David Hildenbrand (Arm)

On Fri, 6 Feb 2026 15:09:41 +0000 Jonathan Cameron <jonathan.cameron@huawei.com> wrote:

> > Andrew if Jonathan is good with it then with changelog updates this can
> > go in, otherwise I don't think this warrants a backport or anything.
> 
> Wait and see if anyone hits it on a real machine (or even non creative QEMU
> setup!)  So for now no need to backport.

Thanks, all.

Below is the current state of this patch.  Is the changelog suitable?

From: Cui Chao <cuichao1753@phytium.com.cn>
Subject: mm: numa_memblks: identify the accurate NUMA ID of CFMW
Date: Tue, 6 Jan 2026 11:10:42 +0800

In some physical memory layout designs, the address space of CFMW (CXL
Fixed Memory Window) resides between multiple segments of system memory
belonging to the same NUMA node.  In numa_cleanup_meminfo, these multiple
segments of system memory are merged into a larger numa_memblk.  When
identifying which NUMA node the CFMW belongs to, it may be incorrectly
assigned to the NUMA node of the merged system memory.

When a CXL RAM region is created in userspace, the memory capacity of
the newly created region is not added to the CFMW-dedicated NUMA node. 
Instead, it is accumulated into an existing NUMA node (e.g., NUMA0
containing RAM).  This makes it impossible to clearly distinguish
between the two types of memory, which may affect memory-tiering
applications.

Example memory layout:

Physical address space:
    0x00000000 - 0x1FFFFFFF  System RAM (node0)
    0x20000000 - 0x2FFFFFFF  CXL CFMW (node2)
    0x40000000 - 0x5FFFFFFF  System RAM (node0)
    0x60000000 - 0x7FFFFFFF  System RAM (node1)

After numa_cleanup_meminfo, the two node0 segments are merged into one:
    0x00000000 - 0x5FFFFFFF  System RAM (node0) // CFMW is inside the range
    0x60000000 - 0x7FFFFFFF  System RAM (node1)

So the CFMW (0x20000000-0x2FFFFFFF) will be incorrectly assigned to node0.

To address this scenario, accurately identifying the correct NUMA node
can be achieved by checking whether the region belongs to both
numa_meminfo and numa_reserved_meminfo.

1. Issue Impact and Backport Recommendation:

This patch fixes an issue on hardware platforms (not QEMU emulation)
where, during the dynamic creation of a CXL RAM region, the memory
capacity is not assigned to the correct CFMW-dedicated NUMA node.  This
issue leads to:

    Failure of the memory tiering mechanism: The system is designed to
    treat System RAM as fast memory and CXL memory as slow memory. For
    performance optimization, hot pages may be migrated to fast memory
    while cold pages are migrated to slow memory. The system uses NUMA
    IDs as an index to identify different tiers of memory. If the NUMA
    ID for CXL memory is calculated incorrectly and its capacity is
    aggregated into the NUMA node containing System RAM (i.e., the node
    for fast memory), the CXL memory cannot be correctly identified. It
    may be misjudged as fast memory, thereby affecting performance
    optimization strategies.

    Inability to distinguish between System RAM and CXL memory even for
    simple manual binding: Tools like |numactl|and other NUMA policy
    utilities cannot differentiate between System RAM and CXL memory,
    making it impossible to perform reasonable memory binding.

    Inaccurate system reporting: Tools like |numactl -H|would display
    memory capacities that do not match the actual physical hardware
    layout, impacting operations and monitoring.

This issue affects all users utilizing the CXL RAM functionality who
rely on memory tiering or NUMA-aware scheduling.  Such configurations
are becoming increasingly common in data centers, cloud computing, and
high-performance computing scenarios.

Therefore, I recommend backporting this patch to all stable kernel 
series that support dynamic CXL region creation.

2. Why a Kernel Update is Recommended Over a Firmware Update:

In the scenario of dynamic CXL region creation, the association between
the memory's HPA range and its corresponding NUMA node is established
when the kernel driver performs the commit operation.  This is a
runtime, OS-managed operation where the platform firmware cannot
intervene to provide a fix.

Considering factors like hardware platform architecture, memory
resources, and others, such a physical address layout can indeed occur.
This patch does not introduce risk; it simply correctly handles the
NUMA node assignment for CXL RAM regions within such a physical address
layout.

Thus, I believe a kernel fix is necessary.

Link: https://lkml.kernel.org/r/20260106031042.1606729-2-cuichao1753@phytium.com.cn
Fixes: 779dd20cfb56 ("cxl/region: Add region creation support")
Signed-off-by: Cui Chao <cuichao1753@phytium.com.cn>
Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Wang Yinfeng <wangyinfeng@phytium.com.cn>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Gregory Price <gourry@gourry.net>
Cc: Joanthan Cameron <Jonathan.Cameron@huawei.com>
Cc: Wang Yinfeng <wangyinfeng@phytium.com.cn>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/numa_memblks.c |    5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

--- a/mm/numa_memblks.c~mm-numa_memblks-identify-the-accurate-numa-id-of-cfmw
+++ a/mm/numa_memblks.c
@@ -570,15 +570,16 @@ static int meminfo_to_nid(struct numa_me
 int phys_to_target_node(u64 start)
 {
 	int nid = meminfo_to_nid(&numa_meminfo, start);
+	int reserved_nid = meminfo_to_nid(&numa_reserved_meminfo, start);

 	/*
 	 * Prefer online nodes, but if reserved memory might be
 	 * hot-added continue the search with reserved ranges.
 	 */
-	if (nid != NUMA_NO_NODE)
+	if (nid != NUMA_NO_NODE && reserved_nid == NUMA_NO_NODE)
 		return nid;

-	return meminfo_to_nid(&numa_reserved_meminfo, start);
+	return reserved_nid;
 }
 EXPORT_SYMBOL_GPL(phys_to_target_node);

_

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2 1/1] mm: numa_memblks: Identify the accurate NUMA ID of CFMW
  2026-02-06 15:57                           ` Andrew Morton
@ 2026-02-06 16:23                             ` Jonathan Cameron
  0 siblings, 0 replies; 25+ messages in thread
From: Jonathan Cameron @ 2026-02-06 16:23 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Gregory Price, Cui Chao, dan.j.williams, Mike Rapoport,
	Wang Yinfeng, linux-cxl, linux-kernel, linux-mm, qemu-devel,
	David Hildenbrand (Arm)

On Fri, 6 Feb 2026 07:57:09 -0800
Andrew Morton <akpm@linux-foundation.org> wrote:

> On Fri, 6 Feb 2026 15:09:41 +0000 Jonathan Cameron <jonathan.cameron@huawei.com> wrote:
> 
> > > Andrew if Jonathan is good with it then with changelog updates this can
> > > go in, otherwise I don't think this warrants a backport or anything.  
> > 
> > Wait and see if anyone hits it on a real machine (or even non creative QEMU
> > setup!)  So for now no need to backport.  
> 
> Thanks, all.
> 
> Below is the current state of this patch.  Is the changelog suitable?
Hi Andrew

Not quite..

> 
> 
> From: Cui Chao <cuichao1753@phytium.com.cn>
> Subject: mm: numa_memblks: identify the accurate NUMA ID of CFMW
> Date: Tue, 6 Jan 2026 11:10:42 +0800
> 
> In some physical memory layout designs, the address space of CFMW (CXL
> Fixed Memory Window) resides between multiple segments of system memory
> belonging to the same NUMA node.  In numa_cleanup_meminfo, these multiple
> segments of system memory are merged into a larger numa_memblk.  When
> identifying which NUMA node the CFMW belongs to, it may be incorrectly
> assigned to the NUMA node of the merged system memory.
> 
> When a CXL RAM region is created in userspace, the memory capacity of
> the newly created region is not added to the CFMW-dedicated NUMA node. 
> Instead, it is accumulated into an existing NUMA node (e.g., NUMA0
> containing RAM).  This makes it impossible to clearly distinguish
> between the two types of memory, which may affect memory-tiering
> applications.
> 
> Example memory layout:
> 
> Physical address space:
>     0x00000000 - 0x1FFFFFFF  System RAM (node0)
>     0x20000000 - 0x2FFFFFFF  CXL CFMW (node2)
>     0x40000000 - 0x5FFFFFFF  System RAM (node0)
>     0x60000000 - 0x7FFFFFFF  System RAM (node1)
> 
> After numa_cleanup_meminfo, the two node0 segments are merged into one:
>     0x00000000 - 0x5FFFFFFF  System RAM (node0) // CFMW is inside the range
>     0x60000000 - 0x7FFFFFFF  System RAM (node1)
> 
> So the CFMW (0x20000000-0x2FFFFFFF) will be incorrectly assigned to node0.
> 
> To address this scenario, accurately identifying the correct NUMA node
> can be achieved by checking whether the region belongs to both
> numa_meminfo and numa_reserved_meminfo.
> 
> 
> 1. Issue Impact and Backport Recommendation:
> 
> This patch fixes an issue on hardware platforms (not QEMU emulation)

I think this bit turned out to not be a bit misleading.  Cui Chao
clarified in: 
https://lore.kernel.org/all/a90bc6f2-105c-4ffc-99d9-4fa5eaa79c45@phytium.com.cn/


"This issue was discovered on the QEMU platform. I need to apologize for 
my earlier imprecise statement (claiming it was hardware instead of 
QEMU). My core point at the time was to emphasize that this is a problem 
in the general code path when facing this scenario, not a QEMU-specific 
emulation issue, and therefore it could theoretically affect real 
hardware as well. I apologize for any confusion this may have caused."

So, whilst this could happen on a real hardware platform, for now we aren't
aware of a suitable configuration actually happening. I'm not sure we can
even create it in in QEMU without some tweaks.

Other than relaxing this to perhaps say that a hardware platform 'might'
have a configuration like the description here looks good to me.

Thanks!

Jonathan


> where, during the dynamic creation of a CXL RAM region, the memory
> capacity is not assigned to the correct CFMW-dedicated NUMA node.  This
> issue leads to:
> 
>     Failure of the memory tiering mechanism: The system is designed to
>     treat System RAM as fast memory and CXL memory as slow memory. For
>     performance optimization, hot pages may be migrated to fast memory
>     while cold pages are migrated to slow memory. The system uses NUMA
>     IDs as an index to identify different tiers of memory. If the NUMA
>     ID for CXL memory is calculated incorrectly and its capacity is
>     aggregated into the NUMA node containing System RAM (i.e., the node
>     for fast memory), the CXL memory cannot be correctly identified. It
>     may be misjudged as fast memory, thereby affecting performance
>     optimization strategies.
> 
>     Inability to distinguish between System RAM and CXL memory even for
>     simple manual binding: Tools like |numactl|and other NUMA policy
>     utilities cannot differentiate between System RAM and CXL memory,
>     making it impossible to perform reasonable memory binding.
> 
>     Inaccurate system reporting: Tools like |numactl -H|would display
>     memory capacities that do not match the actual physical hardware
>     layout, impacting operations and monitoring.
> 
> This issue affects all users utilizing the CXL RAM functionality who
> rely on memory tiering or NUMA-aware scheduling.  Such configurations
> are becoming increasingly common in data centers, cloud computing, and
> high-performance computing scenarios.
> 
> Therefore, I recommend backporting this patch to all stable kernel 
> series that support dynamic CXL region creation.
> 
> 2. Why a Kernel Update is Recommended Over a Firmware Update:
> 
> In the scenario of dynamic CXL region creation, the association between
> the memory's HPA range and its corresponding NUMA node is established
> when the kernel driver performs the commit operation.  This is a
> runtime, OS-managed operation where the platform firmware cannot
> intervene to provide a fix.
> 
> Considering factors like hardware platform architecture, memory
> resources, and others, such a physical address layout can indeed occur.
> This patch does not introduce risk; it simply correctly handles the
> NUMA node assignment for CXL RAM regions within such a physical address
> layout.
> 
> Thus, I believe a kernel fix is necessary.
> 
> Link: https://lkml.kernel.org/r/20260106031042.1606729-2-cuichao1753@phytium.com.cn
> Fixes: 779dd20cfb56 ("cxl/region: Add region creation support")
> Signed-off-by: Cui Chao <cuichao1753@phytium.com.cn>
> Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
> Cc: Mike Rapoport <rppt@kernel.org>
> Cc: Wang Yinfeng <wangyinfeng@phytium.com.cn>
> Cc: Dan Williams <dan.j.williams@intel.com>
> Cc: Gregory Price <gourry@gourry.net>
> Cc: Joanthan Cameron <Jonathan.Cameron@huawei.com>
> Cc: Wang Yinfeng <wangyinfeng@phytium.com.cn>
> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
> ---
> 
>  mm/numa_memblks.c |    5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)
> 
> --- a/mm/numa_memblks.c~mm-numa_memblks-identify-the-accurate-numa-id-of-cfmw
> +++ a/mm/numa_memblks.c
> @@ -570,15 +570,16 @@ static int meminfo_to_nid(struct numa_me
>  int phys_to_target_node(u64 start)
>  {
>  	int nid = meminfo_to_nid(&numa_meminfo, start);
> +	int reserved_nid = meminfo_to_nid(&numa_reserved_meminfo, start);
>  
>  	/*
>  	 * Prefer online nodes, but if reserved memory might be
>  	 * hot-added continue the search with reserved ranges.
>  	 */
> -	if (nid != NUMA_NO_NODE)
> +	if (nid != NUMA_NO_NODE && reserved_nid == NUMA_NO_NODE)
>  		return nid;
>  
> -	return meminfo_to_nid(&numa_reserved_meminfo, start);
> +	return reserved_nid;
>  }
>  EXPORT_SYMBOL_GPL(phys_to_target_node);
>  
> _
> 
> 



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2 1/1] mm: numa_memblks: Identify the accurate NUMA ID of CFMW
  2026-02-06 15:53                           ` Gregory Price
@ 2026-02-06 16:26                             ` Jonathan Cameron
  2026-02-06 16:32                               ` Gregory Price
  0 siblings, 1 reply; 25+ messages in thread
From: Jonathan Cameron @ 2026-02-06 16:26 UTC (permalink / raw)
  To: Gregory Price
  Cc: Andrew Morton, Cui Chao, dan.j.williams, Mike Rapoport,
	Wang Yinfeng, linux-cxl, linux-kernel, linux-mm, qemu-devel,
	David Hildenbrand (Arm)

On Fri, 6 Feb 2026 10:53:11 -0500
Gregory Price <gourry@gourry.net> wrote:

> On Fri, Feb 06, 2026 at 03:09:41PM +0000, Jonathan Cameron wrote:
> > On Fri, 6 Feb 2026 08:31:09 -0500
> > Gregory Price <gourry@gourry.net> wrote:
> > 
> > Now a fun corner is that a node isn't created unless there is something
> > in it - the whole SRAT is the source of truth for what nodes exist
> > - so we need 'something' in it - a cpu will do, or a GI, probably a GP.
> > Otherwise memory ends up in node0.  However, fallback lists etc happen
> > as normal when first mem in a node is added.
> >   
> ...
> > For now I 'suspect' we could hack things to provide lots of waiting numa nodes
> > and merrily assign HPA into them as we like whatever SRAT provides
> > in the way of 'hints' :) 
> >   
> 
> look at ACPI MSCT - "Maximum Proximity Domain Information Structure" ;]
> 
> I don't remember reading anything in the ACPI spec that says something
> has to be ON any of these PXMs for it to be accounted for in the MSCT.
> 
> Platforms can just say "Reserve that many Nodes".
> 
> (Linux does not read this value, and on my existing systems, this number
> always reflects the number of actually present PXMs)
> 
> ---
> 
> We probably want to ignore that and just add this:
> 
> CONFIG_ACPI_NUMA_NODES_PER_CFMWS
>     int
>     range 1 4
>     help
>         This option determines the number of NUMA nodes that will be
> 	added for each CEDT CFMWS entry.
> 
> 	By default ACPI reserves 1 per unique PXM entry in the SRAT,
> 	or 1 for a CXL Fixed Memory Window without SRAT mappings.
> 
> 	This will reserve up to N nodes per CEDT entry, even if that
> 	CEDT has one or more SRAT entries.
> 
> then in the acpi/numa/srat.c code that parses srat/cedt, just track
> the number of nodes over a CEDT range.
> 
> for each srat:
>    account_unique_pxm(pxm, srat_range)
> 
> for each cedt:
>    nr_nodes = unique_pxms(cedt_range)
>    while (nr_nodes < CONFIG_ACPI_NUMA_NODES_PER_CFMWS)
>       node = acpi_map_pxm_to_node(*fake_pxm++);
>       if (node == NUMA_NO_NODE):
>       	err("Unable to reserve additional nodes for CXL windows")
> 	break;
>       node_set(node, numa_nodes_parsed);
>       nr_nodes++
> 
> This should fall out cleanly.
> 
> The additional nodes won't be associated with anything, but could be
> used for hotplug - I imagine.
> 

That aligns with what I was thinking as a first solution to allowing this
to be more dynamic.   We can get clever later if this doesn't prove sufficient.

Jonathan

> ~Gregory



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2 1/1] mm: numa_memblks: Identify the accurate NUMA ID of CFMW
  2026-02-06 16:26                             ` Jonathan Cameron
@ 2026-02-06 16:32                               ` Gregory Price
  2026-02-19 14:19                                 ` Jonathan Cameron
  0 siblings, 1 reply; 25+ messages in thread
From: Gregory Price @ 2026-02-06 16:32 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: Andrew Morton, Cui Chao, dan.j.williams, Mike Rapoport,
	Wang Yinfeng, linux-cxl, linux-kernel, linux-mm, qemu-devel,
	David Hildenbrand (Arm)

On Fri, Feb 06, 2026 at 04:26:44PM +0000, Jonathan Cameron wrote:
> > 
> > This should fall out cleanly.
> > 
> > The additional nodes won't be associated with anything, but could be
> > used for hotplug - I imagine.
> > 
> 
> That aligns with what I was thinking as a first solution to allowing this
> to be more dynamic.   We can get clever later if this doesn't prove sufficient.
> 

I can get this out pretty quickly, hopefully sometime next week.

I had a long talk with Dan about this topic previously, and I'm not sure
how we get more dynamic than this to be honest.  nr_possible_nodes is
*definitely* expected to be immutable after __init all over the kernel,
it's used to allocate a memory.

Surface area is very large here.

~Gregory


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2 1/1] mm: numa_memblks: Identify the accurate NUMA ID of CFMW
  2026-02-06 16:32                               ` Gregory Price
@ 2026-02-19 14:19                                 ` Jonathan Cameron
  0 siblings, 0 replies; 25+ messages in thread
From: Jonathan Cameron @ 2026-02-19 14:19 UTC (permalink / raw)
  To: Gregory Price
  Cc: Andrew Morton, Cui Chao, dan.j.williams, Mike Rapoport,
	Wang Yinfeng, linux-cxl, linux-kernel, linux-mm, qemu-devel,
	David Hildenbrand (Arm)

On Fri, 6 Feb 2026 11:32:09 -0500
Gregory Price <gourry@gourry.net> wrote:

> On Fri, Feb 06, 2026 at 04:26:44PM +0000, Jonathan Cameron wrote:
> > > 
> > > This should fall out cleanly.
> > > 
> > > The additional nodes won't be associated with anything, but could be
> > > used for hotplug - I imagine.
> > >   
> > 
> > That aligns with what I was thinking as a first solution to allowing this
> > to be more dynamic.   We can get clever later if this doesn't prove sufficient.
> >   
> 
> I can get this out pretty quickly, hopefully sometime next week.
> 
> I had a long talk with Dan about this topic previously, and I'm not sure
> how we get more dynamic than this to be honest.  nr_possible_nodes is
> *definitely* expected to be immutable after __init all over the kernel,
> it's used to allocate a memory.

Indeed. But that doesn't mean to say they are all in use after __init.

You end up allocating a bunch of space, that is not used until there
is some memory there.  Not a problem.

So dynamic nr_possible_nodes is tricky.  Dynamic allocation of stuff into
those nodes is fine. That happens with memory hotplug today.

J
> 
> Surface area is very large here.
> 
> ~Gregory



^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2026-02-19 14:19 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2026-01-06  3:10 [PATCH v2 0/1] Identify the accurate NUMA ID of CFMW Cui Chao
2026-01-06  3:10 ` [PATCH v2 1/1] mm: numa_memblks: " Cui Chao
2026-01-08 16:19   ` Jonathan Cameron
2026-01-08 17:48   ` Andrew Morton
2026-01-15  9:43     ` Cui Chao
2026-01-15 18:18       ` Andrew Morton
2026-01-15 19:50         ` dan.j.williams
2026-01-22  8:03           ` Cui Chao
2026-01-22 21:28             ` Andrew Morton
2026-01-23  8:59               ` Cui Chao
2026-01-23 16:46             ` Gregory Price
2026-01-26  9:06               ` Cui Chao
2026-02-05 22:58                 ` Andrew Morton
2026-02-05 23:10                   ` Gregory Price
2026-02-06 11:03                     ` Jonathan Cameron
2026-02-06 13:31                       ` Gregory Price
2026-02-06 15:09                         ` Jonathan Cameron
2026-02-06 15:53                           ` Gregory Price
2026-02-06 16:26                             ` Jonathan Cameron
2026-02-06 16:32                               ` Gregory Price
2026-02-19 14:19                                 ` Jonathan Cameron
2026-02-06 15:57                           ` Andrew Morton
2026-02-06 16:23                             ` Jonathan Cameron
2026-01-09  9:35   ` Pratyush Brahma
2026-01-15 10:06     ` Cui Chao

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox