* [RFC] Add per-socket weight support for multi-socket systems in weighted interleave
@ 2025-05-07 9:35 rakie.kim
2025-05-07 16:38 ` Gregory Price
0 siblings, 1 reply; 11+ messages in thread
From: rakie.kim @ 2025-05-07 9:35 UTC (permalink / raw)
To: gourry, joshua.hahnjy
Cc: akpm, linux-mm, linux-kernel, linux-cxl, dan.j.williams,
ying.huang, kernel_team, honggyu.kim, yunjeong.mun, rakie.kim
Hi Gregory, Joshua,
I hope this message finds you well. I'm writing to discuss a feature I
believe would enhance the flexibility of the weighted interleave policy:
support for per-socket weighting in multi-socket systems.
---
<Background and prior design context>
While reviewing the early versions of the weighted interleave patches,
I noticed that a source-aware weighting structure was included in v1:
https://lore.kernel.org/all/20231207002759.51418-1-gregory.price@memverge.com/
However, this structure was removed in a later version:
https://lore.kernel.org/all/20231209065931.3458-1-gregory.price@memverge.com/
Unfortunately, I was unable to participate in the discussion at that
time, and I sincerely apologize for missing it.
From what I understand, there may have been valid reasons for removing
the source-relative design, including:
1. Increased complexity in mempolicy internals. Adding source awareness
introduces challenges around dynamic nodemask changes, task policy
sharing during fork(), mbind(), rebind(), etc.
2. A lack of concrete, motivating use cases. At that stage, it might
have been more pragmatic to focus on a 1D flat weight array.
If there were additional reasons, I would be grateful to learn them.
That said, I would like to revisit this idea now, as I believe some
real-world NUMA configurations would benefit significantly from
reintroducing this capability.
---
<Motivation: realistic multi-socket memory topologies>
The system I am testing includes multiple CPU sockets, each with local
DRAM and directly attached CXL memory. Here's a simplified diagram:
node0 node1
+-------+ UPI +-------+
| CPU 0 |-+-----+-| CPU 1 |
+-------+ +-------+
| DRAM0 | | DRAM1 |
+---+---+ +---+---+
| |
+---+---+ +---+---+
| CXL 0 | | CXL 1 |
+-------+ +-------+
node2 node3
This type of system is becoming more common, and in my tests, I
encountered two scenarios where per-socket weighting would be highly
beneficial.
Let's assume the following NUMA bandwidth matrix (GB/s):
0 1 2 3
0 300 150 100 50
1 150 300 50 100
And flat weights:
node0 = 3
node1 = 3
node2 = 1
node3 = 1
---
Scenario 1: Adapt weighting based on the task's execution node
Many applications can achieve reasonable performance just by using the
CXL memory on their local socket. However, most workloads do not pin
tasks to a specific CPU node, and the current implementation does not
adjust weights based on where the task is running.
If per-source-node weighting were available, the following matrix could
be used:
0 1 2 3
0 3 0 1 0
1 0 3 0 1
Which means:
1. A task running on CPU0 (node0) would prefer DRAM0 (w=3) and CXL0 (w=1)
2. A task running on CPU1 (node1) would prefer DRAM1 (w=3) and CXL1 (w=1)
3. A large, multithreaded task using both sockets should get both sets
This flexibility is currently not possible with a single flat weight
array.
---
Scenario 2: Reflect relative memory access performance
Remote memory access (e.g., from node0 to node3) incurs a real bandwidth
penalty. Ideally, weights should reflect this. For example:
Bandwidth-based matrix:
0 1 2 3
0 6 3 2 1
1 3 6 1 2
Or DRAM + local CXL only:
0 1 2 3
0 6 0 2 1
1 0 6 1 2
While scenario 1 is probably more common in practice, both can be
expressed within the same design if per-socket weights are supported.
---
<Proposed approach>
Instead of removing the current sysfs interface or flat weight logic, I
propose introducing an optional "multi" mode for per-socket weights.
This would allow users to opt into source-aware behavior.
(The name 'multi' is just an example and should be changed to a more
appropriate name in the future.)
Draft sysfs layout:
/sys/kernel/mm/mempolicy/weighted_interleave/
+-- multi (bool: enable per-socket mode)
+-- node0 (flat weight for legacy/default mode)
+-- node_groups/
+-- node0_group/
| +-- node0 (weight of node0 when running on node0)
| +-- node1
+-- node1_group/
+-- node0
+-- node1
- When `multi` is false (default), existing behavior applies
- When `multi` is true, the system will use per-task `task_numa_node()`
to select a row in a 2D weight table
---
<Additional implementation considerations>
1. Compatibility: The proposal avoids breaking the current interface or
behavior and remains backward-compatible.
2. Auto-tuning: Scenario 1 (local CXL + DRAM) likely works with minimal
change. Scenario 2 (bandwidth-aware tuning) would require more
development, and I would welcome Joshua's input on this.
3. Zero weights: Currently the minimum weight is 1. We may want to allow
zero to fully support asymmetric exclusion.
---
<Next steps>
Before beginning an implementation, I would like to validate this
direction with both of you:
- Does this approach fit with your current design intentions?
- Do you foresee problems with complexity, policy sharing, or interface?
- Is there a better alternative to express this idea?
If there's interest, I would be happy to send an RFC patch or prototype.
Thank you for your time and consideration.
Sincerely,
Rakie
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [RFC] Add per-socket weight support for multi-socket systems in weighted interleave
2025-05-07 9:35 [RFC] Add per-socket weight support for multi-socket systems in weighted interleave rakie.kim
@ 2025-05-07 16:38 ` Gregory Price
2025-05-08 6:30 ` Rakie Kim
0 siblings, 1 reply; 11+ messages in thread
From: Gregory Price @ 2025-05-07 16:38 UTC (permalink / raw)
To: rakie.kim
Cc: joshua.hahnjy, akpm, linux-mm, linux-kernel, linux-cxl,
dan.j.williams, ying.huang, kernel_team, honggyu.kim,
yunjeong.mun
On Wed, May 07, 2025 at 06:35:16PM +0900, rakie.kim@sk.com wrote:
> Hi Gregory, Joshua,
>
> I hope this message finds you well. I'm writing to discuss a feature I
> believe would enhance the flexibility of the weighted interleave policy:
> support for per-socket weighting in multi-socket systems.
>
> ---
>
> <Background and prior design context>
>
> While reviewing the early versions of the weighted interleave patches,
> I noticed that a source-aware weighting structure was included in v1:
>
> https://lore.kernel.org/all/20231207002759.51418-1-gregory.price@memverge.com/
>
> However, this structure was removed in a later version:
>
> https://lore.kernel.org/all/20231209065931.3458-1-gregory.price@memverge.com/
>
> Unfortunately, I was unable to participate in the discussion at that
> time, and I sincerely apologize for missing it.
>
> From what I understand, there may have been valid reasons for removing
> the source-relative design, including:
>
> 1. Increased complexity in mempolicy internals. Adding source awareness
> introduces challenges around dynamic nodemask changes, task policy
> sharing during fork(), mbind(), rebind(), etc.
>
> 2. A lack of concrete, motivating use cases. At that stage, it might
> have been more pragmatic to focus on a 1D flat weight array.
>
> If there were additional reasons, I would be grateful to learn them.
>
x. task local weights would have required additional syscalls, and there
was insufficient active users to warrant the extra complexity.
y. numa interfaces don't capture cross-socket interconnect information,
and as a result actually hides "True" bandwidth values from the
perspective of a given socket.
As a result, mempolicy just isn't well positioned to deal with this
as-designed, and introducing the per-task weights w/ the additional
extensions just was a bridge too far. Global weights are sufficient
if you combine cpusets/core-pinning and a nodemask that excludes
cross-socket nodes (i.e.: Don't use cross-socket memory).
For workloads that do scale up to use both sockets and both devices,
you either want to spread it out according to global weights or use
region-specific (mbind) weighted interleave anyway.
> ---
>
> Scenario 1: Adapt weighting based on the task's execution node
>
> Many applications can achieve reasonable performance just by using the
> CXL memory on their local socket. However, most workloads do not pin
> tasks to a specific CPU node, and the current implementation does not
> adjust weights based on where the task is running.
>
"Most workloads don't..." - but they can, and fairly cleanly via
cgroups/cpusets.
> If per-source-node weighting were available, the following matrix could
> be used:
>
> 0 1 2 3
> 0 3 0 1 0
> 1 0 3 0 1
>
> This flexibility is currently not possible with a single flat weight
> array.
This can be done with a mempolicy that omits undesired nodes from the
nodemask - without requiring any changes.
>
> Scenario 2: Reflect relative memory access performance
>
> Remote memory access (e.g., from node0 to node3) incurs a real bandwidth
> penalty. Ideally, weights should reflect this. For example:
>
> Bandwidth-based matrix:
>
> 0 1 2 3
> 0 6 3 2 1
> 1 3 6 1 2
>
> Or DRAM + local CXL only:
>
> 0 1 2 3
> 0 6 0 2 1
> 1 0 6 1 2
>
> While scenario 1 is probably more common in practice, both can be
> expressed within the same design if per-socket weights are supported.
>
The core issue here is actually that NUMA doesn't have a good way to
represent the cross-socket interconnect bandwidth - and the fact that it
abstracts all devices behind it (both DRAM and CXL).
So reasoning about this problem in terms of NUMA is trying to fit a
square peg in a round hole. I think it's the wrong tool - maybe we need
a new one. I don't know what this looks like.
> ---
>
> <Proposed approach>
>
> Instead of removing the current sysfs interface or flat weight logic, I
> propose introducing an optional "multi" mode for per-socket weights.
> This would allow users to opt into source-aware behavior.
> (The name 'multi' is just an example and should be changed to a more
> appropriate name in the future.)
>
> Draft sysfs layout:
>
> /sys/kernel/mm/mempolicy/weighted_interleave/
> +-- multi (bool: enable per-socket mode)
> +-- node0 (flat weight for legacy/default mode)
> +-- node_groups/
> +-- node0_group/
> | +-- node0 (weight of node0 when running on node0)
> | +-- node1
> +-- node1_group/
> +-- node0
> +-- node1
>
This is starting to look like memory-tiers.c, which is largely useless
at the moment. Maybe we implement such logic in memory-tiers, and then
extend mempolicy to have a MPOL_MEMORY_TIER or MPOL_F_MEMORY_TIER?
That would give us better flexibility to design the mempolicy interface
without having to be bound by the NUMA infrastructure it presently
depends on. We can figure out how to collect cross-socket interconnect
information in memory-tiers, and see what issues we'll have with
engaging that information from the mempolicy/page allocator path.
You'll see in very very early versions of weighted interleave I
originally implemented it via memory-tiers. You might look there for
inspiration.
> <Additional implementation considerations>
>
> 1. Compatibility: The proposal avoids breaking the current interface or
> behavior and remains backward-compatible.
>
> 2. Auto-tuning: Scenario 1 (local CXL + DRAM) likely works with minimal
> change. Scenario 2 (bandwidth-aware tuning) would require more
> development, and I would welcome Joshua's input on this.
>
> 3. Zero weights: Currently the minimum weight is 1. We may want to allow
> zero to fully support asymmetric exclusion.
>
I think we need to explore different changes here - it's become fairly
clear when discussing tiering at LSFMM that NUMA is a dated abstraction
that is showing its limits here. Lets ask what information we want and
how to structure/interact with it first, before designing the sysfs
interface for it.
~Gregory
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [RFC] Add per-socket weight support for multi-socket systems in weighted interleave
2025-05-07 16:38 ` Gregory Price
@ 2025-05-08 6:30 ` Rakie Kim
2025-05-08 15:12 ` Gregory Price
0 siblings, 1 reply; 11+ messages in thread
From: Rakie Kim @ 2025-05-08 6:30 UTC (permalink / raw)
To: Gregory Price
Cc: joshua.hahnjy, akpm, linux-mm, linux-kernel, linux-cxl,
dan.j.williams, ying.huang, kernel_team, honggyu.kim,
yunjeong.mun, rakie.kim
On Wed, 7 May 2025 12:38:18 -0400 Gregory Price <gourry@gourry.net> wrote:
Thank you very much for your detailed response. I fully agree with your points
regarding the limitations of the NUMA abstraction and the potential direction
with memory-tiers. These are indeed important and forward-looking suggestions.
That said, I would still like to emphasize the importance of supporting a
source-aware weight mechanism within the current weighted interleave policy,
even as an optional extension.
The proposed design is completely optional and isolated: it retains the
existing flat weight model as-is and activates the source-aware behavior only
when 'multi' mode is enabled. The complexity is scoped entirely to users who
opt into this mode.
> On Wed, May 07, 2025 at 06:35:16PM +0900, rakie.kim@sk.com wrote:
> > Hi Gregory, Joshua,
> >
> > I hope this message finds you well. I'm writing to discuss a feature I
> > believe would enhance the flexibility of the weighted interleave policy:
> > support for per-socket weighting in multi-socket systems.
> >
> > ---
> >
> > <Background and prior design context>
> >
> > While reviewing the early versions of the weighted interleave patches,
> > I noticed that a source-aware weighting structure was included in v1:
> >
> > https://lore.kernel.org/all/20231207002759.51418-1-gregory.price@memverge.com/
> >
> > However, this structure was removed in a later version:
> >
> > https://lore.kernel.org/all/20231209065931.3458-1-gregory.price@memverge.com/
> >
> > Unfortunately, I was unable to participate in the discussion at that
> > time, and I sincerely apologize for missing it.
> >
> > From what I understand, there may have been valid reasons for removing
> > the source-relative design, including:
> >
> > 1. Increased complexity in mempolicy internals. Adding source awareness
> > introduces challenges around dynamic nodemask changes, task policy
> > sharing during fork(), mbind(), rebind(), etc.
> >
> > 2. A lack of concrete, motivating use cases. At that stage, it might
> > have been more pragmatic to focus on a 1D flat weight array.
> >
> > If there were additional reasons, I would be grateful to learn them.
> >
>
> x. task local weights would have required additional syscalls, and there
> was insufficient active users to warrant the extra complexity.
I also think that additional syscalls are not necessary.
Moreover, this proposal does not require any additional syscalls.
>
> y. numa interfaces don't capture cross-socket interconnect information,
> and as a result actually hides "True" bandwidth values from the
> perspective of a given socket.
>
> As a result, mempolicy just isn't well positioned to deal with this
> as-designed, and introducing the per-task weights w/ the additional
> extensions just was a bridge too far. Global weights are sufficient
> if you combine cpusets/core-pinning and a nodemask that excludes
> cross-socket nodes (i.e.: Don't use cross-socket memory).
>
> For workloads that do scale up to use both sockets and both devices,
> you either want to spread it out according to global weights or use
> region-specific (mbind) weighted interleave anyway.
>
Cpuset and cgroups can control task placement, but they cannot dynamically
alter memory node preferences based on the current execution node. For
multi-threaded tasks running across multiple sockets, cpusets alone cannot
represent per-socket locality preferences.
Source-aware weights, on the other hand, enable automatic memory node
selection based on where a task is running, which greatly improves flexibility
in hybrid bandwidth environments.
> > ---
> >
> > Scenario 1: Adapt weighting based on the task's execution node
> >
> > Many applications can achieve reasonable performance just by using the
> > CXL memory on their local socket. However, most workloads do not pin
> > tasks to a specific CPU node, and the current implementation does not
> > adjust weights based on where the task is running.
> >
>
> "Most workloads don't..." - but they can, and fairly cleanly via
> cgroups/cpusets.
>
> > If per-source-node weighting were available, the following matrix could
> > be used:
> >
> > 0 1 2 3
> > 0 3 0 1 0
> > 1 0 3 0 1
> >
> > This flexibility is currently not possible with a single flat weight
> > array.
>
> This can be done with a mempolicy that omits undesired nodes from the
> nodemask - without requiring any changes.
Nodemask can restrict accessible memory nodes, but cannot implement conditional
preferences based on task execution locality. For example, if a task runs on
node0, it may want to prefer {0,2}; if on node1, prefer {1,3}.
In the current model, implementing this would require runtime updates to the
nodemask per task, which is neither scalable nor practical. Source-aware
weights aim to encode this logic directly into policy behavior.
>
> >
> > Scenario 2: Reflect relative memory access performance
> >
> > Remote memory access (e.g., from node0 to node3) incurs a real bandwidth
> > penalty. Ideally, weights should reflect this. For example:
> >
> > Bandwidth-based matrix:
> >
> > 0 1 2 3
> > 0 6 3 2 1
> > 1 3 6 1 2
> >
> > Or DRAM + local CXL only:
> >
> > 0 1 2 3
> > 0 6 0 2 1
> > 1 0 6 1 2
> >
> > While scenario 1 is probably more common in practice, both can be
> > expressed within the same design if per-socket weights are supported.
> >
>
> The core issue here is actually that NUMA doesn't have a good way to
> represent the cross-socket interconnect bandwidth - and the fact that it
> abstracts all devices behind it (both DRAM and CXL).
>
> So reasoning about this problem in terms of NUMA is trying to fit a
> square peg in a round hole. I think it's the wrong tool - maybe we need
> a new one. I don't know what this looks like.
I agree. NUMA does abstract away cross-socket topology and bandwidth details,
especially for DRAM vs CXL. But I believe weighted interleave was itself an
attempt to make NUMA more topology-aware via per-node weights.
The source-aware extension is a logical next step in that direction
introducing per-socket decision logic without replacing the NUMA model.
While it's clear that the NUMA abstraction has reached its limits in some
areas, the memory policy, sysfs interface, and page allocator are still built
on NUMA. Rather than discarding NUMA outright, I believe we should iterate on
it by introducing well-scoped enhancements, such as this one, to better
understand our future needs.
>
> > ---
> >
> > <Proposed approach>
> >
> > Instead of removing the current sysfs interface or flat weight logic, I
> > propose introducing an optional "multi" mode for per-socket weights.
> > This would allow users to opt into source-aware behavior.
> > (The name 'multi' is just an example and should be changed to a more
> > appropriate name in the future.)
> >
> > Draft sysfs layout:
> >
> > /sys/kernel/mm/mempolicy/weighted_interleave/
> > +-- multi (bool: enable per-socket mode)
> > +-- node0 (flat weight for legacy/default mode)
> > +-- node_groups/
> > +-- node0_group/
> > | +-- node0 (weight of node0 when running on node0)
> > | +-- node1
> > +-- node1_group/
> > +-- node0
> > +-- node1
> >
>
> This is starting to look like memory-tiers.c, which is largely useless
> at the moment. Maybe we implement such logic in memory-tiers, and then
> extend mempolicy to have a MPOL_MEMORY_TIER or MPOL_F_MEMORY_TIER?
>
> That would give us better flexibility to design the mempolicy interface
> without having to be bound by the NUMA infrastructure it presently
> depends on. We can figure out how to collect cross-socket interconnect
> information in memory-tiers, and see what issues we'll have with
> engaging that information from the mempolicy/page allocator path.
>
> You'll see in very very early versions of weighted interleave I
> originally implemented it via memory-tiers. You might look there for
> inspiration.
This is an excellent idea. I fully agree that memory-tiers has strong potential
as a future foundation for flexible memory classification and topology-aware
policy enforcement.
However, in its current form, memory-tiers lacks integration with mempolicy and
does not yet expose weight-based policy control over allocation decisions.
Weighted interleave, by contrast, already connects to allocator logic and
enables immediate policy experimentation. I view this proposal as a practical
starting point for validating ideas that could later inform a memory-tiers
based design.
As you mentioned, early versions of weighted interleave used memory-tiers.
I will revisit that implementation and analyze how the concepts could align or
transition toward the model you're proposing.
>
> > <Additional implementation considerations>
> >
> > 1. Compatibility: The proposal avoids breaking the current interface or
> > behavior and remains backward-compatible.
> >
> > 2. Auto-tuning: Scenario 1 (local CXL + DRAM) likely works with minimal
> > change. Scenario 2 (bandwidth-aware tuning) would require more
> > development, and I would welcome Joshua's input on this.
> >
> > 3. Zero weights: Currently the minimum weight is 1. We may want to allow
> > zero to fully support asymmetric exclusion.
> >
>
> I think we need to explore different changes here - it's become fairly
> clear when discussing tiering at LSFMM that NUMA is a dated abstraction
> that is showing its limits here. Lets ask what information we want and
> how to structure/interact with it first, before designing the sysfs
> interface for it.
>
> ~Gregory
In conclusion, while I acknowledge the NUMA model's aging limitations,
I believe source-aware weights offer a focused and minimal expansion of the
existing framework. This enables:
- policy control based on execution locality,
- compatibility with current NUMA-based infrastructure, and
- a conceptual bridge toward future tier-aware models.
Thank you again for your valuable feedback and thoughtful suggestions.
I will review your memory-tiers guidance in more detail and follow up with
further analysis.
Rakie
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [RFC] Add per-socket weight support for multi-socket systems in weighted interleave
2025-05-08 6:30 ` Rakie Kim
@ 2025-05-08 15:12 ` Gregory Price
2025-05-09 2:30 ` Rakie Kim
2025-05-09 11:31 ` Jonathan Cameron
0 siblings, 2 replies; 11+ messages in thread
From: Gregory Price @ 2025-05-08 15:12 UTC (permalink / raw)
To: Rakie Kim
Cc: joshua.hahnjy, akpm, linux-mm, linux-kernel, linux-cxl,
dan.j.williams, ying.huang, kernel_team, honggyu.kim,
yunjeong.mun
On Thu, May 08, 2025 at 03:30:36PM +0900, Rakie Kim wrote:
> On Wed, 7 May 2025 12:38:18 -0400 Gregory Price <gourry@gourry.net> wrote:
>
> The proposed design is completely optional and isolated: it retains the
> existing flat weight model as-is and activates the source-aware behavior only
> when 'multi' mode is enabled. The complexity is scoped entirely to users who
> opt into this mode.
>
I get what you're going for, just expressing my experience around this
issue specifically.
The lack of enthusiasm for solving the cross-socket case, and thus
reduction from a 2D array to a 1D array, was because reasoning about
interleave w/ cross-socket interconnects is not really feasible with
the NUMA abstraction. Cross-socket interconnects are "Invisible" but
have real performance implications. Unless we have a way to:
1) Represent the topology, AND
2) A way to get performance about that topology
It's not useful. So NUMA is an incomplete (if not wrong) tool for this.
Additionally - reacting to task migration is not a real issue. If
you're deploying an allocation strategy, you probably don't want your
task migrating away from the place where you just spent a bunch of time
allocating based on some existing strategy. So the solution is: don't
migrate, and if you do - don't use cross-socket interleave.
Maybe if we solve the first half of this we can take a look at the task
migration piece again, but I wouldn't try to solve for migration.
At the same time we were discussing this, we were also discussing how to
do external task-mempolicy modifications - which seemed significantly
more useful, but ultimately more complex and without sufficient
interested parties / users.
~Gregory
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [RFC] Add per-socket weight support for multi-socket systems in weighted interleave
2025-05-08 15:12 ` Gregory Price
@ 2025-05-09 2:30 ` Rakie Kim
2025-05-09 5:49 ` Gregory Price
2025-05-09 11:31 ` Jonathan Cameron
1 sibling, 1 reply; 11+ messages in thread
From: Rakie Kim @ 2025-05-09 2:30 UTC (permalink / raw)
To: Gregory Price
Cc: joshua.hahnjy, akpm, linux-mm, linux-kernel, linux-cxl,
dan.j.williams, ying.huang, kernel_team, honggyu.kim,
yunjeong.mun, Rakie Kim
On Thu, 8 May 2025 11:12:35 -0400 Gregory Price <gourry@gourry.net> wrote:
> On Thu, May 08, 2025 at 03:30:36PM +0900, Rakie Kim wrote:
> > On Wed, 7 May 2025 12:38:18 -0400 Gregory Price <gourry@gourry.net> wrote:
> >
> > The proposed design is completely optional and isolated: it retains the
> > existing flat weight model as-is and activates the source-aware behavior only
> > when 'multi' mode is enabled. The complexity is scoped entirely to users who
> > opt into this mode.
> >
>
> I get what you're going for, just expressing my experience around this
> issue specifically.
Thank you very much for your response. Your prior experience and insights
have been extremely helpful in refining how I think about this problem.
>
> The lack of enthusiasm for solving the cross-socket case, and thus
> reduction from a 2D array to a 1D array, was because reasoning about
> interleave w/ cross-socket interconnects is not really feasible with
> the NUMA abstraction. Cross-socket interconnects are "Invisible" but
> have real performance implications. Unless we have a way to:
>
> 1) Represent the topology, AND
> 2) A way to get performance about that topology
>
> It's not useful. So NUMA is an incomplete (if not wrong) tool for this.
Your comment gave me an opportunity to reconsider the purpose of the
feature I originally proposed. In fact, I had two different scenarios
in mind when outlining this direction.
Scenario 1: Adapt weighting based on the task's execution node
A task prefers only the DRAM and locally attached CXL memory of the
socket on which it is running, in order to avoid cross-socket access and
optimize bandwidth.
- A task running on CPU0 (node0) would prefer DRAM0 (w=3) and CXL0 (w=1)
- A task running on CPU1 (node1) would prefer DRAM1 (w=3) and CXL1 (w=1)
Scenario 2: Reflect relative memory access performance
The system adjusts weights based on expected bandwidth differences for
remote accesses. This relies on having access to interconnect performance
data, which NUMA currently does not expose.
As you rightly pointed out, Scenario 2 depends on being able to measure
or model the cost of cross-socket access, which is not available in the
current abstraction. I now realize that this case is less actionable and
needs further research before being pursued.
However, Scenario 1 does not depend on such information. Rather, it is
a locality-preserving optimization where we isolate memory access to
each socket's DRAM and CXL nodes. I believe this use case is implementable
today and worth considering independently from interconnect performance
awareness.
>
> Additionally - reacting to task migration is not a real issue. If
> you're deploying an allocation strategy, you probably don't want your
> task migrating away from the place where you just spent a bunch of time
> allocating based on some existing strategy. So the solution is: don't
> migrate, and if you do - don't use cross-socket interleave.
That's a fair point. I also agree that handling migration is not critical
at this stage, and I'm not actively focusing on that aspect in this
proposal.
>
> Maybe if we solve the first half of this we can take a look at the task
> migration piece again, but I wouldn't try to solve for migration.
>
> At the same time we were discussing this, we were also discussing how to
> do external task-mempolicy modifications - which seemed significantly
> more useful, but ultimately more complex and without sufficient
> interested parties / users.
I'd like to learn more about that thread. If you happen to have a pointer
to that discussion, it would be really helpful.
>
> ~Gregory
>
Thanks again for sharing your insights. I will follow up with a refined
proposal based on the localized socket-based routing model (Scenario 1)
and will give further consideration to the parts dependent on topology
performance measurement for now.
Rakie
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [RFC] Add per-socket weight support for multi-socket systems in weighted interleave
2025-05-09 2:30 ` Rakie Kim
@ 2025-05-09 5:49 ` Gregory Price
2025-05-12 8:22 ` Rakie Kim
0 siblings, 1 reply; 11+ messages in thread
From: Gregory Price @ 2025-05-09 5:49 UTC (permalink / raw)
To: Rakie Kim
Cc: joshua.hahnjy, akpm, linux-mm, linux-kernel, linux-cxl,
dan.j.williams, ying.huang, kernel_team, honggyu.kim,
yunjeong.mun
On Fri, May 09, 2025 at 11:30:26AM +0900, Rakie Kim wrote:
>
> Scenario 1: Adapt weighting based on the task's execution node
> A task prefers only the DRAM and locally attached CXL memory of the
> socket on which it is running, in order to avoid cross-socket access and
> optimize bandwidth.
> - A task running on CPU0 (node0) would prefer DRAM0 (w=3) and CXL0 (w=1)
> - A task running on CPU1 (node1) would prefer DRAM1 (w=3) and CXL1 (w=1)
... snip ...
>
> However, Scenario 1 does not depend on such information. Rather, it is
> a locality-preserving optimization where we isolate memory access to
> each socket's DRAM and CXL nodes. I believe this use case is implementable
> today and worth considering independently from interconnect performance
> awareness.
>
There's nothing to implement - all the controls exist:
1) --cpunodebind=0
2) --weighted-interleave=0,2
3) cpuset.mems
4) cpuset.cpus
You might consider maybe something like "--local-tier" (akin to
--localalloc) that sets an explicitly fallback set based on the local
node. You'd end up doing something like
current_nid = memtier_next_local_node(socket_nid, current_nid)
Where this interface returns the preferred fallback ordering but doesn't
allow cross-socket fallback.
That might be useful, i suppose, in letting a user do:
--cpunodebind=0 --weighted-interleave --local-tier
without having to know anything about the local memory tier structure.
> > At the same time we were discussing this, we were also discussing how to
> > do external task-mempolicy modifications - which seemed significantly
> > more useful, but ultimately more complex and without sufficient
> > interested parties / users.
>
> I'd like to learn more about that thread. If you happen to have a pointer
> to that discussion, it would be really helpful.
>
https://lore.kernel.org/all/20231122211200.31620-1-gregory.price@memverge.com/
https://lore.kernel.org/all/ZV5zGROLefrsEcHJ@r13-u19.micron.com/
https://lore.kernel.org/linux-mm/ZWYsth2CtC4Ilvoz@memverge.com/
https://lore.kernel.org/linux-mm/20221010094842.4123037-1-hezhongkun.hzk@bytedance.com/
There are locking issues with these that aren't easy to fix.
I think the bytedance method uses a task_work queueing to defer a
mempolicy update to the task itself the next time it makes a kernel/user
transition. That's probably the best overall approach i've seen.
https://lore.kernel.org/linux-mm/ZWezcQk+BYEq%2FWiI@memverge.com/
More notes gathered prior to implementing weighted interleave.
~Gregory
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [RFC] Add per-socket weight support for multi-socket systems in weighted interleave
2025-05-08 15:12 ` Gregory Price
2025-05-09 2:30 ` Rakie Kim
@ 2025-05-09 11:31 ` Jonathan Cameron
2025-05-09 16:29 ` Gregory Price
2025-05-12 8:23 ` Rakie Kim
1 sibling, 2 replies; 11+ messages in thread
From: Jonathan Cameron @ 2025-05-09 11:31 UTC (permalink / raw)
To: Gregory Price
Cc: Rakie Kim, joshua.hahnjy, akpm, linux-mm, linux-kernel,
linux-cxl, dan.j.williams, ying.huang, kernel_team, honggyu.kim,
yunjeong.mun, Keith Busch, Jerome Glisse
On Thu, 8 May 2025 11:12:35 -0400
Gregory Price <gourry@gourry.net> wrote:
> On Thu, May 08, 2025 at 03:30:36PM +0900, Rakie Kim wrote:
> > On Wed, 7 May 2025 12:38:18 -0400 Gregory Price <gourry@gourry.net> wrote:
> >
> > The proposed design is completely optional and isolated: it retains the
> > existing flat weight model as-is and activates the source-aware behavior only
> > when 'multi' mode is enabled. The complexity is scoped entirely to users who
> > opt into this mode.
> >
>
> I get what you're going for, just expressing my experience around this
> issue specifically.
>
> The lack of enthusiasm for solving the cross-socket case, and thus
> reduction from a 2D array to a 1D array, was because reasoning about
> interleave w/ cross-socket interconnects is not really feasible with
> the NUMA abstraction. Cross-socket interconnects are "Invisible" but
> have real performance implications. Unless we have a way to:
Sort of invisible... What their topology is, but we have some info...
>
> 1) Represent the topology, AND
> 2) A way to get performance about that topology
There was some discussion on this at LSF-MM.
+CC Keith and Jerome who were once interested in this topic
It's not perfect but ACPI HMAT does have what is probably sufficient info
for a simple case like this (2 socket server + Generic Ports and CXL
description of the rest of the path), it's just that today we aren't exposing that
to userspace (instead only the BW / Latency from a single selected nearest initiator
/CPU node to any memory containing node).
That decision was much discussed back when Keith was adding HMAT support.
At that time the question was what workload needed the dense info (2D matrix)
and we didn't have one. With weighted interleave I think we do.
As to the problems...
We come unstuck badly in much more complex situations as that information
is load free so if we have heavy contention due to one shared link between
islands of nodes it can give a very misleading idea.
[CXL Node 0] [CXL Node 2]
| |
[NODE A]---\ /----[NODE C]
\___Shared link____/
/ \
[NODE B]---/ \----[NODE D]
| |
[CXL Node 1] [CXL Node 3]
In this from ACPI this looks much like this (fully connected
4 socket system).
[CXL Node 0] [CXL Node 2]
| |
[NODE A]-----------------------------[NODE C]
| \___________________________ / |
| ____________________________\/ |
| / \ |
[NODE B]-----------------------------[NODE D]
| |
[CXL Node 1] [CXL Node 3]
In the first case we should probably halve the BW of shared link or something
like that. In the second case use the full version. In general we have no way
to know which one we have and it gets way more fun with 8 + sockets :)
SLIT is indeed useless for anything other than what's nearest decisions
Anyhow, short term I'd like us to revisit what info we present from HMAT
(and what we get from CXL topology descriptions which have pretty much everything we
might want).
That should put the info in userspace to tune weighted interleave better anyway
and perhaps provide the info you need here.
So just all the other problems to solve ;)
J
>
> It's not useful. So NUMA is an incomplete (if not wrong) tool for this.
>
> Additionally - reacting to task migration is not a real issue. If
> you're deploying an allocation strategy, you probably don't want your
> task migrating away from the place where you just spent a bunch of time
> allocating based on some existing strategy. So the solution is: don't
> migrate, and if you do - don't use cross-socket interleave.
>
> Maybe if we solve the first half of this we can take a look at the task
> migration piece again, but I wouldn't try to solve for migration.
>
> At the same time we were discussing this, we were also discussing how to
> do external task-mempolicy modifications - which seemed significantly
> more useful, but ultimately more complex and without sufficient
> interested parties / users.
>
> ~Gregory
>
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [RFC] Add per-socket weight support for multi-socket systems in weighted interleave
2025-05-09 11:31 ` Jonathan Cameron
@ 2025-05-09 16:29 ` Gregory Price
2025-05-12 8:23 ` Rakie Kim
2025-05-12 8:23 ` Rakie Kim
1 sibling, 1 reply; 11+ messages in thread
From: Gregory Price @ 2025-05-09 16:29 UTC (permalink / raw)
To: Jonathan Cameron
Cc: Rakie Kim, joshua.hahnjy, akpm, linux-mm, linux-kernel,
linux-cxl, dan.j.williams, ying.huang, kernel_team, honggyu.kim,
yunjeong.mun, Keith Busch, Jerome Glisse
On Fri, May 09, 2025 at 12:31:31PM +0100, Jonathan Cameron wrote:
> Anyhow, short term I'd like us to revisit what info we present from HMAT
> (and what we get from CXL topology descriptions which have pretty much everything we
> might want).
>
Generally I think if there is new data to enrich the environment, we
should try to collect that first before laying down requirements for new
interfaces / policies. So tl;dr: "This first, please!"
(I know we discussed this at LSFMM, dropped out of my memory banks)
~Gregory
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [RFC] Add per-socket weight support for multi-socket systems in weighted interleave
2025-05-09 5:49 ` Gregory Price
@ 2025-05-12 8:22 ` Rakie Kim
0 siblings, 0 replies; 11+ messages in thread
From: Rakie Kim @ 2025-05-12 8:22 UTC (permalink / raw)
To: Gregory Price
Cc: joshua.hahnjy, akpm, linux-mm, linux-kernel, linux-cxl,
dan.j.williams, ying.huang, kernel_team, honggyu.kim,
yunjeong.mun, Rakie Kim
On Fri, 9 May 2025 01:49:59 -0400 Gregory Price <gourry@gourry.net> wrote:
> On Fri, May 09, 2025 at 11:30:26AM +0900, Rakie Kim wrote:
> >
> > Scenario 1: Adapt weighting based on the task's execution node
> > A task prefers only the DRAM and locally attached CXL memory of the
> > socket on which it is running, in order to avoid cross-socket access and
> > optimize bandwidth.
> > - A task running on CPU0 (node0) would prefer DRAM0 (w=3) and CXL0 (w=1)
> > - A task running on CPU1 (node1) would prefer DRAM1 (w=3) and CXL1 (w=1)
> ... snip ...
> >
> > However, Scenario 1 does not depend on such information. Rather, it is
> > a locality-preserving optimization where we isolate memory access to
> > each socket's DRAM and CXL nodes. I believe this use case is implementable
> > today and worth considering independently from interconnect performance
> > awareness.
> >
>
> There's nothing to implement - all the controls exist:
>
> 1) --cpunodebind=0
> 2) --weighted-interleave=0,2
> 3) cpuset.mems
> 4) cpuset.cpus
Thank you again for your thoughtful response and the detailed suggestions.
As you pointed out, it is indeed possible to construct node-local memory
allocation behaviors using the existing interfaces such as --cpunodebind,
--weighted-interleave, cpuset.mems, and cpuset.cpus. I appreciate you
highlighting that path.
However, what I am proposing in Scenario 1 (Adapt weighting based on the
task's execution node) is slightly different in intent.
The idea is to allow tasks to dynamically prefer the DRAM and CXL nodes
attached to the socket on which they are executing without requiring a
fixed execution node or manual nodemask configuration. For instance, if
a task is running on node0, it would prefer node0 and node2; if running
on node1, it would prefer node1 and node3.
This differs from the current model, which relies on statically binding
both the CPU and memory nodes. My proposal aims to express this behavior
as a policy-level abstraction that dynamically adapts based on execution
locality.
So rather than being a combination of manual configuration and execution
constraints, the intent is to incorporate locality-awareness into the
memory policy itself.
>
> You might consider maybe something like "--local-tier" (akin to
> --localalloc) that sets an explicitly fallback set based on the local
> node. You'd end up doing something like
>
> current_nid = memtier_next_local_node(socket_nid, current_nid)
>
> Where this interface returns the preferred fallback ordering but doesn't
> allow cross-socket fallback.
>
> That might be useful, i suppose, in letting a user do:
>
> --cpunodebind=0 --weighted-interleave --local-tier
>
> without having to know anything about the local memory tier structure.
That said, I believe your suggestion for a "--local-tier" option is a
very good one. It could provide a concise, user-friendly way to activate
such locality-aware fallback behavior, even if the underlying mechanism
requires some policy extension.
In this regard, I fully agree that such an interface could greatly help
users express their intent without requiring them to understand the
details of the memory tier topology.
>
> > > At the same time we were discussing this, we were also discussing how to
> > > do external task-mempolicy modifications - which seemed significantly
> > > more useful, but ultimately more complex and without sufficient
> > > interested parties / users.
> >
> > I'd like to learn more about that thread. If you happen to have a pointer
> > to that discussion, it would be really helpful.
> >
>
> https://lore.kernel.org/all/20231122211200.31620-1-gregory.price@memverge.com/
> https://lore.kernel.org/all/ZV5zGROLefrsEcHJ@r13-u19.micron.com/
> https://lore.kernel.org/linux-mm/ZWYsth2CtC4Ilvoz@memverge.com/
> https://lore.kernel.org/linux-mm/20221010094842.4123037-1-hezhongkun.hzk@bytedance.com/
> There are locking issues with these that aren't easy to fix.
>
> I think the bytedance method uses a task_work queueing to defer a
> mempolicy update to the task itself the next time it makes a kernel/user
> transition. That's probably the best overall approach i've seen.
>
> https://lore.kernel.org/linux-mm/ZWezcQk+BYEq%2FWiI@memverge.com/
> More notes gathered prior to implementing weighted interleave.
Thank you for sharing the earlier links to related discussions and
patches. They were very helpful, and I will review them carefully to
gather more ideas and refine my thoughts further.
I look forward to any further feedback you may have on this topic.
Best regards,
Rakie
>
> ~Gregory
>
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [RFC] Add per-socket weight support for multi-socket systems in weighted interleave
2025-05-09 11:31 ` Jonathan Cameron
2025-05-09 16:29 ` Gregory Price
@ 2025-05-12 8:23 ` Rakie Kim
1 sibling, 0 replies; 11+ messages in thread
From: Rakie Kim @ 2025-05-12 8:23 UTC (permalink / raw)
To: Jonathan Cameron
Cc: Rakie Kim, joshua.hahnjy, akpm, linux-mm, linux-kernel,
linux-cxl, dan.j.williams, ying.huang, kernel_team, honggyu.kim,
yunjeong.mun, Keith Busch, Jerome Glisse, Gregory Price
On Fri, 9 May 2025 12:31:31 +0100 Jonathan Cameron <Jonathan.Cameron@huawei.com> wrote:
> On Thu, 8 May 2025 11:12:35 -0400
> Gregory Price <gourry@gourry.net> wrote:
>
> > On Thu, May 08, 2025 at 03:30:36PM +0900, Rakie Kim wrote:
> > > On Wed, 7 May 2025 12:38:18 -0400 Gregory Price <gourry@gourry.net> wrote:
> > >
> > > The proposed design is completely optional and isolated: it retains the
> > > existing flat weight model as-is and activates the source-aware behavior only
> > > when 'multi' mode is enabled. The complexity is scoped entirely to users who
> > > opt into this mode.
> > >
> >
> > I get what you're going for, just expressing my experience around this
> > issue specifically.
> >
> > The lack of enthusiasm for solving the cross-socket case, and thus
> > reduction from a 2D array to a 1D array, was because reasoning about
> > interleave w/ cross-socket interconnects is not really feasible with
> > the NUMA abstraction. Cross-socket interconnects are "Invisible" but
> > have real performance implications. Unless we have a way to:
>
> Sort of invisible... What their topology is, but we have some info...
>
> >
> > 1) Represent the topology, AND
> > 2) A way to get performance about that topology
>
> There was some discussion on this at LSF-MM.
>
> +CC Keith and Jerome who were once interested in this topic
>
> It's not perfect but ACPI HMAT does have what is probably sufficient info
> for a simple case like this (2 socket server + Generic Ports and CXL
> description of the rest of the path), it's just that today we aren't exposing that
> to userspace (instead only the BW / Latency from a single selected nearest initiator
> /CPU node to any memory containing node).
>
> That decision was much discussed back when Keith was adding HMAT support.
> At that time the question was what workload needed the dense info (2D matrix)
> and we didn't have one. With weighted interleave I think we do.
>
> As to the problems...
>
> We come unstuck badly in much more complex situations as that information
> is load free so if we have heavy contention due to one shared link between
> islands of nodes it can give a very misleading idea.
>
> [CXL Node 0] [CXL Node 2]
> | |
> [NODE A]---\ /----[NODE C]
> \___Shared link____/
> / \
> [NODE B]---/ \----[NODE D]
> | |
> [CXL Node 1] [CXL Node 3]
>
> In this from ACPI this looks much like this (fully connected
> 4 socket system).
>
> [CXL Node 0] [CXL Node 2]
> | |
> [NODE A]-----------------------------[NODE C]
> | \___________________________ / |
> | ____________________________\/ |
> | / \ |
> [NODE B]-----------------------------[NODE D]
> | |
> [CXL Node 1] [CXL Node 3]
>
> In the first case we should probably halve the BW of shared link or something
> like that. In the second case use the full version. In general we have no way
> to know which one we have and it gets way more fun with 8 + sockets :)
>
> SLIT is indeed useless for anything other than what's nearest decisions
>
> Anyhow, short term I'd like us to revisit what info we present from HMAT
> (and what we get from CXL topology descriptions which have pretty much everything we
> might want).
>
> That should put the info in userspace to tune weighted interleave better anyway
> and perhaps provide the info you need here.
>
> So just all the other problems to solve ;)
>
> J
Jonathan, thank you very much for your thoughtful response.
As you pointed out, ACPI HMAT and CXL topology descriptions do contain
meaningful information for simple systems such as two-socket platforms.
If that information were made more accessible to userspace, I believe
existing memory policies could be tuned with much greater precision.
I fully understand that such detailed topology data was not widely
exposed in the past, largely because there was little demand for it.
However, with the growing complexity of memory hierarchies in modern
systems, I believe its relevance and utility are increasing rapidly.
I also appreciate your point about the risks of misrepresentation in
more complex systems, especially where shared interconnect links can
cause bandwidth bottlenecks. That nuance is critical to consider when
designing or interpreting any policy relying on topology data.
In the short term, I fully agree that revisiting what information is
presented from HMAT and CXL topology and how we surface it to
userspace is a realistic and meaningful direction.
Thank you again for your insights, and I look forward to continuing the
discussion.
Rakie
>
> >
> > It's not useful. So NUMA is an incomplete (if not wrong) tool for this.
> >
> > Additionally - reacting to task migration is not a real issue. If
> > you're deploying an allocation strategy, you probably don't want your
> > task migrating away from the place where you just spent a bunch of time
> > allocating based on some existing strategy. So the solution is: don't
> > migrate, and if you do - don't use cross-socket interleave.
> >
> > Maybe if we solve the first half of this we can take a look at the task
> > migration piece again, but I wouldn't try to solve for migration.
> >
> > At the same time we were discussing this, we were also discussing how to
> > do external task-mempolicy modifications - which seemed significantly
> > more useful, but ultimately more complex and without sufficient
> > interested parties / users.
> >
> > ~Gregory
> >
>
>
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [RFC] Add per-socket weight support for multi-socket systems in weighted interleave
2025-05-09 16:29 ` Gregory Price
@ 2025-05-12 8:23 ` Rakie Kim
0 siblings, 0 replies; 11+ messages in thread
From: Rakie Kim @ 2025-05-12 8:23 UTC (permalink / raw)
To: Gregory Price
Cc: Rakie Kim, joshua.hahnjy, akpm, linux-mm, linux-kernel,
linux-cxl, dan.j.williams, ying.huang, kernel_team, honggyu.kim,
yunjeong.mun, Keith Busch, Jerome Glisse, Jonathan Cameron
On Fri, 9 May 2025 12:29:53 -0400 Gregory Price <gourry@gourry.net> wrote:
> On Fri, May 09, 2025 at 12:31:31PM +0100, Jonathan Cameron wrote:
> > Anyhow, short term I'd like us to revisit what info we present from HMAT
> > (and what we get from CXL topology descriptions which have pretty much everything we
> > might want).
> >
>
> Generally I think if there is new data to enrich the environment, we
> should try to collect that first before laying down requirements for new
> interfaces / policies. So tl;dr: "This first, please!"
>
> (I know we discussed this at LSFMM, dropped out of my memory banks)
>
> ~Gregory
>
Thank you for your response and for providing clear direction.
I fully agree with your suggestion that we should first focus on gathering
and exposing the relevant data before moving forward with new policies or
interfaces.
In practice, I believe many of the proposed enhancements can only function
meaningfully if we have a solid understanding of the memory topology and
interconnect structure?and if that information is reliably accessible in
userspace.
Without such data, there is a risk that even well-intentioned policies may
end up diverging from real hardware behavior, or possibly degrading system
performance.
Thank you again for pointing us in the right direction. I'll continue to
revisit my ideas along this path.
Best regards,
Rakie
^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2025-05-12 8:24 UTC | newest]
Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-05-07 9:35 [RFC] Add per-socket weight support for multi-socket systems in weighted interleave rakie.kim
2025-05-07 16:38 ` Gregory Price
2025-05-08 6:30 ` Rakie Kim
2025-05-08 15:12 ` Gregory Price
2025-05-09 2:30 ` Rakie Kim
2025-05-09 5:49 ` Gregory Price
2025-05-12 8:22 ` Rakie Kim
2025-05-09 11:31 ` Jonathan Cameron
2025-05-09 16:29 ` Gregory Price
2025-05-12 8:23 ` Rakie Kim
2025-05-12 8:23 ` Rakie Kim
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox