[LSF/MM/BPF TOPIC] Weighted interleave auto-tuning

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [LSF/MM/BPF TOPIC] Weighted interleave auto-tuning
@ 2025-01-09 18:50 Joshua Hahn
  2025-03-13 15:57 ` Joshua Hahn
  2025-03-27 11:11 ` Oscar Salvador
  0 siblings, 2 replies; 10+ messages in thread
From: Joshua Hahn @ 2025-01-09 18:50 UTC (permalink / raw)
  To: lsf-pc
  Cc: Joshua Hahn, linux-mm, linux-kernel, gourry, ying.huang,
	hyeonggon.yoo, honggyu.kim, kernel-team

Hello everyone, I hope everyone has had a great start to 2025!

Recently, I have been working on a patch series [1] with
Gregory Price <gourry@gourry.net> that provides new default interleave
weights, along with dynamic re-weighting on hotplug events and a series
of UAPIs that allow users to configure how they want the defaults to behave.

In introducing these new defaults, discussions have opened up in the
community regarding how best to create a UAPI that can provide
coherent and transparent interactions for the user. In particular, consider
this scenario: when a hotplug event happens and a node comes online
with new bandwidth information (and therefore changing the bandwidth
distributions across the system), should user-set weights be overwritten
to reflect the new distributions? If so, how can we justify overwriting
user-set values in a sysfs interface? If not, how will users manually
adjust the node weights to the optimal weight?

I would like to revisit some of the design choices made for this patch,
including how the defaults were derived, and open the conversation to
hear what the community believes is a reasonable way to allow users to
tune weighted interleave weights. More broadly, I hope to get gather
community insight on how they use weighted interleave, and do my best to
reflect those workflows in the patch.

Of course, I would also love to hear your thoughts about this topic
in this thread, or in the RFC thread (attached) as well. Have a great day!
Joshua

[1] https://lore.kernel.org/all/20241219191845.3506370-1-joshua.hahnjy@gmail.com/

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Weighted interleave auto-tuning
  2025-01-09 18:50 [LSF/MM/BPF TOPIC] Weighted interleave auto-tuning Joshua Hahn
@ 2025-03-13 15:57 ` Joshua Hahn
  2025-03-14 10:08   ` Huang, Ying
  2025-03-27 11:11 ` Oscar Salvador
  1 sibling, 1 reply; 10+ messages in thread
From: Joshua Hahn @ 2025-03-13 15:57 UTC (permalink / raw)
  To: Joshua Hahn
  Cc: lsf-pc, linux-mm, linux-kernel, gourry, ying.huang,
	hyeonggon.yoo, honggyu.kim, kernel-team

On Thu,  9 Jan 2025 13:50:48 -0500 Joshua Hahn <joshua.hahnjy@gmail.com> wrote:

> Hello everyone, I hope everyone has had a great start to 2025!
> 
> Recently, I have been working on a patch series [1] with
> Gregory Price <gourry@gourry.net> that provides new default interleave
> weights, along with dynamic re-weighting on hotplug events and a series
> of UAPIs that allow users to configure how they want the defaults to behave.
> 
> In introducing these new defaults, discussions have opened up in the
> community regarding how best to create a UAPI that can provide
> coherent and transparent interactions for the user. In particular, consider
> this scenario: when a hotplug event happens and a node comes online
> with new bandwidth information (and therefore changing the bandwidth
> distributions across the system), should user-set weights be overwritten
> to reflect the new distributions? If so, how can we justify overwriting
> user-set values in a sysfs interface? If not, how will users manually
> adjust the node weights to the optimal weight?
> 
> I would like to revisit some of the design choices made for this patch,
> including how the defaults were derived, and open the conversation to
> hear what the community believes is a reasonable way to allow users to
> tune weighted interleave weights. More broadly, I hope to get gather
> community insight on how they use weighted interleave, and do my best to
> reflect those workflows in the patch.

Weighted interleave has since moved onto v7 [1], and a v8 is currently being
drafted. Through feedback from reviewers, we have landed on a coherent UAPI
that gives users two options: auto mode, which leaves all weight calculation
decisions to the system, and manual mode, which leaves weighted interleave
the same as it is without the patch.

Given that the patch's functionality is mostly concrete and that the questions
I hoped to raise during this slot were answered via patch feedback, I hope to
ask another question during the talk:

Should the system dynamically change what metrics it uses to weight the nodes,
based on what bottlenecks the system is currently facing?

In the patch, min(read_bandwidth, write_bandwidth) is used as the heuristic
to determine what a node's weight should be. However, what if the system is
not bottlenecked by bandwidth, but by latency? A system could also be
bottlenecked by read bandwidth, but not by write bandwidth.

Consider a scenario where a system has many memory nodes with varying
latencies and bandwidths. When the system is not bottlenecked by bandwidth,
it might prefer to allocate memory from nodes with lower latency. Once the
system starts feeling pressured by bandwidth, the weights for high bandwidth
(but also high latency) nodes would slowly increase to alleviate pressure
from the system. Once the system is back in a manageable state, weights for
low latency nodes would start increasing again. Users would not have to be
aware of any of this -- they would just see the system take control of the
weight changes as the system's needs continue to change.

This proposal also has some concerns that need to be addressed:
- How reactive should the system be, and how aggressively should it tune the
  weights? We don't want the system to overreact to short spikes in pressure.
- Does dynamic weight adjusting lead to pages being "misplaced"? Should those
  "misplaced" pages be migrated? (probably not)
- Does this need to be in the kernel? A userspace daemon that monitors kernel
  metrics has the ability to make the changes (via the nodeN interfaces).

Thoughts & comments are appreciated! Thank you, and have a great day!
Joshua

[1] https://lore.kernel.org/all/20250305200506.2529583-1-joshua.hahnjy@gmail.com/

Sent using hkml (https://github.com/sjp38/hackermail)

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Weighted interleave auto-tuning
  2025-03-13 15:57 ` Joshua Hahn
@ 2025-03-14 10:08   ` Huang, Ying
  2025-03-14 14:15     ` Jonathan Cameron
  2025-03-14 15:02     ` Joshua Hahn
  0 siblings, 2 replies; 10+ messages in thread
From: Huang, Ying @ 2025-03-14 10:08 UTC (permalink / raw)
  To: Joshua Hahn
  Cc: lsf-pc, linux-mm, linux-kernel, gourry, hyeonggon.yoo,
	honggyu.kim, kernel-team

Joshua Hahn <joshua.hahnjy@gmail.com> writes:

> On Thu,  9 Jan 2025 13:50:48 -0500 Joshua Hahn <joshua.hahnjy@gmail.com> wrote:
>
>> Hello everyone, I hope everyone has had a great start to 2025!
>> 
>> Recently, I have been working on a patch series [1] with
>> Gregory Price <gourry@gourry.net> that provides new default interleave
>> weights, along with dynamic re-weighting on hotplug events and a series
>> of UAPIs that allow users to configure how they want the defaults to behave.
>> 
>> In introducing these new defaults, discussions have opened up in the
>> community regarding how best to create a UAPI that can provide
>> coherent and transparent interactions for the user. In particular, consider
>> this scenario: when a hotplug event happens and a node comes online
>> with new bandwidth information (and therefore changing the bandwidth
>> distributions across the system), should user-set weights be overwritten
>> to reflect the new distributions? If so, how can we justify overwriting
>> user-set values in a sysfs interface? If not, how will users manually
>> adjust the node weights to the optimal weight?
>> 
>> I would like to revisit some of the design choices made for this patch,
>> including how the defaults were derived, and open the conversation to
>> hear what the community believes is a reasonable way to allow users to
>> tune weighted interleave weights. More broadly, I hope to get gather
>> community insight on how they use weighted interleave, and do my best to
>> reflect those workflows in the patch.
>
> Weighted interleave has since moved onto v7 [1], and a v8 is currently being
> drafted. Through feedback from reviewers, we have landed on a coherent UAPI
> that gives users two options: auto mode, which leaves all weight calculation
> decisions to the system, and manual mode, which leaves weighted interleave
> the same as it is without the patch.
>
> Given that the patch's functionality is mostly concrete and that the questions
> I hoped to raise during this slot were answered via patch feedback, I hope to
> ask another question during the talk:
>
> Should the system dynamically change what metrics it uses to weight the nodes,
> based on what bottlenecks the system is currently facing?
>
> In the patch, min(read_bandwidth, write_bandwidth) is used as the heuristic
> to determine what a node's weight should be. However, what if the system is
> not bottlenecked by bandwidth, but by latency? A system could also be
> bottlenecked by read bandwidth, but not by write bandwidth.
>
> Consider a scenario where a system has many memory nodes with varying
> latencies and bandwidths. When the system is not bottlenecked by bandwidth,
> it might prefer to allocate memory from nodes with lower latency. Once the
> system starts feeling pressured by bandwidth, the weights for high bandwidth
> (but also high latency) nodes would slowly increase to alleviate pressure
> from the system. Once the system is back in a manageable state, weights for
> low latency nodes would start increasing again. Users would not have to be
> aware of any of this -- they would just see the system take control of the
> weight changes as the system's needs continue to change.

IIUC, this assumes the capacity of all kinds of memory is large enough.
However, this may be not true in some cases.  So, another possibility is
that, for a system with DRAM and CXL memory nodes.

- There is free space on DRAM node, the bandwidth of DRAM node isn't
  saturated, memory is allocated on DRAM node.

- There is no free space on DRAM node, the bandwidth of DRAM node isn't
  saturated, cold pages are migrated to CXL memory nodes, while hot
  pages are migrated to DRAM memory nodes.

- The bandwidth of DRAM node is saturated, hot pages are migrated to CXL
  memory nodes.

In general, I think that the real situation is complex and this makes it
hard to implement a good policy in kernel.  So, I suspect that it's
better to start with the experiments in user space.

> This proposal also has some concerns that need to be addressed:
> - How reactive should the system be, and how aggressively should it tune the
>   weights? We don't want the system to overreact to short spikes in pressure.
> - Does dynamic weight adjusting lead to pages being "misplaced"? Should those
>   "misplaced" pages be migrated? (probably not)
> - Does this need to be in the kernel? A userspace daemon that monitors kernel
>   metrics has the ability to make the changes (via the nodeN interfaces).
>
> Thoughts & comments are appreciated! Thank you, and have a great day!
> Joshua
>
> [1] https://lore.kernel.org/all/20250305200506.2529583-1-joshua.hahnjy@gmail.com/
>
> Sent using hkml (https://github.com/sjp38/hackermail)

---
Best Regards,
Huang, Ying


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Weighted interleave auto-tuning
  2025-03-14 10:08   ` Huang, Ying
@ 2025-03-14 14:15     ` Jonathan Cameron
  2025-03-14 14:53       ` Gregory Price
  2025-03-14 15:11       ` Joshua Hahn
  2025-03-14 15:02     ` Joshua Hahn
  1 sibling, 2 replies; 10+ messages in thread
From: Jonathan Cameron @ 2025-03-14 14:15 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Joshua Hahn, lsf-pc, linux-mm, linux-kernel, gourry,
	hyeonggon.yoo, honggyu.kim, kernel-team

On Fri, 14 Mar 2025 18:08:35 +0800
"Huang, Ying" <ying.huang@linux.alibaba.com> wrote:

> Joshua Hahn <joshua.hahnjy@gmail.com> writes:
> 
> > On Thu,  9 Jan 2025 13:50:48 -0500 Joshua Hahn <joshua.hahnjy@gmail.com> wrote:
> >  
> >> Hello everyone, I hope everyone has had a great start to 2025!
> >> 
> >> Recently, I have been working on a patch series [1] with
> >> Gregory Price <gourry@gourry.net> that provides new default interleave
> >> weights, along with dynamic re-weighting on hotplug events and a series
> >> of UAPIs that allow users to configure how they want the defaults to behave.
> >> 
> >> In introducing these new defaults, discussions have opened up in the
> >> community regarding how best to create a UAPI that can provide
> >> coherent and transparent interactions for the user. In particular, consider
> >> this scenario: when a hotplug event happens and a node comes online
> >> with new bandwidth information (and therefore changing the bandwidth
> >> distributions across the system), should user-set weights be overwritten
> >> to reflect the new distributions? If so, how can we justify overwriting
> >> user-set values in a sysfs interface? If not, how will users manually
> >> adjust the node weights to the optimal weight?
> >> 
> >> I would like to revisit some of the design choices made for this patch,
> >> including how the defaults were derived, and open the conversation to
> >> hear what the community believes is a reasonable way to allow users to
> >> tune weighted interleave weights. More broadly, I hope to get gather
> >> community insight on how they use weighted interleave, and do my best to
> >> reflect those workflows in the patch.  
> >
> > Weighted interleave has since moved onto v7 [1], and a v8 is currently being
> > drafted. Through feedback from reviewers, we have landed on a coherent UAPI
> > that gives users two options: auto mode, which leaves all weight calculation
> > decisions to the system, and manual mode, which leaves weighted interleave
> > the same as it is without the patch.
> >
> > Given that the patch's functionality is mostly concrete and that the questions
> > I hoped to raise during this slot were answered via patch feedback, I hope to
> > ask another question during the talk:
> >
> > Should the system dynamically change what metrics it uses to weight the nodes,
> > based on what bottlenecks the system is currently facing?
> >
> > In the patch, min(read_bandwidth, write_bandwidth) is used as the heuristic
> > to determine what a node's weight should be. However, what if the system is
> > not bottlenecked by bandwidth, but by latency? A system could also be
> > bottlenecked by read bandwidth, but not by write bandwidth.
> >
> > Consider a scenario where a system has many memory nodes with varying
> > latencies and bandwidths. When the system is not bottlenecked by bandwidth,
> > it might prefer to allocate memory from nodes with lower latency. Once the
> > system starts feeling pressured by bandwidth, the weights for high bandwidth
> > (but also high latency) nodes would slowly increase to alleviate pressure
> > from the system. Once the system is back in a manageable state, weights for
> > low latency nodes would start increasing again. Users would not have to be
> > aware of any of this -- they would just see the system take control of the
> > weight changes as the system's needs continue to change.  
> 
> IIUC, this assumes the capacity of all kinds of memory is large enough.
> However, this may be not true in some cases.  So, another possibility is
> that, for a system with DRAM and CXL memory nodes.
> 
> - There is free space on DRAM node, the bandwidth of DRAM node isn't
>   saturated, memory is allocated on DRAM node.
> 
> - There is no free space on DRAM node, the bandwidth of DRAM node isn't
>   saturated, cold pages are migrated to CXL memory nodes, while hot
>   pages are migrated to DRAM memory nodes.
> 
> - The bandwidth of DRAM node is saturated, hot pages are migrated to CXL
>   memory nodes.
> 
> In general, I think that the real situation is complex and this makes it
> hard to implement a good policy in kernel.  So, I suspect that it's
> better to start with the experiments in user space.
> 
> > This proposal also has some concerns that need to be addressed:
> > - How reactive should the system be, and how aggressively should it tune the
> >   weights? We don't want the system to overreact to short spikes in pressure.
> > - Does dynamic weight adjusting lead to pages being "misplaced"? Should those
> >   "misplaced" pages be migrated? (probably not)
> > - Does this need to be in the kernel? A userspace daemon that monitors kernel
> >   metrics has the ability to make the changes (via the nodeN interfaces).

If this was done in kernel, what metrics would make sense to drive this?
Similar to hot page tracking we may run into contention with PMUs or similar and
their other use cases. 

> >
> > Thoughts & comments are appreciated! Thank you, and have a great day!
> > Joshua
> >
> > [1] https://lore.kernel.org/all/20250305200506.2529583-1-joshua.hahnjy@gmail.com/
> >
> > Sent using hkml (https://github.com/sjp38/hackermail)  
> 
> ---
> Best Regards,
> Huang, Ying
> 



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Weighted interleave auto-tuning
  2025-03-14 14:15     ` Jonathan Cameron
@ 2025-03-14 14:53       ` Gregory Price
  2025-03-14 15:11       ` Joshua Hahn
  1 sibling, 0 replies; 10+ messages in thread
From: Gregory Price @ 2025-03-14 14:53 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: Huang, Ying, Joshua Hahn, lsf-pc, linux-mm, linux-kernel,
	hyeonggon.yoo, honggyu.kim, kernel-team

On Fri, Mar 14, 2025 at 02:15:41PM +0000, Jonathan Cameron wrote:
> > > - Does this need to be in the kernel? A userspace daemon that monitors kernel
> > >   metrics has the ability to make the changes (via the nodeN interfaces).
> 
> If this was done in kernel, what metrics would make sense to drive this?
> Similar to hot page tracking we may run into contention with PMUs or similar and
> their other use cases. 
> 

Rather than directly affect weighted interleave, I think this stemmed
from the idea of a "smart policy" that adjusted allocations based on
bandwidth pressure and VMA permissions (code should be local, stack
should be local, heap could be interleaved - etc).

An example would be if DRAM bandwidth become pressured but CXL wasn't,
then maybe tossing some extra allocations directly to CXL would actually
decrease average latencies.

I'm not sure how we'd actually implement this in userland, and I think
this is ultimately MPOL_PONIES, but it's an interesting exploration.

Some of this context was lost as we worked on weighted interleave
auto-tuning.

~Gregory

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Weighted interleave auto-tuning
  2025-03-14 14:15     ` Jonathan Cameron
  2025-03-14 14:53       ` Gregory Price
@ 2025-03-14 15:11       ` Joshua Hahn
  1 sibling, 0 replies; 10+ messages in thread
From: Joshua Hahn @ 2025-03-14 15:11 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: Huang, Ying, lsf-pc, linux-mm, linux-kernel, gourry,
	hyeonggon.yoo, honggyu.kim, kernel-team

On Fri, 14 Mar 2025 14:15:41 +0000 Jonathan Cameron <Jonathan.Cameron@huawei.com> wrote:

> On Fri, 14 Mar 2025 18:08:35 +0800
> "Huang, Ying" <ying.huang@linux.alibaba.com> wrote:
> 
> > Joshua Hahn <joshua.hahnjy@gmail.com> writes:
> > 
> > > On Thu,  9 Jan 2025 13:50:48 -0500 Joshua Hahn <joshua.hahnjy@gmail.com> wrote:
> > >  
> > >> Hello everyone, I hope everyone has had a great start to 2025!
> > >> 
> > >> Recently, I have been working on a patch series [1] with
> > >> Gregory Price <gourry@gourry.net> that provides new default interleave
> > >> weights, along with dynamic re-weighting on hotplug events and a series
> > >> of UAPIs that allow users to configure how they want the defaults to behave.
> > >> 
> > >> In introducing these new defaults, discussions have opened up in the
> > >> community regarding how best to create a UAPI that can provide
> > >> coherent and transparent interactions for the user. In particular, consider
> > >> this scenario: when a hotplug event happens and a node comes online
> > >> with new bandwidth information (and therefore changing the bandwidth
> > >> distributions across the system), should user-set weights be overwritten
> > >> to reflect the new distributions? If so, how can we justify overwriting
> > >> user-set values in a sysfs interface? If not, how will users manually
> > >> adjust the node weights to the optimal weight?
> > >> 
> > >> I would like to revisit some of the design choices made for this patch,
> > >> including how the defaults were derived, and open the conversation to
> > >> hear what the community believes is a reasonable way to allow users to
> > >> tune weighted interleave weights. More broadly, I hope to get gather
> > >> community insight on how they use weighted interleave, and do my best to
> > >> reflect those workflows in the patch.  
> > >
> > > Weighted interleave has since moved onto v7 [1], and a v8 is currently being
> > > drafted. Through feedback from reviewers, we have landed on a coherent UAPI
> > > that gives users two options: auto mode, which leaves all weight calculation
> > > decisions to the system, and manual mode, which leaves weighted interleave
> > > the same as it is without the patch.
> > >
> > > Given that the patch's functionality is mostly concrete and that the questions
> > > I hoped to raise during this slot were answered via patch feedback, I hope to
> > > ask another question during the talk:
> > >
> > > Should the system dynamically change what metrics it uses to weight the nodes,
> > > based on what bottlenecks the system is currently facing?
> > >
> > > In the patch, min(read_bandwidth, write_bandwidth) is used as the heuristic
> > > to determine what a node's weight should be. However, what if the system is
> > > not bottlenecked by bandwidth, but by latency? A system could also be
> > > bottlenecked by read bandwidth, but not by write bandwidth.
> > >
> > > Consider a scenario where a system has many memory nodes with varying
> > > latencies and bandwidths. When the system is not bottlenecked by bandwidth,
> > > it might prefer to allocate memory from nodes with lower latency. Once the
> > > system starts feeling pressured by bandwidth, the weights for high bandwidth
> > > (but also high latency) nodes would slowly increase to alleviate pressure
> > > from the system. Once the system is back in a manageable state, weights for
> > > low latency nodes would start increasing again. Users would not have to be
> > > aware of any of this -- they would just see the system take control of the
> > > weight changes as the system's needs continue to change.  
> > 
> > IIUC, this assumes the capacity of all kinds of memory is large enough.
> > However, this may be not true in some cases.  So, another possibility is
> > that, for a system with DRAM and CXL memory nodes.
> > 
> > - There is free space on DRAM node, the bandwidth of DRAM node isn't
> >   saturated, memory is allocated on DRAM node.
> > 
> > - There is no free space on DRAM node, the bandwidth of DRAM node isn't
> >   saturated, cold pages are migrated to CXL memory nodes, while hot
> >   pages are migrated to DRAM memory nodes.
> > 
> > - The bandwidth of DRAM node is saturated, hot pages are migrated to CXL
> >   memory nodes.
> > 
> > In general, I think that the real situation is complex and this makes it
> > hard to implement a good policy in kernel.  So, I suspect that it's
> > better to start with the experiments in user space.
> > 
> > > This proposal also has some concerns that need to be addressed:
> > > - How reactive should the system be, and how aggressively should it tune the
> > >   weights? We don't want the system to overreact to short spikes in pressure.
> > > - Does dynamic weight adjusting lead to pages being "misplaced"? Should those
> > >   "misplaced" pages be migrated? (probably not)
> > > - Does this need to be in the kernel? A userspace daemon that monitors kernel
> > >   metrics has the ability to make the changes (via the nodeN interfaces).
> 
> If this was done in kernel, what metrics would make sense to drive this?
> Similar to hot page tracking we may run into contention with PMUs or similar and
> their other use cases. 

Hello Jonathan, thank you for your interest in this proposal!

Yes, I think you and Ying both bring up great points about how this is
probably something more suitable for a userspace program. Usespace probably
has more information about the characteristics of the workload, and I agree
with your point about contention.

If the kernel thread doesn't probe frequently, then it would be making poor
allocation decisions based on stale data, but if it does probe frequently,
it would incur lots of overhead from the contention (and make other contending
threads slower as well). Not to mention, there is also the overhead of
probing itself : -)

I will keep thinking about these questions, and see if I can come up with
any interesting ideas to discuss during LSFMMBPF. Thank you again for your
interest, I hope you have a great day!
Joshua

> > >
> > > Thoughts & comments are appreciated! Thank you, and have a great day!
> > > Joshua
> > >
> > > [1] https://lore.kernel.org/all/20250305200506.2529583-1-joshua.hahnjy@gmail.com/
> > >
> > > Sent using hkml (https://github.com/sjp38/hackermail)  
> > 
> > ---
> > Best Regards,
> > Huang, Ying
> >

Sent using hkml (https://github.com/sjp38/hackermail)



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Weighted interleave auto-tuning
  2025-03-14 10:08   ` Huang, Ying
  2025-03-14 14:15     ` Jonathan Cameron
@ 2025-03-14 15:02     ` Joshua Hahn
  1 sibling, 0 replies; 10+ messages in thread
From: Joshua Hahn @ 2025-03-14 15:02 UTC (permalink / raw)
  To: Huang, Ying
  Cc: lsf-pc, linux-mm, linux-kernel, gourry, hyeonggon.yoo,
	honggyu.kim, kernel-team

On Fri, 14 Mar 2025 18:08:35 +0800 "Huang, Ying" <ying.huang@linux.alibaba.com> wrote:

> Joshua Hahn <joshua.hahnjy@gmail.com> writes:
> 
> > On Thu,  9 Jan 2025 13:50:48 -0500 Joshua Hahn <joshua.hahnjy@gmail.com> wrote:
> >
> >> Hello everyone, I hope everyone has had a great start to 2025!
> >> 
> >> Recently, I have been working on a patch series [1] with
> >> Gregory Price <gourry@gourry.net> that provides new default interleave
> >> weights, along with dynamic re-weighting on hotplug events and a series
> >> of UAPIs that allow users to configure how they want the defaults to behave.
> >> 
> >> In introducing these new defaults, discussions have opened up in the
> >> community regarding how best to create a UAPI that can provide
> >> coherent and transparent interactions for the user. In particular, consider
> >> this scenario: when a hotplug event happens and a node comes online
> >> with new bandwidth information (and therefore changing the bandwidth
> >> distributions across the system), should user-set weights be overwritten
> >> to reflect the new distributions? If so, how can we justify overwriting
> >> user-set values in a sysfs interface? If not, how will users manually
> >> adjust the node weights to the optimal weight?
> >> 
> >> I would like to revisit some of the design choices made for this patch,
> >> including how the defaults were derived, and open the conversation to
> >> hear what the community believes is a reasonable way to allow users to
> >> tune weighted interleave weights. More broadly, I hope to get gather
> >> community insight on how they use weighted interleave, and do my best to
> >> reflect those workflows in the patch.
> >
> > Weighted interleave has since moved onto v7 [1], and a v8 is currently being
> > drafted. Through feedback from reviewers, we have landed on a coherent UAPI
> > that gives users two options: auto mode, which leaves all weight calculation
> > decisions to the system, and manual mode, which leaves weighted interleave
> > the same as it is without the patch.
> >
> > Given that the patch's functionality is mostly concrete and that the questions
> > I hoped to raise during this slot were answered via patch feedback, I hope to
> > ask another question during the talk:
> >
> > Should the system dynamically change what metrics it uses to weight the nodes,
> > based on what bottlenecks the system is currently facing?
> >
> > In the patch, min(read_bandwidth, write_bandwidth) is used as the heuristic
> > to determine what a node's weight should be. However, what if the system is
> > not bottlenecked by bandwidth, but by latency? A system could also be
> > bottlenecked by read bandwidth, but not by write bandwidth.
> >
> > Consider a scenario where a system has many memory nodes with varying
> > latencies and bandwidths. When the system is not bottlenecked by bandwidth,
> > it might prefer to allocate memory from nodes with lower latency. Once the
> > system starts feeling pressured by bandwidth, the weights for high bandwidth
> > (but also high latency) nodes would slowly increase to alleviate pressure
> > from the system. Once the system is back in a manageable state, weights for
> > low latency nodes would start increasing again. Users would not have to be
> > aware of any of this -- they would just see the system take control of the
> > weight changes as the system's needs continue to change.
> 
> IIUC, this assumes the capacity of all kinds of memory is large enough.
> However, this may be not true in some cases.  So, another possibility is
> that, for a system with DRAM and CXL memory nodes.
> 
> - There is free space on DRAM node, the bandwidth of DRAM node isn't
>   saturated, memory is allocated on DRAM node.
> 
> - There is no free space on DRAM node, the bandwidth of DRAM node isn't
>   saturated, cold pages are migrated to CXL memory nodes, while hot
>   pages are migrated to DRAM memory nodes.
> 
> - The bandwidth of DRAM node is saturated, hot pages are migrated to CXL
>   memory nodes.
> 
> In general, I think that the real situation is complex and this makes it
> hard to implement a good policy in kernel.  So, I suspect that it's
> better to start with the experiments in user space.

Hi Ying, thank you so much for your feedback, as always!

Yes, I agree. I brought up this idea out of curiosity, since I thought that
there might be room to experiment with different configurations for weighted
interleave auto-tuning. As you know, we use min(read_bw, write_bw), which I
think is a good heuristic that works for the intent of the weighted interleave
auto-tuning patch-- I wanted to know what a system might look like, that might
use different heuristics given the system's state. But I think you are right
that it is difficult to implement in kernel.

Thanks again, Ying! Will you be attending LSFMMBPF this year? I would love to
say hello in person : -)

Have a great day!
Joshua
 
> > This proposal also has some concerns that need to be addressed:
> > - How reactive should the system be, and how aggressively should it tune the
> >   weights? We don't want the system to overreact to short spikes in pressure.
> > - Does dynamic weight adjusting lead to pages being "misplaced"? Should those
> >   "misplaced" pages be migrated? (probably not)
> > - Does this need to be in the kernel? A userspace daemon that monitors kernel
> >   metrics has the ability to make the changes (via the nodeN interfaces).
> >
> > Thoughts & comments are appreciated! Thank you, and have a great day!
> > Joshua
> >
> > [1] https://lore.kernel.org/all/20250305200506.2529583-1-joshua.hahnjy@gmail.com/
> >
> > Sent using hkml (https://github.com/sjp38/hackermail)
> 
> ---
> Best Regards,
> Huang, Ying

Sent using hkml (https://github.com/sjp38/hackermail)



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Weighted interleave auto-tuning
  2025-01-09 18:50 [LSF/MM/BPF TOPIC] Weighted interleave auto-tuning Joshua Hahn
  2025-03-13 15:57 ` Joshua Hahn
@ 2025-03-27 11:11 ` Oscar Salvador
  2025-03-27 12:39   ` Gregory Price
  2025-03-27 15:46   ` Joshua Hahn
  1 sibling, 2 replies; 10+ messages in thread
From: Oscar Salvador @ 2025-03-27 11:11 UTC (permalink / raw)
  To: Joshua Hahn
  Cc: lsf-pc, linux-mm, linux-kernel, gourry, ying.huang,
	hyeonggon.yoo, honggyu.kim, kernel-team

On Thu, Jan 09, 2025 at 01:50:48PM -0500, Joshua Hahn wrote:
> Hello everyone, I hope everyone has had a great start to 2025!

Hi Joshua,

as discussed in the LSFMM about how you can react to nodes becoming
memory{aware,less}, you can register a hotplug memory notifier, as
memory-tiering currently does.

The current use of the hotplug memory notifier by some consumers (e.g:
memory-tiering, slub, etc) is a bit suboptimal, as they only care about
nodes changing its memory state, yet they get notified for every
{online,offline}_pages operation.

I came up with [1]

I did not publish it yet upstream because I wanted to discuss it a bit
with David, but you can give it a try to see if it works for you.
But till it is upstream, you will have to use the hotplug memory
notifier.

[1] https://github.com/leberus/linux.git numa-node-notifier 

-- 
Oscar Salvador
SUSE Labs

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Weighted interleave auto-tuning
  2025-03-27 11:11 ` Oscar Salvador
@ 2025-03-27 12:39   ` Gregory Price
  2025-03-27 15:46   ` Joshua Hahn
  1 sibling, 0 replies; 10+ messages in thread
From: Gregory Price @ 2025-03-27 12:39 UTC (permalink / raw)
  To: Oscar Salvador
  Cc: Joshua Hahn, lsf-pc, linux-mm, linux-kernel, ying.huang,
	hyeonggon.yoo, honggyu.kim, kernel-team, yunjeong.mun, rakie.kim

On Thu, Mar 27, 2025 at 12:11:55PM +0100, Oscar Salvador wrote:
> Hi Joshua,
> 
> as discussed in the LSFMM about how you can react to nodes becoming
> memory{aware,less}, you can register a hotplug memory notifier, as
> memory-tiering currently does.
> 
> The current use of the hotplug memory notifier by some consumers (e.g:
> memory-tiering, slub, etc) is a bit suboptimal, as they only care about
> nodes changing its memory state, yet they get notified for every
> {online,offline}_pages operation.
> 
> I came up with [1]
> 
> I did not publish it yet upstream because I wanted to discuss it a bit
> with David, but you can give it a try to see if it works for you.
> But till it is upstream, you will have to use the hotplug memory
> notifier.
> 
> [1] https://github.com/leberus/linux.git numa-node-notifier 
> 

+CC: Yunjeong Mun and Rakie Kim

Something to consider as a follow up to your series.

Thanks Oscar, we were just discussing this.  Seems there's multiple
users doing the same thing, so it seems reasonable to discuss.  This
would probably deal with my race condition concerns here as well:

https://lore.kernel.org/linux-mm/20250325102804.1020-1-rakie.kim@sk.com/

> -- 
> Oscar Salvador
> SUSE Labs


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Weighted interleave auto-tuning
  2025-03-27 11:11 ` Oscar Salvador
  2025-03-27 12:39   ` Gregory Price
@ 2025-03-27 15:46   ` Joshua Hahn
  1 sibling, 0 replies; 10+ messages in thread
From: Joshua Hahn @ 2025-03-27 15:46 UTC (permalink / raw)
  To: Oscar Salvador
  Cc: lsf-pc, linux-mm, linux-kernel, gourry, ying.huang,
	hyeonggon.yoo, honggyu.kim, kernel-team

On Thu, 27 Mar 2025 12:11:55 +0100 Oscar Salvador <osalvador@suse.de> wrote:

> On Thu, Jan 09, 2025 at 01:50:48PM -0500, Joshua Hahn wrote:
> > Hello everyone, I hope everyone has had a great start to 2025!
> 
> Hi Joshua,
> 
> as discussed in the LSFMM about how you can react to nodes becoming
> memory{aware,less}, you can register a hotplug memory notifier, as
> memory-tiering currently does.
> 
> The current use of the hotplug memory notifier by some consumers (e.g:
> memory-tiering, slub, etc) is a bit suboptimal, as they only care about
> nodes changing its memory state, yet they get notified for every
> {online,offline}_pages operation.
> 
> I came up with [1]
> 
> I did not publish it yet upstream because I wanted to discuss it a bit
> with David, but you can give it a try to see if it works for you.
> But till it is upstream, you will have to use the hotplug memory
> notifier.

Hi Oscar,

This is great, thank you for taking a look at this. I'll also do some
experimenting on my end and let you know with any results that I see from
my end, using weighted interleave.

In the meantime yes -- I think the hotplug memory notifier should do the
trick. Thank you again, have a great day!
Joshua
 
> [1] https://github.com/leberus/linux.git numa-node-notifier 
> 
> -- 
> Oscar Salvador
> SUSE Labs

Sent using hkml (https://github.com/sjp38/hackermail)



^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2025-03-27 15:46 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-01-09 18:50 [LSF/MM/BPF TOPIC] Weighted interleave auto-tuning Joshua Hahn
2025-03-13 15:57 ` Joshua Hahn
2025-03-14 10:08   ` Huang, Ying
2025-03-14 14:15     ` Jonathan Cameron
2025-03-14 14:53       ` Gregory Price
2025-03-14 15:11       ` Joshua Hahn
2025-03-14 15:02     ` Joshua Hahn
2025-03-27 11:11 ` Oscar Salvador
2025-03-27 12:39   ` Gregory Price
2025-03-27 15:46   ` Joshua Hahn

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox