* [LSF/MM/BPF TOPIC] Weighted interleave auto-tuning @ 2025-01-09 18:50 Joshua Hahn 2025-03-13 15:57 ` Joshua Hahn 2025-03-27 11:11 ` Oscar Salvador 0 siblings, 2 replies; 10+ messages in thread From: Joshua Hahn @ 2025-01-09 18:50 UTC (permalink / raw) To: lsf-pc Cc: Joshua Hahn, linux-mm, linux-kernel, gourry, ying.huang, hyeonggon.yoo, honggyu.kim, kernel-team Hello everyone, I hope everyone has had a great start to 2025! Recently, I have been working on a patch series [1] with Gregory Price <gourry@gourry.net> that provides new default interleave weights, along with dynamic re-weighting on hotplug events and a series of UAPIs that allow users to configure how they want the defaults to behave. In introducing these new defaults, discussions have opened up in the community regarding how best to create a UAPI that can provide coherent and transparent interactions for the user. In particular, consider this scenario: when a hotplug event happens and a node comes online with new bandwidth information (and therefore changing the bandwidth distributions across the system), should user-set weights be overwritten to reflect the new distributions? If so, how can we justify overwriting user-set values in a sysfs interface? If not, how will users manually adjust the node weights to the optimal weight? I would like to revisit some of the design choices made for this patch, including how the defaults were derived, and open the conversation to hear what the community believes is a reasonable way to allow users to tune weighted interleave weights. More broadly, I hope to get gather community insight on how they use weighted interleave, and do my best to reflect those workflows in the patch. Of course, I would also love to hear your thoughts about this topic in this thread, or in the RFC thread (attached) as well. Have a great day! Joshua [1] https://lore.kernel.org/all/20241219191845.3506370-1-joshua.hahnjy@gmail.com/ ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Weighted interleave auto-tuning 2025-01-09 18:50 [LSF/MM/BPF TOPIC] Weighted interleave auto-tuning Joshua Hahn @ 2025-03-13 15:57 ` Joshua Hahn 2025-03-14 10:08 ` Huang, Ying 2025-03-27 11:11 ` Oscar Salvador 1 sibling, 1 reply; 10+ messages in thread From: Joshua Hahn @ 2025-03-13 15:57 UTC (permalink / raw) To: Joshua Hahn Cc: lsf-pc, linux-mm, linux-kernel, gourry, ying.huang, hyeonggon.yoo, honggyu.kim, kernel-team On Thu, 9 Jan 2025 13:50:48 -0500 Joshua Hahn <joshua.hahnjy@gmail.com> wrote: > Hello everyone, I hope everyone has had a great start to 2025! > > Recently, I have been working on a patch series [1] with > Gregory Price <gourry@gourry.net> that provides new default interleave > weights, along with dynamic re-weighting on hotplug events and a series > of UAPIs that allow users to configure how they want the defaults to behave. > > In introducing these new defaults, discussions have opened up in the > community regarding how best to create a UAPI that can provide > coherent and transparent interactions for the user. In particular, consider > this scenario: when a hotplug event happens and a node comes online > with new bandwidth information (and therefore changing the bandwidth > distributions across the system), should user-set weights be overwritten > to reflect the new distributions? If so, how can we justify overwriting > user-set values in a sysfs interface? If not, how will users manually > adjust the node weights to the optimal weight? > > I would like to revisit some of the design choices made for this patch, > including how the defaults were derived, and open the conversation to > hear what the community believes is a reasonable way to allow users to > tune weighted interleave weights. More broadly, I hope to get gather > community insight on how they use weighted interleave, and do my best to > reflect those workflows in the patch. Weighted interleave has since moved onto v7 [1], and a v8 is currently being drafted. Through feedback from reviewers, we have landed on a coherent UAPI that gives users two options: auto mode, which leaves all weight calculation decisions to the system, and manual mode, which leaves weighted interleave the same as it is without the patch. Given that the patch's functionality is mostly concrete and that the questions I hoped to raise during this slot were answered via patch feedback, I hope to ask another question during the talk: Should the system dynamically change what metrics it uses to weight the nodes, based on what bottlenecks the system is currently facing? In the patch, min(read_bandwidth, write_bandwidth) is used as the heuristic to determine what a node's weight should be. However, what if the system is not bottlenecked by bandwidth, but by latency? A system could also be bottlenecked by read bandwidth, but not by write bandwidth. Consider a scenario where a system has many memory nodes with varying latencies and bandwidths. When the system is not bottlenecked by bandwidth, it might prefer to allocate memory from nodes with lower latency. Once the system starts feeling pressured by bandwidth, the weights for high bandwidth (but also high latency) nodes would slowly increase to alleviate pressure from the system. Once the system is back in a manageable state, weights for low latency nodes would start increasing again. Users would not have to be aware of any of this -- they would just see the system take control of the weight changes as the system's needs continue to change. This proposal also has some concerns that need to be addressed: - How reactive should the system be, and how aggressively should it tune the weights? We don't want the system to overreact to short spikes in pressure. - Does dynamic weight adjusting lead to pages being "misplaced"? Should those "misplaced" pages be migrated? (probably not) - Does this need to be in the kernel? A userspace daemon that monitors kernel metrics has the ability to make the changes (via the nodeN interfaces). Thoughts & comments are appreciated! Thank you, and have a great day! Joshua [1] https://lore.kernel.org/all/20250305200506.2529583-1-joshua.hahnjy@gmail.com/ Sent using hkml (https://github.com/sjp38/hackermail) ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Weighted interleave auto-tuning 2025-03-13 15:57 ` Joshua Hahn @ 2025-03-14 10:08 ` Huang, Ying 2025-03-14 14:15 ` Jonathan Cameron 2025-03-14 15:02 ` Joshua Hahn 0 siblings, 2 replies; 10+ messages in thread From: Huang, Ying @ 2025-03-14 10:08 UTC (permalink / raw) To: Joshua Hahn Cc: lsf-pc, linux-mm, linux-kernel, gourry, hyeonggon.yoo, honggyu.kim, kernel-team Joshua Hahn <joshua.hahnjy@gmail.com> writes: > On Thu, 9 Jan 2025 13:50:48 -0500 Joshua Hahn <joshua.hahnjy@gmail.com> wrote: > >> Hello everyone, I hope everyone has had a great start to 2025! >> >> Recently, I have been working on a patch series [1] with >> Gregory Price <gourry@gourry.net> that provides new default interleave >> weights, along with dynamic re-weighting on hotplug events and a series >> of UAPIs that allow users to configure how they want the defaults to behave. >> >> In introducing these new defaults, discussions have opened up in the >> community regarding how best to create a UAPI that can provide >> coherent and transparent interactions for the user. In particular, consider >> this scenario: when a hotplug event happens and a node comes online >> with new bandwidth information (and therefore changing the bandwidth >> distributions across the system), should user-set weights be overwritten >> to reflect the new distributions? If so, how can we justify overwriting >> user-set values in a sysfs interface? If not, how will users manually >> adjust the node weights to the optimal weight? >> >> I would like to revisit some of the design choices made for this patch, >> including how the defaults were derived, and open the conversation to >> hear what the community believes is a reasonable way to allow users to >> tune weighted interleave weights. More broadly, I hope to get gather >> community insight on how they use weighted interleave, and do my best to >> reflect those workflows in the patch. > > Weighted interleave has since moved onto v7 [1], and a v8 is currently being > drafted. Through feedback from reviewers, we have landed on a coherent UAPI > that gives users two options: auto mode, which leaves all weight calculation > decisions to the system, and manual mode, which leaves weighted interleave > the same as it is without the patch. > > Given that the patch's functionality is mostly concrete and that the questions > I hoped to raise during this slot were answered via patch feedback, I hope to > ask another question during the talk: > > Should the system dynamically change what metrics it uses to weight the nodes, > based on what bottlenecks the system is currently facing? > > In the patch, min(read_bandwidth, write_bandwidth) is used as the heuristic > to determine what a node's weight should be. However, what if the system is > not bottlenecked by bandwidth, but by latency? A system could also be > bottlenecked by read bandwidth, but not by write bandwidth. > > Consider a scenario where a system has many memory nodes with varying > latencies and bandwidths. When the system is not bottlenecked by bandwidth, > it might prefer to allocate memory from nodes with lower latency. Once the > system starts feeling pressured by bandwidth, the weights for high bandwidth > (but also high latency) nodes would slowly increase to alleviate pressure > from the system. Once the system is back in a manageable state, weights for > low latency nodes would start increasing again. Users would not have to be > aware of any of this -- they would just see the system take control of the > weight changes as the system's needs continue to change. IIUC, this assumes the capacity of all kinds of memory is large enough. However, this may be not true in some cases. So, another possibility is that, for a system with DRAM and CXL memory nodes. - There is free space on DRAM node, the bandwidth of DRAM node isn't saturated, memory is allocated on DRAM node. - There is no free space on DRAM node, the bandwidth of DRAM node isn't saturated, cold pages are migrated to CXL memory nodes, while hot pages are migrated to DRAM memory nodes. - The bandwidth of DRAM node is saturated, hot pages are migrated to CXL memory nodes. In general, I think that the real situation is complex and this makes it hard to implement a good policy in kernel. So, I suspect that it's better to start with the experiments in user space. > This proposal also has some concerns that need to be addressed: > - How reactive should the system be, and how aggressively should it tune the > weights? We don't want the system to overreact to short spikes in pressure. > - Does dynamic weight adjusting lead to pages being "misplaced"? Should those > "misplaced" pages be migrated? (probably not) > - Does this need to be in the kernel? A userspace daemon that monitors kernel > metrics has the ability to make the changes (via the nodeN interfaces). > > Thoughts & comments are appreciated! Thank you, and have a great day! > Joshua > > [1] https://lore.kernel.org/all/20250305200506.2529583-1-joshua.hahnjy@gmail.com/ > > Sent using hkml (https://github.com/sjp38/hackermail) --- Best Regards, Huang, Ying ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Weighted interleave auto-tuning 2025-03-14 10:08 ` Huang, Ying @ 2025-03-14 14:15 ` Jonathan Cameron 2025-03-14 14:53 ` Gregory Price 2025-03-14 15:11 ` Joshua Hahn 2025-03-14 15:02 ` Joshua Hahn 1 sibling, 2 replies; 10+ messages in thread From: Jonathan Cameron @ 2025-03-14 14:15 UTC (permalink / raw) To: Huang, Ying Cc: Joshua Hahn, lsf-pc, linux-mm, linux-kernel, gourry, hyeonggon.yoo, honggyu.kim, kernel-team On Fri, 14 Mar 2025 18:08:35 +0800 "Huang, Ying" <ying.huang@linux.alibaba.com> wrote: > Joshua Hahn <joshua.hahnjy@gmail.com> writes: > > > On Thu, 9 Jan 2025 13:50:48 -0500 Joshua Hahn <joshua.hahnjy@gmail.com> wrote: > > > >> Hello everyone, I hope everyone has had a great start to 2025! > >> > >> Recently, I have been working on a patch series [1] with > >> Gregory Price <gourry@gourry.net> that provides new default interleave > >> weights, along with dynamic re-weighting on hotplug events and a series > >> of UAPIs that allow users to configure how they want the defaults to behave. > >> > >> In introducing these new defaults, discussions have opened up in the > >> community regarding how best to create a UAPI that can provide > >> coherent and transparent interactions for the user. In particular, consider > >> this scenario: when a hotplug event happens and a node comes online > >> with new bandwidth information (and therefore changing the bandwidth > >> distributions across the system), should user-set weights be overwritten > >> to reflect the new distributions? If so, how can we justify overwriting > >> user-set values in a sysfs interface? If not, how will users manually > >> adjust the node weights to the optimal weight? > >> > >> I would like to revisit some of the design choices made for this patch, > >> including how the defaults were derived, and open the conversation to > >> hear what the community believes is a reasonable way to allow users to > >> tune weighted interleave weights. More broadly, I hope to get gather > >> community insight on how they use weighted interleave, and do my best to > >> reflect those workflows in the patch. > > > > Weighted interleave has since moved onto v7 [1], and a v8 is currently being > > drafted. Through feedback from reviewers, we have landed on a coherent UAPI > > that gives users two options: auto mode, which leaves all weight calculation > > decisions to the system, and manual mode, which leaves weighted interleave > > the same as it is without the patch. > > > > Given that the patch's functionality is mostly concrete and that the questions > > I hoped to raise during this slot were answered via patch feedback, I hope to > > ask another question during the talk: > > > > Should the system dynamically change what metrics it uses to weight the nodes, > > based on what bottlenecks the system is currently facing? > > > > In the patch, min(read_bandwidth, write_bandwidth) is used as the heuristic > > to determine what a node's weight should be. However, what if the system is > > not bottlenecked by bandwidth, but by latency? A system could also be > > bottlenecked by read bandwidth, but not by write bandwidth. > > > > Consider a scenario where a system has many memory nodes with varying > > latencies and bandwidths. When the system is not bottlenecked by bandwidth, > > it might prefer to allocate memory from nodes with lower latency. Once the > > system starts feeling pressured by bandwidth, the weights for high bandwidth > > (but also high latency) nodes would slowly increase to alleviate pressure > > from the system. Once the system is back in a manageable state, weights for > > low latency nodes would start increasing again. Users would not have to be > > aware of any of this -- they would just see the system take control of the > > weight changes as the system's needs continue to change. > > IIUC, this assumes the capacity of all kinds of memory is large enough. > However, this may be not true in some cases. So, another possibility is > that, for a system with DRAM and CXL memory nodes. > > - There is free space on DRAM node, the bandwidth of DRAM node isn't > saturated, memory is allocated on DRAM node. > > - There is no free space on DRAM node, the bandwidth of DRAM node isn't > saturated, cold pages are migrated to CXL memory nodes, while hot > pages are migrated to DRAM memory nodes. > > - The bandwidth of DRAM node is saturated, hot pages are migrated to CXL > memory nodes. > > In general, I think that the real situation is complex and this makes it > hard to implement a good policy in kernel. So, I suspect that it's > better to start with the experiments in user space. > > > This proposal also has some concerns that need to be addressed: > > - How reactive should the system be, and how aggressively should it tune the > > weights? We don't want the system to overreact to short spikes in pressure. > > - Does dynamic weight adjusting lead to pages being "misplaced"? Should those > > "misplaced" pages be migrated? (probably not) > > - Does this need to be in the kernel? A userspace daemon that monitors kernel > > metrics has the ability to make the changes (via the nodeN interfaces). If this was done in kernel, what metrics would make sense to drive this? Similar to hot page tracking we may run into contention with PMUs or similar and their other use cases. > > > > Thoughts & comments are appreciated! Thank you, and have a great day! > > Joshua > > > > [1] https://lore.kernel.org/all/20250305200506.2529583-1-joshua.hahnjy@gmail.com/ > > > > Sent using hkml (https://github.com/sjp38/hackermail) > > --- > Best Regards, > Huang, Ying > ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Weighted interleave auto-tuning 2025-03-14 14:15 ` Jonathan Cameron @ 2025-03-14 14:53 ` Gregory Price 2025-03-14 15:11 ` Joshua Hahn 1 sibling, 0 replies; 10+ messages in thread From: Gregory Price @ 2025-03-14 14:53 UTC (permalink / raw) To: Jonathan Cameron Cc: Huang, Ying, Joshua Hahn, lsf-pc, linux-mm, linux-kernel, hyeonggon.yoo, honggyu.kim, kernel-team On Fri, Mar 14, 2025 at 02:15:41PM +0000, Jonathan Cameron wrote: > > > - Does this need to be in the kernel? A userspace daemon that monitors kernel > > > metrics has the ability to make the changes (via the nodeN interfaces). > > If this was done in kernel, what metrics would make sense to drive this? > Similar to hot page tracking we may run into contention with PMUs or similar and > their other use cases. > Rather than directly affect weighted interleave, I think this stemmed from the idea of a "smart policy" that adjusted allocations based on bandwidth pressure and VMA permissions (code should be local, stack should be local, heap could be interleaved - etc). An example would be if DRAM bandwidth become pressured but CXL wasn't, then maybe tossing some extra allocations directly to CXL would actually decrease average latencies. I'm not sure how we'd actually implement this in userland, and I think this is ultimately MPOL_PONIES, but it's an interesting exploration. Some of this context was lost as we worked on weighted interleave auto-tuning. ~Gregory ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Weighted interleave auto-tuning 2025-03-14 14:15 ` Jonathan Cameron 2025-03-14 14:53 ` Gregory Price @ 2025-03-14 15:11 ` Joshua Hahn 1 sibling, 0 replies; 10+ messages in thread From: Joshua Hahn @ 2025-03-14 15:11 UTC (permalink / raw) To: Jonathan Cameron Cc: Huang, Ying, lsf-pc, linux-mm, linux-kernel, gourry, hyeonggon.yoo, honggyu.kim, kernel-team On Fri, 14 Mar 2025 14:15:41 +0000 Jonathan Cameron <Jonathan.Cameron@huawei.com> wrote: > On Fri, 14 Mar 2025 18:08:35 +0800 > "Huang, Ying" <ying.huang@linux.alibaba.com> wrote: > > > Joshua Hahn <joshua.hahnjy@gmail.com> writes: > > > > > On Thu, 9 Jan 2025 13:50:48 -0500 Joshua Hahn <joshua.hahnjy@gmail.com> wrote: > > > > > >> Hello everyone, I hope everyone has had a great start to 2025! > > >> > > >> Recently, I have been working on a patch series [1] with > > >> Gregory Price <gourry@gourry.net> that provides new default interleave > > >> weights, along with dynamic re-weighting on hotplug events and a series > > >> of UAPIs that allow users to configure how they want the defaults to behave. > > >> > > >> In introducing these new defaults, discussions have opened up in the > > >> community regarding how best to create a UAPI that can provide > > >> coherent and transparent interactions for the user. In particular, consider > > >> this scenario: when a hotplug event happens and a node comes online > > >> with new bandwidth information (and therefore changing the bandwidth > > >> distributions across the system), should user-set weights be overwritten > > >> to reflect the new distributions? If so, how can we justify overwriting > > >> user-set values in a sysfs interface? If not, how will users manually > > >> adjust the node weights to the optimal weight? > > >> > > >> I would like to revisit some of the design choices made for this patch, > > >> including how the defaults were derived, and open the conversation to > > >> hear what the community believes is a reasonable way to allow users to > > >> tune weighted interleave weights. More broadly, I hope to get gather > > >> community insight on how they use weighted interleave, and do my best to > > >> reflect those workflows in the patch. > > > > > > Weighted interleave has since moved onto v7 [1], and a v8 is currently being > > > drafted. Through feedback from reviewers, we have landed on a coherent UAPI > > > that gives users two options: auto mode, which leaves all weight calculation > > > decisions to the system, and manual mode, which leaves weighted interleave > > > the same as it is without the patch. > > > > > > Given that the patch's functionality is mostly concrete and that the questions > > > I hoped to raise during this slot were answered via patch feedback, I hope to > > > ask another question during the talk: > > > > > > Should the system dynamically change what metrics it uses to weight the nodes, > > > based on what bottlenecks the system is currently facing? > > > > > > In the patch, min(read_bandwidth, write_bandwidth) is used as the heuristic > > > to determine what a node's weight should be. However, what if the system is > > > not bottlenecked by bandwidth, but by latency? A system could also be > > > bottlenecked by read bandwidth, but not by write bandwidth. > > > > > > Consider a scenario where a system has many memory nodes with varying > > > latencies and bandwidths. When the system is not bottlenecked by bandwidth, > > > it might prefer to allocate memory from nodes with lower latency. Once the > > > system starts feeling pressured by bandwidth, the weights for high bandwidth > > > (but also high latency) nodes would slowly increase to alleviate pressure > > > from the system. Once the system is back in a manageable state, weights for > > > low latency nodes would start increasing again. Users would not have to be > > > aware of any of this -- they would just see the system take control of the > > > weight changes as the system's needs continue to change. > > > > IIUC, this assumes the capacity of all kinds of memory is large enough. > > However, this may be not true in some cases. So, another possibility is > > that, for a system with DRAM and CXL memory nodes. > > > > - There is free space on DRAM node, the bandwidth of DRAM node isn't > > saturated, memory is allocated on DRAM node. > > > > - There is no free space on DRAM node, the bandwidth of DRAM node isn't > > saturated, cold pages are migrated to CXL memory nodes, while hot > > pages are migrated to DRAM memory nodes. > > > > - The bandwidth of DRAM node is saturated, hot pages are migrated to CXL > > memory nodes. > > > > In general, I think that the real situation is complex and this makes it > > hard to implement a good policy in kernel. So, I suspect that it's > > better to start with the experiments in user space. > > > > > This proposal also has some concerns that need to be addressed: > > > - How reactive should the system be, and how aggressively should it tune the > > > weights? We don't want the system to overreact to short spikes in pressure. > > > - Does dynamic weight adjusting lead to pages being "misplaced"? Should those > > > "misplaced" pages be migrated? (probably not) > > > - Does this need to be in the kernel? A userspace daemon that monitors kernel > > > metrics has the ability to make the changes (via the nodeN interfaces). > > If this was done in kernel, what metrics would make sense to drive this? > Similar to hot page tracking we may run into contention with PMUs or similar and > their other use cases. Hello Jonathan, thank you for your interest in this proposal! Yes, I think you and Ying both bring up great points about how this is probably something more suitable for a userspace program. Usespace probably has more information about the characteristics of the workload, and I agree with your point about contention. If the kernel thread doesn't probe frequently, then it would be making poor allocation decisions based on stale data, but if it does probe frequently, it would incur lots of overhead from the contention (and make other contending threads slower as well). Not to mention, there is also the overhead of probing itself : -) I will keep thinking about these questions, and see if I can come up with any interesting ideas to discuss during LSFMMBPF. Thank you again for your interest, I hope you have a great day! Joshua > > > > > > Thoughts & comments are appreciated! Thank you, and have a great day! > > > Joshua > > > > > > [1] https://lore.kernel.org/all/20250305200506.2529583-1-joshua.hahnjy@gmail.com/ > > > > > > Sent using hkml (https://github.com/sjp38/hackermail) > > > > --- > > Best Regards, > > Huang, Ying > > Sent using hkml (https://github.com/sjp38/hackermail) ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Weighted interleave auto-tuning 2025-03-14 10:08 ` Huang, Ying 2025-03-14 14:15 ` Jonathan Cameron @ 2025-03-14 15:02 ` Joshua Hahn 1 sibling, 0 replies; 10+ messages in thread From: Joshua Hahn @ 2025-03-14 15:02 UTC (permalink / raw) To: Huang, Ying Cc: lsf-pc, linux-mm, linux-kernel, gourry, hyeonggon.yoo, honggyu.kim, kernel-team On Fri, 14 Mar 2025 18:08:35 +0800 "Huang, Ying" <ying.huang@linux.alibaba.com> wrote: > Joshua Hahn <joshua.hahnjy@gmail.com> writes: > > > On Thu, 9 Jan 2025 13:50:48 -0500 Joshua Hahn <joshua.hahnjy@gmail.com> wrote: > > > >> Hello everyone, I hope everyone has had a great start to 2025! > >> > >> Recently, I have been working on a patch series [1] with > >> Gregory Price <gourry@gourry.net> that provides new default interleave > >> weights, along with dynamic re-weighting on hotplug events and a series > >> of UAPIs that allow users to configure how they want the defaults to behave. > >> > >> In introducing these new defaults, discussions have opened up in the > >> community regarding how best to create a UAPI that can provide > >> coherent and transparent interactions for the user. In particular, consider > >> this scenario: when a hotplug event happens and a node comes online > >> with new bandwidth information (and therefore changing the bandwidth > >> distributions across the system), should user-set weights be overwritten > >> to reflect the new distributions? If so, how can we justify overwriting > >> user-set values in a sysfs interface? If not, how will users manually > >> adjust the node weights to the optimal weight? > >> > >> I would like to revisit some of the design choices made for this patch, > >> including how the defaults were derived, and open the conversation to > >> hear what the community believes is a reasonable way to allow users to > >> tune weighted interleave weights. More broadly, I hope to get gather > >> community insight on how they use weighted interleave, and do my best to > >> reflect those workflows in the patch. > > > > Weighted interleave has since moved onto v7 [1], and a v8 is currently being > > drafted. Through feedback from reviewers, we have landed on a coherent UAPI > > that gives users two options: auto mode, which leaves all weight calculation > > decisions to the system, and manual mode, which leaves weighted interleave > > the same as it is without the patch. > > > > Given that the patch's functionality is mostly concrete and that the questions > > I hoped to raise during this slot were answered via patch feedback, I hope to > > ask another question during the talk: > > > > Should the system dynamically change what metrics it uses to weight the nodes, > > based on what bottlenecks the system is currently facing? > > > > In the patch, min(read_bandwidth, write_bandwidth) is used as the heuristic > > to determine what a node's weight should be. However, what if the system is > > not bottlenecked by bandwidth, but by latency? A system could also be > > bottlenecked by read bandwidth, but not by write bandwidth. > > > > Consider a scenario where a system has many memory nodes with varying > > latencies and bandwidths. When the system is not bottlenecked by bandwidth, > > it might prefer to allocate memory from nodes with lower latency. Once the > > system starts feeling pressured by bandwidth, the weights for high bandwidth > > (but also high latency) nodes would slowly increase to alleviate pressure > > from the system. Once the system is back in a manageable state, weights for > > low latency nodes would start increasing again. Users would not have to be > > aware of any of this -- they would just see the system take control of the > > weight changes as the system's needs continue to change. > > IIUC, this assumes the capacity of all kinds of memory is large enough. > However, this may be not true in some cases. So, another possibility is > that, for a system with DRAM and CXL memory nodes. > > - There is free space on DRAM node, the bandwidth of DRAM node isn't > saturated, memory is allocated on DRAM node. > > - There is no free space on DRAM node, the bandwidth of DRAM node isn't > saturated, cold pages are migrated to CXL memory nodes, while hot > pages are migrated to DRAM memory nodes. > > - The bandwidth of DRAM node is saturated, hot pages are migrated to CXL > memory nodes. > > In general, I think that the real situation is complex and this makes it > hard to implement a good policy in kernel. So, I suspect that it's > better to start with the experiments in user space. Hi Ying, thank you so much for your feedback, as always! Yes, I agree. I brought up this idea out of curiosity, since I thought that there might be room to experiment with different configurations for weighted interleave auto-tuning. As you know, we use min(read_bw, write_bw), which I think is a good heuristic that works for the intent of the weighted interleave auto-tuning patch-- I wanted to know what a system might look like, that might use different heuristics given the system's state. But I think you are right that it is difficult to implement in kernel. Thanks again, Ying! Will you be attending LSFMMBPF this year? I would love to say hello in person : -) Have a great day! Joshua > > This proposal also has some concerns that need to be addressed: > > - How reactive should the system be, and how aggressively should it tune the > > weights? We don't want the system to overreact to short spikes in pressure. > > - Does dynamic weight adjusting lead to pages being "misplaced"? Should those > > "misplaced" pages be migrated? (probably not) > > - Does this need to be in the kernel? A userspace daemon that monitors kernel > > metrics has the ability to make the changes (via the nodeN interfaces). > > > > Thoughts & comments are appreciated! Thank you, and have a great day! > > Joshua > > > > [1] https://lore.kernel.org/all/20250305200506.2529583-1-joshua.hahnjy@gmail.com/ > > > > Sent using hkml (https://github.com/sjp38/hackermail) > > --- > Best Regards, > Huang, Ying Sent using hkml (https://github.com/sjp38/hackermail) ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Weighted interleave auto-tuning 2025-01-09 18:50 [LSF/MM/BPF TOPIC] Weighted interleave auto-tuning Joshua Hahn 2025-03-13 15:57 ` Joshua Hahn @ 2025-03-27 11:11 ` Oscar Salvador 2025-03-27 12:39 ` Gregory Price 2025-03-27 15:46 ` Joshua Hahn 1 sibling, 2 replies; 10+ messages in thread From: Oscar Salvador @ 2025-03-27 11:11 UTC (permalink / raw) To: Joshua Hahn Cc: lsf-pc, linux-mm, linux-kernel, gourry, ying.huang, hyeonggon.yoo, honggyu.kim, kernel-team On Thu, Jan 09, 2025 at 01:50:48PM -0500, Joshua Hahn wrote: > Hello everyone, I hope everyone has had a great start to 2025! Hi Joshua, as discussed in the LSFMM about how you can react to nodes becoming memory{aware,less}, you can register a hotplug memory notifier, as memory-tiering currently does. The current use of the hotplug memory notifier by some consumers (e.g: memory-tiering, slub, etc) is a bit suboptimal, as they only care about nodes changing its memory state, yet they get notified for every {online,offline}_pages operation. I came up with [1] I did not publish it yet upstream because I wanted to discuss it a bit with David, but you can give it a try to see if it works for you. But till it is upstream, you will have to use the hotplug memory notifier. [1] https://github.com/leberus/linux.git numa-node-notifier -- Oscar Salvador SUSE Labs ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Weighted interleave auto-tuning 2025-03-27 11:11 ` Oscar Salvador @ 2025-03-27 12:39 ` Gregory Price 2025-03-27 15:46 ` Joshua Hahn 1 sibling, 0 replies; 10+ messages in thread From: Gregory Price @ 2025-03-27 12:39 UTC (permalink / raw) To: Oscar Salvador Cc: Joshua Hahn, lsf-pc, linux-mm, linux-kernel, ying.huang, hyeonggon.yoo, honggyu.kim, kernel-team, yunjeong.mun, rakie.kim On Thu, Mar 27, 2025 at 12:11:55PM +0100, Oscar Salvador wrote: > Hi Joshua, > > as discussed in the LSFMM about how you can react to nodes becoming > memory{aware,less}, you can register a hotplug memory notifier, as > memory-tiering currently does. > > The current use of the hotplug memory notifier by some consumers (e.g: > memory-tiering, slub, etc) is a bit suboptimal, as they only care about > nodes changing its memory state, yet they get notified for every > {online,offline}_pages operation. > > I came up with [1] > > I did not publish it yet upstream because I wanted to discuss it a bit > with David, but you can give it a try to see if it works for you. > But till it is upstream, you will have to use the hotplug memory > notifier. > > [1] https://github.com/leberus/linux.git numa-node-notifier > +CC: Yunjeong Mun and Rakie Kim Something to consider as a follow up to your series. Thanks Oscar, we were just discussing this. Seems there's multiple users doing the same thing, so it seems reasonable to discuss. This would probably deal with my race condition concerns here as well: https://lore.kernel.org/linux-mm/20250325102804.1020-1-rakie.kim@sk.com/ > -- > Oscar Salvador > SUSE Labs ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Weighted interleave auto-tuning 2025-03-27 11:11 ` Oscar Salvador 2025-03-27 12:39 ` Gregory Price @ 2025-03-27 15:46 ` Joshua Hahn 1 sibling, 0 replies; 10+ messages in thread From: Joshua Hahn @ 2025-03-27 15:46 UTC (permalink / raw) To: Oscar Salvador Cc: lsf-pc, linux-mm, linux-kernel, gourry, ying.huang, hyeonggon.yoo, honggyu.kim, kernel-team On Thu, 27 Mar 2025 12:11:55 +0100 Oscar Salvador <osalvador@suse.de> wrote: > On Thu, Jan 09, 2025 at 01:50:48PM -0500, Joshua Hahn wrote: > > Hello everyone, I hope everyone has had a great start to 2025! > > Hi Joshua, > > as discussed in the LSFMM about how you can react to nodes becoming > memory{aware,less}, you can register a hotplug memory notifier, as > memory-tiering currently does. > > The current use of the hotplug memory notifier by some consumers (e.g: > memory-tiering, slub, etc) is a bit suboptimal, as they only care about > nodes changing its memory state, yet they get notified for every > {online,offline}_pages operation. > > I came up with [1] > > I did not publish it yet upstream because I wanted to discuss it a bit > with David, but you can give it a try to see if it works for you. > But till it is upstream, you will have to use the hotplug memory > notifier. Hi Oscar, This is great, thank you for taking a look at this. I'll also do some experimenting on my end and let you know with any results that I see from my end, using weighted interleave. In the meantime yes -- I think the hotplug memory notifier should do the trick. Thank you again, have a great day! Joshua > [1] https://github.com/leberus/linux.git numa-node-notifier > > -- > Oscar Salvador > SUSE Labs Sent using hkml (https://github.com/sjp38/hackermail) ^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2025-03-27 15:46 UTC | newest] Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2025-01-09 18:50 [LSF/MM/BPF TOPIC] Weighted interleave auto-tuning Joshua Hahn 2025-03-13 15:57 ` Joshua Hahn 2025-03-14 10:08 ` Huang, Ying 2025-03-14 14:15 ` Jonathan Cameron 2025-03-14 14:53 ` Gregory Price 2025-03-14 15:11 ` Joshua Hahn 2025-03-14 15:02 ` Joshua Hahn 2025-03-27 11:11 ` Oscar Salvador 2025-03-27 12:39 ` Gregory Price 2025-03-27 15:46 ` Joshua Hahn
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox