From: Joshua Hahn <joshua.hahnjy@gmail.com>
To: "Huang, Ying" <ying.huang@linux.alibaba.com>
Cc: lsf-pc@lists.linux-foundation.org, linux-mm@kvack.org,
linux-kernel@vger.kernel.org, gourry@gourry.net,
hyeonggon.yoo@sk.com, honggyu.kim@sk.com, kernel-team@meta.com
Subject: Re: [LSF/MM/BPF TOPIC] Weighted interleave auto-tuning
Date: Fri, 14 Mar 2025 08:02:46 -0700 [thread overview]
Message-ID: <20250314150248.774524-1-joshua.hahnjy@gmail.com> (raw)
In-Reply-To: <87frjfx6u4.fsf@DESKTOP-5N7EMDA>
On Fri, 14 Mar 2025 18:08:35 +0800 "Huang, Ying" <ying.huang@linux.alibaba.com> wrote:
> Joshua Hahn <joshua.hahnjy@gmail.com> writes:
>
> > On Thu, 9 Jan 2025 13:50:48 -0500 Joshua Hahn <joshua.hahnjy@gmail.com> wrote:
> >
> >> Hello everyone, I hope everyone has had a great start to 2025!
> >>
> >> Recently, I have been working on a patch series [1] with
> >> Gregory Price <gourry@gourry.net> that provides new default interleave
> >> weights, along with dynamic re-weighting on hotplug events and a series
> >> of UAPIs that allow users to configure how they want the defaults to behave.
> >>
> >> In introducing these new defaults, discussions have opened up in the
> >> community regarding how best to create a UAPI that can provide
> >> coherent and transparent interactions for the user. In particular, consider
> >> this scenario: when a hotplug event happens and a node comes online
> >> with new bandwidth information (and therefore changing the bandwidth
> >> distributions across the system), should user-set weights be overwritten
> >> to reflect the new distributions? If so, how can we justify overwriting
> >> user-set values in a sysfs interface? If not, how will users manually
> >> adjust the node weights to the optimal weight?
> >>
> >> I would like to revisit some of the design choices made for this patch,
> >> including how the defaults were derived, and open the conversation to
> >> hear what the community believes is a reasonable way to allow users to
> >> tune weighted interleave weights. More broadly, I hope to get gather
> >> community insight on how they use weighted interleave, and do my best to
> >> reflect those workflows in the patch.
> >
> > Weighted interleave has since moved onto v7 [1], and a v8 is currently being
> > drafted. Through feedback from reviewers, we have landed on a coherent UAPI
> > that gives users two options: auto mode, which leaves all weight calculation
> > decisions to the system, and manual mode, which leaves weighted interleave
> > the same as it is without the patch.
> >
> > Given that the patch's functionality is mostly concrete and that the questions
> > I hoped to raise during this slot were answered via patch feedback, I hope to
> > ask another question during the talk:
> >
> > Should the system dynamically change what metrics it uses to weight the nodes,
> > based on what bottlenecks the system is currently facing?
> >
> > In the patch, min(read_bandwidth, write_bandwidth) is used as the heuristic
> > to determine what a node's weight should be. However, what if the system is
> > not bottlenecked by bandwidth, but by latency? A system could also be
> > bottlenecked by read bandwidth, but not by write bandwidth.
> >
> > Consider a scenario where a system has many memory nodes with varying
> > latencies and bandwidths. When the system is not bottlenecked by bandwidth,
> > it might prefer to allocate memory from nodes with lower latency. Once the
> > system starts feeling pressured by bandwidth, the weights for high bandwidth
> > (but also high latency) nodes would slowly increase to alleviate pressure
> > from the system. Once the system is back in a manageable state, weights for
> > low latency nodes would start increasing again. Users would not have to be
> > aware of any of this -- they would just see the system take control of the
> > weight changes as the system's needs continue to change.
>
> IIUC, this assumes the capacity of all kinds of memory is large enough.
> However, this may be not true in some cases. So, another possibility is
> that, for a system with DRAM and CXL memory nodes.
>
> - There is free space on DRAM node, the bandwidth of DRAM node isn't
> saturated, memory is allocated on DRAM node.
>
> - There is no free space on DRAM node, the bandwidth of DRAM node isn't
> saturated, cold pages are migrated to CXL memory nodes, while hot
> pages are migrated to DRAM memory nodes.
>
> - The bandwidth of DRAM node is saturated, hot pages are migrated to CXL
> memory nodes.
>
> In general, I think that the real situation is complex and this makes it
> hard to implement a good policy in kernel. So, I suspect that it's
> better to start with the experiments in user space.
Hi Ying, thank you so much for your feedback, as always!
Yes, I agree. I brought up this idea out of curiosity, since I thought that
there might be room to experiment with different configurations for weighted
interleave auto-tuning. As you know, we use min(read_bw, write_bw), which I
think is a good heuristic that works for the intent of the weighted interleave
auto-tuning patch-- I wanted to know what a system might look like, that might
use different heuristics given the system's state. But I think you are right
that it is difficult to implement in kernel.
Thanks again, Ying! Will you be attending LSFMMBPF this year? I would love to
say hello in person : -)
Have a great day!
Joshua
> > This proposal also has some concerns that need to be addressed:
> > - How reactive should the system be, and how aggressively should it tune the
> > weights? We don't want the system to overreact to short spikes in pressure.
> > - Does dynamic weight adjusting lead to pages being "misplaced"? Should those
> > "misplaced" pages be migrated? (probably not)
> > - Does this need to be in the kernel? A userspace daemon that monitors kernel
> > metrics has the ability to make the changes (via the nodeN interfaces).
> >
> > Thoughts & comments are appreciated! Thank you, and have a great day!
> > Joshua
> >
> > [1] https://lore.kernel.org/all/20250305200506.2529583-1-joshua.hahnjy@gmail.com/
> >
> > Sent using hkml (https://github.com/sjp38/hackermail)
>
> ---
> Best Regards,
> Huang, Ying
Sent using hkml (https://github.com/sjp38/hackermail)
next prev parent reply other threads:[~2025-03-14 15:03 UTC|newest]
Thread overview: 10+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-01-09 18:50 Joshua Hahn
2025-03-13 15:57 ` Joshua Hahn
2025-03-14 10:08 ` Huang, Ying
2025-03-14 14:15 ` Jonathan Cameron
2025-03-14 14:53 ` Gregory Price
2025-03-14 15:11 ` Joshua Hahn
2025-03-14 15:02 ` Joshua Hahn [this message]
2025-03-27 11:11 ` Oscar Salvador
2025-03-27 12:39 ` Gregory Price
2025-03-27 15:46 ` Joshua Hahn
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20250314150248.774524-1-joshua.hahnjy@gmail.com \
--to=joshua.hahnjy@gmail.com \
--cc=gourry@gourry.net \
--cc=honggyu.kim@sk.com \
--cc=hyeonggon.yoo@sk.com \
--cc=kernel-team@meta.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=lsf-pc@lists.linux-foundation.org \
--cc=ying.huang@linux.alibaba.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox