Re: [LSF/MM/BPF TOPIC] Weighted interleave auto-tuning

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Joshua Hahn <joshua.hahnjy@gmail.com>
To: "Huang, Ying" <ying.huang@linux.alibaba.com>
Cc: lsf-pc@lists.linux-foundation.org, linux-mm@kvack.org,
	linux-kernel@vger.kernel.org, gourry@gourry.net,
	hyeonggon.yoo@sk.com, honggyu.kim@sk.com, kernel-team@meta.com
Subject: Re: [LSF/MM/BPF TOPIC] Weighted interleave auto-tuning
Date: Fri, 14 Mar 2025 08:02:46 -0700	[thread overview]
Message-ID: <20250314150248.774524-1-joshua.hahnjy@gmail.com> (raw)
In-Reply-To: <87frjfx6u4.fsf@DESKTOP-5N7EMDA>

On Fri, 14 Mar 2025 18:08:35 +0800 "Huang, Ying" <ying.huang@linux.alibaba.com> wrote:

> Joshua Hahn <joshua.hahnjy@gmail.com> writes:
> 
> > On Thu,  9 Jan 2025 13:50:48 -0500 Joshua Hahn <joshua.hahnjy@gmail.com> wrote:
> >
> >> Hello everyone, I hope everyone has had a great start to 2025!
> >> 
> >> Recently, I have been working on a patch series [1] with
> >> Gregory Price <gourry@gourry.net> that provides new default interleave
> >> weights, along with dynamic re-weighting on hotplug events and a series
> >> of UAPIs that allow users to configure how they want the defaults to behave.
> >> 
> >> In introducing these new defaults, discussions have opened up in the
> >> community regarding how best to create a UAPI that can provide
> >> coherent and transparent interactions for the user. In particular, consider
> >> this scenario: when a hotplug event happens and a node comes online
> >> with new bandwidth information (and therefore changing the bandwidth
> >> distributions across the system), should user-set weights be overwritten
> >> to reflect the new distributions? If so, how can we justify overwriting
> >> user-set values in a sysfs interface? If not, how will users manually
> >> adjust the node weights to the optimal weight?
> >> 
> >> I would like to revisit some of the design choices made for this patch,
> >> including how the defaults were derived, and open the conversation to
> >> hear what the community believes is a reasonable way to allow users to
> >> tune weighted interleave weights. More broadly, I hope to get gather
> >> community insight on how they use weighted interleave, and do my best to
> >> reflect those workflows in the patch.
> >
> > Weighted interleave has since moved onto v7 [1], and a v8 is currently being
> > drafted. Through feedback from reviewers, we have landed on a coherent UAPI
> > that gives users two options: auto mode, which leaves all weight calculation
> > decisions to the system, and manual mode, which leaves weighted interleave
> > the same as it is without the patch.
> >
> > Given that the patch's functionality is mostly concrete and that the questions
> > I hoped to raise during this slot were answered via patch feedback, I hope to
> > ask another question during the talk:
> >
> > Should the system dynamically change what metrics it uses to weight the nodes,
> > based on what bottlenecks the system is currently facing?
> >
> > In the patch, min(read_bandwidth, write_bandwidth) is used as the heuristic
> > to determine what a node's weight should be. However, what if the system is
> > not bottlenecked by bandwidth, but by latency? A system could also be
> > bottlenecked by read bandwidth, but not by write bandwidth.
> >
> > Consider a scenario where a system has many memory nodes with varying
> > latencies and bandwidths. When the system is not bottlenecked by bandwidth,
> > it might prefer to allocate memory from nodes with lower latency. Once the
> > system starts feeling pressured by bandwidth, the weights for high bandwidth
> > (but also high latency) nodes would slowly increase to alleviate pressure
> > from the system. Once the system is back in a manageable state, weights for
> > low latency nodes would start increasing again. Users would not have to be
> > aware of any of this -- they would just see the system take control of the
> > weight changes as the system's needs continue to change.
> 
> IIUC, this assumes the capacity of all kinds of memory is large enough.
> However, this may be not true in some cases.  So, another possibility is
> that, for a system with DRAM and CXL memory nodes.
> 
> - There is free space on DRAM node, the bandwidth of DRAM node isn't
>   saturated, memory is allocated on DRAM node.
> 
> - There is no free space on DRAM node, the bandwidth of DRAM node isn't
>   saturated, cold pages are migrated to CXL memory nodes, while hot
>   pages are migrated to DRAM memory nodes.
> 
> - The bandwidth of DRAM node is saturated, hot pages are migrated to CXL
>   memory nodes.
> 
> In general, I think that the real situation is complex and this makes it
> hard to implement a good policy in kernel.  So, I suspect that it's
> better to start with the experiments in user space.

Hi Ying, thank you so much for your feedback, as always!

Yes, I agree. I brought up this idea out of curiosity, since I thought that
there might be room to experiment with different configurations for weighted
interleave auto-tuning. As you know, we use min(read_bw, write_bw), which I
think is a good heuristic that works for the intent of the weighted interleave
auto-tuning patch-- I wanted to know what a system might look like, that might
use different heuristics given the system's state. But I think you are right
that it is difficult to implement in kernel.

Thanks again, Ying! Will you be attending LSFMMBPF this year? I would love to
say hello in person : -)

Have a great day!
Joshua
 
> > This proposal also has some concerns that need to be addressed:
> > - How reactive should the system be, and how aggressively should it tune the
> >   weights? We don't want the system to overreact to short spikes in pressure.
> > - Does dynamic weight adjusting lead to pages being "misplaced"? Should those
> >   "misplaced" pages be migrated? (probably not)
> > - Does this need to be in the kernel? A userspace daemon that monitors kernel
> >   metrics has the ability to make the changes (via the nodeN interfaces).
> >
> > Thoughts & comments are appreciated! Thank you, and have a great day!
> > Joshua
> >
> > [1] https://lore.kernel.org/all/20250305200506.2529583-1-joshua.hahnjy@gmail.com/
> >
> > Sent using hkml (https://github.com/sjp38/hackermail)
> 
> ---
> Best Regards,
> Huang, Ying

Sent using hkml (https://github.com/sjp38/hackermail)

next prev parent reply	other threads:[~2025-03-14 15:03 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-01-09 18:50 Joshua Hahn
2025-03-13 15:57 ` Joshua Hahn
2025-03-14 10:08   ` Huang, Ying
2025-03-14 14:15     ` Jonathan Cameron
2025-03-14 14:53       ` Gregory Price
2025-03-14 15:11       ` Joshua Hahn
2025-03-14 15:02     ` Joshua Hahn [this message]
2025-03-27 11:11 ` Oscar Salvador
2025-03-27 12:39   ` Gregory Price
2025-03-27 15:46   ` Joshua Hahn

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20250314150248.774524-1-joshua.hahnjy@gmail.com \
    --to=joshua.hahnjy@gmail.com \
    --cc=gourry@gourry.net \
    --cc=honggyu.kim@sk.com \
    --cc=hyeonggon.yoo@sk.com \
    --cc=kernel-team@meta.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=lsf-pc@lists.linux-foundation.org \
    --cc=ying.huang@linux.alibaba.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox