Re: [Linux Memory Hotness and Promotion] Notes from October 23, 2025

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: SeongJae Park <sj@kernel.org>
To: Honggyu Kim <honggyu.kim@sk.com>
Cc: SeongJae Park <sj@kernel.org>,
	David Rientjes <rientjes@google.com>,
	kernel_team@skhynix.com, Davidlohr Bueso <dave@stgolabs.net>,
	Fan Ni <nifan.cxl@gmail.com>, Gregory Price <gourry@gourry.net>,
	Jonathan Cameron <Jonathan.Cameron@huawei.com>,
	Joshua Hahn <joshua.hahnjy@gmail.com>,
	Raghavendra K T <rkodsara@amd.com>,
	"Rao, Bharata Bhasker" <bharata@amd.com>,
	Wei Xu <weixugc@google.com>,
	Xuezheng Chu <xuezhengchu@huawei.com>,
	Yiannis Nikolakopoulos <yiannis@zptcorp.com>,
	Zi Yan <ziy@nvidia.com>,
	linux-mm@kvack.org, damon@lists.linux.dev,
	Yunjeong Mun <yunjeong.mun@sk.com>
Subject: Re: [Linux Memory Hotness and Promotion] Notes from October 23, 2025
Date: Thu, 20 Nov 2025 18:27:02 -0800	[thread overview]
Message-ID: <20251121022703.134685-1-sj@kernel.org> (raw)
In-Reply-To: <98c0907c-0435-45d2-bd68-e97598b79d0e@sk.com>

On Mon, 17 Nov 2025 20:36:59 +0900 Honggyu Kim <honggyu.kim@sk.com> wrote:

> Hi SJ, David, Ravi and all,
> 
> On 11/14/2025 10:42 AM, SeongJae Park wrote:
[...]
> > The memory capacity extension solution of HMSDK [1], which is developed by SK
> > Hynix, is one good example.  To my understanding (please correct me if I'm
> > wrong), HMSDK is providing separate solutions for bandwidth and capacity
> > expansions.  The user should first understand whether their workload is
> > bandwidth-hungry or capacity-hungry, and select a proper solution.  I suspect
> > the concern from Ravi was one of the reasons.
> 
> Yeah, your understanding is correct in HMSDK cases.

Thank you for confirming!

> 
> > I also recently developed a DAMON-based memory tiering approach [2] that
> > implementing the main idea of TPP [3]: promoting and demoting hot and cold
> > pages aiming a level of the faster node's space utilization.  I didn't see the
> > bandwidth issue from my simple tests of it, but I think the very same problem
> > can be applied to both DAMON-based approach and the original TPP
> > implementation.
> > 
> >>
> >> Ravi suggested adaptive interleaving of memory to optimize both bandwidth
> >> and capacity utilization.  He suggested an approach of a migrator in
> >> kernel space and a calibrator in userspace.  The calibrator would monitor
> >> system bandwidth utilization and, using different weights, determine the
> >> optimal weights for interleaving the hot pages for the highest bandwidth.
> 
> I also think that monitoring bandwidth makes sense.  We recently released a
> tool called bwprof for bandwidth recording and monitoring based on intel pcm.
> https://github.com/skhynix/hmsdk/blob/hmsdk-v4.0/tools/bwprof/bwprof.cc
> 
> This tool can be slightly changed to monitor bandwidth and write it to some
> sysfs interface knobs for this purpose.

Thank you for introducing the tool, I think that can be useful for not only
this case but also general investigations and optimizations of this kind of
memory systems.

> 
> >> If bandwidth saturation is not hit, only cold pages get demoted.  The
> >> migrator reads the target interleave ratio and rearrange the hot pages
> >> from the calibrator and demotes cold pages to the target node.  Currently
> >> this uses DAMOS policies, Migrate_hot and Migrate_cold.
> > 
> > This implementation makes sense to me, especially if the aimed use case is for
> > specific virtual address spaces.
> 
> I think the current issues of adaptive weighted interleave are as follows.
> 1. The adaptive interleaving only works for virtual address mode.

This is true.  But I don't really think this is an issue, since I found no
clear physical address mode interleaving use case.  Since we have clear use
case of virtual mode DAMOS-based interleaving, and I heard no problem from the
use case, I think "all is well".

By the way, interleaving in this context is somewhat confusing to me.
Technically speaking it is DAMOS_MIGRATE_{HOT,COLD} towards multiple
destination nodes with different weights.  And how it should be implemented on
physical address space (whether to decide the migration target node of each
page based on its physical address or its virtual address) was discussed on the
patch series for the multiple migration destination node, but we didn't find
good answer so far.  That's one of the reasons why physical mode
DAMOS-migration to multiple destination nodes is not yet supported.

I understand you are saying it would be nice if Ravi's general idea (optimizing
both bandwidth and capacity) can be implemented for not only virtual address
space but also for physical address space, since it would be easier for
sysadmins?  I agree if so.  Please correct me if I'm getting you wrong, though.

> 2. It scans the entire pages and redistributes them based on the given weight
>     ratios so it limits the general usage as of now.

I think it depends on the detailed usage.  In this specific use case, to my
understanding (correct me if I'm wrong, Ravi), the user-space tool applies
interleaving (or, DAMOS_MIGRATE_HOT to multiple destination nodes) only for
hot pages.  Hence the scanning for interleaving will be executed only for
DAMON-found hot pages.  Also the users may use DAMOS quota or similar features
to further tune the overhead.

Maybe my humble edit of the original mail made you be confused about Ravi's
implementation details?  Sorry if that's the case.

> 
> > Nevertheless, if a physical address space
> > based version is also an option, I think there could be yet another way to
> > achive the goal (optimizing both bandwidth and capacity).
> 
> 3. As mentioned above, having physical address mode is needed, but it makes
>     scanning the entire physical address space and redistribute them and it
>     might require too much overhead in practice.

Same to my comment to above reply to your second point, I think the overhead
could be controlled by adjusting the target page hotness and/or DAMOS quota.
Or, I might misreading your opinion.  Please feel free to correct it in the
case.

Anyway, my idea is not very different from Ravi's one.  It is just a more
simply re-phrased version of it.  In essence, I only changed the word
'interleave', which is not very clear its behavir on physical address to me, to
'migrate_hot' and gave more concrete example using DAMON user-space tool
example commands.

> 
> > My idea is tweaking TPP idea a little bit: migrate pages among NUMA nodees
> > aiming a level of both space and bandwidth utilization of the faster (e.g.,
> > DRAM) node.  In more detail, do the hot pages promotion and cold pages
> > demotions for the target level of faster node space utilization, same to the
> > original TPP idea.  But, stop the hot page promotions if the memory bandwidth
> > consumption of the faster node exceeds a level.  In the case, instead, start
> > demoting _hot_ pages until the memory bandwidth consumption on the faster node
> > decreases below the limit level.
[...]
> As I mentioned at the top of this mail, I think this work makes sense in theory

Glad to get publicly confirmed I'm not the one who sees what I see :D

> but would like to find some practical workloads that can get benefits from this
> work.  It would be grateful if someone can share practical use cases in large
> scale memory systems.

Fully agreed.  Buildable code is much better than words, and test results are
even better than such code.

Nevertheless, I have no good answer for the practical use cases of my idea, for
now.  I even have no plan to find it by myself at the moment, mainly because I
don't have CXL memory to test, for now.  So please don't be blocked by me.  I
will be more than happy to help for any chance though, as always :)

Thanks,
SJ

[...]

     prev parent reply	other threads:[~2025-11-21  2:27 UTC|newest]

Thread overview: 4+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-11-03  0:41 David Rientjes
2025-11-14  1:42 ` SeongJae Park
2025-11-17 11:36   ` Honggyu Kim
2025-11-21  2:27     ` SeongJae Park [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20251121022703.134685-1-sj@kernel.org \
    --to=sj@kernel.org \
    --cc=Jonathan.Cameron@huawei.com \
    --cc=bharata@amd.com \
    --cc=damon@lists.linux.dev \
    --cc=dave@stgolabs.net \
    --cc=gourry@gourry.net \
    --cc=honggyu.kim@sk.com \
    --cc=joshua.hahnjy@gmail.com \
    --cc=kernel_team@skhynix.com \
    --cc=linux-mm@kvack.org \
    --cc=nifan.cxl@gmail.com \
    --cc=rientjes@google.com \
    --cc=rkodsara@amd.com \
    --cc=weixugc@google.com \
    --cc=xuezhengchu@huawei.com \
    --cc=yiannis@zptcorp.com \
    --cc=yunjeong.mun@sk.com \
    --cc=ziy@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox