linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Jonathan Cameron <Jonathan.Cameron@huawei.com>
To: SeongJae Park <sj@kernel.org>
Cc: Raghavendra K T <raghavendra.kt@amd.com>,
	Bharata B Rao <bharata@amd.com>,
	<lsf-pc@lists.linux-foundation.org>, <linux-mm@kvack.org>,
	Michal Hocko <mhocko@suse.com>,
	Dan Williams <dan.j.williams@intel.com>, <linuxarm@huawei.com>,
	Matthew Wilcox <willy@infradead.org>,
	Johannes Weiner <hannes@cmpxchg.org>,
	Gregory Price <gourry@gourry.net>
Subject: Re: [LSF/MM/BPF TOPIC v2] Unifying sources of page temperature information - what info is actually wanted?
Date: Fri, 21 Mar 2025 15:30:44 +0000	[thread overview]
Message-ID: <20250321153044.000017aa@huawei.com> (raw)
In-Reply-To: <20250319235029.54378-1-sj@kernel.org>


> > And here is an attempt to compile how different subsystems
> > use the above data:
> > ==========================================================================================
> > Source			Subsystem	Consumption		Activation/Frequency
> > ==========================================================================================
> > PROT_NONE faults	NUMAB		NUMAB=1 locality based	While task is running,
> > via process pgtable			balancing		rate varies on observed
> > walk					NUMAB=2 hot page	locality and sysctl knobs.
> > 					promotion
> > ==========================================================================================
> > folio_mark_accessed()	FS/filemap/GUP	LRU list activation	On cache access and unmap
> > ==========================================================================================
> > PTE A bit via		Reclaim:LRU	LRU list activation,	During memory pressure
> > rmap walk				deactivation/demotion
> > ==========================================================================================
> > PTE A bit via		Reclaim:MGLRU	LRU list activation,	- During memory pressure
> > rmap walk and process			deactivation/demotion	- Continuous sampling (configurable)
> > pgtable walk							  for workingset reporting
> > ==========================================================================================
> > PTE A bit via		DAMON		LRU activation,
> > rmap walk				hot page promotion,
> > 					demotion etc  
> 
> For virtual address spaces monitoring mode, DAMON uses PTE A bit via pgtable
> walk.
> 
> It's activation and frequency is basically set as user requests.  Activation
> can be set to be reactive to memory pressure like events (using watermarks).
> Frequency can be auto-tuned for pursuing access events per snapshot ratio.

Thanks.  I've added that (in very brief form) to the table in my slides.


> > SJ has proposed perhaps extending Damon as a possible interface layer. I am
> > yet to understand how that works in cases where regions do not provide
> > a compact representation due to lack of contiguity in the hotness.
> > An example usecase is hypervisor wanting to migrate data under unaware,
> > cheap VMs.  After a system has been running for a while (particularly with hot
> > pages being migrated, swap etc) the hotness map looks much like noise.  
> 
> Similar concerns for DAMON's region abstraction were raised for physical
> address space monitoring, because there is no cautious effort for making hot
> pages gathered together (or, locality).
> 
> I'd argue there is no cautious effort to make temperature be spread, though.
> As a result, we can expect a level of uncautious bias, and that matches with my
> experiences from DAMON use cases on products environemnts so far.

Whilst I'm not in a position to share the data, as it's not mine :( I've
seen graphs that show that for at least some use cases, even if we have some
contiguity of hotness in the VA space, it looks like noise in PA.  So
I think this is a case of 'mileage may vary'. Damon works great sometimes but
sometime the spared of access statistics happen to be wrong.

> 
> Also, in practice, DAMON regions are used in combination with other
> information.  For example, DAMON-based reclaim checkes PTE A bit of each page
> in DAMON-suggested cold memory region to make final decision about whether to
> reclaim or not it, like MADV_PAGEOUT does.

Makes sense.  The MADV_PAGEOUT case was one of the motivators for mixing
methods suggestion.  Here it's kind of DAMON + dense A bit checking (on
candidate pages).

> 
> That is, yes, I agree DAMON's region abstraction is maybe not a good way to
> find perfect answer to some questions such as finding N-th hottest single page.
> And it has many rooms to improve.  Nevertheless, even DAMON of today can give
> good enough best-effort answers for questions that practical for some cases,
> such as finding regions that may containing N most hot/cold pages, while
> letting the monitoring overhead fixed as users ask.
> 
> Also, please note that there is no reason to restrict DAMON to always use
> regions abstraction.  For different use-cases and situation, DAMON will be open
> to be extended to use new abstractions.  DAMON aims not to be a subsystem for
> DAMON regions concept but data access monitoring for practical efficiency, and
> continue random evolution for given environments.

Absolutely understood. In my current thinking Damon sits at a particular layer
in the stack and there may be one more abstraction on top of it (e.g. a list
of hot /cold pages). Equally possible that the layers may fuse and it becomes
an aspect of DAMON.

> 
> > 
> > Now for the "there be monsters bit"...
> > ---------------------------------------
> > 
> > - Stability of hotness matters and is hard to establish.
> >   Predict a page will remain hot - various heuristics.
> > 	a) It is hot, probably stays so? (super hot!)
> > 	   Sometimes enough to be detected as hot once,
> > 	   often not.
> > 	b) It has been hot a while, probably stays so.
> > 	   Check this hot list against previous hot list,
> > 	   entries in both needed to promote.
> > 	   This has a problem if hotlist is small compared to
> > 	   total count of hot pages.  Say list is 1%, 20% actually
> > 	   hot, low chance of repeats even in hot pages.
> > 	c) It is hot, let's monitor a while before doing anything.
> > 	   Measurement technique may change. Maybe cheaper
> > 	   to monitor 'candidate' pages than all pages
> > 	   e.g. CXL HMU gives 1000 pages, then we use access bit
> > 	        sampling to check they are at least accessed N times
> > 		in next second.
> > 	d) It was hot, We moved it. Did it stay hot?
> > 	   More useful to identify when we are thrashing and should
> > 	   just stop doing anything.  To late to fix this one!  
> 
> DAMON is providing a sort of b) approach, aka DAMON regions' age, for finding
> both hot and cold regions.
> 
> > - Some data should be considered hot even when not in use (e.g. stack)  
> 
> DAMOS filters is for this kind of exceptions, and DAMON kernel API is flexible
> enough to let callers directly manipulate the regions information based on
> thier special knowledges.  We can further optimize the interface for easier
> uses, of course.

Nice.

> 
> > - Usecases interfere. So it can't just be a broadcast mode
> >   where hotness information is sent to all users.
> > - When to stop, start migration / tracking?
> > 	a) Detecting bad decisions. Enough bad decisions, better to
> > 	   do nothing?
> >  	b) Metadata beyond the counts is useful
> > 	   https://lore.kernel.org/all/87h64u2xkh.fsf@DESKTOP-5N7EMDA/
> > 	   Promotion algorithms can need aggregate statistics for a memory 
> > 	   device to decide how much to move.  
> 
> DAMOS quotas goal feature is a sort of a feature for this question.  It allows
> users to set target metric and value, and tune the aggressiveness.  For
> promotions and demotions, I suggested using upper tier utilization and free
> ratio as such possible goal metric, and gonna post an implementation for that
> soon.

Those are certainly good metrics to consider, but I think we definitely also
need a metric around how beneficial are the moves being made.

That matters more on the promotion path, because that interrupts access to
hot data and so will cause a temporary drop in performance / latency spike.

> 
> > 
> > As noted above, this may well overlap with other sessions.
> > One outcome of the discussion so far is to highlight what I think many
> > already knew.  This is hard!  
> 
> Indeed.  Keeping more people on the same page is important and difficult.
> Thank you for your effort again, and looking forward to discuss in more depth!
>

I'm not sure we'll succeed.  This may well be a wild west situation for a while
yet, but hopefully we can slowly converge or at least build some common
parts.

Jonathan

p.s. Heathrow disruption means I'm crossing my fingers on actually getting to
Montreal.
 
> 
> Thanks,
> SJ
> 
> > 
> > Jonathan  



  reply	other threads:[~2025-03-21 15:30 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-03-19 12:47 Jonathan Cameron
2025-03-19 23:50 ` SeongJae Park
2025-03-21 15:30   ` Jonathan Cameron [this message]
2025-03-21 17:36     ` SeongJae Park
2025-04-04 10:39 ` Jonathan Cameron

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20250321153044.000017aa@huawei.com \
    --to=jonathan.cameron@huawei.com \
    --cc=bharata@amd.com \
    --cc=dan.j.williams@intel.com \
    --cc=gourry@gourry.net \
    --cc=hannes@cmpxchg.org \
    --cc=linux-mm@kvack.org \
    --cc=linuxarm@huawei.com \
    --cc=lsf-pc@lists.linux-foundation.org \
    --cc=mhocko@suse.com \
    --cc=raghavendra.kt@amd.com \
    --cc=sj@kernel.org \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox