From: Jonathan Cameron <Jonathan.Cameron@huawei.com>
To: SeongJae Park <sj@kernel.org>
Cc: Raghavendra K T <raghavendra.kt@amd.com>,
Bharata B Rao <bharata@amd.com>,
<lsf-pc@lists.linux-foundation.org>, <linux-mm@kvack.org>,
Michal Hocko <mhocko@suse.com>,
Dan Williams <dan.j.williams@intel.com>, <linuxarm@huawei.com>,
Matthew Wilcox <willy@infradead.org>,
Johannes Weiner <hannes@cmpxchg.org>,
Gregory Price <gourry@gourry.net>
Subject: Re: [LSF/MM/BPF TOPIC v2] Unifying sources of page temperature information - what info is actually wanted?
Date: Fri, 21 Mar 2025 15:30:44 +0000 [thread overview]
Message-ID: <20250321153044.000017aa@huawei.com> (raw)
In-Reply-To: <20250319235029.54378-1-sj@kernel.org>
> > And here is an attempt to compile how different subsystems
> > use the above data:
> > ==========================================================================================
> > Source Subsystem Consumption Activation/Frequency
> > ==========================================================================================
> > PROT_NONE faults NUMAB NUMAB=1 locality based While task is running,
> > via process pgtable balancing rate varies on observed
> > walk NUMAB=2 hot page locality and sysctl knobs.
> > promotion
> > ==========================================================================================
> > folio_mark_accessed() FS/filemap/GUP LRU list activation On cache access and unmap
> > ==========================================================================================
> > PTE A bit via Reclaim:LRU LRU list activation, During memory pressure
> > rmap walk deactivation/demotion
> > ==========================================================================================
> > PTE A bit via Reclaim:MGLRU LRU list activation, - During memory pressure
> > rmap walk and process deactivation/demotion - Continuous sampling (configurable)
> > pgtable walk for workingset reporting
> > ==========================================================================================
> > PTE A bit via DAMON LRU activation,
> > rmap walk hot page promotion,
> > demotion etc
>
> For virtual address spaces monitoring mode, DAMON uses PTE A bit via pgtable
> walk.
>
> It's activation and frequency is basically set as user requests. Activation
> can be set to be reactive to memory pressure like events (using watermarks).
> Frequency can be auto-tuned for pursuing access events per snapshot ratio.
Thanks. I've added that (in very brief form) to the table in my slides.
> > SJ has proposed perhaps extending Damon as a possible interface layer. I am
> > yet to understand how that works in cases where regions do not provide
> > a compact representation due to lack of contiguity in the hotness.
> > An example usecase is hypervisor wanting to migrate data under unaware,
> > cheap VMs. After a system has been running for a while (particularly with hot
> > pages being migrated, swap etc) the hotness map looks much like noise.
>
> Similar concerns for DAMON's region abstraction were raised for physical
> address space monitoring, because there is no cautious effort for making hot
> pages gathered together (or, locality).
>
> I'd argue there is no cautious effort to make temperature be spread, though.
> As a result, we can expect a level of uncautious bias, and that matches with my
> experiences from DAMON use cases on products environemnts so far.
Whilst I'm not in a position to share the data, as it's not mine :( I've
seen graphs that show that for at least some use cases, even if we have some
contiguity of hotness in the VA space, it looks like noise in PA. So
I think this is a case of 'mileage may vary'. Damon works great sometimes but
sometime the spared of access statistics happen to be wrong.
>
> Also, in practice, DAMON regions are used in combination with other
> information. For example, DAMON-based reclaim checkes PTE A bit of each page
> in DAMON-suggested cold memory region to make final decision about whether to
> reclaim or not it, like MADV_PAGEOUT does.
Makes sense. The MADV_PAGEOUT case was one of the motivators for mixing
methods suggestion. Here it's kind of DAMON + dense A bit checking (on
candidate pages).
>
> That is, yes, I agree DAMON's region abstraction is maybe not a good way to
> find perfect answer to some questions such as finding N-th hottest single page.
> And it has many rooms to improve. Nevertheless, even DAMON of today can give
> good enough best-effort answers for questions that practical for some cases,
> such as finding regions that may containing N most hot/cold pages, while
> letting the monitoring overhead fixed as users ask.
>
> Also, please note that there is no reason to restrict DAMON to always use
> regions abstraction. For different use-cases and situation, DAMON will be open
> to be extended to use new abstractions. DAMON aims not to be a subsystem for
> DAMON regions concept but data access monitoring for practical efficiency, and
> continue random evolution for given environments.
Absolutely understood. In my current thinking Damon sits at a particular layer
in the stack and there may be one more abstraction on top of it (e.g. a list
of hot /cold pages). Equally possible that the layers may fuse and it becomes
an aspect of DAMON.
>
> >
> > Now for the "there be monsters bit"...
> > ---------------------------------------
> >
> > - Stability of hotness matters and is hard to establish.
> > Predict a page will remain hot - various heuristics.
> > a) It is hot, probably stays so? (super hot!)
> > Sometimes enough to be detected as hot once,
> > often not.
> > b) It has been hot a while, probably stays so.
> > Check this hot list against previous hot list,
> > entries in both needed to promote.
> > This has a problem if hotlist is small compared to
> > total count of hot pages. Say list is 1%, 20% actually
> > hot, low chance of repeats even in hot pages.
> > c) It is hot, let's monitor a while before doing anything.
> > Measurement technique may change. Maybe cheaper
> > to monitor 'candidate' pages than all pages
> > e.g. CXL HMU gives 1000 pages, then we use access bit
> > sampling to check they are at least accessed N times
> > in next second.
> > d) It was hot, We moved it. Did it stay hot?
> > More useful to identify when we are thrashing and should
> > just stop doing anything. To late to fix this one!
>
> DAMON is providing a sort of b) approach, aka DAMON regions' age, for finding
> both hot and cold regions.
>
> > - Some data should be considered hot even when not in use (e.g. stack)
>
> DAMOS filters is for this kind of exceptions, and DAMON kernel API is flexible
> enough to let callers directly manipulate the regions information based on
> thier special knowledges. We can further optimize the interface for easier
> uses, of course.
Nice.
>
> > - Usecases interfere. So it can't just be a broadcast mode
> > where hotness information is sent to all users.
> > - When to stop, start migration / tracking?
> > a) Detecting bad decisions. Enough bad decisions, better to
> > do nothing?
> > b) Metadata beyond the counts is useful
> > https://lore.kernel.org/all/87h64u2xkh.fsf@DESKTOP-5N7EMDA/
> > Promotion algorithms can need aggregate statistics for a memory
> > device to decide how much to move.
>
> DAMOS quotas goal feature is a sort of a feature for this question. It allows
> users to set target metric and value, and tune the aggressiveness. For
> promotions and demotions, I suggested using upper tier utilization and free
> ratio as such possible goal metric, and gonna post an implementation for that
> soon.
Those are certainly good metrics to consider, but I think we definitely also
need a metric around how beneficial are the moves being made.
That matters more on the promotion path, because that interrupts access to
hot data and so will cause a temporary drop in performance / latency spike.
>
> >
> > As noted above, this may well overlap with other sessions.
> > One outcome of the discussion so far is to highlight what I think many
> > already knew. This is hard!
>
> Indeed. Keeping more people on the same page is important and difficult.
> Thank you for your effort again, and looking forward to discuss in more depth!
>
I'm not sure we'll succeed. This may well be a wild west situation for a while
yet, but hopefully we can slowly converge or at least build some common
parts.
Jonathan
p.s. Heathrow disruption means I'm crossing my fingers on actually getting to
Montreal.
>
> Thanks,
> SJ
>
> >
> > Jonathan
next prev parent reply other threads:[~2025-03-21 15:30 UTC|newest]
Thread overview: 5+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-03-19 12:47 Jonathan Cameron
2025-03-19 23:50 ` SeongJae Park
2025-03-21 15:30 ` Jonathan Cameron [this message]
2025-03-21 17:36 ` SeongJae Park
2025-04-04 10:39 ` Jonathan Cameron
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20250321153044.000017aa@huawei.com \
--to=jonathan.cameron@huawei.com \
--cc=bharata@amd.com \
--cc=dan.j.williams@intel.com \
--cc=gourry@gourry.net \
--cc=hannes@cmpxchg.org \
--cc=linux-mm@kvack.org \
--cc=linuxarm@huawei.com \
--cc=lsf-pc@lists.linux-foundation.org \
--cc=mhocko@suse.com \
--cc=raghavendra.kt@amd.com \
--cc=sj@kernel.org \
--cc=willy@infradead.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox