[Linux Memory Hotness and Promotion] Notes from November 20, 2025

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [Linux Memory Hotness and Promotion] Notes from November 20, 2025
@ 2025-11-24  3:04 David Rientjes
  2025-11-24  4:05 ` Bharata B Rao
  0 siblings, 1 reply; 2+ messages in thread
From: David Rientjes @ 2025-11-24  3:04 UTC (permalink / raw)
  To: Davidlohr Bueso, Fan Ni, Gregory Price, Jonathan Cameron,
	Joshua Hahn, Raghavendra K T, Rao, Bharata Bhasker,
	SeongJae Park, Wei Xu, Xuezheng Chu, Yiannis Nikolakopoulos,
	Zi Yan
  Cc: linux-mm

Hi everybody,

Here are the notes from the last Linux Memory Hotness and Promotion call
that happened on Thursday, November 20.  Thanks to everybody who was 
involved!

These notes are intended to bring people up to speed who could not attend 
the call as well as keep the conversation going in between meetings.

----->o-----
Bharata updated that he had a set of results for the scenario that 
involves promotion upstream, he posted this as a reply to his RFC v3.  Any 
feedback on that series or proposed benchmarks to run would be very 
useful.  He was also thinking about consolidating all the tunables in 
sysfs into a sub directory rather than have them in the parent directory 
for MM.  I suggested this may also start out in debugfs until the APIs 
become more stable.

Bharata was also planning on redoing the NUMAB2 support so that its 
cleaner and the page movement ratelimiting and associated logic is 
separated, which enables using faults as a source.  He's also planning on 
using folio_mark_accessed() as a source of hotness to cover promotion of 
unmapped file folios.  He'll be writing a dedicated microbenchmark for 
testing of this.  He'll also be investing additional benchmarks for the 
overall series as a whole.

----->o-----
Jonathan Cameron asked what the general feel was about the memory 
overhead: currently this tracking requires ~2GB per 1TB.  Wei Xu noted 
that Google is taking a similar approach but with one byte per page in 
page flags.  If just for promotion purposes, we likely don't need eight 
bytes per page.  Even NUMA Balancing does not use eight bytes per page.  
Jonathan said it currently uses 33 bits per page so some shrinkage might 
be possible.  Wei said promotion still requires the per-pfn scan which can 
be expensive.

Wei said there would be one data structure with the information so we can 
do atomic updates on the hot metadata and then there is a much smaller 
data structure that tracks which pages to promote. 

Raghu noted that in discussion with Bharata that it was pointed out that 
the tracking of memory here is only necessary for the low tier since that 
memory is the only viable set of pages to promote.  Jonathan noted that 
may be the majority of system memory.  The metadata itself is only stored 
in top tier memory, which is expensive.

----->o-----
We discussed the benchmarks that we should use for evaluation of all of 
these approaches.  SeongJae noted that he had no specific benchmark in 
mind but we should discuss the access pattern the benchmark should have.  
This should have some temporal access patterns but also have different 
hotness in different locations of memory; secondly, the pattern should 
change during runtime.

Jonathan said there's been a heavy reliance on memcached but that's not 
ideal because it's too predictable; we actually need the opposite of this.  
I noted that I've had some success running specjbb and redis workloads.  
Redis is interesting because it does not always observe spatial locality.

Yiannis noted one of the challenges with specint is that the duration of 
the benchmark itself is not long enough to assess optimal placement logic.  
Wei agreed with this, the benchmark would need to run for a long time.  
Yiannis further mentioned that these can be used to over-subscribe cores, 
however, the induce pressure (and consume bandwidth).

----->o-----
Raghu updated on his patch series to use the LRU gen scan API which 
iterates through a single mm, this provides more control over the memory 
that is being iterated.  He was working through some issues in the patch 
series and may need to reach out to Kinsey for discussion on klruscand.  
Jonathan also provided some feedback on the mailing list.

Raghu asked Kinsey if it would be possible to have an API that scanned a 
single mm; Kinsey said yes, this was similar to what was being thought 
about internally.  Raghu said this would be useful for integration.

Wei asked Raghu if his series will integrate the scanning and promotion 
together so that when a page is identified we can promote right away.  
Raghu said this was implemented like NUMAB but does not happen after a 
single access.  There is a separate migration thread.

Jonathan asked if we necessarily care if we lose some information; if 
there is a ton of memory to promote, we can't migrate everything, so do we 
care if some hotness information is actually lost?  Wei suggested that we 
must have a mechanism for promoting the hottest pages, not just hot pages, 
so some amount of history is required.  Jonathan said that if everything 
was insanely hot and we lose some information it would readily reappear 
again.  Raghu's patch series only uses a single bit from page flags, Wei 
suggested extending this.

----->o-----
Next meeting will be on Thursday, December 4 at 8:30am PST (UTC-8),
everybody is welcome: https://meet.google.com/jak-ytdx-hnm

Topics for the next meeting:

 - updates on Bharata's RFC v3 with new benchmarks and consolidation of
   tunables
 - continued discussion on memory overheads used to save the memory
   hotness state and the list of promotion targets
 - benchmarks to use as the industry standard beyond just memcache, such
   as redis
 - discuss generalized subsystem for providing bandwidth information
   independent of the underlying platform, ideally through resctrl,
   otherwise utilizing bandwidth information will be challenging
   + preferably this bandwidth monitoring is not per NUMA node but rather
     slow and fast
 - similarly, discuss generalized subsystem for providing memory hotness
   information
 - determine minimal viable upstream opportunity to optimize for tiering
   that is extensible for future use cases and optimizations
 - update on non-temporal stores enlightenment for memory tiering
 - enlightening migrate_pages() for hardware assists and how this work
   will be charged to userspace, including for memory compaction

Please let me know if you'd like to propose additional topics for
discussion, thank you!

^ permalink raw reply	[flat|nested] 2+ messages in thread

* Re: [Linux Memory Hotness and Promotion] Notes from November 20, 2025
  2025-11-24  3:04 [Linux Memory Hotness and Promotion] Notes from November 20, 2025 David Rientjes
@ 2025-11-24  4:05 ` Bharata B Rao
  0 siblings, 0 replies; 2+ messages in thread
From: Bharata B Rao @ 2025-11-24  4:05 UTC (permalink / raw)
  To: David Rientjes, Davidlohr Bueso, Fan Ni, Gregory Price,
	Jonathan Cameron, Joshua Hahn, Raghavendra K T, SeongJae Park,
	Wei Xu, Xuezheng Chu, Yiannis Nikolakopoulos, Zi Yan
  Cc: linux-mm

On 24-Nov-25 8:34 AM, David Rientjes wrote:
> Hi everybody,
> 
> Here are the notes from the last Linux Memory Hotness and Promotion call
> that happened on Thursday, November 20.  Thanks to everybody who was 
> involved!
> 
> These notes are intended to bring people up to speed who could not attend 
> the call as well as keep the conversation going in between meetings.
> 
> ----->o-----
> Bharata updated that he had a set of results for the scenario that 
> involves promotion upstream, he posted this as a reply to his RFC v3.  Any 
> feedback on that series or proposed benchmarks to run would be very 
> useful.  He was also thinking about consolidating all the tunables in 
> sysfs into a sub directory rather than have them in the parent directory 
> for MM.  I suggested this may also start out in debugfs until the APIs 
> become more stable.

Sure.

> 
> Bharata was also planning on redoing the NUMAB2 support so that its 
> cleaner and the page movement ratelimiting and associated logic is 
> separated, which enables using faults as a source.  He's also planning on 
> using folio_mark_accessed() as a source of hotness to cover promotion of 
> unmapped file folios.  He'll be writing a dedicated microbenchmark for 
> testing of this.  He'll also be investing additional benchmarks for the 
> overall series as a whole.
> 
> ----->o-----
> Jonathan Cameron asked what the general feel was about the memory 
> overhead: currently this tracking requires ~2GB per 1TB.  Wei Xu noted 
> that Google is taking a similar approach but with one byte per page in 
> page flags.  If just for promotion purposes, we likely don't need eight 
> bytes per page.  Even NUMA Balancing does not use eight bytes per page.  
> Jonathan said it currently uses 33 bits per page so some shrinkage might 
> be possible.  Wei said promotion still requires the per-pfn scan which can 
> be expensive.

NUMAB does the promotion at detection time and hence it has no need to
store information like target NID. When you want to do batched migration,
which is separated from the detection mechanism (preferably using a
dedicated migration thread), we need additional space to store all the
required data for hot page promotion.

Here is how I am using currently:

NID - 10 bits which is the max that CONFIG_NODES_SHIFT can have.
frequency - 3 bits to capture 8 different accesses
time - 19 bits to  accommodate 8.73s time window at 1000HZ.
ready bit - 1 bit to mark the page as ready for migration.

I can probably fit everything within 32 bits as NID can be reduced at least
by 2 bits. However I used "unsigned long" so that atomic bit operations
to update the hotness parameters become seamless with machine word size.

Regarding per-PFN scanning, currently I am scanning PFNs mem_section-wise
and hence it should be possible to completely skip those sections which
haven't been marked as containing hot pages completely. I will try to add
this optimization in my next iteration. Despite this, the concern if scanning
thread (kmigrated) will get to all the hot pages in time is still a question.
At least in my pathological testcases which generate a lot of hot pages, it
is not coming out as a problem.

> 
> Wei said there would be one data structure with the information so we can 
> do atomic updates on the hot metadata and then there is a much smaller 
> data structure that tracks which pages to promote.

Yes, that's what I had in my previous version of the patchset with hash
and heap. However these are the issues:

- Multiple data structures, more space requirement.
- Cost of keeping both data structures in sync as frequent updates are
  needed for hotness parameters.
- Needs dynamic allocation of hot page records which is troublesome for
  large amounts of hot page records.

Wei - I know you have often given inputs on efficient data organization
but it would help if you can be more verbose on what kind of data
organization would be optimal given the issues I have highlighted.

> 
> Raghu noted that in discussion with Bharata that it was pointed out that 
> the tracking of memory here is only necessary for the low tier since that 
> memory is the only viable set of pages to promote.  Jonathan noted that 
> may be the majority of system memory.  The metadata itself is only stored 
> in top tier memory, which is expensive.
> 
> ----->o-----
> We discussed the benchmarks that we should use for evaluation of all of 
> these approaches.  SeongJae noted that he had no specific benchmark in 
> mind but we should discuss the access pattern the benchmark should have.  
> This should have some temporal access patterns but also have different 
> hotness in different locations of memory; secondly, the pattern should 
> change during runtime.
> 
> Jonathan said there's been a heavy reliance on memcached but that's not 
> ideal because it's too predictable; we actually need the opposite of this.  
> I noted that I've had some success running specjbb and redis workloads.  
> Redis is interesting because it does not always observe spatial locality.
> 
> Yiannis noted one of the challenges with specint is that the duration of 
> the benchmark itself is not long enough to assess optimal placement logic.  
> Wei agreed with this, the benchmark would need to run for a long time.  
> Yiannis further mentioned that these can be used to over-subscribe cores, 
> however, the induce pressure (and consume bandwidth).
> 
> ----->o-----
> Raghu updated on his patch series to use the LRU gen scan API which 
> iterates through a single mm, this provides more control over the memory 
> that is being iterated.  He was working through some issues in the patch 
> series and may need to reach out to Kinsey for discussion on klruscand.  
> Jonathan also provided some feedback on the mailing list.
> 
> Raghu asked Kinsey if it would be possible to have an API that scanned a 
> single mm; Kinsey said yes, this was similar to what was being thought 
> about internally.  Raghu said this would be useful for integration.
> 
> Wei asked Raghu if his series will integrate the scanning and promotion 
> together so that when a page is identified we can promote right away.  
> Raghu said this was implemented like NUMAB but does not happen after a 
> single access.  There is a separate migration thread.
> 
> Jonathan asked if we necessarily care if we lose some information; if 
> there is a ton of memory to promote, we can't migrate everything, so do we 
> care if some hotness information is actually lost?  Wei suggested that we 
> must have a mechanism for promoting the hottest pages, not just hot pages, 
> so some amount of history is required.  Jonathan said that if everything 
> was insanely hot and we lose some information it would readily reappear 
> again.  Raghu's patch series only uses a single bit from page flags, Wei 
> suggested extending this.

In my previous approach (which had a list of hottest pages in the form of
a heap) and also in the current approach (kmigrated scanning for hot PFNs)
the migrator thread can use dynamic frequency and time threshold values
to filter out hottest pages from hot pages.

Regards,
Bharata.



^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2025-11-24  4:05 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-11-24  3:04 [Linux Memory Hotness and Promotion] Notes from November 20, 2025 David Rientjes
2025-11-24  4:05 ` Bharata B Rao

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox