[LSF/MM/BPF TOPIC] Overhauling hot page detection and promotion based on PTE A bit scanning

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [LSF/MM/BPF TOPIC] Overhauling hot page detection and promotion based on PTE A bit scanning
@ 2025-01-23 10:57 Raghavendra K T
  2025-01-23 18:20 ` SeongJae Park
                   ` (4 more replies)
  0 siblings, 5 replies; 33+ messages in thread
From: Raghavendra K T @ 2025-01-23 10:57 UTC (permalink / raw)
  To: linux-mm, akpm, lsf-pc, bharata
  Cc: gourry, nehagholkar, abhishekd, ying.huang, nphamcs, hannes,
	feng.tang, kbusch, Hasan.Maruf, sj, david, willy, k.shutemov,
	mgorman, vbabka, hughd, rientjes, shy828301, liam.howlett,
	peterz, mingo, nadav.amit, shivankg, ziy, jhubbard,
	AneeshKumar.KizhakeVeetil, linux-kernel, jon.grimm,
	santosh.shukla, Michael.Day, riel, weixugc, leesuyeon0506,
	honggyu.kim, leillc, kmanaouil.dev, rppt, dave.hansen,
	raghavendra.kt

Bharata and I would like to propose the following topic for LSFMM.

Topic: Overhauling hot page detection and promotion based on PTE A bit scanning.

In the Linux kernel, hot page information can potentially be obtained from
multiple sources:

a. PROT_NONE faults (NUMA balancing)
b. PTE Access bit (LRU scanning)
c. Hardware provided page hotness info (like AMD IBS)

This information is further used to migrate (or promote) pages from slow memory
tier to top tier to increase performance.

In the current hot page promotion mechanism, all the activities including the
process address space scanning, NUMA hint fault handling and page migration are
performed in the process context. i.e., scanning overhead is borne by the
applications.

I had recently posted a patch [1] to improve this in the context of slow-tier
page promotion. Here, Scanning is done by a global kernel thread which routinely
scans all the processes' address spaces and checks for accesses by reading the
PTE A bit. The hot pages thus identified are maintained in list and subsequently
are promoted to a default top-tier node. Thus, the approach pushes overhead of
scanning, NUMA hint faults and migrations off from process context.

The topic was presented in the MM alignment session hosted by David Rientjes [2].
The topic also finds a mention in S J Park's LSFMM proposal [3].

Here is the list of potential discussion points:
1. Other improvements and enhancements to PTE A bit scanning approach. Use of
multiple kernel threads, throttling improvements, promotion policies, per-process
opt-in via prctl, virtual vs physical address based scanning, tuning hot page
detection algorithm etc.

2. Possibility of maintaining single source of truth for page hotness that would
maintain hot page information from multiple sources and let other sub-systems
use that info.

3. Discuss how hardware provided hotness info (like AMD IBS) can further aid
promotion. Bharata had posted an RFC [4] on this a while back.

4. Overlap with DAMON and potential reuse.

Links:

[1] https://lore.kernel.org/all/20241201153818.2633616-1-raghavendra.kt@amd.com/
[2] https://lore.kernel.org/linux-mm/20241226012833.rmmbkws4wdhzdht6@ed.ac.uk/T/
[3] https://lore.kernel.org/lkml/Z4XUoWlU-UgRik18@gourry-fedora-PF4VCD3F/T/
[4] https://lore.kernel.org/lkml/20230208073533.715-2-bharata@amd.com/

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Overhauling hot page detection and promotion based on PTE A bit scanning
  2025-01-23 10:57 [LSF/MM/BPF TOPIC] Overhauling hot page detection and promotion based on PTE A bit scanning Raghavendra K T
@ 2025-01-23 18:20 ` SeongJae Park
  2025-01-24  8:54   ` Raghavendra K T
  2025-01-24  5:53 ` Hyeonggon Yoo
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 33+ messages in thread
From: SeongJae Park @ 2025-01-23 18:20 UTC (permalink / raw)
  To: Raghavendra K T
  Cc: SeongJae Park, linux-mm, akpm, lsf-pc, bharata, gourry,
	nehagholkar, abhishekd, ying.huang, nphamcs, hannes, feng.tang,
	kbusch, Hasan.Maruf, david, willy, k.shutemov, mgorman, vbabka,
	hughd, rientjes, shy828301, liam.howlett, peterz, mingo,
	nadav.amit, shivankg, ziy, jhubbard, AneeshKumar.KizhakeVeetil,
	linux-kernel, jon.grimm, santosh.shukla, Michael.Day, riel,
	weixugc, leesuyeon0506, honggyu.kim, leillc, kmanaouil.dev, rppt,
	dave.hansen

Hi Raghavendra,

On Thu, 23 Jan 2025 10:57:21 +0000 Raghavendra K T <raghavendra.kt@amd.com> wrote:

> Bharata and I would like to propose the following topic for LSFMM.
> 
> Topic: Overhauling hot page detection and promotion based on PTE A bit scanning.

Thank you for proposing this.  I'm interested in this!

>  
> In the Linux kernel, hot page information can potentially be obtained from
> multiple sources:
>  
> a. PROT_NONE faults (NUMA balancing)
> b. PTE Access bit (LRU scanning)
> c. Hardware provided page hotness info (like AMD IBS)
>  
> This information is further used to migrate (or promote) pages from slow memory
> tier to top tier to increase performance.
> 
> In the current hot page promotion mechanism, all the activities including the
> process address space scanning, NUMA hint fault handling and page migration are
> performed in the process context. i.e., scanning overhead is borne by the
> applications.

I understand that you're mentioning about only fully in-kernel solutions.  Just
for readers' context, SK hynix' HMSDK cpacity expansion[1] does the works in
two asynchronous threads (one for promotion and the other for demotion), using
DAMON in kernel as the core worker, and controlling DAMON from the user-space.

>  
> I had recently posted a patch [1] to improve this in the context of slow-tier
> page promotion. Here, Scanning is done by a global kernel thread which routinely
> scans all the processes' address spaces and checks for accesses by reading the
> PTE A bit. The hot pages thus identified are maintained in list and subsequently
> are promoted to a default top-tier node. Thus, the approach pushes overhead of
> scanning, NUMA hint faults and migrations off from process context.
> 
> The topic was presented in the MM alignment session hosted by David Rientjes [2].
> The topic also finds a mention in S J Park's LSFMM proposal [3].
>  
> Here is the list of potential discussion points:

Great discussion points, thank you.  I'm adding how DAMON tries to deal with
some of the points below.

> 1. Other improvements and enhancements to PTE A bit scanning approach. Use of
> multiple kernel threads,

DAMON allows use of multiple kernel threads for different monitoring scopes.
There were also ideas for splitting the monitoring part and migration-like
system operation part to different threads.

> throttling improvements,

DAMON provides features called "adaptive regions adjustment" and "DAMOS quotas"
for throttling overheads from access monitoring and migration-like system
operation actions.

> promotion policies,

DAMON's access-aware system operation feature (DAMOS) allows setting this kind
of system operation policy based on access pattern and additional information
including page level information such as anonymousness, belonging cgroup, page
granular A bit recheck.

> per-process opt-in via prctl,

DAMON allows making the system operation action to pages belonging to specific
cgroups using a feature called DAMOS filters.  It is not integrated with prctl,
and would work in cgroups scope, but may be able to be used.  Extending DAMOS
filters for belonging processes may also be doable.

> virtual vs physical address based scanning,

DAMON supports both virtual and physical address spaces monitoring.  DAMON's
pages migration is currently not supported for virtual address spaces, though I
believe adding the support is not difficult.

I'm bit in favor or physical address space, probably because I'm biased to what
DAMON currently supports, but also due to unmapped pages promotion like edge
cases.

> tuning hot page detection algorithm etc.

DAMON requires users manually tuning some important paramters for hot pages
detection.  We recently provided a tuning guide[2], and working on making it
automated.  I believe the essential problem is similar to many use cases
regardless of the type of low level access check primitives, so want to learn
if the tuning automation idea can be generally used.

> 
> 2. Possibility of maintaining single source of truth for page hotness that would
> maintain hot page information from multiple sources and let other sub-systems
> use that info.

DAMON is currently using the PTE A bit as the essential access check primitive.
We designed DAMON to be able to be extended for other access check primitives
such as page faults and AMD IBS like h/w features.  We are now planning to do
such extension, though still in the very early low-priority planning stage.
DAMON also provides the kernel API.

> 
> 3. Discuss how hardware provided hotness info (like AMD IBS) can further aid
> promotion. Bharata had posted an RFC [4] on this a while back.

Maybe CXL Hotness Monitoring Unit could also be an interesting thing to discuss
together.

> 
> 4. Overlap with DAMON and potential reuse.

I confess that it seems some of the works might overlap with DAMON to my biased
eyes.  I'm looking forward to attend this session, to make it less biased and
more aligned with people :)

>  
> Links:
> 
> [1] https://lore.kernel.org/all/20241201153818.2633616-1-raghavendra.kt@amd.com/
> [2] https://lore.kernel.org/linux-mm/20241226012833.rmmbkws4wdhzdht6@ed.ac.uk/T/
> [3] https://lore.kernel.org/lkml/Z4XUoWlU-UgRik18@gourry-fedora-PF4VCD3F/T/
> [4] https://lore.kernel.org/lkml/20230208073533.715-2-bharata@amd.com/

Again, thank you for proposing this topic, and I wish to see you at Montreal!

[1] https://github.com/skhynix/hmsdk/wiki/Capacity-Expansion
[2] https://lkml.kernel.org/r/20250110185232.54907-1-sj@kernel.org

Thanks,
SJ

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Overhauling hot page detection and promotion based on PTE A bit scanning
  2025-01-23 18:20 ` SeongJae Park
@ 2025-01-24  8:54   ` Raghavendra K T
  2025-01-24 18:05     ` Jonathan Cameron
  0 siblings, 1 reply; 33+ messages in thread
From: Raghavendra K T @ 2025-01-24  8:54 UTC (permalink / raw)
  To: SeongJae Park
  Cc: linux-mm, akpm, lsf-pc, bharata, gourry, nehagholkar, abhishekd,
	ying.huang, nphamcs, hannes, feng.tang, kbusch, Hasan.Maruf,
	david, willy, k.shutemov, mgorman, vbabka, hughd, rientjes,
	shy828301, liam.howlett, peterz, mingo, nadav.amit, shivankg,
	ziy, jhubbard, AneeshKumar.KizhakeVeetil, linux-kernel,
	jon.grimm, santosh.shukla, Michael.Day, riel, weixugc,
	leesuyeon0506, honggyu.kim, leillc, kmanaouil.dev, rppt,
	dave.hansen

On 1/23/2025 11:50 PM, SeongJae Park wrote:
> Hi Raghavendra,
> 
> On Thu, 23 Jan 2025 10:57:21 +0000 Raghavendra K T <raghavendra.kt@amd.com> wrote:
> 
>> Bharata and I would like to propose the following topic for LSFMM.
>>
>> Topic: Overhauling hot page detection and promotion based on PTE A bit scanning.
> 
> Thank you for proposing this.  I'm interested in this!
> 

Thank you.

[...]

>> virtual vs physical address based scanning,
> 
> DAMON supports both virtual and physical address spaces monitoring.  DAMON's
> pages migration is currently not supported for virtual address spaces, though I
> believe adding the support is not difficult.
> 

Will check this.

[...]

>>
>> 3. Discuss how hardware provided hotness info (like AMD IBS) can further aid
>> promotion. Bharata had posted an RFC [4] on this a while back.
> 
> Maybe CXL Hotness Monitoring Unit could also be an interesting thing to discuss
> together.
> 

Definitely.

>>
>> 4. Overlap with DAMON and potential reuse.
> 
> I confess that it seems some of the works might overlap with DAMON to my biased
> eyes.  I'm looking forward to attend this session, to make it less biased and
> more aligned with people :)
> 

Yes. Agree.

>>   
>> Links:
>>
>> [1] https://lore.kernel.org/all/20241201153818.2633616-1-raghavendra.kt@amd.com/
>> [2] https://lore.kernel.org/linux-mm/20241226012833.rmmbkws4wdhzdht6@ed.ac.uk/T/
>> [3] https://lore.kernel.org/lkml/Z4XUoWlU-UgRik18@gourry-fedora-PF4VCD3F/T/
>> [4] https://lore.kernel.org/lkml/20230208073533.715-2-bharata@amd.com/
> 
> Again, thank you for proposing this topic, and I wish to see you at Montreal!
> 

Same here .. Thank you :)

- Raghu



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Overhauling hot page detection and promotion based on PTE A bit scanning
  2025-01-24  8:54   ` Raghavendra K T
@ 2025-01-24 18:05     ` Jonathan Cameron
  0 siblings, 0 replies; 33+ messages in thread
From: Jonathan Cameron @ 2025-01-24 18:05 UTC (permalink / raw)
  To: Raghavendra K T
  Cc: SeongJae Park, linux-mm, akpm, lsf-pc, bharata, gourry,
	nehagholkar, abhishekd, ying.huang, nphamcs, hannes, feng.tang,
	kbusch, Hasan.Maruf, david, willy, k.shutemov, mgorman, vbabka,
	hughd, rientjes, shy828301, liam.howlett, peterz, mingo,
	nadav.amit, shivankg, ziy, jhubbard, AneeshKumar.KizhakeVeetil,
	linux-kernel, jon.grimm, santosh.shukla, Michael.Day, riel,
	weixugc, leesuyeon0506, honggyu.kim, leillc, kmanaouil.dev, rppt,
	dave.hansen

On Fri, 24 Jan 2025 14:24:40 +0530
Raghavendra K T <raghavendra.kt@amd.com> wrote:

> On 1/23/2025 11:50 PM, SeongJae Park wrote:
> > Hi Raghavendra,
> > 
> > On Thu, 23 Jan 2025 10:57:21 +0000 Raghavendra K T <raghavendra.kt@amd.com> wrote:
> >   
> >> Bharata and I would like to propose the following topic for LSFMM.
> >>
> >> Topic: Overhauling hot page detection and promotion based on PTE A bit scanning.  
> > 
> > Thank you for proposing this.  I'm interested in this!
> >   
> 
> Thank you.
> 
> [...]
> 
> >> virtual vs physical address based scanning,  
> > 
> > DAMON supports both virtual and physical address spaces monitoring.  DAMON's
> > pages migration is currently not supported for virtual address spaces, though I
> > believe adding the support is not difficult.
> >   
> 
> Will check this.
> 
> [...]
> 
> >>
> >> 3. Discuss how hardware provided hotness info (like AMD IBS) can further aid
> >> promotion. Bharata had posted an RFC [4] on this a while back.  
> > 
> > Maybe CXL Hotness Monitoring Unit could also be an interesting thing to discuss
> > together.
> >   
> 
> Definitely.

Thanks for the shout out SJ.  I'm definitely interested in this topic from the
angle of the hardware hotness monitoring units (roughly speaking ones that give
you a list of hot pages - typically by PA).  Making sure any solution works
for those is perhaps key for the longer term.  Not entirely clear to me that
we are ready yet for data aggregation solution, or mixed techniques
but definitely interesting to brainstorm.

Until now my main focus has been on getting infrastructure in place to work out the
lower levels of using a hardware hotness monitoring unit (using QEMU for now
with TCG plugins to get the access data).  In general, not stuff I suspect
anyone will want to discuss at LSF/MM, but perhaps providing insights into how
good data we might get could be.

Unless the hardware units people build are very capable (and expensive)
the chances are we will have to deal with accuracy limitations that I
suspect the users of this data for migration etc do not want to explicitly
deal with.  If our tracking is coming from multiple sources we need to
deal with differences in, and potentially estimation of accuracy.
Anything efficient is going to have some accuracy issues (regions for Damon,
access bit scanning frequency for your technique, sampling for page fault
techniques, data in wrong place - access bit's will tell you to promote
stuff that is always in cache - arguably a waste of time etc)
I've no idea yet how painful this is going to be.  Using the different
sources to overcome limitations on each one is interesting but likely
to be complex and tricky to generalize.  Maybe access bit scanning
to detect hotish large scale regions, then a hardware tracker to separate
out 'hot' from 'warm'.  Sounds fun, but far form general!

Lots of problems to solve in this space. And when we have done that, there
is paravirtualizing hardware trackers / other methods, application
specific usage of the data (some apps will know better than the kernel and
will want this data, security / side channels etc).

For stretch goals there is even the fun question of hotness monitoring down
stream of interleave, particularly when it's scrambled and not a power of
2 ways.  Again, maybe not a general problem but will affect data biases.
How much of that we want to hide down in implementations below some general
'give me hot stuff' is an open question (I'm guessing hide almost everything
beyond controls on data bandwidth).

Jonathan

> 
> >>
> >> 4. Overlap with DAMON and potential reuse.  
> > 
> > I confess that it seems some of the works might overlap with DAMON to my biased
> > eyes.  I'm looking forward to attend this session, to make it less biased and
> > more aligned with people :)
> >   
> 
> Yes. Agree.
> 
> >>   
> >> Links:
> >>
> >> [1] https://lore.kernel.org/all/20241201153818.2633616-1-raghavendra.kt@amd.com/
> >> [2] https://lore.kernel.org/linux-mm/20241226012833.rmmbkws4wdhzdht6@ed.ac.uk/T/
> >> [3] https://lore.kernel.org/lkml/Z4XUoWlU-UgRik18@gourry-fedora-PF4VCD3F/T/
> >> [4] https://lore.kernel.org/lkml/20230208073533.715-2-bharata@amd.com/  
> > 
> > Again, thank you for proposing this topic, and I wish to see you at Montreal!
> >   
> 
> Same here .. Thank you :)
> 
> - Raghu
> 
> 

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Overhauling hot page detection and promotion based on PTE A bit scanning
  2025-01-23 10:57 [LSF/MM/BPF TOPIC] Overhauling hot page detection and promotion based on PTE A bit scanning Raghavendra K T
  2025-01-23 18:20 ` SeongJae Park
@ 2025-01-24  5:53 ` Hyeonggon Yoo
  2025-01-24  9:02   ` Raghavendra K T
  2025-02-06  3:14   ` Yuanchu Xie
  2025-01-26  2:27 ` Huang, Ying
                   ` (2 subsequent siblings)
  4 siblings, 2 replies; 33+ messages in thread
From: Hyeonggon Yoo @ 2025-01-24  5:53 UTC (permalink / raw)
  To: Raghavendra K T, linux-mm, akpm, lsf-pc, bharata
  Cc: kernel_team, 42.hyeyoo, gourry, nehagholkar, abhishekd,
	ying.huang, nphamcs, hannes, feng.tang, kbusch, Hasan.Maruf, sj,
	david, willy, k.shutemov, mgorman, vbabka, hughd, rientjes,
	shy828301, liam.howlett, peterz, mingo, nadav.amit, shivankg,
	ziy, jhubbard, AneeshKumar.KizhakeVeetil, linux-kernel,
	jon.grimm, santosh.shukla, Michael.Day, riel, weixugc,
	leesuyeon0506, honggyu.kim, leillc, kmanaouil.dev, rppt,
	dave.hansen, yuanchu



On 1/23/2025 7:57 PM, Raghavendra K T wrote:
> Bharata and I would like to propose the following topic for LSFMM.
> 
> Topic: Overhauling hot page detection and promotion based on PTE A bit scanning.
>   
> In the Linux kernel, hot page information can potentially be obtained from
> multiple sources:
>   
> a. PROT_NONE faults (NUMA balancing)
> b. PTE Access bit (LRU scanning)
> c. Hardware provided page hotness info (like AMD IBS)
>   
> This information is further used to migrate (or promote) pages from slow memory
> tier to top tier to increase performance.
> 
> In the current hot page promotion mechanism, all the activities including the
> process address space scanning, NUMA hint fault handling and page migration are
> performed in the process context. i.e., scanning overhead is borne by the
> applications.
>  
> I had recently posted a patch [1] to improve this in the context of slow-tier
> page promotion. Here, Scanning is done by a global kernel thread which routinely
> scans all the processes' address spaces and checks for accesses by reading the
> PTE A bit. The hot pages thus identified are maintained in list and subsequently> are promoted to a default top-tier node. Thus, the approach pushes overhead of
> scanning, NUMA hint faults and migrations off from process context.
> 
> The topic was presented in the MM alignment session hosted by David Rientjes [2].
> The topic also finds a mention in S J Park's LSFMM proposal [3].
>   
> Here is the list of potential discussion points:
> 1. Other improvements and enhancements to PTE A bit scanning approach. Use of
> multiple kernel threads, throttling improvements, promotion policies, per-process
> opt-in via prctl, virtual vs physical address based scanning, tuning hot page
> detection algorithm etc.

Yuanchu's MGLRU periodic aging series [1] seems quite relevant here,
you might want to look at it. adding Yuanchu to Cc.

By the way, do you have any reason why you'd prefer opt-in prctl
over per-memcg control?

[1] https://lore.kernel.org/all/20221214225123.2770216-1-yuanchu@google.com/
  
> 2. Possibility of maintaining single source of truth for page hotness that would
> maintain hot page information from multiple sources and let other sub-systems
> use that info.
> 
> 3. Discuss how hardware provided hotness info (like AMD IBS) can further aid
> promotion. Bharata had posted an RFC [4] on this a while back.
> 
> 4. Overlap with DAMON and potential reuse.
>   
> Links:
> 
> [1] https://lore.kernel.org/all/20241201153818.2633616-1-raghavendra.kt@amd.com/
> [2] https://lore.kernel.org/linux-mm/20241226012833.rmmbkws4wdhzdht6@ed.ac.uk/T/
> [3] https://lore.kernel.org/lkml/Z4XUoWlU-UgRik18@gourry-fedora-PF4VCD3F/T/
> [4] https://lore.kernel.org/lkml/20230208073533.715-2-bharata@amd.com/
>   
> 



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Overhauling hot page detection and promotion based on PTE A bit scanning
  2025-01-24  5:53 ` Hyeonggon Yoo
@ 2025-01-24  9:02   ` Raghavendra K T
  2025-01-27  7:01     ` David Rientjes
  2025-02-06  3:14   ` Yuanchu Xie
  1 sibling, 1 reply; 33+ messages in thread
From: Raghavendra K T @ 2025-01-24  9:02 UTC (permalink / raw)
  To: Hyeonggon Yoo, linux-mm, akpm, lsf-pc, bharata
  Cc: kernel_team, 42.hyeyoo, gourry, nehagholkar, abhishekd,
	ying.huang, nphamcs, hannes, feng.tang, kbusch, Hasan.Maruf, sj,
	david, willy, k.shutemov, mgorman, vbabka, hughd, rientjes,
	shy828301, liam.howlett, peterz, mingo, nadav.amit, shivankg,
	ziy, jhubbard, AneeshKumar.KizhakeVeetil, linux-kernel,
	jon.grimm, santosh.shukla, Michael.Day, riel, weixugc,
	leesuyeon0506, honggyu.kim, leillc, kmanaouil.dev, rppt,
	dave.hansen, yuanchu

On 1/24/2025 11:23 AM, Hyeonggon Yoo wrote:
> 
> 
> On 1/23/2025 7:57 PM, Raghavendra K T wrote:
>> Bharata and I would like to propose the following topic for LSFMM.
>>
>> Topic: Overhauling hot page detection and promotion based on PTE A bit 
>> scanning.
[...]
>> Here is the list of potential discussion points:
>> 1. Other improvements and enhancements to PTE A bit scanning approach. 
>> Use of
>> multiple kernel threads, throttling improvements, promotion policies, 
>> per-process
>> opt-in via prctl, virtual vs physical address based scanning, tuning 
>> hot page
>> detection algorithm etc.
> 
> Yuanchu's MGLRU periodic aging series [1] seems quite relevant here,
> you might want to look at it. adding Yuanchu to Cc.

Thank you for pointing that.

> 
> By the way, do you have any reason why you'd prefer opt-in prctl
> over per-memcg control?
>

opt-in prctl came in the MM alignment discussion, and have added that.
per-memcg also definitely makes sense. I am not aware which is the most
used usecase. But adding provision for both with one having more
priority over other may be the way to go.

Overall point here is to save time in unnecessary scanning.
will be adding prctl in the upcoming version to start with.

> [1] https://lore.kernel.org/all/20221214225123.2770216-1- 
> yuanchu@google.com/
> 



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Overhauling hot page detection and promotion based on PTE A bit scanning
  2025-01-24  9:02   ` Raghavendra K T
@ 2025-01-27  7:01     ` David Rientjes
  2025-01-27  7:11       ` Raghavendra K T
  0 siblings, 1 reply; 33+ messages in thread
From: David Rientjes @ 2025-01-27  7:01 UTC (permalink / raw)
  To: Raghavendra K T
  Cc: Hyeonggon Yoo, linux-mm, akpm, lsf-pc, bharata, kernel_team,
	42.hyeyoo, gourry, nehagholkar, abhishekd, ying.huang, nphamcs,
	hannes, feng.tang, kbusch, Hasan.Maruf, sj, david, willy,
	k.shutemov, mgorman, vbabka, hughd, shy828301, liam.howlett,
	peterz, mingo, nadav.amit, shivankg, ziy, jhubbard,
	AneeshKumar.KizhakeVeetil, linux-kernel, jon.grimm,
	santosh.shukla, Michael.Day, riel, weixugc, leesuyeon0506,
	honggyu.kim, leillc, kmanaouil.dev, rppt, dave.hansen, yuanchu

On Fri, 24 Jan 2025, Raghavendra K T wrote:

> On 1/24/2025 11:23 AM, Hyeonggon Yoo wrote:
> > 
> > 
> > On 1/23/2025 7:57 PM, Raghavendra K T wrote:
> > > Bharata and I would like to propose the following topic for LSFMM.
> > > 
> > > Topic: Overhauling hot page detection and promotion based on PTE A bit
> > > scanning.
> [...]
> > > Here is the list of potential discussion points:
> > > 1. Other improvements and enhancements to PTE A bit scanning approach. Use
> > > of
> > > multiple kernel threads, throttling improvements, promotion policies,
> > > per-process
> > > opt-in via prctl, virtual vs physical address based scanning, tuning hot
> > > page
> > > detection algorithm etc.
> > 
> > Yuanchu's MGLRU periodic aging series [1] seems quite relevant here,
> > you might want to look at it. adding Yuanchu to Cc.
> 
> Thank you for pointing that.
> 

+1.  Yuanchu, do you have ideas for how MGLRU periodic aging and working 
set can play a role in this?

> > By the way, do you have any reason why you'd prefer opt-in prctl
> > over per-memcg control?
> > 
> 
> opt-in prctl came in the MM alignment discussion, and have added that.

Are you planning on sending a refresh of that patch series? :)

> per-memcg also definitely makes sense. I am not aware which is the most
> used usecase. But adding provision for both with one having more
> priority over other may be the way to go.
> 

I would suggest leveraging prctl() for this as opposed to memcg.  I think 
making this part of memcg is beyond the scope for what memcg is intended 
to do, limitation of memory resources, similar to the recent discussions 
on per-cgroup control for THP.

Additionally, the current memcg configuration of the system may also not 
be convenient for using for this purpose, especially if one process should 
be opted out in the memcg hierarchy.  Requiring users to change how their 
memcg is configured just to opt out would be rather unfortunate.

> Overall point here is to save time in unnecessary scanning.
> will be adding prctl in the upcoming version to start with.
> 

Fully agreed.

Thanks very much for proposing this topic, Raghu, I think it will be very 
useful to discuss!  Looking forward to it!


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Overhauling hot page detection and promotion based on PTE A bit scanning
  2025-01-27  7:01     ` David Rientjes
@ 2025-01-27  7:11       ` Raghavendra K T
  0 siblings, 0 replies; 33+ messages in thread
From: Raghavendra K T @ 2025-01-27  7:11 UTC (permalink / raw)
  To: David Rientjes
  Cc: Hyeonggon Yoo, linux-mm, akpm, lsf-pc, bharata, kernel_team,
	42.hyeyoo, gourry, nehagholkar, abhishekd, ying.huang, nphamcs,
	hannes, feng.tang, kbusch, Hasan.Maruf, sj, david, willy,
	k.shutemov, mgorman, vbabka, hughd, shy828301, liam.howlett,
	peterz, mingo, nadav.amit, shivankg, ziy, jhubbard,
	AneeshKumar.KizhakeVeetil, linux-kernel, jon.grimm,
	santosh.shukla, Michael.Day, riel, weixugc, leesuyeon0506,
	honggyu.kim, leillc, kmanaouil.dev, rppt, dave.hansen, yuanchu



On 1/27/2025 12:31 PM, David Rientjes wrote:
> On Fri, 24 Jan 2025, Raghavendra K T wrote:
[...]
> 
>>> By the way, do you have any reason why you'd prefer opt-in prctl
>>> over per-memcg control?
>>>
>>
>> opt-in prctl came in the MM alignment discussion, and have added that.
> 
> Are you planning on sending a refresh of that patch series? :)

Hello David,
Current plan is to send by next week.
(Because after measuring the per mm latency and overall latency to do
full scan, I was thinking to add parallel scanning in next version itself).

> 
>> per-memcg also definitely makes sense. I am not aware which is the most
>> used usecase. But adding provision for both with one having more
>> priority over other may be the way to go.
>>
> 
> I would suggest leveraging prctl() for this as opposed to memcg.  I think
> making this part of memcg is beyond the scope for what memcg is intended
> to do, limitation of memory resources, similar to the recent discussions
> on per-cgroup control for THP.
> 
> Additionally, the current memcg configuration of the system may also not
> be convenient for using for this purpose, especially if one process should
> be opted out in the memcg hierarchy.  Requiring users to change how their
> memcg is configured just to opt out would be rather unfortunate.
> 
>> Overall point here is to save time in unnecessary scanning.
>> will be adding prctl in the upcoming version to start with.
>>
> 
> Fully agreed.
> 
> Thanks very much for proposing this topic, Raghu, I think it will be very
> useful to discuss!  Looking forward to it!

Thank you.

- Raghu


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Overhauling hot page detection and promotion based on PTE A bit scanning
  2025-01-24  5:53 ` Hyeonggon Yoo
  2025-01-24  9:02   ` Raghavendra K T
@ 2025-02-06  3:14   ` Yuanchu Xie
  1 sibling, 0 replies; 33+ messages in thread
From: Yuanchu Xie @ 2025-02-06  3:14 UTC (permalink / raw)
  To: Hyeonggon Yoo, Raghavendra K T, bharata
  Cc: linux-mm, akpm, lsf-pc, kernel_team, 42.hyeyoo, gourry,
	nehagholkar, abhishekd, ying.huang, nphamcs, hannes, feng.tang,
	kbusch, Hasan.Maruf, sj, david, willy, k.shutemov, mgorman,
	vbabka, hughd, rientjes, shy828301, liam.howlett, peterz, mingo,
	nadav.amit, shivankg, ziy, jhubbard, AneeshKumar.KizhakeVeetil,
	linux-kernel, jon.grimm, santosh.shukla, Michael.Day, riel,
	weixugc, leesuyeon0506, honggyu.kim, leillc, kmanaouil.dev, rppt,
	dave.hansen, Kinsey Ho

On Thu, Jan 23, 2025 at 9:53 PM Hyeonggon Yoo <hyeonggon.yoo@sk.com> wrote:
> On 1/23/2025 7:57 PM, Raghavendra K T wrote:
> > Bharata and I would like to propose the following topic for LSFMM.
> >
> > Here is the list of potential discussion points:
> > 1. Other improvements and enhancements to PTE A bit scanning approach. Use of
> > multiple kernel threads, throttling improvements, promotion policies, per-process
> > opt-in via prctl, virtual vs physical address based scanning, tuning hot page
> > detection algorithm etc.
>
> Yuanchu's MGLRU periodic aging series [1] seems quite relevant here,
> you might want to look at it. adding Yuanchu to Cc.

Thanks for the mention, Hyeonggon Yoo.

Working set reporting doesn't aim to promote/demote/reclaim pages, but
to show aggregate stats of the memory in access recency. The periodic
aging part is optional since client devices wouldn't want a background
daemon wasting battery aging lruvecs when nothing is happening.
For the server use case, the aging kthread periodically invoke MGLRU
aging, which performs the PTE A bit scanning. MGLRU handles unmapped
page cache as well for reclaim purposes.

Reading through the kmmscand patch series.
Kmmscand also keeps a list of mm_struct and performs scanning on them,
so given there're many use cases for PTE A bit scanning, this seems
like an opportunity to abstract some of the mm_struct scanning.
Code-wise the A bit scanners do very similar things, and the MGLRU
version has optional optimizations that reduce the scanning overhead.
I wonder if you have considered migrating pages from the MGLRU young
generation of a remote node, or pages that have remained in the young
generation. Some changes to MGLRU would be necessary in that case.

Also adding Kinsey Ho since he's been looking at page promotion as well.

Thanks,
Yuanchu

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Overhauling hot page detection and promotion based on PTE A bit scanning
  2025-01-23 10:57 [LSF/MM/BPF TOPIC] Overhauling hot page detection and promotion based on PTE A bit scanning Raghavendra K T
  2025-01-23 18:20 ` SeongJae Park
  2025-01-24  5:53 ` Hyeonggon Yoo
@ 2025-01-26  2:27 ` Huang, Ying
  2025-01-27  5:11   ` Bharata B Rao
  2025-02-07 19:06   ` Davidlohr Bueso
  2025-01-31 12:28 ` Jonathan Cameron
  2025-04-07  3:13 ` Bharata B Rao
  4 siblings, 2 replies; 33+ messages in thread
From: Huang, Ying @ 2025-01-26  2:27 UTC (permalink / raw)
  To: Raghavendra K T
  Cc: linux-mm, akpm, lsf-pc, bharata, gourry, nehagholkar, abhishekd,
	nphamcs, hannes, feng.tang, kbusch, Hasan.Maruf, sj, david,
	willy, k.shutemov, mgorman, vbabka, hughd, rientjes, shy828301,
	liam.howlett, peterz, mingo, nadav.amit, shivankg, ziy, jhubbard,
	AneeshKumar.KizhakeVeetil, linux-kernel, jon.grimm,
	santosh.shukla, Michael.Day, riel, weixugc, leesuyeon0506,
	honggyu.kim, leillc, kmanaouil.dev, rppt, dave.hansen

Hi, Raghavendra,

Raghavendra K T <raghavendra.kt@amd.com> writes:

> Bharata and I would like to propose the following topic for LSFMM.
>
> Topic: Overhauling hot page detection and promotion based on PTE A bit scanning.
>  
> In the Linux kernel, hot page information can potentially be obtained from
> multiple sources:
>  
> a. PROT_NONE faults (NUMA balancing)
> b. PTE Access bit (LRU scanning)
> c. Hardware provided page hotness info (like AMD IBS)
>  
> This information is further used to migrate (or promote) pages from slow memory
> tier to top tier to increase performance.
>
> In the current hot page promotion mechanism, all the activities including the
> process address space scanning, NUMA hint fault handling and page migration are
> performed in the process context. i.e., scanning overhead is borne by the
> applications.
>  
> I had recently posted a patch [1] to improve this in the context of slow-tier
> page promotion. Here, Scanning is done by a global kernel thread which routinely
> scans all the processes' address spaces and checks for accesses by reading the
> PTE A bit. The hot pages thus identified are maintained in list and subsequently
> are promoted to a default top-tier node. Thus, the approach pushes overhead of
> scanning, NUMA hint faults and migrations off from process context.

This has been discussed before too.  For example, in the following thread

https://lore.kernel.org/all/20200417100633.GU20730@hirez.programming.kicks-ass.net/T/

The drawbacks of asynchronous scanning including

- The CPU cycles used are not charged properly

- There may be no idle CPU cycles to use

- The scanning CPU may be not near the workload CPUs enough

It's better to involve Mel and Peter in the discussion for this.

> The topic was presented in the MM alignment session hosted by David Rientjes [2].
> The topic also finds a mention in S J Park's LSFMM proposal [3].
>  
> Here is the list of potential discussion points:
> 1. Other improvements and enhancements to PTE A bit scanning approach. Use of
> multiple kernel threads, throttling improvements, promotion policies, per-process
> opt-in via prctl, virtual vs physical address based scanning, tuning hot page
> detection algorithm etc.

One drawback of physical address based scanning is that it's hard to
apply some workload specific policy.  For example, if a low priority
workload has many relatively hot pages, while a high priority workload
has many relative warm (not so hot) pages.  We need to promote the warm
pages in the high priority workload, while physcial address based
scanning may report the hot pages in the low priority workload.  Right?

> 2. Possibility of maintaining single source of truth for page hotness that would
> maintain hot page information from multiple sources and let other sub-systems
> use that info.
>
> 3. Discuss how hardware provided hotness info (like AMD IBS) can further aid
> promotion. Bharata had posted an RFC [4] on this a while back.
>
> 4. Overlap with DAMON and potential reuse.
>  
> Links:
>
> [1] https://lore.kernel.org/all/20241201153818.2633616-1-raghavendra.kt@amd.com/
> [2] https://lore.kernel.org/linux-mm/20241226012833.rmmbkws4wdhzdht6@ed.ac.uk/T/
> [3] https://lore.kernel.org/lkml/Z4XUoWlU-UgRik18@gourry-fedora-PF4VCD3F/T/
> [4] https://lore.kernel.org/lkml/20230208073533.715-2-bharata@amd.com/
>  

---
Best Regards,
Huang, Ying


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Overhauling hot page detection and promotion based on PTE A bit scanning
  2025-01-26  2:27 ` Huang, Ying
@ 2025-01-27  5:11   ` Bharata B Rao
  2025-01-27 18:34     ` SeongJae Park
  2025-02-07 19:06   ` Davidlohr Bueso
  1 sibling, 1 reply; 33+ messages in thread
From: Bharata B Rao @ 2025-01-27  5:11 UTC (permalink / raw)
  To: Huang, Ying, Raghavendra K T
  Cc: linux-mm, akpm, lsf-pc, gourry, nehagholkar, abhishekd, nphamcs,
	hannes, feng.tang, kbusch, Hasan.Maruf, sj, david, willy,
	k.shutemov, mgorman, vbabka, hughd, rientjes, shy828301,
	liam.howlett, peterz, mingo, nadav.amit, shivankg, ziy, jhubbard,
	AneeshKumar.KizhakeVeetil, linux-kernel, jon.grimm,
	santosh.shukla, Michael.Day, riel, weixugc, leesuyeon0506,
	honggyu.kim, leillc, kmanaouil.dev, rppt, dave.hansen

On 26-Jan-25 7:57 AM, Huang, Ying wrote:
> Hi, Raghavendra,
> 
> Raghavendra K T <raghavendra.kt@amd.com> writes:
> 
>> Bharata and I would like to propose the following topic for LSFMM.
>>
>> Topic: Overhauling hot page detection and promotion based on PTE A bit scanning.
>>   
>> In the Linux kernel, hot page information can potentially be obtained from
>> multiple sources:
>>   
>> a. PROT_NONE faults (NUMA balancing)
>> b. PTE Access bit (LRU scanning)
>> c. Hardware provided page hotness info (like AMD IBS)
>>   
>> This information is further used to migrate (or promote) pages from slow memory
>> tier to top tier to increase performance.
>>
>> In the current hot page promotion mechanism, all the activities including the
>> process address space scanning, NUMA hint fault handling and page migration are
>> performed in the process context. i.e., scanning overhead is borne by the
>> applications.
>>   
>> I had recently posted a patch [1] to improve this in the context of slow-tier
>> page promotion. Here, Scanning is done by a global kernel thread which routinely
>> scans all the processes' address spaces and checks for accesses by reading the
>> PTE A bit. The hot pages thus identified are maintained in list and subsequently
>> are promoted to a default top-tier node. Thus, the approach pushes overhead of
>> scanning, NUMA hint faults and migrations off from process context.
> 
> This has been discussed before too.  For example, in the following thread
> 
> https://lore.kernel.org/all/20200417100633.GU20730@hirez.programming.kicks-ass.net/T/

Thanks for pointing to this discussion.

> 
> The drawbacks of asynchronous scanning including
> 
> - The CPU cycles used are not charged properly
> 
> - There may be no idle CPU cycles to use
> 
> - The scanning CPU may be not near the workload CPUs enough
> 
> It's better to involve Mel and Peter in the discussion for this.

They are CC'ed in this thread and hopefully have insights to share.

Charging CPU cycles to the right process has been brought up in other 
similar contexts. Recent one is from page migration batching and using 
multiple threads for migration - 
https://lore.kernel.org/all/CAHbLzkpoKP0fVZP5b10wdzAMDLWysDy7oH0qaUssiUXj80R6bw@mail.gmail.com/

Does it make sense to treat hot page promotion from slow tiers 
differently compared to locality based balancing? I mean couldn't the 
charging of this async thread be similar to the cycles spent by other 
system threads like kcompactd and khugepaged?

> 
>> The topic was presented in the MM alignment session hosted by David Rientjes [2].
>> The topic also finds a mention in S J Park's LSFMM proposal [3].
>>   
>> Here is the list of potential discussion points:
>> 1. Other improvements and enhancements to PTE A bit scanning approach. Use of
>> multiple kernel threads, throttling improvements, promotion policies, per-process
>> opt-in via prctl, virtual vs physical address based scanning, tuning hot page
>> detection algorithm etc.
> 
> One drawback of physical address based scanning is that it's hard to
> apply some workload specific policy.  For example, if a low priority
> workload has many relatively hot pages, while a high priority workload
> has many relative warm (not so hot) pages.  We need to promote the warm
> pages in the high priority workload, while physcial address based
> scanning may report the hot pages in the low priority workload.  Right?

Correct. I wonder if DAMON has already devised a scheme to address this. SJ?

Regards,
Bharata.


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Overhauling hot page detection and promotion based on PTE A bit scanning
  2025-01-27  5:11   ` Bharata B Rao
@ 2025-01-27 18:34     ` SeongJae Park
  2025-02-07  8:10       ` Huang, Ying
  0 siblings, 1 reply; 33+ messages in thread
From: SeongJae Park @ 2025-01-27 18:34 UTC (permalink / raw)
  To: Bharata B Rao
  Cc: SeongJae Park, Huang, Ying, Raghavendra K T, linux-mm, akpm,
	lsf-pc, gourry, nehagholkar, abhishekd, nphamcs, hannes,
	feng.tang, kbusch, Hasan.Maruf, david, willy, k.shutemov,
	mgorman, vbabka, hughd, rientjes, shy828301, liam.howlett,
	peterz, mingo, nadav.amit, shivankg, ziy, jhubbard,
	AneeshKumar.KizhakeVeetil, linux-kernel, jon.grimm,
	santosh.shukla, Michael.Day, riel, weixugc, leesuyeon0506,
	honggyu.kim, leillc, kmanaouil.dev, rppt, dave.hansen

On Mon, 27 Jan 2025 10:41:07 +0530 Bharata B Rao <bharata@amd.com> wrote:

> On 26-Jan-25 7:57 AM, Huang, Ying wrote:
> > Hi, Raghavendra,
> > 
> > Raghavendra K T <raghavendra.kt@amd.com> writes:
> > 
> >> Bharata and I would like to propose the following topic for LSFMM.
> >>
> >> Topic: Overhauling hot page detection and promotion based on PTE A bit scanning.
> >>   
> >> In the Linux kernel, hot page information can potentially be obtained from
> >> multiple sources:
> >>   
> >> a. PROT_NONE faults (NUMA balancing)
> >> b. PTE Access bit (LRU scanning)
> >> c. Hardware provided page hotness info (like AMD IBS)
> >>   
> >> This information is further used to migrate (or promote) pages from slow memory
> >> tier to top tier to increase performance.
> >>
> >> In the current hot page promotion mechanism, all the activities including the
> >> process address space scanning, NUMA hint fault handling and page migration are
> >> performed in the process context. i.e., scanning overhead is borne by the
> >> applications.
> >>   
> >> I had recently posted a patch [1] to improve this in the context of slow-tier
> >> page promotion. Here, Scanning is done by a global kernel thread which routinely
> >> scans all the processes' address spaces and checks for accesses by reading the
> >> PTE A bit. The hot pages thus identified are maintained in list and subsequently
> >> are promoted to a default top-tier node. Thus, the approach pushes overhead of
> >> scanning, NUMA hint faults and migrations off from process context.
> > 
> > This has been discussed before too.  For example, in the following thread
> > 
> > https://lore.kernel.org/all/20200417100633.GU20730@hirez.programming.kicks-ass.net/T/
> 
> Thanks for pointing to this discussion.
> 
> > 
> > The drawbacks of asynchronous scanning including
> > 
> > - The CPU cycles used are not charged properly
> > 
> > - There may be no idle CPU cycles to use
> > 
> > - The scanning CPU may be not near the workload CPUs enough
> > 
> > It's better to involve Mel and Peter in the discussion for this.
> 
> They are CC'ed in this thread and hopefully have insights to share.
> 
> Charging CPU cycles to the right process has been brought up in other 
> similar contexts. Recent one is from page migration batching and using 
> multiple threads for migration - 
> https://lore.kernel.org/all/CAHbLzkpoKP0fVZP5b10wdzAMDLWysDy7oH0qaUssiUXj80R6bw@mail.gmail.com/
> 
> Does it make sense to treat hot page promotion from slow tiers 
> differently compared to locality based balancing? I mean couldn't the 
> charging of this async thread be similar to the cycles spent by other 
> system threads like kcompactd and khugepaged?

I'm up to this idea.

I agree the fairness is a thing that we need to aware of.  But IMHO, it is
something that the async approach can further be advanced for, not a strict
blocker for now.

> 
> > 
> >> The topic was presented in the MM alignment session hosted by David Rientjes [2].
> >> The topic also finds a mention in S J Park's LSFMM proposal [3].
> >>   
> >> Here is the list of potential discussion points:
> >> 1. Other improvements and enhancements to PTE A bit scanning approach. Use of
> >> multiple kernel threads, throttling improvements, promotion policies, per-process
> >> opt-in via prctl, virtual vs physical address based scanning, tuning hot page
> >> detection algorithm etc.
> > 
> > One drawback of physical address based scanning is that it's hard to
> > apply some workload specific policy.  For example, if a low priority
> > workload has many relatively hot pages, while a high priority workload
> > has many relative warm (not so hot) pages.  We need to promote the warm
> > pages in the high priority workload, while physcial address based
> > scanning may report the hot pages in the low priority workload.  Right?
> 
> Correct. I wonder if DAMON has already devised a scheme to address this. SJ?

Yes, I think DAMOS quotas and DAMOS filters can be used to address this issue.

For this case, assuming each workload has its own cgroup, users can add a DAMOS
scheme for promotion per workload.  The schemes will have different DAMOS
quotas based on the workloads' priority.  The schemes will also be controlled
to do the promotion for pages of the specific workloads using DAMOS filters.

For example, below kdamond configuration can be used.

# damo args damon \
	--damos_action migrate_hot 0 --damos_quotas 100ms 1G 1s 0% 100% 100% \
	--damos_filter reject none memcg /workloads/high-priority \
	\
	--damos_action migrate_hot 0 --damos_quotas 10ms 100M 1s 0% 100% 100% \
	--damos_filter reject none memcg /workloads/low-priority \
	--damos_nr_filters 1 1 --out kdamond.json
# damo report damon --input_file ./kdamond.json --damon_params_omit_defaults
kdamond 0
    context 0
        ops: paddr
        target 0
            region [4,294,967,296, 68,577,918,975) (59.868 GiB)
        intervals: sample 5 ms, aggr 100 ms, update 1 s
        nr_regions: [10, 1,000]
        scheme 0
            action: migrate_hot to node 0 per aggr interval
            target access pattern
                sz: [0 B, max]
                nr_accesses: [0 %, 18,446,744,073,709,551,616 %]
                age: [0 ns, max]
            quotas
                100 ms / 1024.000 MiB / 0 B per 1 s
                priority: sz 0 %, nr_accesses 100 %, age 100 %
            filter 0
                reject none memcg /workloads/high-priority
        scheme 1
            action: migrate_hot to node 0 per aggr interval
            target access pattern
                sz: [0 B, max]
                nr_accesses: [0 %, 18,446,744,073,709,551,616 %]
                age: [0 ns, max]
            quotas
                10 ms / 100.000 MiB / 0 B per 1 s
                priority: sz 0 %, nr_accesses 100 %, age 100 %
            filter 0
                reject none memcg /workloads/low-priority

Please note that this is just one example based on existing DAMOS features.
This may have drawbacks and future optimizations would be possible.


Thanks,
SJ


> 
> Regards,
> Bharata.


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Overhauling hot page detection and promotion based on PTE A bit scanning
  2025-01-27 18:34     ` SeongJae Park
@ 2025-02-07  8:10       ` Huang, Ying
  2025-02-07  9:06         ` Gregory Price
  2025-02-07 19:52         ` SeongJae Park
  0 siblings, 2 replies; 33+ messages in thread
From: Huang, Ying @ 2025-02-07  8:10 UTC (permalink / raw)
  To: SeongJae Park, Bharata B Rao
  Cc: Raghavendra K T, linux-mm, akpm, lsf-pc, gourry, nehagholkar,
	abhishekd, nphamcs, hannes, feng.tang, kbusch, Hasan.Maruf,
	david, willy, k.shutemov, mgorman, vbabka, hughd, rientjes,
	shy828301, liam.howlett, peterz, mingo, nadav.amit, shivankg,
	ziy, jhubbard, AneeshKumar.KizhakeVeetil, linux-kernel,
	jon.grimm, santosh.shukla, Michael.Day, riel, weixugc,
	leesuyeon0506, honggyu.kim, leillc, kmanaouil.dev, rppt,
	dave.hansen

SeongJae Park <sj@kernel.org> writes:

> On Mon, 27 Jan 2025 10:41:07 +0530 Bharata B Rao <bharata@amd.com> wrote:
>
>> On 26-Jan-25 7:57 AM, Huang, Ying wrote:
>> > Hi, Raghavendra,
>> > 
>> > Raghavendra K T <raghavendra.kt@amd.com> writes:
>> > 
>> >> Bharata and I would like to propose the following topic for LSFMM.
>> >>
>> >> Topic: Overhauling hot page detection and promotion based on PTE A bit scanning.
>> >>   
>> >> In the Linux kernel, hot page information can potentially be obtained from
>> >> multiple sources:
>> >>   
>> >> a. PROT_NONE faults (NUMA balancing)
>> >> b. PTE Access bit (LRU scanning)
>> >> c. Hardware provided page hotness info (like AMD IBS)
>> >>   
>> >> This information is further used to migrate (or promote) pages from slow memory
>> >> tier to top tier to increase performance.
>> >>
>> >> In the current hot page promotion mechanism, all the activities including the
>> >> process address space scanning, NUMA hint fault handling and page migration are
>> >> performed in the process context. i.e., scanning overhead is borne by the
>> >> applications.
>> >>   
>> >> I had recently posted a patch [1] to improve this in the context of slow-tier
>> >> page promotion. Here, Scanning is done by a global kernel thread which routinely
>> >> scans all the processes' address spaces and checks for accesses by reading the
>> >> PTE A bit. The hot pages thus identified are maintained in list and subsequently
>> >> are promoted to a default top-tier node. Thus, the approach pushes overhead of
>> >> scanning, NUMA hint faults and migrations off from process context.
>> > 
>> > This has been discussed before too.  For example, in the following thread
>> > 
>> > https://lore.kernel.org/all/20200417100633.GU20730@hirez.programming.kicks-ass.net/T/
>> 
>> Thanks for pointing to this discussion.
>> 
>> > 
>> > The drawbacks of asynchronous scanning including
>> > 
>> > - The CPU cycles used are not charged properly
>> > 
>> > - There may be no idle CPU cycles to use
>> > 
>> > - The scanning CPU may be not near the workload CPUs enough
>> > 
>> > It's better to involve Mel and Peter in the discussion for this.
>> 
>> They are CC'ed in this thread and hopefully have insights to share.
>> 
>> Charging CPU cycles to the right process has been brought up in other 
>> similar contexts. Recent one is from page migration batching and using 
>> multiple threads for migration - 
>> https://lore.kernel.org/all/CAHbLzkpoKP0fVZP5b10wdzAMDLWysDy7oH0qaUssiUXj80R6bw@mail.gmail.com/
>> 
>> Does it make sense to treat hot page promotion from slow tiers 
>> differently compared to locality based balancing? I mean couldn't the 
>> charging of this async thread be similar to the cycles spent by other 
>> system threads like kcompactd and khugepaged?
>
> I'm up to this idea.
>
> I agree the fairness is a thing that we need to aware of.  But IMHO, it is
> something that the async approach can further be advanced for, not a strict
> blocker for now.

Personally, I have no objection to async operations in general.
However, we may need to find some way to control these async operations
instead of adding more and more background kthreads blindly.  How to
charge and constrain the resources used by these async operations is
important too.  For example, some users may want to bind some async
operations on some CPUs.

IMHO, we should think about the requirements and possible solutions
instead of ignoring the issues.

>> 
>> > 
>> >> The topic was presented in the MM alignment session hosted by David Rientjes [2].
>> >> The topic also finds a mention in S J Park's LSFMM proposal [3].
>> >>   
>> >> Here is the list of potential discussion points:
>> >> 1. Other improvements and enhancements to PTE A bit scanning approach. Use of
>> >> multiple kernel threads, throttling improvements, promotion policies, per-process
>> >> opt-in via prctl, virtual vs physical address based scanning, tuning hot page
>> >> detection algorithm etc.
>> > 
>> > One drawback of physical address based scanning is that it's hard to
>> > apply some workload specific policy.  For example, if a low priority
>> > workload has many relatively hot pages, while a high priority workload
>> > has many relative warm (not so hot) pages.  We need to promote the warm
>> > pages in the high priority workload, while physcial address based
>> > scanning may report the hot pages in the low priority workload.  Right?
>> 
>> Correct. I wonder if DAMON has already devised a scheme to address this. SJ?
>
> Yes, I think DAMOS quotas and DAMOS filters can be used to address this issue.
>
> For this case, assuming each workload has its own cgroup, users can add a DAMOS
> scheme for promotion per workload.  The schemes will have different DAMOS
> quotas based on the workloads' priority.  The schemes will also be controlled
> to do the promotion for pages of the specific workloads using DAMOS filters.
>
> For example, below kdamond configuration can be used.
>
> # damo args damon \
> 	--damos_action migrate_hot 0 --damos_quotas 100ms 1G 1s 0% 100% 100% \
> 	--damos_filter reject none memcg /workloads/high-priority \
> 	\
> 	--damos_action migrate_hot 0 --damos_quotas 10ms 100M 1s 0% 100% 100% \
> 	--damos_filter reject none memcg /workloads/low-priority \
> 	--damos_nr_filters 1 1 --out kdamond.json
> # damo report damon --input_file ./kdamond.json --damon_params_omit_defaults
> kdamond 0
>     context 0
>         ops: paddr
>         target 0
>             region [4,294,967,296, 68,577,918,975) (59.868 GiB)
>         intervals: sample 5 ms, aggr 100 ms, update 1 s
>         nr_regions: [10, 1,000]
>         scheme 0
>             action: migrate_hot to node 0 per aggr interval
>             target access pattern
>                 sz: [0 B, max]
>                 nr_accesses: [0 %, 18,446,744,073,709,551,616 %]
>                 age: [0 ns, max]
>             quotas
>                 100 ms / 1024.000 MiB / 0 B per 1 s
>                 priority: sz 0 %, nr_accesses 100 %, age 100 %
>             filter 0
>                 reject none memcg /workloads/high-priority
>         scheme 1
>             action: migrate_hot to node 0 per aggr interval
>             target access pattern
>                 sz: [0 B, max]
>                 nr_accesses: [0 %, 18,446,744,073,709,551,616 %]
>                 age: [0 ns, max]
>             quotas
>                 10 ms / 100.000 MiB / 0 B per 1 s
>                 priority: sz 0 %, nr_accesses 100 %, age 100 %
>             filter 0
>                 reject none memcg /workloads/low-priority
>
> Please note that this is just one example based on existing DAMOS features.
> This may have drawbacks and future optimizations would be possible.

IIUC, this is something like,

physical address -> struct page -> cgroup -> per-cgroup hot threshold

this sounds good to me.  Thanks!

---
Best Regards,
Huang, Ying


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Overhauling hot page detection and promotion based on PTE A bit scanning
  2025-02-07  8:10       ` Huang, Ying
@ 2025-02-07  9:06         ` Gregory Price
  2025-02-07 19:52         ` SeongJae Park
  1 sibling, 0 replies; 33+ messages in thread
From: Gregory Price @ 2025-02-07  9:06 UTC (permalink / raw)
  To: Huang, Ying
  Cc: SeongJae Park, Bharata B Rao, Raghavendra K T, linux-mm, akpm,
	lsf-pc, nehagholkar, abhishekd, nphamcs, hannes, feng.tang,
	kbusch, Hasan.Maruf, david, willy, k.shutemov, mgorman, vbabka,
	hughd, rientjes, shy828301, liam.howlett, peterz, mingo,
	nadav.amit, shivankg, ziy, jhubbard, AneeshKumar.KizhakeVeetil,
	linux-kernel, jon.grimm, santosh.shukla, Michael.Day, riel,
	weixugc, leesuyeon0506, honggyu.kim, leillc, kmanaouil.dev, rppt,
	dave.hansen

On Fri, Feb 07, 2025 at 04:10:47PM +0800, Huang, Ying wrote:
> SeongJae Park <sj@kernel.org> writes:
> >
> > I agree the fairness is a thing that we need to aware of.  But IMHO, it is
> > something that the async approach can further be advanced for, not a strict
> > blocker for now.
> 
> Personally, I have no objection to async operations in general.
> However, we may need to find some way to control these async operations
> instead of adding more and more background kthreads blindly.  How to
> charge and constrain the resources used by these async operations is
> important too.  For example, some users may want to bind some async
> operations on some CPUs.
> 
> IMHO, we should think about the requirements and possible solutions
> instead of ignoring the issues.
>

It also concerns me that most every proposal on async promotion ignores
the promotion-node selection problem as if it's a secondary issue.

Async systems fundamentally lack accessor-locality information unless it
is recorded - and recording this information is expensive and/or
heuristically imprecise for memory shared across tasks (two threads in
the same process schedule across sockets).

If we can't agree on a solution to this problem, it undercuts many of
these RFCs which often simply hard-code the target node to "0" because
it's too hard or too expensive to consider the multi-socket scenario.

~Gregory


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Overhauling hot page detection and promotion based on PTE A bit scanning
  2025-02-07  8:10       ` Huang, Ying
  2025-02-07  9:06         ` Gregory Price
@ 2025-02-07 19:52         ` SeongJae Park
  1 sibling, 0 replies; 33+ messages in thread
From: SeongJae Park @ 2025-02-07 19:52 UTC (permalink / raw)
  To: Huang, Ying
  Cc: SeongJae Park, Bharata B Rao, Raghavendra K T, linux-mm, akpm,
	lsf-pc, gourry, nehagholkar, abhishekd, nphamcs, hannes,
	feng.tang, kbusch, Hasan.Maruf, david, willy, k.shutemov,
	mgorman, vbabka, hughd, rientjes, shy828301, liam.howlett,
	peterz, mingo, nadav.amit, shivankg, ziy, jhubbard,
	AneeshKumar.KizhakeVeetil, linux-kernel, jon.grimm,
	santosh.shukla, Michael.Day, riel, weixugc, leesuyeon0506,
	honggyu.kim, leillc, kmanaouil.dev, rppt, dave.hansen

On Fri, 07 Feb 2025 16:10:47 +0800 "Huang, Ying" <ying.huang@linux.alibaba.com> wrote:

> SeongJae Park <sj@kernel.org> writes:
> 
> > On Mon, 27 Jan 2025 10:41:07 +0530 Bharata B Rao <bharata@amd.com> wrote:
> >
> >> On 26-Jan-25 7:57 AM, Huang, Ying wrote:
> >> > Hi, Raghavendra,
> >> > 
> >> > Raghavendra K T <raghavendra.kt@amd.com> writes:
[...]
> >> > The drawbacks of asynchronous scanning including
> >> > 
> >> > - The CPU cycles used are not charged properly
> >> > 
> >> > - There may be no idle CPU cycles to use
> >> > 
> >> > - The scanning CPU may be not near the workload CPUs enough
> >> > 
> >> > It's better to involve Mel and Peter in the discussion for this.
> >> 
> >> They are CC'ed in this thread and hopefully have insights to share.
> >> 
> >> Charging CPU cycles to the right process has been brought up in other 
> >> similar contexts. Recent one is from page migration batching and using 
> >> multiple threads for migration - 
> >> https://lore.kernel.org/all/CAHbLzkpoKP0fVZP5b10wdzAMDLWysDy7oH0qaUssiUXj80R6bw@mail.gmail.com/
> >> 
> >> Does it make sense to treat hot page promotion from slow tiers 
> >> differently compared to locality based balancing? I mean couldn't the 
> >> charging of this async thread be similar to the cycles spent by other 
> >> system threads like kcompactd and khugepaged?
> >
> > I'm up to this idea.
> >
> > I agree the fairness is a thing that we need to aware of.  But IMHO, it is
> > something that the async approach can further be advanced for, not a strict
> > blocker for now.
> 
> Personally, I have no objection to async operations in general.
> However, we may need to find some way to control these async operations
> instead of adding more and more background kthreads blindly.  How to
> charge and constrain the resources used by these async operations is
> important too.  For example, some users may want to bind some async
> operations on some CPUs.
> 
> IMHO, we should think about the requirements and possible solutions
> instead of ignoring the issues.

I agree.  For DAMON, we implemented DAMOS quotas feature for such resource
control.  We also had a (non-public) discussion about splitting DAMON thread
for monitoring part and operation schemes execution parts for finer control.
I'm also thinking about making the quotas for monitoring part resource
consumption.  We didn't implement the ideas yet since the requirements on
real-world is unclear as of now, though.  We will keep collecting the
requirements and prioritize those or make another solution as the requirements
becomes clearer.

[...]
> >> > One drawback of physical address based scanning is that it's hard to
> >> > apply some workload specific policy.  For example, if a low priority
> >> > workload has many relatively hot pages, while a high priority workload
> >> > has many relative warm (not so hot) pages.  We need to promote the warm
> >> > pages in the high priority workload, while physcial address based
> >> > scanning may report the hot pages in the low priority workload.  Right?
> >> 
> >> Correct. I wonder if DAMON has already devised a scheme to address this. SJ?
> >
> > Yes, I think DAMOS quotas and DAMOS filters can be used to address this issue.
> >
> > For this case, assuming each workload has its own cgroup, users can add a DAMOS
> > scheme for promotion per workload.  The schemes will have different DAMOS
> > quotas based on the workloads' priority.  The schemes will also be controlled
> > to do the promotion for pages of the specific workloads using DAMOS filters.
> >
> > For example, below kdamond configuration can be used.
> >
> > # damo args damon \
> > 	--damos_action migrate_hot 0 --damos_quotas 100ms 1G 1s 0% 100% 100% \
> > 	--damos_filter reject none memcg /workloads/high-priority \
> > 	\
> > 	--damos_action migrate_hot 0 --damos_quotas 10ms 100M 1s 0% 100% 100% \
> > 	--damos_filter reject none memcg /workloads/low-priority \
> > 	--damos_nr_filters 1 1 --out kdamond.json
> > # damo report damon --input_file ./kdamond.json --damon_params_omit_defaults
> > kdamond 0
> >     context 0
> >         ops: paddr
> >         target 0
> >             region [4,294,967,296, 68,577,918,975) (59.868 GiB)
> >         intervals: sample 5 ms, aggr 100 ms, update 1 s
> >         nr_regions: [10, 1,000]
> >         scheme 0
> >             action: migrate_hot to node 0 per aggr interval
> >             target access pattern
> >                 sz: [0 B, max]
> >                 nr_accesses: [0 %, 18,446,744,073,709,551,616 %]
> >                 age: [0 ns, max]
> >             quotas
> >                 100 ms / 1024.000 MiB / 0 B per 1 s
> >                 priority: sz 0 %, nr_accesses 100 %, age 100 %
> >             filter 0
> >                 reject none memcg /workloads/high-priority
> >         scheme 1
> >             action: migrate_hot to node 0 per aggr interval
> >             target access pattern
> >                 sz: [0 B, max]
> >                 nr_accesses: [0 %, 18,446,744,073,709,551,616 %]
> >                 age: [0 ns, max]
> >             quotas
> >                 10 ms / 100.000 MiB / 0 B per 1 s
> >                 priority: sz 0 %, nr_accesses 100 %, age 100 %
> >             filter 0
> >                 reject none memcg /workloads/low-priority
> >
> > Please note that this is just one example based on existing DAMOS features.
> > This may have drawbacks and future optimizations would be possible.
> 
> IIUC, this is something like,
> 
> physical address -> struct page -> cgroup -> per-cgroup hot threshold

You're right.

> 
> this sounds good to me.  Thanks!

Happy to hear that, and looking forward to contiue improving it further with
you! :)


Thanks,
SJ

> 
> ---
> Best Regards,
> Huang, Ying


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Overhauling hot page detection and promotion based on PTE A bit scanning
  2025-01-26  2:27 ` Huang, Ying
  2025-01-27  5:11   ` Bharata B Rao
@ 2025-02-07 19:06   ` Davidlohr Bueso
  2025-03-14  1:56     ` Raghavendra K T
  1 sibling, 1 reply; 33+ messages in thread
From: Davidlohr Bueso @ 2025-02-07 19:06 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Raghavendra K T, linux-mm, akpm, lsf-pc, bharata, gourry,
	nehagholkar, abhishekd, nphamcs, hannes, feng.tang, kbusch,
	Hasan.Maruf, sj, david, willy, k.shutemov, mgorman, vbabka,
	hughd, rientjes, shy828301, liam.howlett, peterz, mingo,
	nadav.amit, shivankg, ziy, jhubbard, AneeshKumar.KizhakeVeetil,
	linux-kernel, jon.grimm, santosh.shukla, Michael.Day, riel,
	weixugc, leesuyeon0506, honggyu.kim, leillc, kmanaouil.dev, rppt,
	dave.hansen, dongjoo.linux.dev

On Sun, 26 Jan 2025, Huang, Ying wrote:

>Hi, Raghavendra,
>
>Raghavendra K T <raghavendra.kt@amd.com> writes:
>
>> Bharata and I would like to propose the following topic for LSFMM.
>>
>> Topic: Overhauling hot page detection and promotion based on PTE A bit scanning.
>>
>> In the Linux kernel, hot page information can potentially be obtained from
>> multiple sources:
>>
>> a. PROT_NONE faults (NUMA balancing)
>> b. PTE Access bit (LRU scanning)
>> c. Hardware provided page hotness info (like AMD IBS)
>>
>> This information is further used to migrate (or promote) pages from slow memory
>> tier to top tier to increase performance.
>>
>> In the current hot page promotion mechanism, all the activities including the
>> process address space scanning, NUMA hint fault handling and page migration are
>> performed in the process context. i.e., scanning overhead is borne by the
>> applications.
>>
>> I had recently posted a patch [1] to improve this in the context of slow-tier
>> page promotion. Here, Scanning is done by a global kernel thread which routinely
>> scans all the processes' address spaces and checks for accesses by reading the
>> PTE A bit. The hot pages thus identified are maintained in list and subsequently
>> are promoted to a default top-tier node. Thus, the approach pushes overhead of
>> scanning, NUMA hint faults and migrations off from process context.

It seems that overall having a global view of hot memory is where folks are leaning
towards. In the past we have discussed an external thread to harvest information
from different sources and do the corresponding migration. I think your work is a
step in this direction (and shows promising numbers), but I'm not sure if it should
be doing the scanning part, as opposed to just receive the information and migrate
(according to some policy based on a wider system view of what is hot; ie: what CHMU
says is hot might not be so hot to the rest of the system, or as is pointed out
below, workload based, as priorities).

>
>This has been discussed before too.  For example, in the following thread
>
>https://lore.kernel.org/all/20200417100633.GU20730@hirez.programming.kicks-ass.net/T/
>
>The drawbacks of asynchronous scanning including
>
>- The CPU cycles used are not charged properly
>
>- There may be no idle CPU cycles to use
>
>- The scanning CPU may be not near the workload CPUs enough

One approach we experimented with was doing only the page migration asynchronously,
leaving the scanning to the task context, which also knows the dest numa node.
Results showed that page fault latencies were reduced without affecting benchmark
performance. Of course busy systems are an issue, as the window between servicing
the fault and actually making it available to the user in fast memory is enlarged.

>It's better to involve Mel and Peter in the discussion for this.
>
>> The topic was presented in the MM alignment session hosted by David Rientjes [2].
>> The topic also finds a mention in S J Park's LSFMM proposal [3].
>>
>> Here is the list of potential discussion points:
>> 1. Other improvements and enhancements to PTE A bit scanning approach. Use of
>> multiple kernel threads, throttling improvements, promotion policies, per-process
>> opt-in via prctl, virtual vs physical address based scanning, tuning hot page
>> detection algorithm etc.
>
>One drawback of physical address based scanning is that it's hard to
>apply some workload specific policy.  For example, if a low priority
>workload has many relatively hot pages, while a high priority workload
>has many relative warm (not so hot) pages.  We need to promote the warm
>pages in the high priority workload, while physcial address based
>scanning may report the hot pages in the low priority workload.  Right?
>
>> 2. Possibility of maintaining single source of truth for page hotness that would
>> maintain hot page information from multiple sources and let other sub-systems
>> use that info.
>>
>> 3. Discuss how hardware provided hotness info (like AMD IBS) can further aid
>> promotion. Bharata had posted an RFC [4] on this a while back.
>>
>> 4. Overlap with DAMON and potential reuse.
>>
>> Links:
>>
>> [1] https://lore.kernel.org/all/20241201153818.2633616-1-raghavendra.kt@amd.com/
>> [2] https://lore.kernel.org/linux-mm/20241226012833.rmmbkws4wdhzdht6@ed.ac.uk/T/
>> [3] https://lore.kernel.org/lkml/Z4XUoWlU-UgRik18@gourry-fedora-PF4VCD3F/T/
>> [4] https://lore.kernel.org/lkml/20230208073533.715-2-bharata@amd.com/
>>
>
>---
>Best Regards,
>Huang, Ying
>


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Overhauling hot page detection and promotion based on PTE A bit scanning
  2025-02-07 19:06   ` Davidlohr Bueso
@ 2025-03-14  1:56     ` Raghavendra K T
  2025-03-14  2:12       ` Raghavendra K T
  0 siblings, 1 reply; 33+ messages in thread
From: Raghavendra K T @ 2025-03-14  1:56 UTC (permalink / raw)
  To: Huang, Ying, linux-mm, akpm, lsf-pc, bharata, gourry,
	nehagholkar, abhishekd, nphamcs, hannes, feng.tang, kbusch,
	Hasan.Maruf, sj, david, willy, k.shutemov, mgorman, vbabka,
	hughd, rientjes, shy828301, liam.howlett, peterz, mingo,
	nadav.amit, shivankg, ziy, jhubbard, AneeshKumar.KizhakeVeetil,
	linux-kernel, jon.grimm, santosh.shukla, Michael.Day, riel,
	weixugc, leesuyeon0506, honggyu.kim, leillc, kmanaouil.dev, rppt,
	dave.hansen, dongjoo.linux.dev



On 2/8/2025 12:36 AM, Davidlohr Bueso wrote:
> On Sun, 26 Jan 2025, Huang, Ying wrote:
> 
>> Hi, Raghavendra,
>>
>> Raghavendra K T <raghavendra.kt@amd.com> writes:
>>
>>> Bharata and I would like to propose the following topic for LSFMM.
>>>
>>> Topic: Overhauling hot page detection and promotion based on PTE A 
>>> bit scanning.
>>>
>>> In the Linux kernel, hot page information can potentially be obtained 
>>> from
>>> multiple sources:
>>>
>>> a. PROT_NONE faults (NUMA balancing)
>>> b. PTE Access bit (LRU scanning)
>>> c. Hardware provided page hotness info (like AMD IBS)
>>>
>>> This information is further used to migrate (or promote) pages from 
>>> slow memory
>>> tier to top tier to increase performance.
>>>
>>> In the current hot page promotion mechanism, all the activities 
>>> including the
>>> process address space scanning, NUMA hint fault handling and page 
>>> migration are
>>> performed in the process context. i.e., scanning overhead is borne by 
>>> the
>>> applications.
>>>
>>> I had recently posted a patch [1] to improve this in the context of 
>>> slow-tier
>>> page promotion. Here, Scanning is done by a global kernel thread 
>>> which routinely
>>> scans all the processes' address spaces and checks for accesses by 
>>> reading the
>>> PTE A bit. The hot pages thus identified are maintained in list and 
>>> subsequently
>>> are promoted to a default top-tier node. Thus, the approach pushes 
>>> overhead of
>>> scanning, NUMA hint faults and migrations off from process context.
> 
> It seems that overall having a global view of hot memory is where folks 
> are leaning
> towards. In the past we have discussed an external thread to harvest 
> information
> from different sources and do the corresponding migration. I think your 
> work is a
> step in this direction (and shows promising numbers), but I'm not sure 
> if it should
> be doing the scanning part, as opposed to just receive the information 
> and migrate
> (according to some policy based on a wider system view of what is hot; 
> ie: what CHMU
> says is hot might not be so hot to the rest of the system, or as is 
> pointed out
> below, workload based, as priorities).
> 
>>
>> This has been discussed before too.  For example, in the following thread
>>
>> https://lore.kernel.org/ 
>> all/20200417100633.GU20730@hirez.programming.kicks-ass.net/T/
>>
>> The drawbacks of asynchronous scanning including
>>
>> - The CPU cycles used are not charged properly
>>
>> - There may be no idle CPU cycles to use
>>
>> - The scanning CPU may be not near the workload CPUs enough
> 
> One approach we experimented with was doing only the page migration 
> asynchronously,
> leaving the scanning to the task context, which also knows the dest numa 
> node.
> Results showed that page fault latencies were reduced without affecting 
> benchmark
> performance. Of course busy systems are an issue, as the window between 
> servicing
> the fault and actually making it available to the user in fast memory is 
> enlarged.
> 
>> It's better to involve Mel and Peter in the discussion for this.
>>
>>> The topic was presented in the MM alignment session hosted by David 
>>> Rientjes [2].
>>> The topic also finds a mention in S J Park's LSFMM proposal [3].
>>>
>>> Here is the list of potential discussion points:
>>> 1. Other improvements and enhancements to PTE A bit scanning 
>>> approach. Use of
>>> multiple kernel threads, throttling improvements, promotion policies, 
>>> per-process
>>> opt-in via prctl, virtual vs physical address based scanning, tuning 
>>> hot page
>>> detection algorithm etc.
>>
>> One drawback of physical address based scanning is that it's hard to
>> apply some workload specific policy.  For example, if a low priority
>> workload has many relatively hot pages, while a high priority workload
>> has many relative warm (not so hot) pages.  We need to promote the warm
>> pages in the high priority workload, while physcial address based
>> scanning may report the hot pages in the low priority workload.  Right?
>>
>>> 2. Possibility of maintaining single source of truth for page hotness 
>>> that would
>>> maintain hot page information from multiple sources and let other 
>>> sub-systems
>>> use that info.
>>>
>>> 3. Discuss how hardware provided hotness info (like AMD IBS) can 
>>> further aid
>>> promotion. Bharata had posted an RFC [4] on this a while back.
>>>
>>> 4. Overlap with DAMON and potential reuse.
>>>
>>> Links:
>>>
>>> [1] https://lore.kernel.org/all/20241201153818.2633616-1- 
>>> raghavendra.kt@amd.com/
>>> [2] https://lore.kernel.org/linux- 
>>> mm/20241226012833.rmmbkws4wdhzdht6@ed.ac.uk/T/
>>> [3] https://lore.kernel.org/lkml/Z4XUoWlU-UgRik18@gourry-fedora- 
>>> PF4VCD3F/T/
>>> [4] https://lore.kernel.org/lkml/20230208073533.715-2-bharata@amd.com/


Hello All,
Sorry to comeback late on this. But after "Unifying source of page
temperature discussion",
I was trying to get one step closer towards that. (along with Bharata).
(also sometime spent on failed muti-threaded scanning that perhaps needs
more time if it is needed).

I am posting a single patch which is still in "raw" state (as reply to
this email). I will cleanup, split the patch and post early next week.

Sending this so to have a gist of what is coming atleast before LSFMM.

So here are the list of implemented feedback that we can build further
(depending on the consensus).

1. Scanning and migration is separated. A separate migration thread is
created.

Potential improvements that can be done here:
  - Have one instance of migration thread per node.
  - API to accept hot pages for promotion from different sources
	(for e.g., IBS / LRU as Bharata already mentioned)
  - Controlling throttling similar to what Huang has done in NUMAB=2 case
  - Take both PFN and folio as argument for migration
  - Make use of batch migration enhancements
  - usage of per mm migration list to have a easy lookup and control
(using  mmslot, This also helps build upon identifying actual hot pages 
(2 subsequent access) than single access.)

2. Implemented David's (Rientjes) suggestion of having a prctl approach.
Currently prctl values can range from 0..10.
	0 is for disabling
	>1 for enabling. But in the future idea is to use this as controlling 
scan rate further.

3. Steves' comment on tracing incorporated

4. Davidlohr's reported issue on the path series is fixed

5. Very importantly,
I do have a basic algorithm that detects "target node for migration"
which was the main pain point for PTE A bit scanning.

Algorithm:
As part of our scanning we are doing, scan of top tier pages also.
During the scan, How many pages
    - scanned/accessed that belongs to particular toptier/slowtier node
is also recorded.
Currently my algorithm chooses the toptier node that had the maximum
pages scanned.

But we can really build complex algorithm using scanned/accessed recently.
(for e.g. decay last scanned/accessed info, if current topteir node
becomes nearly becomes full find next preferred node, thus using
nodemask/or preferred list instead of single node etc).

Potential improvements on scanning part can be use of complex data 
structures to maintain area of hotpages similar to what DAMON is doing
or reuse some infrastructure from DAMON.

Thanks and Regards
- Raghu




^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Overhauling hot page detection and promotion based on PTE A bit scanning
  2025-03-14  1:56     ` Raghavendra K T
@ 2025-03-14  2:12       ` Raghavendra K T
  0 siblings, 0 replies; 33+ messages in thread
From: Raghavendra K T @ 2025-03-14  2:12 UTC (permalink / raw)
  To: raghavendra.kt
  Cc: AneeshKumar.KizhakeVeetil, Hasan.Maruf, Michael.Day, abhishekd,
	akpm, bharata, dave.hansen, david, dongjoo.linux.dev, feng.tang,
	gourry, hannes, honggyu.kim, hughd, jhubbard, jon.grimm,
	k.shutemov, kbusch, kmanaouil.dev, leesuyeon0506, leillc,
	liam.howlett, linux-kernel, linux-mm, lsf-pc, mgorman, mingo,
	nadav.amit, nehagholkar, nphamcs, peterz, riel, rientjes, rppt,
	santosh.shukla, shivankg, shy828301, sj, vbabka, weixugc, willy,
	ying.huang, ziy

PTE A bit scanning single patch RFC v1

---8x---
diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst
index 09f0aed5a08b..78633cab3f1a 100644
--- a/Documentation/filesystems/proc.rst
+++ b/Documentation/filesystems/proc.rst
@@ -195,6 +195,7 @@ read the file /proc/PID/status::
   VmLib:      1412 kB
   VmPTE:        20 kb
   VmSwap:        0 kB
+  PTEAScanScale: 0
   HugetlbPages:          0 kB
   CoreDumping:    0
   THP_enabled:	  1
@@ -278,6 +279,7 @@ It's slow but very precise.
  VmPTE                       size of page table entries
  VmSwap                      amount of swap used by anonymous private data
                              (shmem swap usage is not included)
+ PTEAScanScale               Integer representing async PTE A bit scan agrression
  HugetlbPages                size of hugetlb memory portions
  CoreDumping                 process's memory is currently being dumped
                              (killing the process may lead to a corrupted core)
diff --git a/fs/exec.c b/fs/exec.c
index 506cd411f4ac..e76285e4bc73 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -68,6 +68,7 @@
 #include <linux/user_events.h>
 #include <linux/rseq.h>
 #include <linux/ksm.h>
+#include <linux/kmmscand.h>
 
 #include <linux/uaccess.h>
 #include <asm/mmu_context.h>
@@ -266,6 +267,8 @@ static int __bprm_mm_init(struct linux_binprm *bprm)
 	if (err)
 		goto err_ksm;
 
+	kmmscand_execve(mm);
+
 	/*
 	 * Place the stack at the largest stack address the architecture
 	 * supports. Later, we'll move this to an appropriate place. We don't
@@ -288,6 +291,7 @@ static int __bprm_mm_init(struct linux_binprm *bprm)
 	return 0;
 err:
 	ksm_exit(mm);
+	kmmscand_exit(mm);
 err_ksm:
 	mmap_write_unlock(mm);
 err_free:
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index f02cd362309a..55620a5178fb 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -79,6 +79,10 @@ void task_mem(struct seq_file *m, struct mm_struct *mm)
 		    " kB\nVmPTE:\t", mm_pgtables_bytes(mm) >> 10, 8);
 	SEQ_PUT_DEC(" kB\nVmSwap:\t", swap);
 	seq_puts(m, " kB\n");
+#ifdef CONFIG_KMMSCAND
+	seq_put_decimal_ull_width(m, "PTEAScanScale:\t", mm->pte_scan_scale, 8);
+	seq_puts(m, "\n");
+#endif
 	hugetlb_report_usage(m, mm);
 }
 #undef SEQ_PUT_DEC
diff --git a/include/linux/kmmscand.h b/include/linux/kmmscand.h
new file mode 100644
index 000000000000..7021f7d979a6
--- /dev/null
+++ b/include/linux/kmmscand.h
@@ -0,0 +1,31 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_KMMSCAND_H_
+#define _LINUX_KMMSCAND_H_
+
+#ifdef CONFIG_KMMSCAND
+extern void __kmmscand_enter(struct mm_struct *mm);
+extern void __kmmscand_exit(struct mm_struct *mm);
+
+static inline void kmmscand_execve(struct mm_struct *mm)
+{
+	__kmmscand_enter(mm);
+}
+
+static inline void kmmscand_fork(struct mm_struct *mm, struct mm_struct *oldmm)
+{
+	mm->pte_scan_scale = oldmm->pte_scan_scale;
+	__kmmscand_enter(mm);
+}
+
+static inline void kmmscand_exit(struct mm_struct *mm)
+{
+	__kmmscand_exit(mm);
+}
+#else /* !CONFIG_KMMSCAND */
+static inline void __kmmscand_enter(struct mm_struct *mm) {}
+static inline void __kmmscand_exit(struct mm_struct *mm) {}
+static inline void kmmscand_execve(struct mm_struct *mm) {}
+static inline void kmmscand_fork(struct mm_struct *mm, struct mm_struct *oldmm) {}
+static inline void kmmscand_exit(struct mm_struct *mm) {}
+#endif
+#endif /* _LINUX_KMMSCAND_H_ */
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 7b1068ddcbb7..fbd9273f6c65 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -682,6 +682,18 @@ struct vm_operations_struct {
 					  unsigned long addr);
 };
 
+#ifdef CONFIG_KMMSCAND
+void count_kmmscand_mm_scans(void);
+void count_kmmscand_vma_scans(void);
+void count_kmmscand_migadded(void);
+void count_kmmscand_migrated(void);
+void count_kmmscand_migrate_failed(void);
+void count_kmmscand_kzalloc_fail(void);
+void count_kmmscand_slowtier(void);
+void count_kmmscand_toptier(void);
+void count_kmmscand_idlepage(void);
+#endif
+
 #ifdef CONFIG_NUMA_BALANCING
 static inline void vma_numab_state_init(struct vm_area_struct *vma)
 {
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 0234f14f2aa6..7950554f7447 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -1014,6 +1014,14 @@ struct mm_struct {
 
 		/* numa_scan_seq prevents two threads remapping PTEs. */
 		int numa_scan_seq;
+#endif
+#ifdef CONFIG_KMMSCAND
+		/* Tracks promotion node. XXX: use nodemask */
+		int target_node;
+
+		/* Integer representing PTE A bit scan aggression (0-10) */
+		unsigned int pte_scan_scale;
+
 #endif
 		/*
 		 * An operation with batched TLB flushing is going on. Anything
diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index f70d0958095c..620c1b1c157a 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -65,6 +65,17 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
 		NUMA_HINT_FAULTS_LOCAL,
 		NUMA_PAGE_MIGRATE,
 #endif
+#ifdef CONFIG_KMMSCAND
+		KMMSCAND_MM_SCANS,
+		KMMSCAND_VMA_SCANS,
+		KMMSCAND_MIGADDED,
+		KMMSCAND_MIGRATED,
+		KMMSCAND_MIGRATE_FAILED,
+		KMMSCAND_KZALLOC_FAIL,
+		KMMSCAND_SLOWTIER,
+		KMMSCAND_TOPTIER,
+		KMMSCAND_IDLEPAGE,
+#endif
 #ifdef CONFIG_MIGRATION
 		PGMIGRATE_SUCCESS, PGMIGRATE_FAIL,
 		THP_MIGRATION_SUCCESS,
diff --git a/include/trace/events/kmem.h b/include/trace/events/kmem.h
index b37eb0a7060f..be1a7188a192 100644
--- a/include/trace/events/kmem.h
+++ b/include/trace/events/kmem.h
@@ -9,6 +9,96 @@
 #include <linux/tracepoint.h>
 #include <trace/events/mmflags.h>
 
+DECLARE_EVENT_CLASS(kmem_mm_class,
+
+	TP_PROTO(struct mm_struct *mm),
+
+	TP_ARGS(mm),
+
+	TP_STRUCT__entry(
+		__field(	struct mm_struct *, mm		)
+	),
+
+	TP_fast_assign(
+		__entry->mm = mm;
+	),
+
+	TP_printk("mm = %p", __entry->mm)
+);
+
+DEFINE_EVENT(kmem_mm_class, kmem_mm_enter,
+	TP_PROTO(struct mm_struct *mm),
+	TP_ARGS(mm)
+);
+
+DEFINE_EVENT(kmem_mm_class, kmem_mm_exit,
+	TP_PROTO(struct mm_struct *mm),
+	TP_ARGS(mm)
+);
+
+DEFINE_EVENT(kmem_mm_class, kmem_scan_mm_start,
+	TP_PROTO(struct mm_struct *mm),
+	TP_ARGS(mm)
+);
+
+TRACE_EVENT(kmem_scan_mm_end,
+
+	TP_PROTO( struct mm_struct *mm,
+		 unsigned long start,
+		 unsigned long total,
+		 unsigned long scan_period,
+		 unsigned long scan_size,
+		 int target_node),
+
+	TP_ARGS(mm, start, total, scan_period, scan_size, target_node),
+
+	TP_STRUCT__entry(
+		__field(	struct mm_struct *, mm		)
+		__field(	unsigned long,   start		)
+		__field(	unsigned long,   total		)
+		__field(	unsigned long,   scan_period	)
+		__field(	unsigned long,   scan_size	)
+		__field(	int,   		 target_node	)
+	),
+
+	TP_fast_assign(
+		__entry->mm = mm;
+		__entry->start = start;
+		__entry->total = total;
+		__entry->scan_period  = scan_period;
+		__entry->scan_size    = scan_size;
+		__entry->target_node  = target_node;
+	),
+
+	TP_printk("mm=%p, start = %ld, total = %ld, scan_period = %ld, scan_size = %ld node = %d",
+		__entry->mm, __entry->start, __entry->total, __entry->scan_period,
+		__entry->scan_size, __entry->target_node)
+);
+
+TRACE_EVENT(kmem_scan_mm_migrate,
+
+	TP_PROTO(struct mm_struct *mm,
+		 int rc,
+		 int target_node),
+
+	TP_ARGS(mm, rc, target_node),
+
+	TP_STRUCT__entry(
+		__field(	struct mm_struct *, mm		)
+		__field(	int,   rc			)
+		__field(	int,   target_node		)
+	),
+
+	TP_fast_assign(
+		__entry->mm = mm;
+		__entry->rc = rc;
+		__entry->target_node = target_node;
+	),
+
+	TP_printk("mm = %p rc = %d node = %d",
+		__entry->mm, __entry->rc, __entry->target_node)
+);
+
 TRACE_EVENT(kmem_cache_alloc,
 
 	TP_PROTO(unsigned long call_site,
diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index 5c6080680cb2..18face11440a 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -353,4 +353,11 @@ struct prctl_mm_map {
  */
 #define PR_LOCK_SHADOW_STACK_STATUS      76
 
+/* Set/get PTE A bit scan scale */
+#define PR_SET_PTE_A_SCAN_SCALE		77
+#define PR_GET_PTE_A_SCAN_SCALE		78
+# define PR_PTE_A_SCAN_SCALE_MIN	0
+# define PR_PTE_A_SCAN_SCALE_MAX	10
+# define PR_PTE_A_SCAN_SCALE_DEFAULT	1
+
 #endif /* _LINUX_PRCTL_H */
diff --git a/kernel/fork.c b/kernel/fork.c
index 735405a9c5f3..bfbbacb8ec36 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -85,6 +85,7 @@
 #include <linux/user-return-notifier.h>
 #include <linux/oom.h>
 #include <linux/khugepaged.h>
+#include <linux/kmmscand.h>
 #include <linux/signalfd.h>
 #include <linux/uprobes.h>
 #include <linux/aio.h>
@@ -105,6 +106,7 @@
 #include <uapi/linux/pidfd.h>
 #include <linux/pidfs.h>
 #include <linux/tick.h>
+#include <linux/prctl.h>
 
 #include <asm/pgalloc.h>
 #include <linux/uaccess.h>
@@ -656,6 +658,8 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
 	mm->exec_vm = oldmm->exec_vm;
 	mm->stack_vm = oldmm->stack_vm;
 
+	kmmscand_fork(mm, oldmm);
+
 	/* Use __mt_dup() to efficiently build an identical maple tree. */
 	retval = __mt_dup(&oldmm->mm_mt, &mm->mm_mt, GFP_KERNEL);
 	if (unlikely(retval))
@@ -1289,6 +1293,9 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
 	init_tlb_flush_pending(mm);
 #if defined(CONFIG_TRANSPARENT_HUGEPAGE) && !defined(CONFIG_SPLIT_PMD_PTLOCKS)
 	mm->pmd_huge_pte = NULL;
+#endif
+#ifdef CONFIG_KMMSCAND
+	mm->pte_scan_scale = PR_PTE_A_SCAN_SCALE_DEFAULT;
 #endif
 	mm_init_uprobes_state(mm);
 	hugetlb_count_init(mm);
@@ -1353,6 +1360,7 @@ static inline void __mmput(struct mm_struct *mm)
 	exit_aio(mm);
 	ksm_exit(mm);
 	khugepaged_exit(mm); /* must run before exit_mmap */
+	kmmscand_exit(mm);
 	exit_mmap(mm);
 	mm_put_huge_zero_folio(mm);
 	set_mm_exe_file(mm, NULL);
diff --git a/kernel/sys.c b/kernel/sys.c
index cb366ff8703a..0518480d8f78 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -2142,6 +2142,19 @@ static int prctl_set_auxv(struct mm_struct *mm, unsigned long addr,
 
 	return 0;
 }
+#ifdef CONFIG_KMMSCAND
+static int prctl_pte_scan_scale_write(unsigned int scale)
+{
+	scale = clamp(scale, PR_PTE_A_SCAN_SCALE_MIN, PR_PTE_A_SCAN_SCALE_MAX);
+	current->mm->pte_scan_scale = scale;
+	return 0;
+}
+
+static unsigned int prctl_pte_scan_scale_read(void)
+{
+	return current->mm->pte_scan_scale;
+}
+#endif
 
 static int prctl_set_mm(int opt, unsigned long addr,
 			unsigned long arg4, unsigned long arg5)
@@ -2811,6 +2824,18 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
 			return -EINVAL;
 		error = arch_lock_shadow_stack_status(me, arg2);
 		break;
+#ifdef CONFIG_KMMSCAND
+	case PR_SET_PTE_A_SCAN_SCALE:
+		if (arg3 || arg4 || arg5)
+			return -EINVAL;
+		error = prctl_pte_scan_scale_write((unsigned int) arg2);
+		break;
+	case PR_GET_PTE_A_SCAN_SCALE:
+		if (arg2 || arg3 || arg4 || arg5)
+			return -EINVAL;
+		error = prctl_pte_scan_scale_read();
+		break;
+#endif
 	default:
 		trace_task_prctl_unknown(option, arg2, arg3, arg4, arg5);
 		error = -EINVAL;
diff --git a/mm/Kconfig b/mm/Kconfig
index 1b501db06417..529bf140e1f7 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -783,6 +783,13 @@ config KSM
 	  until a program has madvised that an area is MADV_MERGEABLE, and
 	  root has set /sys/kernel/mm/ksm/run to 1 (if CONFIG_SYSFS is set).
 
+config KMMSCAND
+	bool "Enable PTE A bit scanning and Migration"
+	depends on NUMA_BALANCING
+	help
+	  Enable PTE A bit scanning of page. CXL pages accessed are migrated to
+	  regular NUMA node (node 0 - default).
+
 config DEFAULT_MMAP_MIN_ADDR
 	int "Low address space to protect from user allocation"
 	depends on MMU
diff --git a/mm/Makefile b/mm/Makefile
index 850386a67b3e..45e2f8cc8fd6 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -94,6 +94,7 @@ obj-$(CONFIG_FAIL_PAGE_ALLOC) += fail_page_alloc.o
 obj-$(CONFIG_MEMTEST)		+= memtest.o
 obj-$(CONFIG_MIGRATION) += migrate.o
 obj-$(CONFIG_NUMA) += memory-tiers.o
+obj-$(CONFIG_KMMSCAND) += kmmscand.o
 obj-$(CONFIG_DEVICE_MIGRATION) += migrate_device.o
 obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o khugepaged.o
 obj-$(CONFIG_PAGE_COUNTER) += page_counter.o
diff --git a/mm/kmmscand.c b/mm/kmmscand.c
new file mode 100644
index 000000000000..2fc1b46cf512
--- /dev/null
+++ b/mm/kmmscand.c
@@ -0,0 +1,1505 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <linux/mm.h>
+#include <linux/mm_types.h>
+#include <linux/sched.h>
+#include <linux/sched/mm.h>
+#include <linux/mmu_notifier.h>
+#include <linux/migrate.h>
+#include <linux/rmap.h>
+#include <linux/pagewalk.h>
+#include <linux/page_ext.h>
+#include <linux/page_idle.h>
+#include <linux/page_table_check.h>
+#include <linux/pagemap.h>
+#include <linux/swap.h>
+#include <linux/mm_inline.h>
+#include <linux/kthread.h>
+#include <linux/kmmscand.h>
+#include <linux/memory-tiers.h>
+#include <linux/mempolicy.h>
+#include <linux/string.h>
+#include <linux/cleanup.h>
+#include <linux/minmax.h>
+#include <linux/delay.h>
+#include <trace/events/kmem.h>
+
+#include <asm/pgalloc.h>
+#include "internal.h"
+#include "mm_slot.h"
+
+static struct task_struct *kmmscand_thread __read_mostly;
+static DEFINE_MUTEX(kmmscand_mutex);
+extern unsigned int sysctl_numa_balancing_scan_delay;
+
+/*
+ * Total VMA size to cover during scan.
+ * Min: 256MB default: 1GB max: 4GB
+ */
+#define KMMSCAND_SCAN_SIZE_MIN	(256 * 1024 * 1024UL)
+#define KMMSCAND_SCAN_SIZE_MAX	(8 * 1024 * 1024 * 1024UL)
+#define KMMSCAND_SCAN_SIZE	(2 * 1024 * 1024 * 1024UL)
+
+static unsigned long kmmscand_scan_size __read_mostly = KMMSCAND_SCAN_SIZE;
+
+/*
+ * Scan period for each mm.
+ * Min: 400ms default: 2sec Max: 5sec
+ */
+#define KMMSCAND_SCAN_PERIOD_MAX	5000U
+#define KMMSCAND_SCAN_PERIOD_MIN	400U
+#define KMMSCAND_SCAN_PERIOD		2000U
+
+static unsigned int kmmscand_mm_scan_period_ms __read_mostly = KMMSCAND_SCAN_PERIOD;
+
+/* How long to pause between two scan and migration cycle */
+static unsigned int kmmscand_scan_sleep_ms __read_mostly = 16;
+
+/* Max number of mms to scan in one scan and migration cycle */
+#define KMMSCAND_MMS_TO_SCAN	(4 * 1024UL)
+static unsigned long kmmscand_mms_to_scan __read_mostly = KMMSCAND_MMS_TO_SCAN;
+
+volatile bool kmmscand_scan_enabled = true;
+static bool need_wakeup;
+static bool migrated_need_wakeup;
+
+/* How long to pause between two migration cycles */
+static unsigned int kmmmigrate_sleep_ms __read_mostly = 20;
+
+static struct task_struct *kmmmigrated_thread __read_mostly;
+static DEFINE_MUTEX(kmmmigrated_mutex);
+static DECLARE_WAIT_QUEUE_HEAD(kmmmigrated_wait);
+static unsigned long kmmmigrated_sleep_expire;
+
+/* mm of the migrating folio entry */
+static struct mm_struct *kmmscand_cur_migrate_mm;
+
+/* Migration list is manipulated underneath because of mm_exit */
+static bool  kmmscand_migration_list_dirty;
+
+static unsigned long kmmscand_sleep_expire;
+#define KMMSCAND_DEFAULT_TARGET_NODE	(0)
+static int kmmscand_target_node = KMMSCAND_DEFAULT_TARGET_NODE;
+
+static DEFINE_SPINLOCK(kmmscand_mm_lock);
+static DEFINE_SPINLOCK(kmmscand_migrate_lock);
+static DECLARE_WAIT_QUEUE_HEAD(kmmscand_wait);
+
+#define KMMSCAND_SLOT_HASH_BITS 10
+static DEFINE_READ_MOSTLY_HASHTABLE(kmmscand_slots_hash, KMMSCAND_SLOT_HASH_BITS);
+
+static struct kmem_cache *kmmscand_slot_cache __read_mostly;
+
+struct kmmscand_nodeinfo {
+	unsigned long nr_scanned;
+	unsigned long nr_accessed;
+	int node;
+	bool is_toptier;
+};
+
+struct kmmscand_mm_slot {
+	struct mm_slot slot;
+	/* Unit: ms. Determines how aften mm scan should happen. */
+	unsigned int scan_period;
+	unsigned long next_scan;
+	/* Tracks how many useful pages obtained for migration in the last scan */
+	unsigned long scan_delta;
+	/* Determines how much VMA address space to be covered in the scanning */
+	unsigned long scan_size;
+	long address;
+	volatile bool is_scanned;
+	int target_node;
+};
+
+struct kmmscand_scan {
+	struct list_head mm_head;
+	struct kmmscand_mm_slot *mm_slot;
+};
+
+struct kmmscand_scan kmmscand_scan = {
+	.mm_head = LIST_HEAD_INIT(kmmscand_scan.mm_head),
+};
+
+struct kmmscand_scanctrl {
+	struct list_head scan_list;
+	struct kmmscand_nodeinfo *nodeinfo[MAX_NUMNODES];
+};
+
+struct kmmscand_scanctrl kmmscand_scanctrl;
+
+struct kmmscand_migrate_list {
+	struct list_head migrate_head;
+};
+
+struct kmmscand_migrate_list kmmscand_migrate_list = {
+	.migrate_head = LIST_HEAD_INIT(kmmscand_migrate_list.migrate_head),
+};
+
+struct kmmscand_migrate_info {
+	struct list_head migrate_node;
+	struct mm_struct *mm;
+	struct folio *folio;
+	unsigned long address;
+};
+
+#ifdef CONFIG_SYSFS
+static ssize_t scan_sleep_ms_show(struct kobject *kobj,
+					 struct kobj_attribute *attr,
+					 char *buf)
+{
+	return sysfs_emit(buf, "%u\n", kmmscand_scan_sleep_ms);
+}
+
+static ssize_t scan_sleep_ms_store(struct kobject *kobj,
+					  struct kobj_attribute *attr,
+					  const char *buf, size_t count)
+{
+	unsigned int msecs;
+	int err;
+
+	err = kstrtouint(buf, 10, &msecs);
+	if (err)
+		return -EINVAL;
+
+	kmmscand_scan_sleep_ms = msecs;
+	kmmscand_sleep_expire = 0;
+	wake_up_interruptible(&kmmscand_wait);
+
+	return count;
+}
+static struct kobj_attribute scan_sleep_ms_attr =
+	__ATTR_RW(scan_sleep_ms);
+
+static ssize_t mm_scan_period_ms_show(struct kobject *kobj,
+					 struct kobj_attribute *attr,
+					 char *buf)
+{
+	return sysfs_emit(buf, "%u\n", kmmscand_mm_scan_period_ms);
+}
+
+/* If a value less than MIN or greater than MAX asked for store value is clamped */
+static ssize_t mm_scan_period_ms_store(struct kobject *kobj,
+					  struct kobj_attribute *attr,
+					  const char *buf, size_t count)
+{
+	unsigned int msecs, stored_msecs;
+	int err;
+
+	err = kstrtouint(buf, 10, &msecs);
+	if (err)
+		return -EINVAL;
+
+	stored_msecs = clamp(msecs, KMMSCAND_SCAN_PERIOD_MIN, KMMSCAND_SCAN_PERIOD_MAX);
+
+	kmmscand_mm_scan_period_ms = stored_msecs;
+	kmmscand_sleep_expire = 0;
+	wake_up_interruptible(&kmmscand_wait);
+
+	return count;
+}
+
+static struct kobj_attribute mm_scan_period_ms_attr =
+	__ATTR_RW(mm_scan_period_ms);
+
+static ssize_t mms_to_scan_show(struct kobject *kobj,
+					 struct kobj_attribute *attr,
+					 char *buf)
+{
+	return sysfs_emit(buf, "%lu\n", kmmscand_mms_to_scan);
+}
+
+static ssize_t mms_to_scan_store(struct kobject *kobj,
+					  struct kobj_attribute *attr,
+					  const char *buf, size_t count)
+{
+	unsigned long val;
+	int err;
+
+	err = kstrtoul(buf, 10, &val);
+	if (err)
+		return -EINVAL;
+
+	kmmscand_mms_to_scan = val;
+	kmmscand_sleep_expire = 0;
+	wake_up_interruptible(&kmmscand_wait);
+
+	return count;
+}
+
+static struct kobj_attribute mms_to_scan_attr =
+	__ATTR_RW(mms_to_scan);
+
+static ssize_t scan_enabled_show(struct kobject *kobj,
+					 struct kobj_attribute *attr,
+					 char *buf)
+{
+	return sysfs_emit(buf, "%u\n", kmmscand_scan_enabled ? 1 : 0);
+}
+
+static ssize_t scan_enabled_store(struct kobject *kobj,
+					  struct kobj_attribute *attr,
+					  const char *buf, size_t count)
+{
+	unsigned int val;
+	int err;
+
+	err = kstrtouint(buf, 10, &val);
+	if (err || val > 1)
+		return -EINVAL;
+
+	if (val) {
+		kmmscand_scan_enabled = true;
+		need_wakeup = true;
+	} else
+		kmmscand_scan_enabled = false;
+
+	kmmscand_sleep_expire = 0;
+	wake_up_interruptible(&kmmscand_wait);
+
+	return count;
+}
+
+static struct kobj_attribute scan_enabled_attr =
+	__ATTR_RW(scan_enabled);
+
+static ssize_t target_node_show(struct kobject *kobj,
+					 struct kobj_attribute *attr,
+					 char *buf)
+{
+	return sysfs_emit(buf, "%u\n", kmmscand_target_node);
+}
+
+static ssize_t target_node_store(struct kobject *kobj,
+					  struct kobj_attribute *attr,
+					  const char *buf, size_t count)
+{
+	int err, node;
+
+	err = kstrtoint(buf, 10, &node);
+	if (err)
+		return -EINVAL;
+
+	kmmscand_sleep_expire = 0;
+	if (!node_is_toptier(node))
+		return -EINVAL;
+
+	kmmscand_target_node = node;
+	wake_up_interruptible(&kmmscand_wait);
+
+	return count;
+}
+static struct kobj_attribute target_node_attr =
+	__ATTR_RW(target_node);
+
+static struct attribute *kmmscand_attr[] = {
+	&scan_sleep_ms_attr.attr,
+	&mm_scan_period_ms_attr.attr,
+	&mms_to_scan_attr.attr,
+	&scan_enabled_attr.attr,
+	&target_node_attr.attr,
+	NULL,
+};
+
+struct attribute_group kmmscand_attr_group = {
+	.attrs = kmmscand_attr,
+	.name = "kmmscand",
+};
+#endif
+
+void count_kmmscand_mm_scans(void)
+{
+	count_vm_numa_event(KMMSCAND_MM_SCANS);
+}
+void count_kmmscand_vma_scans(void)
+{
+	count_vm_numa_event(KMMSCAND_VMA_SCANS);
+}
+void count_kmmscand_migadded(void)
+{
+	count_vm_numa_event(KMMSCAND_MIGADDED);
+}
+void count_kmmscand_migrated(void)
+{
+	count_vm_numa_event(KMMSCAND_MIGRATED);
+}
+void count_kmmscand_migrate_failed(void)
+{
+	count_vm_numa_event(KMMSCAND_MIGRATE_FAILED);
+}
+void count_kmmscand_kzalloc_fail(void)
+{
+	count_vm_numa_event(KMMSCAND_KZALLOC_FAIL);
+}
+void count_kmmscand_slowtier(void)
+{
+	count_vm_numa_event(KMMSCAND_SLOWTIER);
+}
+void count_kmmscand_toptier(void)
+{
+	count_vm_numa_event(KMMSCAND_TOPTIER);
+}
+void count_kmmscand_idlepage(void)
+{
+	count_vm_numa_event(KMMSCAND_IDLEPAGE);
+}
+
+static int kmmscand_has_work(void)
+{
+	return !list_empty(&kmmscand_scan.mm_head);
+}
+
+static int kmmmigrated_has_work(void)
+{
+	if (!list_empty(&kmmscand_migrate_list.migrate_head))
+		return true;
+	return false;
+}
+
+static bool kmmscand_should_wakeup(void)
+{
+	bool wakeup =  kthread_should_stop() || need_wakeup ||
+	       time_after_eq(jiffies, kmmscand_sleep_expire);
+	if (need_wakeup)
+		need_wakeup = false;
+
+	return wakeup;
+}
+
+static bool kmmmigrated_should_wakeup(void)
+{
+	bool wakeup =  kthread_should_stop() || migrated_need_wakeup ||
+	       time_after_eq(jiffies, kmmmigrated_sleep_expire);
+	if (migrated_need_wakeup)
+		migrated_need_wakeup = false;
+
+	return wakeup;
+}
+
+static void kmmscand_wait_work(void)
+{
+	const unsigned long scan_sleep_jiffies =
+		msecs_to_jiffies(kmmscand_scan_sleep_ms);
+
+	if (!scan_sleep_jiffies)
+		return;
+
+	kmmscand_sleep_expire = jiffies + scan_sleep_jiffies;
+	wait_event_timeout(kmmscand_wait,
+			kmmscand_should_wakeup(),
+			scan_sleep_jiffies);
+	return;
+}
+
+static void kmmmigrated_wait_work(void)
+{
+	const unsigned long migrate_sleep_jiffies =
+		msecs_to_jiffies(kmmmigrate_sleep_ms);
+
+       if (!migrate_sleep_jiffies)
+		return;
+
+	kmmmigrated_sleep_expire = jiffies + migrate_sleep_jiffies;
+	wait_event_timeout(kmmmigrated_wait,
+			kmmmigrated_should_wakeup(),
+			migrate_sleep_jiffies);
+	return;
+}
+
+static unsigned long get_slowtier_accesed(struct kmmscand_scanctrl *scanctrl)
+{
+	int node;
+	unsigned long accessed = 0;
+
+	for_each_node_state(node, N_MEMORY) {
+		if (!node_is_toptier(node))
+			accessed += (scanctrl->nodeinfo[node])->nr_accessed;
+	}
+	return accessed;
+}
+
+static inline unsigned long get_nodeinfo_nr_accessed(struct kmmscand_nodeinfo *ni)
+{
+	return ni->nr_accessed;
+}
+
+static inline void set_nodeinfo_nr_accessed(struct kmmscand_nodeinfo *ni, unsigned long val)
+{
+	ni->nr_accessed = val;
+}
+
+static inline void reset_nodeinfo_nr_accessed(struct kmmscand_nodeinfo *ni)
+{
+	set_nodeinfo_nr_accessed(ni, 0);
+}
+
+static inline void nodeinfo_nr_accessed_inc(struct kmmscand_nodeinfo *ni)
+{
+	ni->nr_accessed++;
+}
+
+static inline unsigned long get_nodeinfo_nr_scanned(struct kmmscand_nodeinfo *ni)
+{
+	return ni->nr_scanned;
+}
+
+static inline void set_nodeinfo_nr_scanned(struct kmmscand_nodeinfo *ni, unsigned long val)
+{
+	ni->nr_scanned = val;
+}
+
+static inline void reset_nodeinfo_nr_scanned(struct kmmscand_nodeinfo *ni)
+{
+	set_nodeinfo_nr_scanned(ni, 0);
+}
+
+static inline void nodeinfo_nr_scanned_inc(struct kmmscand_nodeinfo *ni)
+{
+	ni->nr_scanned++;
+}
+
+
+static inline void reset_nodeinfo(struct kmmscand_nodeinfo *ni)
+{
+	set_nodeinfo_nr_scanned(ni, 0);
+	set_nodeinfo_nr_accessed(ni, 0);
+}
+
+static void init_one_nodeinfo(struct kmmscand_nodeinfo *ni, int node)
+{
+	ni->nr_scanned = 0;
+	ni->nr_accessed = 0;
+	ni->node = node;
+	ni->is_toptier = node_is_toptier(node) ? true : false;
+}
+
+static struct kmmscand_nodeinfo *alloc_one_nodeinfo(int node)
+{
+	struct kmmscand_nodeinfo *ni;
+
+	ni = kzalloc(sizeof(*ni), GFP_KERNEL);
+
+	if (!ni)
+		return NULL;
+
+	init_one_nodeinfo(ni, node);
+
+	return ni;
+}
+
+/* TBD: Handle errors */
+static void init_scanctrl(struct kmmscand_scanctrl *scanctrl)
+{
+	struct kmmscand_nodeinfo *ni;
+	int node;
+	for_each_node_state(node, N_MEMORY) {
+		ni = alloc_one_nodeinfo(node);
+		if (!ni)
+			WARN_ON_ONCE(ni);
+		scanctrl->nodeinfo[node] = ni;
+	}
+	pr_warn("scan ctrl init %d", node);
+}
+
+static void reset_scanctrl(struct kmmscand_scanctrl *scanctrl)
+{
+	int node;
+	for_each_node_state(node, N_MEMORY)
+		reset_nodeinfo(scanctrl->nodeinfo[node]);
+}
+
+static bool kmmscand_eligible_srcnid(int nid)
+{
+	if (!node_is_toptier(nid))
+		return true;
+
+	return false;
+}
+
+/*
+ * Do not know what info to pass in the future to make
+ * decision on taget node. Keep it void * now.
+ */
+static int kmmscand_get_target_node(void *data)
+{
+	return kmmscand_target_node;
+}
+
+static int get_target_node(struct kmmscand_scanctrl *scanctrl)
+{
+	int node, target_node = -9999;
+	unsigned long old_accessed = 0;
+
+	for_each_node(node) {
+		if (get_nodeinfo_nr_scanned(scanctrl->nodeinfo[node]) > old_accessed
+			&& node_is_toptier(node)) {
+			old_accessed = get_nodeinfo_nr_scanned(scanctrl->nodeinfo[node]);
+			target_node = node;
+		}
+	}
+	if (target_node == -9999)
+		target_node = kmmscand_get_target_node(NULL);
+
+	return target_node;
+}
+
+extern bool migrate_balanced_pgdat(struct pglist_data *pgdat,
+					unsigned long nr_migrate_pages);
+
+/*XXX: Taken from migrate.c to avoid NUMAB mode=2 and NULL vma checks*/
+static int kmmscand_migrate_misplaced_folio_prepare(struct folio *folio,
+		struct vm_area_struct *vma, int node)
+{
+	int nr_pages = folio_nr_pages(folio);
+	pg_data_t *pgdat = NODE_DATA(node);
+
+	if (folio_is_file_lru(folio)) {
+		/*
+		 * Do not migrate file folios that are mapped in multiple
+		 * processes with execute permissions as they are probably
+		 * shared libraries.
+		 *
+		 * See folio_likely_mapped_shared() on possible imprecision
+		 * when we cannot easily detect if a folio is shared.
+		 */
+		if (vma && (vma->vm_flags & VM_EXEC) &&
+		    folio_likely_mapped_shared(folio))
+			return -EACCES;
+		/*
+		 * Do not migrate dirty folios as not all filesystems can move
+		 * dirty folios in MIGRATE_ASYNC mode which is a waste of
+		 * cycles.
+		 */
+		if (folio_test_dirty(folio))
+			return -EAGAIN;
+	}
+
+	/* Avoid migrating to a node that is nearly full */
+	if (!migrate_balanced_pgdat(pgdat, nr_pages)) {
+		int z;
+
+		for (z = pgdat->nr_zones - 1; z >= 0; z--) {
+			if (managed_zone(pgdat->node_zones + z))
+				break;
+		}
+
+		/*
+		 * If there are no managed zones, it should not proceed
+		 * further.
+		 */
+		if (z < 0)
+			return -EAGAIN;
+
+		wakeup_kswapd(pgdat->node_zones + z, 0,
+			      folio_order(folio), ZONE_MOVABLE);
+		return -EAGAIN;
+	}
+
+	if (!folio_isolate_lru(folio))
+		return -EAGAIN;
+
+	node_stat_mod_folio(folio, NR_ISOLATED_ANON + folio_is_file_lru(folio),
+			    nr_pages);
+
+	return 0;
+}
+
+enum kmmscand_migration_err {
+	KMMSCAND_NULL_MM = 1,
+	KMMSCAND_EXITING_MM,
+	KMMSCAND_INVALID_FOLIO,
+	KMMSCAND_INVALID_VMA,
+	KMMSCAND_INELIGIBLE_SRC_NODE,
+	KMMSCAND_SAME_SRC_DEST_NODE,
+	KMMSCAND_PTE_NOT_PRESENT,
+	KMMSCAND_PMD_NOT_PRESENT,
+	KMMSCAND_NO_PTE_OFFSET_MAP_LOCK,
+	KMMSCAND_LRU_ISOLATION_ERR,
+};
+
+static int kmmscand_promote_folio(struct kmmscand_migrate_info *info, int destnid)
+{
+	unsigned long pfn;
+	unsigned long address;
+	struct page *page;
+	struct folio *folio;
+	int ret;
+	struct mm_struct *mm;
+	pmd_t *pmd;
+	pte_t *pte;
+	spinlock_t *ptl;
+	pmd_t pmde;
+	int srcnid;
+
+	if (info->mm == NULL)
+		return KMMSCAND_NULL_MM;
+
+	if (info->mm == READ_ONCE(kmmscand_cur_migrate_mm) &&
+		READ_ONCE(kmmscand_migration_list_dirty)) {
+		WARN_ON_ONCE(mm);
+		return KMMSCAND_EXITING_MM;
+	}
+
+	mm = info->mm;
+
+	folio = info->folio;
+
+	/* Check again if the folio is really valid now */
+	if (folio) {
+		pfn = folio_pfn(folio);
+		page = pfn_to_online_page(pfn);
+	}
+
+	if (!page || PageTail(page) || !folio || !folio_test_lru(folio) || folio_test_unevictable(folio) ||
+		folio_is_zone_device(folio) || !folio_mapped(folio) || folio_likely_mapped_shared(folio))
+		return KMMSCAND_INVALID_FOLIO;
+
+	folio_get(folio);
+
+	srcnid = folio_nid(folio);
+
+	/* Do not try to promote pages from regular nodes */
+	if (!kmmscand_eligible_srcnid(srcnid)) {
+		folio_put(folio);
+		return KMMSCAND_INELIGIBLE_SRC_NODE;
+	}
+
+
+	if (srcnid == destnid) {
+		folio_put(folio);
+		return KMMSCAND_SAME_SRC_DEST_NODE;
+	}
+	address = info->address;
+	pmd = pmd_off(mm, address);
+	pmde = pmdp_get(pmd);
+
+	if (!pmd_present(pmde)) {
+		folio_put(folio);
+		return KMMSCAND_PMD_NOT_PRESENT;
+	}
+	pte = pte_offset_map_lock(mm, pmd, address, &ptl);
+
+	if (!pte) {
+		folio_put(folio);
+		WARN_ON_ONCE(!pte);
+		return KMMSCAND_NO_PTE_OFFSET_MAP_LOCK;
+	}
+
+
+	ret = kmmscand_migrate_misplaced_folio_prepare(folio, NULL, destnid);
+	if (ret) {
+		folio_put(folio);
+		pte_unmap_unlock(pte, ptl);
+		return KMMSCAND_LRU_ISOLATION_ERR;
+	}
+
+	folio_put(folio);
+	pte_unmap_unlock(pte, ptl);
+
+	return  migrate_misplaced_folio(folio, destnid);
+}
+
+static bool folio_idle_clear_pte_refs_one(struct folio *folio,
+					 struct vm_area_struct *vma,
+					 unsigned long addr,
+					 pte_t *ptep)
+{
+	bool referenced = false;
+	struct mm_struct *mm = vma->vm_mm;
+	pmd_t *pmd = pmd_off(mm, addr);
+
+	if (ptep) {
+		if (ptep_clear_young_notify(vma, addr, ptep))
+			referenced = true;
+	} else if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE)) {
+		if (!pmd_present(*pmd))
+			WARN_ON_ONCE(1);
+		if (pmdp_clear_young_notify(vma, addr, pmd))
+			referenced = true;
+	} else {
+		WARN_ON_ONCE(1);
+	}
+
+	if (referenced) {
+		folio_clear_idle(folio);
+		folio_set_young(folio);
+	}
+
+	return true;
+}
+
+static void page_idle_clear_pte_refs(struct page *page, pte_t *pte, struct mm_walk *walk)
+{
+	bool need_lock;
+	struct folio *folio =  page_folio(page);
+	unsigned long address;
+
+	if (!folio_mapped(folio) || !folio_raw_mapping(folio))
+		return;
+
+	need_lock = !folio_test_anon(folio) || folio_test_ksm(folio);
+	if (need_lock && !folio_trylock(folio))
+		return;
+	address = vma_address(walk->vma, page_pgoff(folio, page), compound_nr(page));
+	VM_BUG_ON_VMA(address == -EFAULT, walk->vma);
+	folio_idle_clear_pte_refs_one(folio, walk->vma, address, pte);
+
+	if (need_lock)
+		folio_unlock(folio);
+}
+
+static int hot_vma_idle_pte_entry(pte_t *pte,
+				 unsigned long addr,
+				 unsigned long next,
+				 struct mm_walk *walk)
+{
+	struct page *page;
+	struct folio *folio;
+	struct mm_struct *mm;
+	struct vm_area_struct *vma;
+	struct kmmscand_migrate_info *info;
+	struct kmmscand_scanctrl *scanctrl = walk->private;
+
+	int srcnid;
+
+	pte_t pteval = ptep_get(pte);
+
+	if (!pte_present(pteval))
+		return 1;
+
+	if (pte_none(pteval))
+		return 1;
+
+	vma = walk->vma;
+	mm = vma->vm_mm;
+
+	page = pte_page(*pte);
+
+	page_idle_clear_pte_refs(page, pte, walk);
+
+	folio = page_folio(page);
+	folio_get(folio);
+
+	if (!folio || folio_is_zone_device(folio) || folio_test_unevictable(folio)
+		|| !folio_mapped(folio) || folio_likely_mapped_shared(folio)) {
+		folio_put(folio);
+		return 1;
+	}
+
+	srcnid = folio_nid(folio);
+
+	if (node_is_toptier(srcnid)) {
+		scanctrl->nodeinfo[srcnid]->nr_scanned++;
+		count_kmmscand_toptier();
+	}
+
+	if (!folio_test_idle(folio) || folio_test_young(folio) ||
+			mmu_notifier_test_young(mm, addr) ||
+			folio_test_referenced(folio) || pte_young(pteval)) {
+
+		/* TBD: Use helpers */
+		scanctrl->nodeinfo[srcnid]->nr_accessed++;
+
+		/* Do not try to promote pages from regular nodes */
+		if (!kmmscand_eligible_srcnid(srcnid))
+			goto end;
+
+		info = kzalloc(sizeof(struct kmmscand_migrate_info), GFP_NOWAIT);
+		if (info && scanctrl) {
+
+			count_kmmscand_slowtier();
+			info->mm = mm;
+			info->address = addr;
+			info->folio = folio;
+
+			/* No need of lock now */
+			list_add_tail(&info->migrate_node, &scanctrl->scan_list);
+
+			count_kmmscand_migadded();
+		}
+	} else
+		count_kmmscand_idlepage();
+end:
+	folio_set_idle(folio);
+	folio_put(folio);
+	return 0;
+}
+
+static const struct mm_walk_ops hot_vma_set_idle_ops = {
+	.pte_entry = hot_vma_idle_pte_entry,
+	.walk_lock = PGWALK_RDLOCK,
+};
+
+static void kmmscand_walk_page_vma(struct vm_area_struct *vma, struct kmmscand_scanctrl *scanctrl)
+{
+	if (!vma_migratable(vma) || !vma_policy_mof(vma) ||
+	    is_vm_hugetlb_page(vma) || (vma->vm_flags & VM_MIXEDMAP)) {
+		return;
+	}
+	if (!vma->vm_mm ||
+	    (vma->vm_file && (vma->vm_flags & (VM_READ|VM_WRITE)) == (VM_READ)))
+		return;
+
+	if (!vma_is_accessible(vma))
+		return;
+
+	walk_page_vma(vma, &hot_vma_set_idle_ops, scanctrl);
+}
+
+static inline int kmmscand_test_exit(struct mm_struct *mm)
+{
+	return atomic_read(&mm->mm_users) == 0;
+}
+
+static void kmmscand_cleanup_migration_list(struct mm_struct *mm)
+{
+	struct kmmscand_migrate_info *info, *tmp;
+
+	spin_lock(&kmmscand_migrate_lock);
+	if (!list_empty(&kmmscand_migrate_list.migrate_head)) {
+		if (mm == READ_ONCE(kmmscand_cur_migrate_mm)) {
+			/* A folio in this mm is being migrated. wait */
+			WRITE_ONCE(kmmscand_migration_list_dirty, true);
+		}
+
+		list_for_each_entry_safe(info, tmp, &kmmscand_migrate_list.migrate_head,
+			migrate_node) {
+			if (info && (info->mm == mm)) {
+				info->mm = NULL;
+				WRITE_ONCE(kmmscand_migration_list_dirty, true);
+			}
+		}
+	}
+	spin_unlock(&kmmscand_migrate_lock);
+}
+
+static void kmmscand_collect_mm_slot(struct kmmscand_mm_slot *mm_slot)
+{
+	struct mm_slot *slot = &mm_slot->slot;
+	struct mm_struct *mm = slot->mm;
+
+	lockdep_assert_held(&kmmscand_mm_lock);
+
+	if (kmmscand_test_exit(mm)) {
+		/* free mm_slot */
+		hash_del(&slot->hash);
+		list_del(&slot->mm_node);
+
+		kmmscand_cleanup_migration_list(mm);
+
+		mm_slot_free(kmmscand_slot_cache, mm_slot);
+		mmdrop(mm);
+	} else {
+		WARN_ON_ONCE(mm_slot);
+		mm_slot->is_scanned = false;
+	}
+}
+
+static void kmmscand_migrate_folio(void)
+{
+	int ret = 0, dest = -1;
+	struct mm_struct *oldmm = NULL;
+	struct kmmscand_migrate_info *info, *tmp;
+
+	spin_lock(&kmmscand_migrate_lock);
+
+	if (!list_empty(&kmmscand_migrate_list.migrate_head)) {
+		list_for_each_entry_safe(info, tmp, &kmmscand_migrate_list.migrate_head,
+			migrate_node) {
+			if (READ_ONCE(kmmscand_migration_list_dirty)) {
+				kmmscand_migration_list_dirty = false;
+				list_del(&info->migrate_node);
+				/*
+				 * Do not try to migrate this entry because mm might have
+				 * vanished underneath.
+				 */
+				kfree(info);
+				spin_unlock(&kmmscand_migrate_lock);
+				goto dirty_list_handled;
+			}
+
+			list_del(&info->migrate_node);
+			/* Note down the mm of folio entry we are migrating */
+			WRITE_ONCE(kmmscand_cur_migrate_mm, info->mm);
+			spin_unlock(&kmmscand_migrate_lock);
+
+			if (info->mm) {
+				if (oldmm != info->mm) {
+					if(!mmap_read_trylock(info->mm)) {
+						dest = kmmscand_get_target_node(NULL);
+					} else {
+						dest = READ_ONCE(info->mm->target_node);
+						mmap_read_unlock(info->mm);
+					}
+					oldmm = info->mm;
+				}
+
+				ret = kmmscand_promote_folio(info, dest);
+				trace_kmem_scan_mm_migrate(info->mm, ret, dest);
+			}
+
+			/* TBD: encode migrated count here, currently assume folio_nr_pages */
+			if (!ret)
+				count_kmmscand_migrated();
+			else
+				count_kmmscand_migrate_failed();
+
+			kfree(info);
+
+			spin_lock(&kmmscand_migrate_lock);
+			/* Reset  mm  of folio entry we are migrating */
+			WRITE_ONCE(kmmscand_cur_migrate_mm, NULL);
+			spin_unlock(&kmmscand_migrate_lock);
+dirty_list_handled:
+			cond_resched();
+			spin_lock(&kmmscand_migrate_lock);
+		}
+	}
+	spin_unlock(&kmmscand_migrate_lock);
+}
+
+/*
+ * This is the normal change percentage when old and new delta remain same.
+ * i.e., either both positive or both zero.
+ */
+#define SCAN_PERIOD_TUNE_PERCENT	15
+
+/* This is to change the scan_period aggressively when deltas are different */
+#define SCAN_PERIOD_CHANGE_SCALE	3
+/*
+ * XXX: Hack to prevent unmigrated pages coming again and again while scanning.
+ * Actual fix needs to identify the type of unmigrated pages OR consider migration
+ * failures in next scan.
+ */
+#define KMMSCAND_IGNORE_SCAN_THR	100
+
+#define SCAN_SIZE_CHANGE_SCALE	1
+/*
+ * X : Number of useful pages in the last scan.
+ * Y : Number of useful pages found in current scan.
+ * Tuning scan_period:
+ *	Initial scan_period is 2s.
+ *	case 1: (X = 0, Y = 0)
+ *		Increase scan_period by SCAN_PERIOD_TUNE_PERCENT.
+ *	case 2: (X = 0, Y > 0)
+ *		Decrease scan_period by (2 << SCAN_PERIOD_CHANGE_SCALE).
+ *	case 3: (X > 0, Y = 0 )
+ *		Increase scan_period by (2 << SCAN_PERIOD_CHANGE_SCALE).
+ *	case 4: (X > 0, Y > 0)
+ *		Decrease scan_period by SCAN_PERIOD_TUNE_PERCENT.
+ * Tuning scan_size:
+ * Initial scan_size is 4GB
+ *	case 1: (X = 0, Y = 0)
+ *		Decrease scan_size by (1 << SCAN_SIZE_CHANGE_SCALE).
+ *	case 2: (X = 0, Y > 0)
+ *		scan_size = KMMSCAND_SCAN_SIZE_MAX
+ *  case 3: (X > 0, Y = 0 )
+ *		No change
+ *  case 4: (X > 0, Y > 0)
+ *		Increase scan_size by (1 << SCAN_SIZE_CHANGE_SCALE).
+ */
+static inline void kmmscand_update_mmslot_info(struct kmmscand_mm_slot *mm_slot,
+				unsigned long total, int target_node)
+{
+	unsigned int scan_period;
+	unsigned long now;
+	unsigned long scan_size;
+	unsigned long old_scan_delta;
+
+	/* XXX: Hack to get rid of continuously failing/unmigrateable pages */
+	if (total < KMMSCAND_IGNORE_SCAN_THR)
+		total = 0;
+
+	scan_period = mm_slot->scan_period;
+	scan_size = mm_slot->scan_size;
+
+	old_scan_delta = mm_slot->scan_delta;
+
+	/*
+	 * case 1: old_scan_delta and new delta are similar, (slow) TUNE_PERCENT used.
+	 * case 2: old_scan_delta and new delta are different. (fast) CHANGE_SCALE used.
+	 * TBD:
+	 * 1. Further tune scan_period based on delta between last and current scan delta.
+	 * 2. Optimize calculation
+	 */
+	if (!old_scan_delta && !total) {
+		scan_period = (100 + SCAN_PERIOD_TUNE_PERCENT) * scan_period;
+		scan_period /= 100;
+		scan_size = scan_size >> SCAN_SIZE_CHANGE_SCALE;
+	} else if (old_scan_delta && total) {
+		scan_period = (100 - SCAN_PERIOD_TUNE_PERCENT) * scan_period;
+		scan_period /= 100;
+		scan_size = scan_size << SCAN_SIZE_CHANGE_SCALE;
+	} else if (old_scan_delta && !total) {
+		scan_period = scan_period << SCAN_PERIOD_CHANGE_SCALE;
+	} else {
+		scan_period = scan_period >> SCAN_PERIOD_CHANGE_SCALE;
+		scan_size = KMMSCAND_SCAN_SIZE_MAX;
+	}
+
+	scan_period = clamp(scan_period, KMMSCAND_SCAN_PERIOD_MIN, KMMSCAND_SCAN_PERIOD_MAX);
+	scan_size = clamp(scan_size, KMMSCAND_SCAN_SIZE_MIN, KMMSCAND_SCAN_SIZE_MAX);
+
+	now = jiffies;
+	mm_slot->next_scan = now + msecs_to_jiffies(scan_period);
+	mm_slot->scan_period = scan_period;
+	mm_slot->scan_size = scan_size;
+	mm_slot->scan_delta = total;
+	mm_slot->target_node = target_node;
+}
+
+static unsigned long kmmscand_scan_mm_slot(void)
+{
+	bool next_mm = false;
+	bool update_mmslot_info = false;
+
+	unsigned int mm_slot_scan_period;
+	int target_node, mm_slot_target_node, mm_target_node;
+	unsigned long now;
+	unsigned long mm_slot_next_scan;
+	unsigned long mm_slot_scan_size;
+	unsigned long scanned_size = 0;
+	unsigned long address;
+	unsigned long total = 0;
+
+	struct mm_slot *slot;
+	struct mm_struct *mm;
+	struct vma_iterator vmi;
+	struct vm_area_struct *vma = NULL;
+	struct kmmscand_mm_slot *mm_slot;
+
+	/* Retrieve mm */
+	spin_lock(&kmmscand_mm_lock);
+
+	if (kmmscand_scan.mm_slot) {
+		mm_slot = kmmscand_scan.mm_slot;
+		slot = &mm_slot->slot;
+		address = mm_slot->address;
+	} else {
+		slot = list_entry(kmmscand_scan.mm_head.next,
+				     struct mm_slot, mm_node);
+		mm_slot = mm_slot_entry(slot, struct kmmscand_mm_slot, slot);
+		address = mm_slot->address;
+		kmmscand_scan.mm_slot = mm_slot;
+	}
+
+	mm = slot->mm;
+	mm_slot->is_scanned = true;
+	mm_slot_next_scan = mm_slot->next_scan;
+	mm_slot_scan_period = mm_slot->scan_period;
+	mm_slot_scan_size = mm_slot->scan_size;
+	mm_slot_target_node = mm_slot->target_node;
+	spin_unlock(&kmmscand_mm_lock);
+
+	if (unlikely(!mmap_read_trylock(mm)))
+		goto outerloop_mmap_lock;
+
+	if (unlikely(kmmscand_test_exit(mm))) {
+		next_mm = true;
+		goto outerloop;
+	}
+
+	if (!mm->pte_scan_scale) {
+		next_mm = true;
+		goto outerloop;
+	}
+
+	mm_target_node = READ_ONCE(mm->target_node);
+	/* XXX: Do we need write lock? */
+	if (mm_target_node != mm_slot_target_node)
+		WRITE_ONCE(mm->target_node, mm_slot_target_node);
+
+	trace_kmem_scan_mm_start(mm);
+
+	now = jiffies;
+
+	if (mm_slot_next_scan && time_before(now, mm_slot_next_scan))
+		goto outerloop;
+
+	vma_iter_init(&vmi, mm, address);
+	vma = vma_next(&vmi);
+	if (!vma) {
+		address = 0;
+		vma_iter_set(&vmi, address);
+		vma = vma_next(&vmi);
+	}
+
+	for_each_vma(vmi, vma) {
+		/* Count the scanned pages here to decide exit */
+		kmmscand_walk_page_vma(vma, &kmmscand_scanctrl);
+		count_kmmscand_vma_scans();
+		scanned_size += vma->vm_end - vma->vm_start;
+		address = vma->vm_end;
+
+		if (scanned_size >= mm_slot_scan_size) {
+			total = get_slowtier_accesed(&kmmscand_scanctrl);
+
+			/* If we had got accessed pages, ignore the current scan_size threshold */
+			if (total > KMMSCAND_IGNORE_SCAN_THR) {
+				mm_slot_scan_size = KMMSCAND_SCAN_SIZE_MAX;
+				continue;
+			}
+			next_mm = true;
+			break;
+		}
+
+		/* Add scanned folios to migration list */
+		spin_lock(&kmmscand_migrate_lock);
+		list_splice_tail_init(&kmmscand_scanctrl.scan_list, &kmmscand_migrate_list.migrate_head);
+		spin_unlock(&kmmscand_migrate_lock);
+	}
+
+	if (!vma)
+		address = 0;
+
+	update_mmslot_info = true;
+
+	count_kmmscand_mm_scans();
+
+	total = get_slowtier_accesed(&kmmscand_scanctrl);
+	target_node = get_target_node(&kmmscand_scanctrl);
+
+	mm_target_node = READ_ONCE(mm->target_node);
+
+	/* XXX: Do we need write lock? */
+	if (mm_target_node != target_node)
+		WRITE_ONCE(mm->target_node, target_node);
+
+	reset_scanctrl(&kmmscand_scanctrl);
+
+	if (update_mmslot_info) {
+		mm_slot->address = address;
+		kmmscand_update_mmslot_info(mm_slot, total, target_node);
+	}
+
+	trace_kmem_scan_mm_end(mm, address, total, mm_slot_scan_period,
+			mm_slot_scan_size, target_node);
+
+outerloop:
+	/* exit_mmap will destroy ptes after this */
+	mmap_read_unlock(mm);
+
+outerloop_mmap_lock:
+	spin_lock(&kmmscand_mm_lock);
+	VM_BUG_ON(kmmscand_scan.mm_slot != mm_slot);
+
+	/*
+	 * Release the current mm_slot if this mm is about to die, or
+	 * if we scanned all vmas of this mm.
+	 */
+	if (unlikely(kmmscand_test_exit(mm)) || !vma || next_mm) {
+		/*
+		 * Make sure that if mm_users is reaching zero while
+		 * kmmscand runs here, kmmscand_exit will find
+		 * mm_slot not pointing to the exiting mm.
+		 */
+		WARN_ON_ONCE(current->rcu_read_lock_nesting < 0);
+		if (slot->mm_node.next != &kmmscand_scan.mm_head) {
+			slot = list_entry(slot->mm_node.next,
+					struct mm_slot, mm_node);
+			kmmscand_scan.mm_slot =
+				mm_slot_entry(slot, struct kmmscand_mm_slot, slot);
+
+		} else
+			kmmscand_scan.mm_slot = NULL;
+
+		WARN_ON_ONCE(current->rcu_read_lock_nesting < 0);
+		if (kmmscand_test_exit(mm)) {
+			kmmscand_collect_mm_slot(mm_slot);
+			WARN_ON_ONCE(current->rcu_read_lock_nesting < 0);
+			goto end;
+		}
+	}
+	mm_slot->is_scanned = false;
+end:
+	spin_unlock(&kmmscand_mm_lock);
+	return total;
+}
+
+static void kmmscand_do_scan(void)
+{
+	unsigned long iter = 0, mms_to_scan;
+
+	mms_to_scan = READ_ONCE(kmmscand_mms_to_scan);
+
+	while (true) {
+		cond_resched();
+
+		if (unlikely(kthread_should_stop()) || !READ_ONCE(kmmscand_scan_enabled))
+			break;
+
+		if (kmmscand_has_work())
+			kmmscand_scan_mm_slot();
+
+		iter++;
+		if (iter >= mms_to_scan)
+			break;
+	}
+}
+
+static int kmmscand(void *none)
+{
+	for (;;) {
+		if (unlikely(kthread_should_stop()))
+			break;
+
+		kmmscand_do_scan();
+
+		while (!READ_ONCE(kmmscand_scan_enabled)) {
+			cpu_relax();
+			kmmscand_wait_work();
+		}
+
+		kmmscand_wait_work();
+	}
+	return 0;
+}
+
+#ifdef CONFIG_SYSFS
+extern struct kobject *mm_kobj;
+static int __init kmmscand_init_sysfs(struct kobject **kobj)
+{
+	int err;
+
+	err = sysfs_create_group(*kobj, &kmmscand_attr_group);
+	if (err) {
+		pr_err("failed to register kmmscand group\n");
+		goto err_kmmscand_attr;
+	}
+
+	return 0;
+
+err_kmmscand_attr:
+	sysfs_remove_group(*kobj, &kmmscand_attr_group);
+	return err;
+}
+
+static void __init kmmscand_exit_sysfs(struct kobject *kobj)
+{
+		sysfs_remove_group(kobj, &kmmscand_attr_group);
+}
+#else
+static inline int __init kmmscand_init_sysfs(struct kobject **kobj)
+{
+	return 0;
+}
+static inline void __init kmmscand_exit_sysfs(struct kobject *kobj)
+{
+}
+#endif
+
+static inline void kmmscand_destroy(void)
+{
+	kmem_cache_destroy(kmmscand_slot_cache);
+	kmmscand_exit_sysfs(mm_kobj);
+}
+
+void __kmmscand_enter(struct mm_struct *mm)
+{
+	struct kmmscand_mm_slot *kmmscand_slot;
+	struct mm_slot *slot;
+	unsigned long now;
+	int wakeup;
+
+	/* __kmmscand_exit() must not run from under us */
+	VM_BUG_ON_MM(kmmscand_test_exit(mm), mm);
+
+	kmmscand_slot = mm_slot_alloc(kmmscand_slot_cache);
+
+	if (!kmmscand_slot)
+		return;
+
+	now = jiffies;
+	kmmscand_slot->address = 0;
+	kmmscand_slot->scan_period = kmmscand_mm_scan_period_ms;
+	kmmscand_slot->scan_size = kmmscand_scan_size;
+	kmmscand_slot->next_scan = now +
+			msecs_to_jiffies(sysctl_numa_balancing_scan_delay);
+	kmmscand_slot->scan_delta = 0;
+
+	slot = &kmmscand_slot->slot;
+
+	spin_lock(&kmmscand_mm_lock);
+	mm_slot_insert(kmmscand_slots_hash, mm, slot);
+
+	wakeup = list_empty(&kmmscand_scan.mm_head);
+	list_add_tail(&slot->mm_node, &kmmscand_scan.mm_head);
+	spin_unlock(&kmmscand_mm_lock);
+
+	mmgrab(mm);
+	trace_kmem_mm_enter(mm);
+	if (wakeup)
+		wake_up_interruptible(&kmmscand_wait);
+}
+
+void __kmmscand_exit(struct mm_struct *mm)
+{
+	struct kmmscand_mm_slot *mm_slot;
+	struct mm_slot *slot;
+	int free = 0, serialize = 1;
+
+	trace_kmem_mm_exit(mm);
+	spin_lock(&kmmscand_mm_lock);
+	slot = mm_slot_lookup(kmmscand_slots_hash, mm);
+	mm_slot = mm_slot_entry(slot, struct kmmscand_mm_slot, slot);
+	if (mm_slot && kmmscand_scan.mm_slot != mm_slot) {
+		hash_del(&slot->hash);
+		list_del(&slot->mm_node);
+		free = 1;
+	} else if (mm_slot && kmmscand_scan.mm_slot == mm_slot && !mm_slot->is_scanned) {
+		hash_del(&slot->hash);
+		list_del(&slot->mm_node);
+		free = 1;
+		/* TBD: Set the actual next slot */
+		kmmscand_scan.mm_slot = NULL;
+	} else if (mm_slot && kmmscand_scan.mm_slot == mm_slot && mm_slot->is_scanned) {
+		serialize = 0;
+	}
+
+	spin_unlock(&kmmscand_mm_lock);
+
+	if (serialize)
+		kmmscand_cleanup_migration_list(mm);
+
+	if (free) {
+		mm_slot_free(kmmscand_slot_cache, mm_slot);
+		mmdrop(mm);
+	} else if (mm_slot) {
+		mmap_write_lock(mm);
+		mmap_write_unlock(mm);
+	}
+}
+
+static int start_kmmscand(void)
+{
+	int err = 0;
+
+	guard(mutex)(&kmmscand_mutex);
+
+	/* Some one already succeeded in starting daemon */
+	if (kmmscand_thread)
+		goto end;
+
+	kmmscand_thread = kthread_run(kmmscand, NULL, "kmmscand");
+	if (IS_ERR(kmmscand_thread)) {
+		pr_err("kmmscand: kthread_run(kmmscand) failed\n");
+		err = PTR_ERR(kmmscand_thread);
+		kmmscand_thread = NULL;
+		goto end;
+	} else {
+		pr_info("kmmscand: Successfully started kmmscand");
+	}
+
+	if (!list_empty(&kmmscand_scan.mm_head))
+		wake_up_interruptible(&kmmscand_wait);
+
+end:
+	return err;
+}
+
+static int stop_kmmscand(void)
+{
+	int err = 0;
+
+	guard(mutex)(&kmmscand_mutex);
+
+	if (kmmscand_thread) {
+		kthread_stop(kmmscand_thread);
+		kmmscand_thread = NULL;
+	}
+
+	return err;
+}
+static int kmmmigrated(void *arg)
+{
+	for (;;) {
+		WRITE_ONCE(migrated_need_wakeup, false);
+		if (unlikely(kthread_should_stop()))
+			break;
+		if (kmmmigrated_has_work())
+			kmmscand_migrate_folio();
+		msleep(20);
+		kmmmigrated_wait_work();
+	}
+	return 0;
+}
+
+static int start_kmmmigrated(void)
+{
+	int err = 0;
+
+	guard(mutex)(&kmmmigrated_mutex);
+
+	/* Someone already succeeded in starting daemon */
+	if (kmmmigrated_thread)
+		goto end;
+
+	kmmmigrated_thread = kthread_run(kmmmigrated, NULL, "kmmmigrated");
+	if (IS_ERR(kmmmigrated_thread)) {
+		pr_err("kmmmigrated: kthread_run(kmmmigrated)  failed\n");
+		err = PTR_ERR(kmmmigrated_thread);
+		kmmmigrated_thread = NULL;
+		goto end;
+	} else {
+		pr_info("kmmmigrated: Successfully started kmmmigrated");
+	}
+
+	wake_up_interruptible(&kmmmigrated_wait);
+end:
+	return err;
+}
+
+static int stop_kmmmigrated(void)
+{
+	guard(mutex)(&kmmmigrated_mutex);
+	kthread_stop(kmmmigrated_thread);
+	return 0;
+}
+
+static void init_migration_list(void)
+{
+	INIT_LIST_HEAD(&kmmscand_migrate_list.migrate_head);
+	INIT_LIST_HEAD(&kmmscand_scanctrl.scan_list);
+	spin_lock_init(&kmmscand_migrate_lock);
+	init_waitqueue_head(&kmmscand_wait);
+	init_waitqueue_head(&kmmmigrated_wait);
+	init_scanctrl(&kmmscand_scanctrl);
+}
+
+static int __init kmmscand_init(void)
+{
+	int err;
+
+	kmmscand_slot_cache = KMEM_CACHE(kmmscand_mm_slot, 0);
+
+	if (!kmmscand_slot_cache) {
+		pr_err("kmmscand: kmem_cache error");
+		return -ENOMEM;
+	}
+
+	err = kmmscand_init_sysfs(&mm_kobj);
+
+	if (err)
+		goto err_init_sysfs;
+
+	init_migration_list();
+
+	err = start_kmmscand();
+	if (err)
+		goto err_kmmscand;
+
+	err = start_kmmmigrated();
+	if (err)
+		goto err_kmmmigrated;
+
+	return 0;
+
+err_kmmmigrated:
+	stop_kmmmigrated();
+
+err_kmmscand:
+	stop_kmmscand();
+err_init_sysfs:
+	kmmscand_destroy();
+
+	return err;
+}
+subsys_initcall(kmmscand_init);
diff --git a/mm/migrate.c b/mm/migrate.c
index fb19a18892c8..9d39abc7662a 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -2598,7 +2598,7 @@ SYSCALL_DEFINE6(move_pages, pid_t, pid, unsigned long, nr_pages,
  * Returns true if this is a safe migration target node for misplaced NUMA
  * pages. Currently it only checks the watermarks which is crude.
  */
-static bool migrate_balanced_pgdat(struct pglist_data *pgdat,
+bool migrate_balanced_pgdat(struct pglist_data *pgdat,
 				   unsigned long nr_migrate_pages)
 {
 	int z;
@@ -2656,7 +2656,7 @@ int migrate_misplaced_folio_prepare(struct folio *folio,
 		 * See folio_likely_mapped_shared() on possible imprecision
 		 * when we cannot easily detect if a folio is shared.
 		 */
-		if ((vma->vm_flags & VM_EXEC) &&
+		if ((vma && vma->vm_flags & VM_EXEC) &&
 		    folio_likely_mapped_shared(folio))
 			return -EACCES;
 
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 16bfe1c694dd..cb21441969c5 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1340,6 +1340,17 @@ const char * const vmstat_text[] = {
 	"numa_hint_faults_local",
 	"numa_pages_migrated",
 #endif
+#ifdef CONFIG_KMMSCAND
+	"nr_kmmscand_mm_scans",
+	"nr_kmmscand_vma_scans",
+	"nr_kmmscand_migadded",
+	"nr_kmmscand_migrated",
+	"nr_kmmscand_migrate_failed",
+	"nr_kmmscand_kzalloc_fail",
+	"nr_kmmscand_slowtier",
+	"nr_kmmscand_toptier",
+	"nr_kmmscand_idlepage",
+#endif
 #ifdef CONFIG_MIGRATION
 	"pgmigrate_success",
 	"pgmigrate_fail",



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Overhauling hot page detection and promotion based on PTE A bit scanning
  2025-01-23 10:57 [LSF/MM/BPF TOPIC] Overhauling hot page detection and promotion based on PTE A bit scanning Raghavendra K T
                   ` (2 preceding siblings ...)
  2025-01-26  2:27 ` Huang, Ying
@ 2025-01-31 12:28 ` Jonathan Cameron
  2025-01-31 13:09   ` [LSF/MM/BPF TOPIC] Unifying sources of page temperature information - what info is actually wanted? Jonathan Cameron
  2025-02-03  2:23   ` [LSF/MM/BPF TOPIC] Overhauling hot page detection and promotion based on PTE A bit scanning Raghavendra K T
  2025-04-07  3:13 ` Bharata B Rao
  4 siblings, 2 replies; 33+ messages in thread
From: Jonathan Cameron @ 2025-01-31 12:28 UTC (permalink / raw)
  To: Raghavendra K T
  Cc: linux-mm, akpm, lsf-pc, bharata, gourry, nehagholkar, abhishekd,
	ying.huang, nphamcs, hannes, feng.tang, kbusch, Hasan.Maruf, sj,
	david, willy, k.shutemov, mgorman, vbabka, hughd, rientjes,
	shy828301, liam.howlett, peterz, mingo, nadav.amit, shivankg,
	ziy, jhubbard, AneeshKumar.KizhakeVeetil, linux-kernel,
	jon.grimm, santosh.shukla, Michael.Day, riel, weixugc,
	leesuyeon0506, honggyu.kim, leillc, kmanaouil.dev, rppt,
	dave.hansen


> Here is the list of potential discussion points:
...

> 2. Possibility of maintaining single source of truth for page hotness that would
> maintain hot page information from multiple sources and let other sub-systems
> use that info.
Hi,

I was thinking of proposing a separate topic on a single source of hotness,
but this question covers it so I'll add some thoughts here instead.
I think we are very early, but sharing some experience and thoughts in a
session may be useful.

What do the other subsystems that want to use a single source of page hotness
want to be able to find out? (subject to filters like memory range, process etc)

A) How hot is page X?  
- Is this useful, or too much data? What would use it?
  * Application optimization maybe. Very handy for developing algorithms
    to do the rest of the options here as an Oracle!
- Provides both the cold and hot end of the scale, but maybe measurement
  techniques vary and can not be easily combined. Hard in general to combine
  multiple sources of truth if aiming for an absolute number.

B) Which pages are super hot?
- Probably these that make the most difference if they are in a slower memory tier.

C) Some pages are hot enough to consider moving?
- This may be good enough to get the key data into the fast memory over time.
- Can combine sources of info as being able to compare precise numbers doesn't matter.

D) Which pages are fairly cold?
- Likewise maybe good enough over time.

E) Which pages are very cold?
- Ideal case for tiering. Swap these with the super hot ones.
- Maybe extra signal for swap / zswap etc

F) Did these hot pages remain hot (and same for cold)
- This is needed to know when to back off doing things as we have unstable
  hotness (two phase applications are a pain for this), sampling a few
  pages may be fine.

Messy corners:

Temporal aspects.
- If only providing lists of hottest / coldest in last second, very hard
  to find those that are of a stable temperature. We end up moving
  very hot data (which is disruptive) and it doesn't stay hot.
- Can reduce that affect by long sampling windows on some measurement approaches
  (on hardware trackers that can trash accuracy due to resource exhaustion
   and other subtle effects).
- bistable / phase based applications are a pain but perhaps up to higher
  levels to back off.

My main interest is migrating in tiered systems but good to look at what
else would use a common layer.

Mostly I want to know something that is useful to move, and assume convergence
over the long term with the best things to move so to me the ideal layer has
following interface (strawman so shoot holes in it!):

1) Give me up to X hotish pages from a slow tier (greater than a specific measure
of temperature)
2) Give me X coldish pages a faster tier.
3) I expect to ask again in X seconds so please have some info ready for me!
4) (a path to get an idea of 'unhelpful moves' from earlier iterations - this
    is bleeding the tiering application into a shared interface though).

If we have multiple subsystems using the data we will need to resolve their
conflicting demands to generate good enough data with appropriate overhead.

I'd also like a virtualized solution for case of hardware PA trackers (what
I have with CXL Hotness Monitoring Units) and classic memory pool / stranding
avoidance case where the VM is the right entity to make migration decisions.
Making that interface convey what the kernel is going to use would be an
efficient option. I'd like to hide how the sausage was made from the VM.

Jonathan


^ permalink raw reply	[flat|nested] 33+ messages in thread

* [LSF/MM/BPF TOPIC] Unifying sources of page temperature information - what info is actually wanted?
  2025-01-31 12:28 ` Jonathan Cameron
@ 2025-01-31 13:09   ` Jonathan Cameron
  2025-02-05  6:24     ` Bharata B Rao
  2025-02-16  6:49     ` Huang, Ying
  2025-02-03  2:23   ` [LSF/MM/BPF TOPIC] Overhauling hot page detection and promotion based on PTE A bit scanning Raghavendra K T
  1 sibling, 2 replies; 33+ messages in thread
From: Jonathan Cameron @ 2025-01-31 13:09 UTC (permalink / raw)
  To: Raghavendra K T
  Cc: linux-mm, akpm, lsf-pc, bharata, gourry, nehagholkar, abhishekd,
	ying.huang, nphamcs, hannes, feng.tang, kbusch, Hasan.Maruf, sj,
	david, willy, k.shutemov, mgorman, vbabka, hughd, rientjes,
	shy828301, liam.howlett, peterz, mingo, nadav.amit, shivankg,
	ziy, jhubbard, AneeshKumar.KizhakeVeetil, linux-kernel,
	jon.grimm, santosh.shukla, Michael.Day, riel, weixugc,
	leesuyeon0506, honggyu.kim, leillc, kmanaouil.dev, rppt,
	dave.hansen

On Fri, 31 Jan 2025 12:28:03 +0000
Jonathan Cameron <Jonathan.Cameron@huawei.com> wrote:

> > Here is the list of potential discussion points:  
> ...
> 
> > 2. Possibility of maintaining single source of truth for page hotness that would
> > maintain hot page information from multiple sources and let other sub-systems
> > use that info.  
> Hi,
> 
> I was thinking of proposing a separate topic on a single source of hotness,
> but this question covers it so I'll add some thoughts here instead.
> I think we are very early, but sharing some experience and thoughts in a
> session may be useful.

Thinking more on this over lunch, I think it is worth calling this out as a
potential session topic in it's own right rather than trying to find
time within other sessions.  Hence the title change.

I think a session would start with a brief listing of the temperature sources
we have and those on the horizon to motivate what we are unifying, then
discussion to focus on need for such a unification + requirements 
(maybe with a straw man).

> 
> What do the other subsystems that want to use a single source of page hotness
> want to be able to find out? (subject to filters like memory range, process etc)
> 
> A) How hot is page X?  
> - Is this useful, or too much data? What would use it?
>   * Application optimization maybe. Very handy for developing algorithms
>     to do the rest of the options here as an Oracle!
> - Provides both the cold and hot end of the scale, but maybe measurement
>   techniques vary and can not be easily combined. Hard in general to combine
>   multiple sources of truth if aiming for an absolute number.
> 
> B) Which pages are super hot?
> - Probably these that make the most difference if they are in a slower memory tier.
> 
> C) Some pages are hot enough to consider moving?
> - This may be good enough to get the key data into the fast memory over time.
> - Can combine sources of info as being able to compare precise numbers doesn't matter.
> 
> D) Which pages are fairly cold?
> - Likewise maybe good enough over time.
> 
> E) Which pages are very cold?
> - Ideal case for tiering. Swap these with the super hot ones.
> - Maybe extra signal for swap / zswap etc
> 
> F) Did these hot pages remain hot (and same for cold)
> - This is needed to know when to back off doing things as we have unstable
>   hotness (two phase applications are a pain for this), sampling a few
>   pages may be fine.
> 
> Messy corners:
> 
> Temporal aspects.
> - If only providing lists of hottest / coldest in last second, very hard
>   to find those that are of a stable temperature. We end up moving
>   very hot data (which is disruptive) and it doesn't stay hot.
> - Can reduce that affect by long sampling windows on some measurement approaches
>   (on hardware trackers that can trash accuracy due to resource exhaustion
>    and other subtle effects).
> - bistable / phase based applications are a pain but perhaps up to higher
>   levels to back off.
> 
> My main interest is migrating in tiered systems but good to look at what
> else would use a common layer.
> 
> Mostly I want to know something that is useful to move, and assume convergence
> over the long term with the best things to move so to me the ideal layer has
> following interface (strawman so shoot holes in it!):
> 
> 1) Give me up to X hotish pages from a slow tier (greater than a specific measure
> of temperature)
> 2) Give me X coldish pages a faster tier.
> 3) I expect to ask again in X seconds so please have some info ready for me!
> 4) (a path to get an idea of 'unhelpful moves' from earlier iterations - this
>     is bleeding the tiering application into a shared interface though).
> 
> If we have multiple subsystems using the data we will need to resolve their
> conflicting demands to generate good enough data with appropriate overhead.
> 
> I'd also like a virtualized solution for case of hardware PA trackers (what
> I have with CXL Hotness Monitoring Units) and classic memory pool / stranding
> avoidance case where the VM is the right entity to make migration decisions.
> Making that interface convey what the kernel is going to use would be an
> efficient option. I'd like to hide how the sausage was made from the VM.
> 
> Jonathan
> 



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Unifying sources of page temperature information - what info is actually wanted?
  2025-01-31 13:09   ` [LSF/MM/BPF TOPIC] Unifying sources of page temperature information - what info is actually wanted? Jonathan Cameron
@ 2025-02-05  6:24     ` Bharata B Rao
  2025-02-05 16:05       ` Johannes Weiner
                         ` (2 more replies)
  2025-02-16  6:49     ` Huang, Ying
  1 sibling, 3 replies; 33+ messages in thread
From: Bharata B Rao @ 2025-02-05  6:24 UTC (permalink / raw)
  To: Jonathan Cameron, Raghavendra K T
  Cc: linux-mm, akpm, lsf-pc, gourry, nehagholkar, abhishekd,
	ying.huang, nphamcs, hannes, feng.tang, kbusch, Hasan.Maruf, sj,
	david, willy, k.shutemov, mgorman, vbabka, hughd, rientjes,
	shy828301, liam.howlett, peterz, mingo, nadav.amit, shivankg,
	ziy, jhubbard, AneeshKumar.KizhakeVeetil, linux-kernel,
	jon.grimm, santosh.shukla, Michael.Day, riel, weixugc,
	leesuyeon0506, honggyu.kim, leillc, kmanaouil.dev, rppt,
	dave.hansen

On 31-Jan-25 6:39 PM, Jonathan Cameron wrote:
> On Fri, 31 Jan 2025 12:28:03 +0000
> Jonathan Cameron <Jonathan.Cameron@huawei.com> wrote:
> 
>>> Here is the list of potential discussion points:
>> ...
>>
>>> 2. Possibility of maintaining single source of truth for page hotness that would
>>> maintain hot page information from multiple sources and let other sub-systems
>>> use that info.
>> Hi,
>>
>> I was thinking of proposing a separate topic on a single source of hotness,
>> but this question covers it so I'll add some thoughts here instead.
>> I think we are very early, but sharing some experience and thoughts in a
>> session may be useful.
> 
> Thinking more on this over lunch, I think it is worth calling this out as a
> potential session topic in it's own right rather than trying to find
> time within other sessions.  Hence the title change.
> 
> I think a session would start with a brief listing of the temperature sources
> we have and those on the horizon to motivate what we are unifying, then
> discussion to focus on need for such a unification + requirements
> (maybe with a straw man).

Here is a compilation of available temperature sources and how the 
hot/access data is consumed by different subsystems:

PA-Physical address available
VA-Virtual address available
AA-Access time available
NA-accessing Node info available

I have left the slot blank for those which I am not sure about.
==================================================
Temperature		PA	VA	AA	NA
source
==================================================
PROT_NONE faults	Y	Y	Y	Y
--------------------------------------------------
folio_mark_accessed()	Y		Y	Y
--------------------------------------------------
PTE A bit		Y	Y	N	N
--------------------------------------------------
Platform hints		Y	Y	Y	Y
(AMD IBS)
--------------------------------------------------
Device hints		Y
(CXL HMU)
==================================================

And here is an attempt to compile how different subsystems
use the above data:
==============================================================
Source			Subsystem		Consumption
==============================================================
PROT_NONE faults	NUMAB		NUMAB=1 locality based
via process pgtable			balancing
walk					NUMAB=2 hot page
					promotion
==============================================================
folio_mark_accessed()	FS/filemap/GUP	LRU list activation
==============================================================
PTE A bit via		Reclaim:LRU	LRU list activation,	
rmap walk				deactivation/demotion
==============================================================
PTE A bit via		Reclaim:MGLRU	LRU list activation,	
rmap walk and process			deactivation/demotion
pgtable walk
==============================================================
PTE A bit via		DAMON		LRU activation,
rmap walk				hot page promotion,
					demotion etc
==============================================================
Platform hints		NUMAB		NUMAB=1 Locality based
(AMD IBS)				balancing and
					NUMAB=2 hot page
					promotion
==============================================================
Device hints		NUMAB		NUMAB=2 hot page
					promotion
==============================================================
The last two are listed as possibilities.

Feel free to correct/clarify and add more.

Regards,
Bharata.







^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Unifying sources of page temperature information - what info is actually wanted?
  2025-02-05  6:24     ` Bharata B Rao
@ 2025-02-05 16:05       ` Johannes Weiner
  2025-02-06  6:46         ` SeongJae Park
  2025-02-06 15:30         ` Jonathan Cameron
  2025-02-07  9:50       ` Matthew Wilcox
  2025-02-16  7:04       ` Huang, Ying
  2 siblings, 2 replies; 33+ messages in thread
From: Johannes Weiner @ 2025-02-05 16:05 UTC (permalink / raw)
  To: Bharata B Rao
  Cc: Jonathan Cameron, Raghavendra K T, linux-mm, akpm, lsf-pc,
	gourry, nehagholkar, abhishekd, ying.huang, nphamcs, feng.tang,
	kbusch, Hasan.Maruf, sj, david, willy, k.shutemov, mgorman,
	vbabka, hughd, rientjes, shy828301, liam.howlett, peterz, mingo,
	nadav.amit, shivankg, ziy, jhubbard, AneeshKumar.KizhakeVeetil,
	linux-kernel, jon.grimm, santosh.shukla, Michael.Day, riel,
	weixugc, leesuyeon0506, honggyu.kim, leillc, kmanaouil.dev, rppt,
	dave.hansen

On Wed, Feb 05, 2025 at 11:54:05AM +0530, Bharata B Rao wrote:
> On 31-Jan-25 6:39 PM, Jonathan Cameron wrote:
> > On Fri, 31 Jan 2025 12:28:03 +0000
> > Jonathan Cameron <Jonathan.Cameron@huawei.com> wrote:
> > 
> >>> Here is the list of potential discussion points:
> >> ...
> >>
> >>> 2. Possibility of maintaining single source of truth for page hotness that would
> >>> maintain hot page information from multiple sources and let other sub-systems
> >>> use that info.
> >> Hi,
> >>
> >> I was thinking of proposing a separate topic on a single source of hotness,
> >> but this question covers it so I'll add some thoughts here instead.
> >> I think we are very early, but sharing some experience and thoughts in a
> >> session may be useful.
> > 
> > Thinking more on this over lunch, I think it is worth calling this out as a
> > potential session topic in it's own right rather than trying to find
> > time within other sessions.  Hence the title change.
> > 
> > I think a session would start with a brief listing of the temperature sources
> > we have and those on the horizon to motivate what we are unifying, then
> > discussion to focus on need for such a unification + requirements
> > (maybe with a straw man).
> 
> Here is a compilation of available temperature sources and how the 
> hot/access data is consumed by different subsystems:

This is super useful, thanks for collecting this.

> PA-Physical address available
> VA-Virtual address available
> AA-Access time available
> NA-accessing Node info available
> 
> I have left the slot blank for those which I am not sure about.
> ==================================================
> Temperature		PA	VA	AA	NA
> source
> ==================================================
> PROT_NONE faults	Y	Y	Y	Y
> --------------------------------------------------
> folio_mark_accessed()	Y		Y	Y
> --------------------------------------------------

For fma(), the VA info is available in unmap, but usually it isn't -
or doesn't meaningfully exist, as in the case of unmapped buffered IO.

I'd say it's an N.

> PTE A bit		Y	Y	N	N
> --------------------------------------------------
> Platform hints		Y	Y	Y	Y
> (AMD IBS)
> --------------------------------------------------
> Device hints		Y
> (CXL HMU)
> ==================================================

For the following table, it might be useful to add *when* the source
produces this information. Sampling frequency is a likely challenge:
consumers have different requirements, and overhead should be limited
to the minimum required to serve enabled consumers.

Here is an (incomplete) attempt - sorry about the long lines:

> And here is an attempt to compile how different subsystems
> use the above data:
> ==============================================================
> Source			Subsystem		Consumption         Activation/Frequency
> ==============================================================
> PROT_NONE faults	NUMAB		NUMAB=1 locality based              While task is running,
> via process pgtable			balancing                           rate varies on observed
> walk					NUMAB=2 hot page                    locality and sysctl knobs.
> 					promotion
> ==============================================================
> folio_mark_accessed()	FS/filemap/GUP	LRU list activation                 On cache access and unmap
> ==============================================================
> PTE A bit via		Reclaim:LRU	LRU list activation,	            During memory pressure
> rmap walk				deactivation/demotion
> ==============================================================
> PTE A bit via		Reclaim:MGLRU	LRU list activation,	            - During memory pressure
> rmap walk and process			deactivation/demotion               - Continuous sampling (configurable)
> pgtable walk                                                                for workingset reporting
> ==============================================================
> PTE A bit via		DAMON		LRU activation,                     Continuous sampling (configurable)?
> rmap walk				hot page promotion,                 (I believe SJ is looking into
> 					demotion etc                         auto-tuning this).
> ==============================================================
> Platform hints		NUMAB		NUMAB=1 Locality based
> (AMD IBS)				balancing and
> 					NUMAB=2 hot page
> 					promotion
> ==============================================================
> Device hints		NUMAB		NUMAB=2 hot page
> 					promotion
> ==============================================================
> The last two are listed as possibilities.
> 
> Feel free to correct/clarify and add more.
> 
> Regards,
> Bharata.


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Unifying sources of page temperature information - what info is actually wanted?
  2025-02-05 16:05       ` Johannes Weiner
@ 2025-02-06  6:46         ` SeongJae Park
  2025-02-06 15:30         ` Jonathan Cameron
  1 sibling, 0 replies; 33+ messages in thread
From: SeongJae Park @ 2025-02-06  6:46 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: SeongJae Park, Bharata B Rao, Jonathan Cameron, Raghavendra K T,
	linux-mm, akpm, lsf-pc, gourry, nehagholkar, abhishekd,
	ying.huang, nphamcs, feng.tang, kbusch, Hasan.Maruf, david,
	willy, k.shutemov, mgorman, vbabka, hughd, rientjes, shy828301,
	liam.howlett, peterz, mingo, nadav.amit, shivankg, ziy, jhubbard,
	AneeshKumar.KizhakeVeetil, linux-kernel, jon.grimm,
	santosh.shukla, Michael.Day, riel, weixugc, leesuyeon0506,
	honggyu.kim, leillc, kmanaouil.dev, rppt, dave.hansen

On Wed, 5 Feb 2025 11:05:29 -0500 Johannes Weiner <hannes@cmpxchg.org> wrote:

> On Wed, Feb 05, 2025 at 11:54:05AM +0530, Bharata B Rao wrote:
> > On 31-Jan-25 6:39 PM, Jonathan Cameron wrote:
> > > On Fri, 31 Jan 2025 12:28:03 +0000
> > > Jonathan Cameron <Jonathan.Cameron@huawei.com> wrote:
[...]
> > Here is a compilation of available temperature sources and how the 
> > hot/access data is consumed by different subsystems:
> 
> This is super useful, thanks for collecting this.

Indeed.  Thank you Bharata!

[...]
> For the following table, it might be useful to add *when* the source
> produces this information.  Sampling frequency is a likely challenge:
> consumers have different requirements, and overhead should be limited
> to the minimum required to serve enabled consumers.

+1

> 
> Here is an (incomplete) attempt - sorry about the long lines:
> 
> > And here is an attempt to compile how different subsystems
> > use the above data:
> > ==============================================================
> > Source			Subsystem		Consumption         Activation/Frequency
[...]
> > ==============================================================
> > PTE A bit via		DAMON		LRU activation,                     Continuous sampling (configurable)?
> > rmap walk				hot page promotion,                 (I believe SJ is looking into
> > 					demotion etc                         auto-tuning this).

You're right.  I'm working on auto-tuning of the sampling/aggregation intervals
of DAMON based on its tuning guide theory[1].  Hopefully I will be able to post
an RFC patch series within a couple of weeks.

> > ==============================================================
> > Platform hints		NUMAB		NUMAB=1 Locality based
> > (AMD IBS)				balancing and
> > 					NUMAB=2 hot page
> > 					promotion
> > ==============================================================
> > Device hints		NUMAB		NUMAB=2 hot page
> > 					promotion
> > ==============================================================
> > The last two are listed as possibilities.

I'm also trying to extend DAMON to use PROT_NONE faults and AMD IBS like access
check sources.  Hopefully I will share more details of the plan and experiment
results for the PROT_NONE faults extension by LSFMMBPF.

[1] https://lore.kernel.org/20241202175459.2005526-1-sj@kernel.org


Thanks,
SJ

[...]


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Unifying sources of page temperature information - what info is actually wanted?
  2025-02-05 16:05       ` Johannes Weiner
  2025-02-06  6:46         ` SeongJae Park
@ 2025-02-06 15:30         ` Jonathan Cameron
  1 sibling, 0 replies; 33+ messages in thread
From: Jonathan Cameron @ 2025-02-06 15:30 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Bharata B Rao, Raghavendra K T, linux-mm, akpm, lsf-pc, gourry,
	nehagholkar, abhishekd, ying.huang, nphamcs, feng.tang, kbusch,
	Hasan.Maruf, sj, david, willy, k.shutemov, mgorman, vbabka,
	hughd, rientjes, shy828301, liam.howlett, peterz, mingo,
	nadav.amit, shivankg, ziy, jhubbard, AneeshKumar.KizhakeVeetil,
	linux-kernel, jon.grimm, santosh.shukla, Michael.Day, riel,
	weixugc, leesuyeon0506, honggyu.kim, leillc, kmanaouil.dev, rppt,
	dave.hansen

On Wed, 5 Feb 2025 11:05:29 -0500
Johannes Weiner <hannes@cmpxchg.org> wrote:

> On Wed, Feb 05, 2025 at 11:54:05AM +0530, Bharata B Rao wrote:
> > On 31-Jan-25 6:39 PM, Jonathan Cameron wrote:  
> > > On Fri, 31 Jan 2025 12:28:03 +0000
> > > Jonathan Cameron <Jonathan.Cameron@huawei.com> wrote:
> > >   
> > >>> Here is the list of potential discussion points:  
> > >> ...
> > >>  
> > >>> 2. Possibility of maintaining single source of truth for page hotness that would
> > >>> maintain hot page information from multiple sources and let other sub-systems
> > >>> use that info.  
> > >> Hi,
> > >>
> > >> I was thinking of proposing a separate topic on a single source of hotness,
> > >> but this question covers it so I'll add some thoughts here instead.
> > >> I think we are very early, but sharing some experience and thoughts in a
> > >> session may be useful.  
> > > 
> > > Thinking more on this over lunch, I think it is worth calling this out as a
> > > potential session topic in it's own right rather than trying to find
> > > time within other sessions.  Hence the title change.
> > > 
> > > I think a session would start with a brief listing of the temperature sources
> > > we have and those on the horizon to motivate what we are unifying, then
> > > discussion to focus on need for such a unification + requirements
> > > (maybe with a straw man).  
> > 
> > Here is a compilation of available temperature sources and how the 
> > hot/access data is consumed by different subsystems:  
> 
> This is super useful, thanks for collecting this.

Absolutely agree!

> 
> > PA-Physical address available
> > VA-Virtual address available
> > AA-Access time available
> > NA-accessing Node info available
> > 
> > I have left the slot blank for those which I am not sure about.
> > ==================================================
> > Temperature		PA	VA	AA	NA
> > source
> > ==================================================
> > PROT_NONE faults	Y	Y	Y	Y
> > --------------------------------------------------
> > folio_mark_accessed()	Y		Y	Y
> > --------------------------------------------------  
> 
> For fma(), the VA info is available in unmap, but usually it isn't -
> or doesn't meaningfully exist, as in the case of unmapped buffered IO.
> 
> I'd say it's an N.
> 
> > PTE A bit		Y	Y	N	N
> > --------------------------------------------------
> > Platform hints		Y	Y	Y	Y
> > (AMD IBS)
> > --------------------------------------------------
> > Device hints		Y
> > (CXL HMU)
> > ==================================================  

For the use cases where we have relatively few 'pages' the cost of a reverse
map look up doesn't look to be a problem.  Trick is to do it
only after we've done what we can in PA space to cut down on the
pages of interest. So maybe (Y) to reflect that it is indirect.
Whether it makes sense to do that before or after some common
layer is an interesting question.  That PA/VA mapping might be
out of date anyway by the time we see the data.

> 
> For the following table, it might be useful to add *when* the source
> produces this information. Sampling frequency is a likely challenge:
> consumers have different requirements, and overhead should be limited
> to the minimum required to serve enabled consumers.
> 
> Here is an (incomplete) attempt - sorry about the long lines:
> 
> > And here is an attempt to compile how different subsystems
> > use the above data:
> > ==============================================================
> > Source			Subsystem		Consumption         Activation/Frequency
> > ==============================================================
> > PROT_NONE faults	NUMAB		NUMAB=1 locality based              While task is running,
> > via process pgtable			balancing                           rate varies on observed
> > walk					NUMAB=2 hot page                    locality and sysctl knobs.
> > 					promotion
> > ==============================================================
> > folio_mark_accessed()	FS/filemap/GUP	LRU list activation                 On cache access and unmap
> > ==============================================================
> > PTE A bit via		Reclaim:LRU	LRU list activation,	            During memory pressure
> > rmap walk				deactivation/demotion
> > ==============================================================
> > PTE A bit via		Reclaim:MGLRU	LRU list activation,	            - During memory pressure
> > rmap walk and process			deactivation/demotion               - Continuous sampling (configurable)
> > pgtable walk                                                                for workingset reporting
> > ==============================================================
> > PTE A bit via		DAMON		LRU activation,                     Continuous sampling (configurable)?
> > rmap walk				hot page promotion,                 (I believe SJ is looking into
> > 					demotion etc                         auto-tuning this).
> > ==============================================================
> > Platform hints		NUMAB		NUMAB=1 Locality based
> > (AMD IBS)				balancing and
> > 					NUMAB=2 hot page
> > 					promotion
> > ==============================================================
Based on the CXL one...

> > Device hints		NUMAB		NUMAB=2 hot page       Continuous sampling, frequency controllable.
> > 					promotion                      Subsampling programable.
> > ==============================================================
> > The last two are listed as possibilities.
> > 
> > Feel free to correct/clarify and add more.

The above covers what the use cases require. Maybe we need to do similar
for the controls needed the other way (frequency already covered)

Filtering.
* Process ID
* Address range (PA / VA)
* Access type (read vs write) may matter for migration cost.

Also frequency is more nuanced perhaps:
- How often to give data (timeliness)
- How much data to give (bandwidth)
- When don't I care (threshold)
- How precise do I want it to be (subsampling etc)

The layering is clearly to be complex, so maybe addressing each
use case for what info that needs would be helpful?

The following is probably too simplistic.

==================================================================
Usecase       Nature of data
==================================================================
NUMAB =1      Enough hot pages with remote source.
Balancing
==================================================================
NUMAB =2      Enough hot pages in slow memory
Tiering
Promotion
==================================================================
NUMAB = 2     Enough cold pages in fast memory
Tiering
Demotion
===================================================================
LRU list      Specific pages of interest accessed
activation
===================================================================
LRU list      Enough cold pages?
deactivation
====================================================================


Jonathan
> > 
> > Regards,
> > Bharata.  



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Unifying sources of page temperature information - what info is actually wanted?
  2025-02-05  6:24     ` Bharata B Rao
  2025-02-05 16:05       ` Johannes Weiner
@ 2025-02-07  9:50       ` Matthew Wilcox
  2025-02-16  7:04       ` Huang, Ying
  2 siblings, 0 replies; 33+ messages in thread
From: Matthew Wilcox @ 2025-02-07  9:50 UTC (permalink / raw)
  To: Bharata B Rao
  Cc: Jonathan Cameron, Raghavendra K T, linux-mm, akpm, lsf-pc,
	gourry, nehagholkar, abhishekd, ying.huang, nphamcs, hannes,
	feng.tang, kbusch, Hasan.Maruf, sj, david, k.shutemov, mgorman,
	vbabka, hughd, rientjes, shy828301, liam.howlett, peterz, mingo,
	nadav.amit, shivankg, ziy, jhubbard, AneeshKumar.KizhakeVeetil,
	linux-kernel, jon.grimm, santosh.shukla, Michael.Day, riel,
	weixugc, leesuyeon0506, honggyu.kim, leillc, kmanaouil.dev, rppt,
	dave.hansen

On Wed, Feb 05, 2025 at 11:54:05AM +0530, Bharata B Rao wrote:
> Here is a compilation of available temperature sources and how the
> hot/access data is consumed by different subsystems:
> 
> PA-Physical address available
> VA-Virtual address available
> AA-Access time available
> NA-accessing Node info available
> 
> I have left the slot blank for those which I am not sure about.
> ==================================================
> Temperature		PA	VA	AA	NA
> source
> ==================================================
> PROT_NONE faults	Y	Y	Y	Y
> --------------------------------------------------
> folio_mark_accessed()	Y		Y	Y
> --------------------------------------------------
> PTE A bit		Y	Y	N	N
> --------------------------------------------------
> Platform hints		Y	Y	Y	Y
> (AMD IBS)
> --------------------------------------------------
> Device hints		Y
> (CXL HMU)
> ==================================================
> 
> And here is an attempt to compile how different subsystems
> use the above data:
> ==============================================================
> Source			Subsystem		Consumption
> ==============================================================
> PROT_NONE faults	NUMAB		NUMAB=1 locality based
> via process pgtable			balancing
> walk					NUMAB=2 hot page
> 					promotion
> ==============================================================
> folio_mark_accessed()	FS/filemap/GUP	LRU list activation
> ==============================================================
> PTE A bit via		Reclaim:LRU	LRU list activation,	
> rmap walk				deactivation/demotion
> ==============================================================
> PTE A bit via		Reclaim:MGLRU	LRU list activation,	
> rmap walk and process			deactivation/demotion
> pgtable walk
> ==============================================================
> PTE A bit via		DAMON		LRU activation,
> rmap walk				hot page promotion,
> 					demotion etc
> ==============================================================
> Platform hints		NUMAB		NUMAB=1 Locality based
> (AMD IBS)				balancing and
> 					NUMAB=2 hot page
> 					promotion
> ==============================================================
> Device hints		NUMAB		NUMAB=2 hot page
> 					promotion
> ==============================================================
> The last two are listed as possibilities.
> 
> Feel free to correct/clarify and add more.

There's PG_young / PG_idle as well.


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Unifying sources of page temperature information - what info is actually wanted?
  2025-02-05  6:24     ` Bharata B Rao
  2025-02-05 16:05       ` Johannes Weiner
  2025-02-07  9:50       ` Matthew Wilcox
@ 2025-02-16  7:04       ` Huang, Ying
  2 siblings, 0 replies; 33+ messages in thread
From: Huang, Ying @ 2025-02-16  7:04 UTC (permalink / raw)
  To: Bharata B Rao
  Cc: Jonathan Cameron, Raghavendra K T, linux-mm, akpm, lsf-pc,
	gourry, nehagholkar, abhishekd, nphamcs, hannes, feng.tang,
	kbusch, Hasan.Maruf, sj, david, willy, k.shutemov, mgorman,
	vbabka, hughd, rientjes, shy828301, liam.howlett, peterz, mingo,
	nadav.amit, shivankg, ziy, jhubbard, AneeshKumar.KizhakeVeetil,
	linux-kernel, jon.grimm, santosh.shukla, Michael.Day, riel,
	weixugc, leesuyeon0506, honggyu.kim, leillc, kmanaouil.dev, rppt,
	dave.hansen, yuanchu

Hi, Bharata,

Bharata B Rao <bharata@amd.com> writes:

> On 31-Jan-25 6:39 PM, Jonathan Cameron wrote:
>> On Fri, 31 Jan 2025 12:28:03 +0000
>> Jonathan Cameron <Jonathan.Cameron@huawei.com> wrote:
>> 
>>>> Here is the list of potential discussion points:
>>> ...
>>>
>>>> 2. Possibility of maintaining single source of truth for page hotness that would
>>>> maintain hot page information from multiple sources and let other sub-systems
>>>> use that info.
>>> Hi,
>>>
>>> I was thinking of proposing a separate topic on a single source of hotness,
>>> but this question covers it so I'll add some thoughts here instead.
>>> I think we are very early, but sharing some experience and thoughts in a
>>> session may be useful.
>> Thinking more on this over lunch, I think it is worth calling this
>> out as a
>> potential session topic in it's own right rather than trying to find
>> time within other sessions.  Hence the title change.
>> I think a session would start with a brief listing of the
>> temperature sources
>> we have and those on the horizon to motivate what we are unifying, then
>> discussion to focus on need for such a unification + requirements
>> (maybe with a straw man).
>
> Here is a compilation of available temperature sources and how the
> hot/access data is consumed by different subsystems:

Thanks for your information!

> PA-Physical address available
> VA-Virtual address available
> AA-Access time available
> NA-accessing Node info available
>
> I have left the slot blank for those which I am not sure about.
> ==================================================
> Temperature		PA	VA	AA	NA
> source
> ==================================================
> PROT_NONE faults	Y	Y	Y	Y
> --------------------------------------------------
> folio_mark_accessed()	Y		Y	Y
> --------------------------------------------------
> PTE A bit		Y	Y	N	N

We can get some coarse-grained AA from PTE A bit scanning.  That is, the
page is accessed at least once between two rounds of scanning.  The AA
is less the scanning interval.  IIUC, the similar information is
available in Yuanchu's MGLRU periodic aging series [1].

[1] https://lore.kernel.org/all/20221214225123.2770216-1-yuanchu@google.com/

> --------------------------------------------------
> Platform hints		Y	Y	Y	Y
> (AMD IBS)
> --------------------------------------------------
> Device hints		Y
> (CXL HMU)
> ==================================================
>
> And here is an attempt to compile how different subsystems
> use the above data:
> ==============================================================
> Source			Subsystem		Consumption
> ==============================================================
> PROT_NONE faults	NUMAB		NUMAB=1 locality based
> via process pgtable			balancing
> walk					NUMAB=2 hot page
> 					promotion
> ==============================================================
> folio_mark_accessed()	FS/filemap/GUP	LRU list activation

IIUC, Gregory is working on a patchset to promote unmapped file cache
pages via folio_mark_accessed().

> ==============================================================
> PTE A bit via		Reclaim:LRU	LRU list activation,	
> rmap walk				deactivation/demotion
> ==============================================================
> PTE A bit via		Reclaim:MGLRU	LRU list activation,	
> rmap walk and process			deactivation/demotion
> pgtable walk
> ==============================================================
> PTE A bit via		DAMON		LRU activation,
> rmap walk				hot page promotion,
> 					demotion etc
> ==============================================================
> Platform hints		NUMAB		NUMAB=1 Locality based
> (AMD IBS)				balancing and
> 					NUMAB=2 hot page
> 					promotion
> ==============================================================
> Device hints		NUMAB		NUMAB=2 hot page
> 					promotion
> ==============================================================
> The last two are listed as possibilities.
>
> Feel free to correct/clarify and add more.

---
Best Regards,
Huang, Ying


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Unifying sources of page temperature information - what info is actually wanted?
  2025-01-31 13:09   ` [LSF/MM/BPF TOPIC] Unifying sources of page temperature information - what info is actually wanted? Jonathan Cameron
  2025-02-05  6:24     ` Bharata B Rao
@ 2025-02-16  6:49     ` Huang, Ying
  2025-02-17  4:10       ` Bharata B Rao
  2025-03-14 14:24       ` Jonathan Cameron
  1 sibling, 2 replies; 33+ messages in thread
From: Huang, Ying @ 2025-02-16  6:49 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: Raghavendra K T, linux-mm, akpm, lsf-pc, bharata, gourry,
	nehagholkar, abhishekd, nphamcs, hannes, feng.tang, kbusch,
	Hasan.Maruf, sj, david, willy, k.shutemov, mgorman, vbabka,
	hughd, rientjes, shy828301, liam.howlett, peterz, mingo,
	nadav.amit, shivankg, ziy, jhubbard, AneeshKumar.KizhakeVeetil,
	linux-kernel, jon.grimm, santosh.shukla, Michael.Day, riel,
	weixugc, leesuyeon0506, honggyu.kim, leillc, kmanaouil.dev, rppt,
	dave.hansen

Hi, Jonathan,

Sorry for late reply.

Jonathan Cameron <Jonathan.Cameron@huawei.com> writes:

> On Fri, 31 Jan 2025 12:28:03 +0000
> Jonathan Cameron <Jonathan.Cameron@huawei.com> wrote:
>
>> > Here is the list of potential discussion points:  
>> ...
>> 
>> > 2. Possibility of maintaining single source of truth for page hotness that would
>> > maintain hot page information from multiple sources and let other sub-systems
>> > use that info.  
>> Hi,
>> 
>> I was thinking of proposing a separate topic on a single source of hotness,
>> but this question covers it so I'll add some thoughts here instead.
>> I think we are very early, but sharing some experience and thoughts in a
>> session may be useful.
>
> Thinking more on this over lunch, I think it is worth calling this out as a
> potential session topic in it's own right rather than trying to find
> time within other sessions.  Hence the title change.
>
> I think a session would start with a brief listing of the temperature sources
> we have and those on the horizon to motivate what we are unifying, then
> discussion to focus on need for such a unification + requirements 
> (maybe with a straw man).
>
>> 
>> What do the other subsystems that want to use a single source of page hotness
>> want to be able to find out? (subject to filters like memory range, process etc)
>> 
>> A) How hot is page X?  
>> - Is this useful, or too much data? What would use it?
>>   * Application optimization maybe. Very handy for developing algorithms
>>     to do the rest of the options here as an Oracle!
>> - Provides both the cold and hot end of the scale, but maybe measurement
>>   techniques vary and can not be easily combined. Hard in general to combine
>>   multiple sources of truth if aiming for an absolute number.
>> 
>> B) Which pages are super hot?
>> - Probably these that make the most difference if they are in a slower memory tier.
>> 
>> C) Some pages are hot enough to consider moving?
>> - This may be good enough to get the key data into the fast memory over time.
>> - Can combine sources of info as being able to compare precise numbers doesn't matter.
>> 
>> D) Which pages are fairly cold?
>> - Likewise maybe good enough over time.
>> 
>> E) Which pages are very cold?
>> - Ideal case for tiering. Swap these with the super hot ones.
>> - Maybe extra signal for swap / zswap etc
>> 
>> F) Did these hot pages remain hot (and same for cold)
>> - This is needed to know when to back off doing things as we have unstable
>>   hotness (two phase applications are a pain for this), sampling a few
>>   pages may be fine.
>> 
>> Messy corners:
>> 
>> Temporal aspects.
>> - If only providing lists of hottest / coldest in last second, very hard
>>   to find those that are of a stable temperature. We end up moving
>>   very hot data (which is disruptive) and it doesn't stay hot.
>> - Can reduce that affect by long sampling windows on some measurement approaches
>>   (on hardware trackers that can trash accuracy due to resource exhaustion
>>    and other subtle effects).
>> - bistable / phase based applications are a pain but perhaps up to higher
>>   levels to back off.
>> 
>> My main interest is migrating in tiered systems but good to look at what
>> else would use a common layer.
>> 
>> Mostly I want to know something that is useful to move, and assume convergence
>> over the long term with the best things to move so to me the ideal layer has
>> following interface (strawman so shoot holes in it!):
>> 
>> 1) Give me up to X hotish pages from a slow tier (greater than a specific measure
>> of temperature)

Because the hot pages may be available upon page accessing (such PROT_NONE
page fault), the interface may be "push" style instead of "pull" style,
e.g.,

int register_hot_page_handler(void (*handler)(struct page *hot_page, int temperature));

>> 2) Give me X coldish pages a faster tier.
>> 3) I expect to ask again in X seconds so please have some info ready for me!
>> 4) (a path to get an idea of 'unhelpful moves' from earlier iterations - this
>>     is bleeding the tiering application into a shared interface though).

In addition to get a list hot/cold pages, it's also useful to get
hot/cold statistics of a memory device (NUMA node), e.g., something like
below,

Access frequency        percent
   > 1000 HZ            10%
 600-1000 HZ            20%
 200- 600 HZ            50%
   1- 200 HZ            15%
      < 1 HZ             5%

Compared with hot/cold pages list, this may be gotten with lower
overhead and can be useful to tune the promotion/demotion alrogithm.  At
the same time, a sampled (incomplete) list of hot/cold page list may be
available too.

>> If we have multiple subsystems using the data we will need to resolve their
>> conflicting demands to generate good enough data with appropriate overhead.
>> 
>> I'd also like a virtualized solution for case of hardware PA trackers (what
>> I have with CXL Hotness Monitoring Units) and classic memory pool / stranding
>> avoidance case where the VM is the right entity to make migration decisions.
>> Making that interface convey what the kernel is going to use would be an
>> efficient option. I'd like to hide how the sausage was made from the VM.

---
Best Regards,
Huang, Ying


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Unifying sources of page temperature information - what info is actually wanted?
  2025-02-16  6:49     ` Huang, Ying
@ 2025-02-17  4:10       ` Bharata B Rao
  2025-02-17  8:06         ` Huang, Ying
  2025-03-14 14:24       ` Jonathan Cameron
  1 sibling, 1 reply; 33+ messages in thread
From: Bharata B Rao @ 2025-02-17  4:10 UTC (permalink / raw)
  To: Huang, Ying, Jonathan Cameron
  Cc: Raghavendra K T, linux-mm, akpm, lsf-pc, gourry, nehagholkar,
	abhishekd, nphamcs, hannes, feng.tang, kbusch, Hasan.Maruf, sj,
	david, willy, k.shutemov, mgorman, vbabka, hughd, rientjes,
	shy828301, liam.howlett, peterz, mingo, nadav.amit, shivankg,
	ziy, jhubbard, AneeshKumar.KizhakeVeetil, linux-kernel,
	jon.grimm, santosh.shukla, Michael.Day, riel, weixugc,
	leesuyeon0506, honggyu.kim, leillc, kmanaouil.dev, rppt,
	dave.hansen

On 16-Feb-25 12:19 PM, Huang, Ying wrote:
>>>
>>> 1) Give me up to X hotish pages from a slow tier (greater than a specific measure
>>> of temperature)
> 
> Because the hot pages may be available upon page accessing (such PROT_NONE
> page fault), the interface may be "push" style instead of "pull" style,
> e.g.,
> 
> int register_hot_page_handler(void (*handler)(struct page *hot_page, int temperature));

Yes, push model appears natural to me given that there are producers who 
are themselves consumers as well.

Let's take an example of access being detected by PTE scan by DAMON 
first and LRU and hot page promotion subsystems have registered handlers 
for hot page info.

Now if hot page promotion handler gets called first and if it promotes 
the page, calling LRU registered handler still makes sense? May be not I 
suppose.

On the other hand if LRU subsystem handler gets first and it 
adjusts/modifies the hot page's list, it would still make sense to 
activate the hot page promotion handler to check for possible promotion.

Is this how you are envisioning the different consumers of hot page 
access info could work/cooperate?

Regards,
Bharata.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Unifying sources of page temperature information - what info is actually wanted?
  2025-02-17  4:10       ` Bharata B Rao
@ 2025-02-17  8:06         ` Huang, Ying
  0 siblings, 0 replies; 33+ messages in thread
From: Huang, Ying @ 2025-02-17  8:06 UTC (permalink / raw)
  To: Bharata B Rao
  Cc: Jonathan Cameron, Raghavendra K T, linux-mm, akpm, lsf-pc,
	gourry, nehagholkar, abhishekd, nphamcs, hannes, kbusch,
	Hasan.Maruf, sj, david, willy, k.shutemov, mgorman, vbabka,
	hughd, rientjes, shy828301, liam.howlett, peterz, mingo,
	nadav.amit, shivankg, ziy, jhubbard, AneeshKumar.KizhakeVeetil,
	linux-kernel, jon.grimm, santosh.shukla, Michael.Day, riel,
	weixugc, leesuyeon0506, honggyu.kim, leillc, kmanaouil.dev, rppt,
	dave.hansen, feng.tang

Bharata B Rao <bharata@amd.com> writes:

> On 16-Feb-25 12:19 PM, Huang, Ying wrote:
>>>>
>>>> 1) Give me up to X hotish pages from a slow tier (greater than a specific measure
>>>> of temperature)
>> Because the hot pages may be available upon page accessing (such
>> PROT_NONE
>> page fault), the interface may be "push" style instead of "pull" style,
>> e.g.,
>> int register_hot_page_handler(void (*handler)(struct page *hot_page,
>> int temperature));
>
> Yes, push model appears natural to me given that there are producers
> who are themselves consumers as well.
>
> Let's take an example of access being detected by PTE scan by DAMON
> first and LRU and hot page promotion subsystems have registered
> handlers for hot page info.
>
> Now if hot page promotion handler gets called first and if it promotes
> the page, calling LRU registered handler still makes sense? May be not
> I suppose.
>
> On the other hand if LRU subsystem handler gets first and it
> adjusts/modifies the hot page's list, it would still make sense to
> activate the hot page promotion handler to check for possible
> promotion.
>
> Is this how you are envisioning the different consumers of hot page
> access info could work/cooperate?

Sorry, I have no idea about what is the right behavior now.  It appears
hard to coordinate different consumers.

In theory, we can promote the hottest pages while activate (in LRU
lists) the warm pages.

---
Best Regards,
Huang, Ying


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Unifying sources of page temperature information - what info is actually wanted?
  2025-02-16  6:49     ` Huang, Ying
  2025-02-17  4:10       ` Bharata B Rao
@ 2025-03-14 14:24       ` Jonathan Cameron
  2025-03-17 22:34         ` Davidlohr Bueso
  1 sibling, 1 reply; 33+ messages in thread
From: Jonathan Cameron @ 2025-03-14 14:24 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Raghavendra K T, linux-mm, akpm, lsf-pc, bharata, gourry,
	nehagholkar, abhishekd, nphamcs, hannes, feng.tang, kbusch,
	Hasan.Maruf, sj, david, willy, k.shutemov, mgorman, vbabka,
	hughd, rientjes, shy828301, liam.howlett, peterz, mingo,
	nadav.amit, shivankg, ziy, jhubbard, AneeshKumar.KizhakeVeetil,
	linux-kernel, jon.grimm, santosh.shukla, Michael.Day, riel,
	weixugc, leesuyeon0506, honggyu.kim, leillc, kmanaouil.dev, rppt,
	dave.hansen

On Sun, 16 Feb 2025 14:49:50 +0800
"Huang, Ying" <ying.huang@linux.alibaba.com> wrote:

> Hi, Jonathan,
> 
> Sorry for late reply.

Sorry for even later reply!

> 
> Jonathan Cameron <Jonathan.Cameron@huawei.com> writes:
> 
> > On Fri, 31 Jan 2025 12:28:03 +0000
> > Jonathan Cameron <Jonathan.Cameron@huawei.com> wrote:
> >  
> >> > Here is the list of potential discussion points:    
> >> ...
> >>   
> >> > 2. Possibility of maintaining single source of truth for page hotness that would
> >> > maintain hot page information from multiple sources and let other sub-systems
> >> > use that info.    
> >> Hi,
> >> 
> >> I was thinking of proposing a separate topic on a single source of hotness,
> >> but this question covers it so I'll add some thoughts here instead.
> >> I think we are very early, but sharing some experience and thoughts in a
> >> session may be useful.  
> >
> > Thinking more on this over lunch, I think it is worth calling this out as a
> > potential session topic in it's own right rather than trying to find
> > time within other sessions.  Hence the title change.
> >
> > I think a session would start with a brief listing of the temperature sources
> > we have and those on the horizon to motivate what we are unifying, then
> > discussion to focus on need for such a unification + requirements 
> > (maybe with a straw man).
> >  
> >> 
> >> What do the other subsystems that want to use a single source of page hotness
> >> want to be able to find out? (subject to filters like memory range, process etc)
> >> 
> >> A) How hot is page X?  
> >> - Is this useful, or too much data? What would use it?
> >>   * Application optimization maybe. Very handy for developing algorithms
> >>     to do the rest of the options here as an Oracle!
> >> - Provides both the cold and hot end of the scale, but maybe measurement
> >>   techniques vary and can not be easily combined. Hard in general to combine
> >>   multiple sources of truth if aiming for an absolute number.
> >> 
> >> B) Which pages are super hot?
> >> - Probably these that make the most difference if they are in a slower memory tier.
> >> 
> >> C) Some pages are hot enough to consider moving?
> >> - This may be good enough to get the key data into the fast memory over time.
> >> - Can combine sources of info as being able to compare precise numbers doesn't matter.
> >> 
> >> D) Which pages are fairly cold?
> >> - Likewise maybe good enough over time.
> >> 
> >> E) Which pages are very cold?
> >> - Ideal case for tiering. Swap these with the super hot ones.
> >> - Maybe extra signal for swap / zswap etc
> >> 
> >> F) Did these hot pages remain hot (and same for cold)
> >> - This is needed to know when to back off doing things as we have unstable
> >>   hotness (two phase applications are a pain for this), sampling a few
> >>   pages may be fine.
> >> 
> >> Messy corners:
> >> 
> >> Temporal aspects.
> >> - If only providing lists of hottest / coldest in last second, very hard
> >>   to find those that are of a stable temperature. We end up moving
> >>   very hot data (which is disruptive) and it doesn't stay hot.
> >> - Can reduce that affect by long sampling windows on some measurement approaches
> >>   (on hardware trackers that can trash accuracy due to resource exhaustion
> >>    and other subtle effects).
> >> - bistable / phase based applications are a pain but perhaps up to higher
> >>   levels to back off.
> >> 
> >> My main interest is migrating in tiered systems but good to look at what
> >> else would use a common layer.
> >> 
> >> Mostly I want to know something that is useful to move, and assume convergence
> >> over the long term with the best things to move so to me the ideal layer has
> >> following interface (strawman so shoot holes in it!):
> >> 
> >> 1) Give me up to X hotish pages from a slow tier (greater than a specific measure
> >> of temperature)  
> 
> Because the hot pages may be available upon page accessing (such PROT_NONE
> page fault), the interface may be "push" style instead of "pull" style,
> e.g.,

Absolutely agree that might be the approach, but with some form of back pressure
as for at least some approaches it is much cheaper to find a find a few hot
pages than to find lots of them.  More complex if you want a few of the very hottest
or just hotter than X. 

> 
> int register_hot_page_handler(void (*handler)(struct page *hot_page, int temperature));
> 
> >> 2) Give me X coldish pages a faster tier.
> >> 3) I expect to ask again in X seconds so please have some info ready for me!
> >> 4) (a path to get an idea of 'unhelpful moves' from earlier iterations - this
> >>     is bleeding the tiering application into a shared interface though).  
> 
> In addition to get a list hot/cold pages, it's also useful to get
> hot/cold statistics of a memory device (NUMA node), e.g., something like
> below,
> 
> Access frequency        percent
>    > 1000 HZ            10%  
>  600-1000 HZ            20%
>  200- 600 HZ            50%
>    1- 200 HZ            15%
>       < 1 HZ             5%
> 
> Compared with hot/cold pages list, this may be gotten with lower
> overhead and can be useful to tune the promotion/demotion alrogithm.  At
> the same time, a sampled (incomplete) list of hot/cold page list may be
> available too.

I agree it's useful info and 'might' be cheaper to get.  Depends on the
tracking solution and impacts of sampling approaches.

> 
> >> If we have multiple subsystems using the data we will need to resolve their
> >> conflicting demands to generate good enough data with appropriate overhead.
> >> 
> >> I'd also like a virtualized solution for case of hardware PA trackers (what
> >> I have with CXL Hotness Monitoring Units) and classic memory pool / stranding
> >> avoidance case where the VM is the right entity to make migration decisions.
> >> Making that interface convey what the kernel is going to use would be an
> >> efficient option. I'd like to hide how the sausage was made from the VM.  
> 
> ---
> Best Regards,
> Huang, Ying



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Unifying sources of page temperature information - what info is actually wanted?
  2025-03-14 14:24       ` Jonathan Cameron
@ 2025-03-17 22:34         ` Davidlohr Bueso
  0 siblings, 0 replies; 33+ messages in thread
From: Davidlohr Bueso @ 2025-03-17 22:34 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: Huang, Ying, Raghavendra K T, linux-mm, akpm, lsf-pc, bharata,
	gourry, nehagholkar, abhishekd, nphamcs, hannes, feng.tang,
	kbusch, Hasan.Maruf, sj, david, willy, k.shutemov, mgorman,
	vbabka, hughd, rientjes, shy828301, liam.howlett, peterz, mingo,
	nadav.amit, shivankg, ziy, jhubbard, AneeshKumar.KizhakeVeetil,
	linux-kernel, jon.grimm, santosh.shukla, Michael.Day, riel,
	weixugc, leesuyeon0506, honggyu.kim, leillc, kmanaouil.dev, rppt,
	dave.hansen

On Fri, 14 Mar 2025, Jonathan Cameron wrote:

>On Sun, 16 Feb 2025 14:49:50 +0800
>"Huang, Ying" <ying.huang@linux.alibaba.com> wrote:

>> Because the hot pages may be available upon page accessing (such PROT_NONE
>> page fault), the interface may be "push" style instead of "pull" style,
>> e.g.,
>

Right, I was also thinking along those lines. Hot pages could be fed right into
kpromoted (with the appropriate interface for 'phi' of course), then kicked to
do the migration. This already has the frequency, and the destination node so
no guessing as to where the page should be placed.

So this makes me wonder kmmscand vs NUMAB=2... should both co-exist? Doubling
the scanning overhead, so I think not (albeit non mapped page cache pages).
The original data from kmmscand is with a busted nid selection, but now Raghu
has a proposed some heuristics, so I am curious what kind of numbers come up
in terms of accuracy and performance vs a NUMAB=2 migration offload.

>Absolutely agree that might be the approach, but with some form of back pressure
>as for at least some approaches it is much cheaper to find a find a few hot
>pages than to find lots of them.  More complex if you want a few of the very hottest
>or just hotter than X.

Yeah, also cases like different CXL type3 devices with different access latencies
both saying here's what's hot.

Thanks,
Davidlohr

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Overhauling hot page detection and promotion based on PTE A bit scanning
  2025-01-31 12:28 ` Jonathan Cameron
  2025-01-31 13:09   ` [LSF/MM/BPF TOPIC] Unifying sources of page temperature information - what info is actually wanted? Jonathan Cameron
@ 2025-02-03  2:23   ` Raghavendra K T
  1 sibling, 0 replies; 33+ messages in thread
From: Raghavendra K T @ 2025-02-03  2:23 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: linux-mm, akpm, lsf-pc, bharata, gourry, nehagholkar, abhishekd,
	ying.huang, nphamcs, hannes, feng.tang, kbusch, Hasan.Maruf, sj,
	david, willy, k.shutemov, mgorman, vbabka, hughd, rientjes,
	shy828301, liam.howlett, peterz, mingo, nadav.amit, shivankg,
	ziy, jhubbard, AneeshKumar.KizhakeVeetil, linux-kernel,
	jon.grimm, santosh.shukla, Michael.Day, riel, weixugc,
	leesuyeon0506, honggyu.kim, leillc, kmanaouil.dev, rppt,
	dave.hansen



On 1/31/2025 5:58 PM, Jonathan Cameron wrote:
> 
>> Here is the list of potential discussion points:
> ...
> 
>> 2. Possibility of maintaining single source of truth for page hotness that would
>> maintain hot page information from multiple sources and let other sub-systems
>> use that info.
> Hi,
> 
> I was thinking of proposing a separate topic on a single source of hotness,
> but this question covers it so I'll add some thoughts here instead.
> I think we are very early, but sharing some experience and thoughts in a
> session may be useful.
> 
> What do the other subsystems that want to use a single source of page hotness
> want to be able to find out? (subject to filters like memory range, process etc)
> 
> A) How hot is page X?
> - Is this useful, or too much data? What would use it?
>    * Application optimization maybe. Very handy for developing algorithms
>      to do the rest of the options here as an Oracle!
> - Provides both the cold and hot end of the scale, but maybe measurement
>    techniques vary and can not be easily combined. Hard in general to combine
>    multiple sources of truth if aiming for an absolute number.>
> B) Which pages are super hot?
> - Probably these that make the most difference if they are in a slower memory tier.
> 
> C) Some pages are hot enough to consider moving?
> - This may be good enough to get the key data into the fast memory over time.
> - Can combine sources of info as being able to compare precise numbers doesn't matter.
> 
> D) Which pages are fairly cold?
> - Likewise maybe good enough over time.
> 
> E) Which pages are very cold?
> - Ideal case for tiering. Swap these with the super hot ones.
> - Maybe extra signal for swap / zswap etc
> 
> F) Did these hot pages remain hot (and same for cold)
> - This is needed to know when to back off doing things as we have unstable
>    hotness (two phase applications are a pain for this), sampling a few
>    pages may be fine.
> 
> Messy corners:
> 
> Temporal aspects.
> - If only providing lists of hottest / coldest in last second, very hard
>    to find those that are of a stable temperature. We end up moving
>    very hot data (which is disruptive) and it doesn't stay hot.
> - Can reduce that affect by long sampling windows on some measurement approaches
>    (on hardware trackers that can trash accuracy due to resource exhaustion
>     and other subtle effects).
> - bistable / phase based applications are a pain but perhaps up to higher
>    levels to back off.
> 
> My main interest is migrating in tiered systems but good to look at what
> else would use a common layer.
> > Mostly I want to know something that is useful to move, and assume 
convergence
> over the long term with the best things to move so to me the ideal layer has
> following interface (strawman so shoot holes in it!):
> 
> 1) Give me up to X hotish pages from a slow tier (greater than a specific measure
> of temperature)
> 2) Give me X coldish pages a faster tier.
> 3) I expect to ask again in X seconds so please have some info ready for me!> 4) (a path to get an idea of 'unhelpful moves' from earlier 
iterations - this
>      is bleeding the tiering application into a shared interface though).
> 

Hello Jonathan,
Thank you for listing all these points in detail.

Agree with you all the points in general.

Very hard/tricky to find balance. for. e.g. if we are slow in finding
hot pages with more accuracy, we need to store more information, on the
other hand prematurely moving pages may result in ping ponging.

So may be we have to balance with moderately accurate information, but
at the same time optimizing such that we avoid redundant scans from
everybody.

Thinking loud again, apart from slow-tier optimization potentials listed
above, I also hope that we can provide necessary information for NUMAB=1
case, to get it to know more about hot VMA's (Mel had pointed that 
identifying hot VMA's helps scanning long back).

> If we have multiple subsystems using the data we will need to resolve their
> conflicting demands to generate good enough data with appropriate overhead.
> > I'd also like a virtualized solution for case of hardware PA trackers 
(what
> I have with CXL Hotness Monitoring Units) and classic memory pool / stranding
> avoidance case where the VM is the right entity to make migration decisions.
> Making that interface convey what the kernel is going to use would be an
> efficient option. I'd like to hide how the sausage was made from the VM.
> 

Thanks and Regards
- Raghu



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Overhauling hot page detection and promotion based on PTE A bit scanning
  2025-01-23 10:57 [LSF/MM/BPF TOPIC] Overhauling hot page detection and promotion based on PTE A bit scanning Raghavendra K T
                   ` (3 preceding siblings ...)
  2025-01-31 12:28 ` Jonathan Cameron
@ 2025-04-07  3:13 ` Bharata B Rao
  4 siblings, 0 replies; 33+ messages in thread
From: Bharata B Rao @ 2025-04-07  3:13 UTC (permalink / raw)
  To: Raghavendra K T, linux-mm, akpm, lsf-pc
  Cc: gourry, nehagholkar, abhishekd, ying.huang, nphamcs, hannes,
	feng.tang, kbusch, Hasan.Maruf, sj, david, willy, k.shutemov,
	mgorman, vbabka, hughd, rientjes, shy828301, liam.howlett,
	peterz, mingo, nadav.amit, shivankg, ziy, jhubbard,
	AneeshKumar.KizhakeVeetil, linux-kernel, jon.grimm,
	santosh.shukla, Michael.Day, riel, weixugc, leesuyeon0506,
	leillc, kmanaouil.dev, rppt, dave.hansen

On 23-Jan-25 4:27 PM, Raghavendra K T wrote:
> Bharata and I would like to propose the following topic for LSFMM.
> 
> Topic: Overhauling hot page detection and promotion based on PTE A bit scanning.

Slides that were used during LSFMM discussion -

https://docs.google.com/presentation/d/1zLyGriEyky_HLJPrrdKdhAS7h5oiGf4tIdGuhGX3fJ8/edit?usp=sharing

Regards,
Bharata.


^ permalink raw reply	[flat|nested] 33+ messages in thread

end of thread, other threads:[~2025-04-07  3:14 UTC | newest]

Thread overview: 33+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-01-23 10:57 [LSF/MM/BPF TOPIC] Overhauling hot page detection and promotion based on PTE A bit scanning Raghavendra K T
2025-01-23 18:20 ` SeongJae Park
2025-01-24  8:54   ` Raghavendra K T
2025-01-24 18:05     ` Jonathan Cameron
2025-01-24  5:53 ` Hyeonggon Yoo
2025-01-24  9:02   ` Raghavendra K T
2025-01-27  7:01     ` David Rientjes
2025-01-27  7:11       ` Raghavendra K T
2025-02-06  3:14   ` Yuanchu Xie
2025-01-26  2:27 ` Huang, Ying
2025-01-27  5:11   ` Bharata B Rao
2025-01-27 18:34     ` SeongJae Park
2025-02-07  8:10       ` Huang, Ying
2025-02-07  9:06         ` Gregory Price
2025-02-07 19:52         ` SeongJae Park
2025-02-07 19:06   ` Davidlohr Bueso
2025-03-14  1:56     ` Raghavendra K T
2025-03-14  2:12       ` Raghavendra K T
2025-01-31 12:28 ` Jonathan Cameron
2025-01-31 13:09   ` [LSF/MM/BPF TOPIC] Unifying sources of page temperature information - what info is actually wanted? Jonathan Cameron
2025-02-05  6:24     ` Bharata B Rao
2025-02-05 16:05       ` Johannes Weiner
2025-02-06  6:46         ` SeongJae Park
2025-02-06 15:30         ` Jonathan Cameron
2025-02-07  9:50       ` Matthew Wilcox
2025-02-16  7:04       ` Huang, Ying
2025-02-16  6:49     ` Huang, Ying
2025-02-17  4:10       ` Bharata B Rao
2025-02-17  8:06         ` Huang, Ying
2025-03-14 14:24       ` Jonathan Cameron
2025-03-17 22:34         ` Davidlohr Bueso
2025-02-03  2:23   ` [LSF/MM/BPF TOPIC] Overhauling hot page detection and promotion based on PTE A bit scanning Raghavendra K T
2025-04-07  3:13 ` Bharata B Rao

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox