From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id D0135C36010
	for <linux-mm@archiver.kernel.org>; Fri,  4 Apr 2025 10:39:22 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 61A8B6B000C; Fri,  4 Apr 2025 06:39:20 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 5C6E96B000D; Fri,  4 Apr 2025 06:39:20 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 4B5E46B000E; Fri,  4 Apr 2025 06:39:20 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10])
	by kanga.kvack.org (Postfix) with ESMTP id 2E2B36B000C
	for <linux-mm@kvack.org>; Fri,  4 Apr 2025 06:39:20 -0400 (EDT)
Received: from smtpin11.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay08.hostedemail.com (Postfix) with ESMTP id 934FE142576
	for <linux-mm@kvack.org>; Fri,  4 Apr 2025 10:39:21 +0000 (UTC)
X-FDA: 83296014522.11.254EFDC
Received: from frasgout.his.huawei.com (frasgout.his.huawei.com [185.176.79.56])
	by imf29.hostedemail.com (Postfix) with ESMTP id 33F50120011
	for <linux-mm@kvack.org>; Fri,  4 Apr 2025 10:39:18 +0000 (UTC)
Authentication-Results: imf29.hostedemail.com;
	dkim=none;
	spf=pass (imf29.hostedemail.com: domain of jonathan.cameron@huawei.com designates 185.176.79.56 as permitted sender) smtp.mailfrom=jonathan.cameron@huawei.com;
	dmarc=pass (policy=quarantine) header.from=huawei.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1743763159;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=QsYzn13vyOrX0iCdcUce/8PsEZzPd2UqusBXp38ZHBw=;
	b=4dcl/KbW1iI+Yr5q9ARqbIssNlK3HZRiu0duklIolBKFyUnWp3hKh5TY2fQIJrStGvj066
	1rbjnM+FixChTMxLKvsjWzEpjz2R+D6ojNJSvEJ6uHE+5RmvwN6jRZ6X/XpUn8wwCD5QfV
	4CjO15SgQDExVloh3ENBr4iSQjDlahQ=
ARC-Authentication-Results: i=1;
	imf29.hostedemail.com;
	dkim=none;
	spf=pass (imf29.hostedemail.com: domain of jonathan.cameron@huawei.com designates 185.176.79.56 as permitted sender) smtp.mailfrom=jonathan.cameron@huawei.com;
	dmarc=pass (policy=quarantine) header.from=huawei.com
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1743763159; a=rsa-sha256;
	cv=none;
	b=vvcb8Z9GUhnxb3buzC/ksolskp3Arg002tIF2wByKKns58Qs0AWODErpfe80eWarbeDXZA
	UA7IL8Ei7/+sdVtneAnHMX1u920+EASJxi3TDKx7JoeH0a1gjwUMGKegUluefUi/H929me
	N4DAu8ugyKpzXIyisaiRkc2G96l59Ss=
Received: from mail.maildlp.com (unknown [172.18.186.31])
	by frasgout.his.huawei.com (SkyGuard) with ESMTP id 4ZTZkY4CG1z67HSr;
	Fri,  4 Apr 2025 18:35:33 +0800 (CST)
Received: from frapeml500008.china.huawei.com (unknown [7.182.85.71])
	by mail.maildlp.com (Postfix) with ESMTPS id 3FB6A1400D3;
	Fri,  4 Apr 2025 18:39:15 +0800 (CST)
Received: from localhost (10.203.177.66) by frapeml500008.china.huawei.com
 (7.182.85.71) with Microsoft SMTP Server (version=TLS1_2,
 cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.1.2507.39; Fri, 4 Apr
 2025 12:39:14 +0200
Date: Fri, 4 Apr 2025 11:39:12 +0100
From: Jonathan Cameron <Jonathan.Cameron@huawei.com>
To: Raghavendra K T <raghavendra.kt@amd.com>, Bharata B Rao <bharata@amd.com>,
	SeongJae Park <sj@kernel.org>, <lsf-pc@lists.linux-foundation.org>,
	<linux-mm@kvack.org>
CC: Michal Hocko <mhocko@suse.com>, Dan Williams <dan.j.williams@intel.com>,
	Matthew Wilcox <willy@infradead.org>, Johannes Weiner <hannes@cmpxchg.org>,
	Gregory Price <gourry@gourry.net>
Subject: Re: [LSF/MM/BPF TOPIC v2] Unifying sources of page temperature
 information - what info is actually wanted?
Message-ID: <20250404113912.00002606@huawei.com>
In-Reply-To: <20250319124552.0000344a@huawei.com>
References: <20250319124552.0000344a@huawei.com>
X-Mailer: Claws Mail 4.3.0 (GTK 3.24.42; x86_64-w64-mingw32)
MIME-Version: 1.0
Content-Type: text/plain; charset="US-ASCII"
Content-Transfer-Encoding: 7bit
X-Originating-IP: [10.203.177.66]
X-ClientProxiedBy: lhrpeml500009.china.huawei.com (7.191.174.84) To
 frapeml500008.china.huawei.com (7.182.85.71)
X-Rspamd-Queue-Id: 33F50120011
X-Rspam-User: 
X-Rspamd-Server: rspam02
X-Stat-Signature: mpfbsdttjb8sbmk7bqoha1p86ii3swpi
X-HE-Tag: 1743763158-54383
X-HE-Meta: U2FsdGVkX1+28fJyuf4WJZigGT4oWFw6EphTVlBFJ//d9pmglTsQI/Undngvc5JGOhhPWmQeelsSecUL2n9XxQycgmDKQwOW0L1pDg4HvmLRohey7XdMOfyPIsg7lyhqoUsp5mrWQNmaoEDlUS3oqn1EGhq3i+qswbn0jBv3WguZd7+E9hD1kh7YjKRANWb/N/2fv6Errd545n/9czJC5tCkSAF3SV+rNKKEbMZn8aRXftBmVPhtci5+J5NkDNssddR7BDrp76gE7lP2OOTIpmj+tH/xCXWwmiUjw4gqPUPOKj8Cd2cyrCjUQFJqXmRtm1U23+29uHK7HTl1mgtX9ZDMdK306ciBuzXkiSOFHl8W3FxJqsHDy+gpb7950uo1jlFVnJiIxugb+wCiBLphLXH3C7GsNvzfdDnGy/XLlKXQsiteD521gRA9igxuA1r8LnT+VtEfPxa4Zvt4IeIh/MnbyGgtV1cDR0DCye2ofyEtpHgeQDBFKCoLm2AludndQBH8ZNs2m2Xcxdxtgda3s6hPdto3mwUIptFxBjmDeXp1dJGodvTGi/8pvcOqeZl8cxNP3t/ZMxuy1Qe5wRgezdMflEMQFJJcTbs+0avi3BFZUB0PfRh4brqR0HqwV805xq/vGIDu+5m3r8RFbFz7mli3FfZ/XD9ColKkQGUkqXdD5y8/YhXPSbwhLu8YugKAlKOg2JmIAcHJRj8uEVgbGfWZxFGoaSdWs3SFehx2bZ1HTI544zDpzkLmtOy6xdKnpEc301I4iDbeXMaHh2PvgKFLegjVFtchuClK95J2AoBbGCfncel6Bz2615ItEYUkAG1eNmVI2Lll2V5agFOdv9Rtj6SFRBAATT9c/50K4Qwh9iIa+bKJ3WgbTT5hipMIB/vEhjQojgUcPUtFhMz+HQe8SZSI/DRQzt8lxEay+LOAy6iNffJ/QM3XQ3jYcFQEO8uFiOssOF+bx8DEhRj
 PvXayx6X
 fbFFMjWYz+01l1S4UhCkbU6IZqNa/7HlsCglZrivlTmqtQsMvsdAHisW0Ou9Q47KQKihEn76201LXA37kbnJMYS1gOyPdJPK8Ib/EWiw48EwcCCW0ZsoipItNdK4Ms4msA7d4+ZRa9iSU4UIMFhahtSMVQj916A3imuDmqoHkMfFe2h2a9LD+nHZg0sdcNpmQ+6KhNwMrLdLWYhSkdD/PZ/h29jU4S3RU7c2Pj2vkKd+LIPTRBwKjlOR/RQyL8F6CprEW5ka4BPq/6QyappH4H1ieENYbvBbRRajLsuWvyxz06k6H+LcVWArihlgC7nQkx9bp
X-Bogosity: Ham, tests=bogofilter, spamicity=0.001037, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

On Wed, 19 Mar 2025 12:47:53 +0000
Jonathan Cameron <Jonathan.Cameron@huawei.com> wrote:


https://drive.google.com/file/d/1o9g-Bggg7jJwrkLa90ZyLEW6xPdp2D2a/view?usp=drivesdk

Slides as presented at LSF-MM.

> Prior to LSFMM, this is an update on where the discussion has gone on list
> since the original proposal back in January (which was buried in the
> thread for Ragha's proposal focused on PTE A bit scanning)
> 
> v1: https://lore.kernel.org/all/20250131130901.00000dd1@huawei.com/
> 
> Note that this is combining comments and discussion from many people and I may
> well have summarized things badly + missed key details. If time allows
> I'll update with a v3 when people have ripped up this straw man.
> 
> Bharata has posted code for one approach and discussion is ongoing:
> https://lore.kernel.org/linux-mm/20250306054532.221138-1-bharata@amd.com/
> This proposal overlaps with part of several other proposals, (Damon, access
> bit tracking etc) but the focus is intended to be more general.
> 
> Abstract:
> 
> We have:
> 1) A range of different technologies tracking what may be loosely defined
> as the hotness of regions of memory.
> 2) A set of use cases that care about this data.
> 
> Question:
> 
> Is it useful or feasible to aggregate the data from the sources (1) to some
> layer before providing answers to (2)?  What should that layer look like?
> What services and abstractions should it provide? Is there commonality in
> what those use cases need?
> 
> By aggregate I'm not necessarily implying multiple techniques in use at
> once, but more that we want one interface driven by whatever solution
> is the right balance on a particular system. That balance can be affected
> by hardware availability or characteristics of the system or workloa
> 
> Note that many of the hotness driven actions are painful (e.g. migration
> of hot pages) and for those we need to be very sure it is a good idea
> to do anything at all!
> 
> My assumption is that in at least some cases the problem will be too hard
> to solve in kernel but lets consider what we can do.
> 
> On to the details:
> ------------------
> 
> Note: I'm ignoring the low level implementation details of each method
> and how they avoid resource exhaustion, tune sampling timing (epoch length)
> and what is sampled (scanning random etc) as in at least some cases that's
> a problem for the lowest technique specific level.
> 
> Enumerating the cases (thanks to Bharata, Johannes, SJ and others for inputs
> on this!)  Much of this is direct quotes from this thread:
> https://lore.kernel.org/all/de31971e-98fc-4baf-8f4f-09d153902e2e@amd.com/
> (particularly Bharata's reply to my original questions)
> 
> Here is a compilation of available temperature sources and how the 
> hot/access data is consumed by different subsystems:
> 
> PA-Physical address available
> VA-Virtual address available
> AA-Access time available
> NA-accessing Node info available
> 
> ==================================================
> Temperature		PA	VA	AA	NA
> source
> ==================================================
> PROT_NONE faults	Y	Y	Y	Y
> --------------------------------------------------
> folio_mark_accessed()	Y		Y	Y
> --------------------------------------------------
> PTE A bit		Y	Y	N*	N
> --------------------------------------------------
> Platform hints		Y	Y	Y	Y
> (AMD IBS)
> --------------------------------------------------
> Device hints		Y	N	N	N
> (CXL HMU)
> ==================================================
> * Some information available from scanning timing.
>   In all cases other methods can be applied to fill in the missing data
>   (rmap etc)
> 
> And here is an attempt to compile how different subsystems
> use the above data:
> ==========================================================================================
> Source			Subsystem	Consumption		Activation/Frequency
> ==========================================================================================
> PROT_NONE faults	NUMAB		NUMAB=1 locality based	While task is running,
> via process pgtable			balancing		rate varies on observed
> walk					NUMAB=2 hot page	locality and sysctl knobs.
> 					promotion
> ==========================================================================================
> folio_mark_accessed()	FS/filemap/GUP	LRU list activation	On cache access and unmap
> ==========================================================================================
> PTE A bit via		Reclaim:LRU	LRU list activation,	During memory pressure
> rmap walk				deactivation/demotion
> ==========================================================================================
> PTE A bit via		Reclaim:MGLRU	LRU list activation,	- During memory pressure
> rmap walk and process			deactivation/demotion	- Continuous sampling (configurable)
> pgtable walk							  for workingset reporting
> ==========================================================================================
> PTE A bit via		DAMON		LRU activation,
> rmap walk				hot page promotion,
> 					demotion etc
> ==========================================================================================
> Platform hints		NUMAB		NUMAB=1 Locality based
> (e.g. AMD IBS)				balancing and
> 					NUMAB=2 hot page
> 					promotion
> ==========================================================================================
> Device hints		NUMAB		NUMAB=2 hot page
> (e.g. CXL HMU)				promotion
> ==========================================================================================
> PG_young / PG_idle ?
> ==========================================================================================
> 
> Technique trade offs:
> 
> Why not just use one method?
> 
> - Cost of capture, cost of use.
>   * Run all the time - aggregate data for stability of hotness.
>   * Run occasionally to minimize cost.
> 
> - Different availability. e.g. IBS might be needed for other things,
>   hardware monitors may not be available.
> 
> Straw man (based part on IBS proposal linked above)
> ---------------------------------------------------
> 
> Multiple sources become similar at different levels.
> 
> Taking just tiering promotion as an example and keeping in mind the golden
> rule of tiered memory: Put data in the right place to start with if you
> can.  So this is about when you can't: application unaware, changing memory
> pressure and workload mix etc.
> 
>    _____________________     __________________
>   | Sampling techniques |   | Hardware units  |
>   | - Access counter,   |   | CXL HMU etc     |
>   | - Trace based       |   |_________________|
>   |_____________________|           |
>              |                  Hot page
>            Events                   |
>              |                      |
>    __________v___________           |
>   |  Events to counts    |          |
>   |  - hashtable, sketch |          |
>   |    etc               |          |
>   |______________________|          |
>              |                      |
>           Hot page                  |
>              |                      |
>   ___________V______________________V_________
>  |  Hot list - responsible for stability?     |
>  |____________________________________________|
>              |
>         Timely hotlist data        
>              |               Additional data (process newness, stack location...?)
>    __________v__________________|___
>   |  Promotion Daemon               |
>   |_________________________________|
> 
> For all paths where data is flowing down we probably need control parameters
> flowing back the other way + if we have multiple users of the datastream
> we need to satisfy each of their constraints.
> 
> SJ has proposed perhaps extending Damon as a possible interface layer. I am
> yet to understand how that works in cases where regions do not provide
> a compact representation due to lack of contiguity in the hotness.
> An example usecase is hypervisor wanting to migrate data under unaware,
> cheap VMs.  After a system has been running for a while (particularly with hot
> pages being migrated, swap etc) the hotness map looks much like noise.
> 
> Now for the "there be monsters bit"...
> ---------------------------------------
> 
> - Stability of hotness matters and is hard to establish.
>   Predict a page will remain hot - various heuristics.
> 	a) It is hot, probably stays so? (super hot!)
> 	   Sometimes enough to be detected as hot once,
> 	   often not.
> 	b) It has been hot a while, probably stays so.
> 	   Check this hot list against previous hot list,
> 	   entries in both needed to promote.
> 	   This has a problem if hotlist is small compared to
> 	   total count of hot pages.  Say list is 1%, 20% actually
> 	   hot, low chance of repeats even in hot pages.
> 	c) It is hot, let's monitor a while before doing anything.
> 	   Measurement technique may change. Maybe cheaper
> 	   to monitor 'candidate' pages than all pages
> 	   e.g. CXL HMU gives 1000 pages, then we use access bit
> 	        sampling to check they are at least accessed N times
> 		in next second.
> 	d) It was hot, We moved it. Did it stay hot?
> 	   More useful to identify when we are thrashing and should
> 	   just stop doing anything.  To late to fix this one!
> - Some data should be considered hot even when not in use (e.g. stack)
> - Usecases interfere. So it can't just be a broadcast mode
>   where hotness information is sent to all users.
> - When to stop, start migration / tracking?
> 	a) Detecting bad decisions. Enough bad decisions, better to
> 	   do nothing?
>  	b) Metadata beyond the counts is useful
> 	   https://lore.kernel.org/all/87h64u2xkh.fsf@DESKTOP-5N7EMDA/
> 	   Promotion algorithms can need aggregate statistics for a memory 
> 	   device to decide how much to move.
> 
> As noted above, this may well overlap with other sessions.
> One outcome of the discussion so far is to highlight what I think many
> already knew.  This is hard!
> 
> Jonathan
>