From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id D0135C36010 for ; Fri, 4 Apr 2025 10:39:22 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 61A8B6B000C; Fri, 4 Apr 2025 06:39:20 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 5C6E96B000D; Fri, 4 Apr 2025 06:39:20 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 4B5E46B000E; Fri, 4 Apr 2025 06:39:20 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id 2E2B36B000C for ; Fri, 4 Apr 2025 06:39:20 -0400 (EDT) Received: from smtpin11.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay08.hostedemail.com (Postfix) with ESMTP id 934FE142576 for ; Fri, 4 Apr 2025 10:39:21 +0000 (UTC) X-FDA: 83296014522.11.254EFDC Received: from frasgout.his.huawei.com (frasgout.his.huawei.com [185.176.79.56]) by imf29.hostedemail.com (Postfix) with ESMTP id 33F50120011 for ; Fri, 4 Apr 2025 10:39:18 +0000 (UTC) Authentication-Results: imf29.hostedemail.com; dkim=none; spf=pass (imf29.hostedemail.com: domain of jonathan.cameron@huawei.com designates 185.176.79.56 as permitted sender) smtp.mailfrom=jonathan.cameron@huawei.com; dmarc=pass (policy=quarantine) header.from=huawei.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1743763159; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=QsYzn13vyOrX0iCdcUce/8PsEZzPd2UqusBXp38ZHBw=; b=4dcl/KbW1iI+Yr5q9ARqbIssNlK3HZRiu0duklIolBKFyUnWp3hKh5TY2fQIJrStGvj066 1rbjnM+FixChTMxLKvsjWzEpjz2R+D6ojNJSvEJ6uHE+5RmvwN6jRZ6X/XpUn8wwCD5QfV 4CjO15SgQDExVloh3ENBr4iSQjDlahQ= ARC-Authentication-Results: i=1; imf29.hostedemail.com; dkim=none; spf=pass (imf29.hostedemail.com: domain of jonathan.cameron@huawei.com designates 185.176.79.56 as permitted sender) smtp.mailfrom=jonathan.cameron@huawei.com; dmarc=pass (policy=quarantine) header.from=huawei.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1743763159; a=rsa-sha256; cv=none; b=vvcb8Z9GUhnxb3buzC/ksolskp3Arg002tIF2wByKKns58Qs0AWODErpfe80eWarbeDXZA UA7IL8Ei7/+sdVtneAnHMX1u920+EASJxi3TDKx7JoeH0a1gjwUMGKegUluefUi/H929me N4DAu8ugyKpzXIyisaiRkc2G96l59Ss= Received: from mail.maildlp.com (unknown [172.18.186.31]) by frasgout.his.huawei.com (SkyGuard) with ESMTP id 4ZTZkY4CG1z67HSr; Fri, 4 Apr 2025 18:35:33 +0800 (CST) Received: from frapeml500008.china.huawei.com (unknown [7.182.85.71]) by mail.maildlp.com (Postfix) with ESMTPS id 3FB6A1400D3; Fri, 4 Apr 2025 18:39:15 +0800 (CST) Received: from localhost (10.203.177.66) by frapeml500008.china.huawei.com (7.182.85.71) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.1.2507.39; Fri, 4 Apr 2025 12:39:14 +0200 Date: Fri, 4 Apr 2025 11:39:12 +0100 From: Jonathan Cameron To: Raghavendra K T , Bharata B Rao , SeongJae Park , , CC: Michal Hocko , Dan Williams , Matthew Wilcox , Johannes Weiner , Gregory Price Subject: Re: [LSF/MM/BPF TOPIC v2] Unifying sources of page temperature information - what info is actually wanted? Message-ID: <20250404113912.00002606@huawei.com> In-Reply-To: <20250319124552.0000344a@huawei.com> References: <20250319124552.0000344a@huawei.com> X-Mailer: Claws Mail 4.3.0 (GTK 3.24.42; x86_64-w64-mingw32) MIME-Version: 1.0 Content-Type: text/plain; charset="US-ASCII" Content-Transfer-Encoding: 7bit X-Originating-IP: [10.203.177.66] X-ClientProxiedBy: lhrpeml500009.china.huawei.com (7.191.174.84) To frapeml500008.china.huawei.com (7.182.85.71) X-Rspamd-Queue-Id: 33F50120011 X-Rspam-User: X-Rspamd-Server: rspam02 X-Stat-Signature: mpfbsdttjb8sbmk7bqoha1p86ii3swpi X-HE-Tag: 1743763158-54383 X-HE-Meta: U2FsdGVkX1+28fJyuf4WJZigGT4oWFw6EphTVlBFJ//d9pmglTsQI/Undngvc5JGOhhPWmQeelsSecUL2n9XxQycgmDKQwOW0L1pDg4HvmLRohey7XdMOfyPIsg7lyhqoUsp5mrWQNmaoEDlUS3oqn1EGhq3i+qswbn0jBv3WguZd7+E9hD1kh7YjKRANWb/N/2fv6Errd545n/9czJC5tCkSAF3SV+rNKKEbMZn8aRXftBmVPhtci5+J5NkDNssddR7BDrp76gE7lP2OOTIpmj+tH/xCXWwmiUjw4gqPUPOKj8Cd2cyrCjUQFJqXmRtm1U23+29uHK7HTl1mgtX9ZDMdK306ciBuzXkiSOFHl8W3FxJqsHDy+gpb7950uo1jlFVnJiIxugb+wCiBLphLXH3C7GsNvzfdDnGy/XLlKXQsiteD521gRA9igxuA1r8LnT+VtEfPxa4Zvt4IeIh/MnbyGgtV1cDR0DCye2ofyEtpHgeQDBFKCoLm2AludndQBH8ZNs2m2Xcxdxtgda3s6hPdto3mwUIptFxBjmDeXp1dJGodvTGi/8pvcOqeZl8cxNP3t/ZMxuy1Qe5wRgezdMflEMQFJJcTbs+0avi3BFZUB0PfRh4brqR0HqwV805xq/vGIDu+5m3r8RFbFz7mli3FfZ/XD9ColKkQGUkqXdD5y8/YhXPSbwhLu8YugKAlKOg2JmIAcHJRj8uEVgbGfWZxFGoaSdWs3SFehx2bZ1HTI544zDpzkLmtOy6xdKnpEc301I4iDbeXMaHh2PvgKFLegjVFtchuClK95J2AoBbGCfncel6Bz2615ItEYUkAG1eNmVI2Lll2V5agFOdv9Rtj6SFRBAATT9c/50K4Qwh9iIa+bKJ3WgbTT5hipMIB/vEhjQojgUcPUtFhMz+HQe8SZSI/DRQzt8lxEay+LOAy6iNffJ/QM3XQ3jYcFQEO8uFiOssOF+bx8DEhRj PvXayx6X fbFFMjWYz+01l1S4UhCkbU6IZqNa/7HlsCglZrivlTmqtQsMvsdAHisW0Ou9Q47KQKihEn76201LXA37kbnJMYS1gOyPdJPK8Ib/EWiw48EwcCCW0ZsoipItNdK4Ms4msA7d4+ZRa9iSU4UIMFhahtSMVQj916A3imuDmqoHkMfFe2h2a9LD+nHZg0sdcNpmQ+6KhNwMrLdLWYhSkdD/PZ/h29jU4S3RU7c2Pj2vkKd+LIPTRBwKjlOR/RQyL8F6CprEW5ka4BPq/6QyappH4H1ieENYbvBbRRajLsuWvyxz06k6H+LcVWArihlgC7nQkx9bp X-Bogosity: Ham, tests=bogofilter, spamicity=0.001037, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Wed, 19 Mar 2025 12:47:53 +0000 Jonathan Cameron wrote: https://drive.google.com/file/d/1o9g-Bggg7jJwrkLa90ZyLEW6xPdp2D2a/view?usp=drivesdk Slides as presented at LSF-MM. > Prior to LSFMM, this is an update on where the discussion has gone on list > since the original proposal back in January (which was buried in the > thread for Ragha's proposal focused on PTE A bit scanning) > > v1: https://lore.kernel.org/all/20250131130901.00000dd1@huawei.com/ > > Note that this is combining comments and discussion from many people and I may > well have summarized things badly + missed key details. If time allows > I'll update with a v3 when people have ripped up this straw man. > > Bharata has posted code for one approach and discussion is ongoing: > https://lore.kernel.org/linux-mm/20250306054532.221138-1-bharata@amd.com/ > This proposal overlaps with part of several other proposals, (Damon, access > bit tracking etc) but the focus is intended to be more general. > > Abstract: > > We have: > 1) A range of different technologies tracking what may be loosely defined > as the hotness of regions of memory. > 2) A set of use cases that care about this data. > > Question: > > Is it useful or feasible to aggregate the data from the sources (1) to some > layer before providing answers to (2)? What should that layer look like? > What services and abstractions should it provide? Is there commonality in > what those use cases need? > > By aggregate I'm not necessarily implying multiple techniques in use at > once, but more that we want one interface driven by whatever solution > is the right balance on a particular system. That balance can be affected > by hardware availability or characteristics of the system or workloa > > Note that many of the hotness driven actions are painful (e.g. migration > of hot pages) and for those we need to be very sure it is a good idea > to do anything at all! > > My assumption is that in at least some cases the problem will be too hard > to solve in kernel but lets consider what we can do. > > On to the details: > ------------------ > > Note: I'm ignoring the low level implementation details of each method > and how they avoid resource exhaustion, tune sampling timing (epoch length) > and what is sampled (scanning random etc) as in at least some cases that's > a problem for the lowest technique specific level. > > Enumerating the cases (thanks to Bharata, Johannes, SJ and others for inputs > on this!) Much of this is direct quotes from this thread: > https://lore.kernel.org/all/de31971e-98fc-4baf-8f4f-09d153902e2e@amd.com/ > (particularly Bharata's reply to my original questions) > > Here is a compilation of available temperature sources and how the > hot/access data is consumed by different subsystems: > > PA-Physical address available > VA-Virtual address available > AA-Access time available > NA-accessing Node info available > > ================================================== > Temperature PA VA AA NA > source > ================================================== > PROT_NONE faults Y Y Y Y > -------------------------------------------------- > folio_mark_accessed() Y Y Y > -------------------------------------------------- > PTE A bit Y Y N* N > -------------------------------------------------- > Platform hints Y Y Y Y > (AMD IBS) > -------------------------------------------------- > Device hints Y N N N > (CXL HMU) > ================================================== > * Some information available from scanning timing. > In all cases other methods can be applied to fill in the missing data > (rmap etc) > > And here is an attempt to compile how different subsystems > use the above data: > ========================================================================================== > Source Subsystem Consumption Activation/Frequency > ========================================================================================== > PROT_NONE faults NUMAB NUMAB=1 locality based While task is running, > via process pgtable balancing rate varies on observed > walk NUMAB=2 hot page locality and sysctl knobs. > promotion > ========================================================================================== > folio_mark_accessed() FS/filemap/GUP LRU list activation On cache access and unmap > ========================================================================================== > PTE A bit via Reclaim:LRU LRU list activation, During memory pressure > rmap walk deactivation/demotion > ========================================================================================== > PTE A bit via Reclaim:MGLRU LRU list activation, - During memory pressure > rmap walk and process deactivation/demotion - Continuous sampling (configurable) > pgtable walk for workingset reporting > ========================================================================================== > PTE A bit via DAMON LRU activation, > rmap walk hot page promotion, > demotion etc > ========================================================================================== > Platform hints NUMAB NUMAB=1 Locality based > (e.g. AMD IBS) balancing and > NUMAB=2 hot page > promotion > ========================================================================================== > Device hints NUMAB NUMAB=2 hot page > (e.g. CXL HMU) promotion > ========================================================================================== > PG_young / PG_idle ? > ========================================================================================== > > Technique trade offs: > > Why not just use one method? > > - Cost of capture, cost of use. > * Run all the time - aggregate data for stability of hotness. > * Run occasionally to minimize cost. > > - Different availability. e.g. IBS might be needed for other things, > hardware monitors may not be available. > > Straw man (based part on IBS proposal linked above) > --------------------------------------------------- > > Multiple sources become similar at different levels. > > Taking just tiering promotion as an example and keeping in mind the golden > rule of tiered memory: Put data in the right place to start with if you > can. So this is about when you can't: application unaware, changing memory > pressure and workload mix etc. > > _____________________ __________________ > | Sampling techniques | | Hardware units | > | - Access counter, | | CXL HMU etc | > | - Trace based | |_________________| > |_____________________| | > | Hot page > Events | > | | > __________v___________ | > | Events to counts | | > | - hashtable, sketch | | > | etc | | > |______________________| | > | | > Hot page | > | | > ___________V______________________V_________ > | Hot list - responsible for stability? | > |____________________________________________| > | > Timely hotlist data > | Additional data (process newness, stack location...?) > __________v__________________|___ > | Promotion Daemon | > |_________________________________| > > For all paths where data is flowing down we probably need control parameters > flowing back the other way + if we have multiple users of the datastream > we need to satisfy each of their constraints. > > SJ has proposed perhaps extending Damon as a possible interface layer. I am > yet to understand how that works in cases where regions do not provide > a compact representation due to lack of contiguity in the hotness. > An example usecase is hypervisor wanting to migrate data under unaware, > cheap VMs. After a system has been running for a while (particularly with hot > pages being migrated, swap etc) the hotness map looks much like noise. > > Now for the "there be monsters bit"... > --------------------------------------- > > - Stability of hotness matters and is hard to establish. > Predict a page will remain hot - various heuristics. > a) It is hot, probably stays so? (super hot!) > Sometimes enough to be detected as hot once, > often not. > b) It has been hot a while, probably stays so. > Check this hot list against previous hot list, > entries in both needed to promote. > This has a problem if hotlist is small compared to > total count of hot pages. Say list is 1%, 20% actually > hot, low chance of repeats even in hot pages. > c) It is hot, let's monitor a while before doing anything. > Measurement technique may change. Maybe cheaper > to monitor 'candidate' pages than all pages > e.g. CXL HMU gives 1000 pages, then we use access bit > sampling to check they are at least accessed N times > in next second. > d) It was hot, We moved it. Did it stay hot? > More useful to identify when we are thrashing and should > just stop doing anything. To late to fix this one! > - Some data should be considered hot even when not in use (e.g. stack) > - Usecases interfere. So it can't just be a broadcast mode > where hotness information is sent to all users. > - When to stop, start migration / tracking? > a) Detecting bad decisions. Enough bad decisions, better to > do nothing? > b) Metadata beyond the counts is useful > https://lore.kernel.org/all/87h64u2xkh.fsf@DESKTOP-5N7EMDA/ > Promotion algorithms can need aggregate statistics for a memory > device to decide how much to move. > > As noted above, this may well overlap with other sessions. > One outcome of the discussion so far is to highlight what I think many > already knew. This is hard! > > Jonathan >