From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id E2BCEC35FFA for ; Wed, 19 Mar 2025 12:48:02 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id B23D4280002; Wed, 19 Mar 2025 08:48:00 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id AD339280001; Wed, 19 Mar 2025 08:48:00 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 99C96280002; Wed, 19 Mar 2025 08:48:00 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id 7831D280001 for ; Wed, 19 Mar 2025 08:48:00 -0400 (EDT) Received: from smtpin21.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay08.hostedemail.com (Postfix) with ESMTP id 6C100141E31 for ; Wed, 19 Mar 2025 12:48:01 +0000 (UTC) X-FDA: 83238277962.21.DB28FB3 Received: from frasgout.his.huawei.com (frasgout.his.huawei.com [185.176.79.56]) by imf12.hostedemail.com (Postfix) with ESMTP id 4976B4000E for ; Wed, 19 Mar 2025 12:47:59 +0000 (UTC) Authentication-Results: imf12.hostedemail.com; dkim=none; spf=pass (imf12.hostedemail.com: domain of jonathan.cameron@huawei.com designates 185.176.79.56 as permitted sender) smtp.mailfrom=jonathan.cameron@huawei.com; dmarc=pass (policy=quarantine) header.from=huawei.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1742388479; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding:in-reply-to: references; bh=cE10vje144xo0P+x/ickvGXGIGGGeOqqwhBVvoifK9Y=; b=BSFzro40aRK8Ruz9IHsimapmwywQqKVg8tvTH/NeN1CEdWqT+WM6XuorWy7sv9faA2ozTz FFtJjSrOq4hJYgNsMAaldvGAe3UOQRxmtGYl+F/VhHN7EjQ8rQ60ZN/k7aCiB/OtBkbcju FwMIwURfiCtkQSSkUHPHn43If64vvlQ= ARC-Authentication-Results: i=1; imf12.hostedemail.com; dkim=none; spf=pass (imf12.hostedemail.com: domain of jonathan.cameron@huawei.com designates 185.176.79.56 as permitted sender) smtp.mailfrom=jonathan.cameron@huawei.com; dmarc=pass (policy=quarantine) header.from=huawei.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1742388479; a=rsa-sha256; cv=none; b=0dM/x0N+y+zIHC+L8+8rI3gFsAEyg5nCpsU06DsNrO9mYdds7OsgAKowwhFQSFdE+C2ybG paOTr/uFXHQmdL4A7ElcSovi+a8v0wBBseMuvnoI0sLjB0ZZxnz8RC4e1MZeerUYizWlTu CPE7BEF9EsJw4rualtZ9GL161pQLQXY= Received: from mail.maildlp.com (unknown [172.18.186.231]) by frasgout.his.huawei.com (SkyGuard) with ESMTP id 4ZHpMJ2PFgz6K9Rs; Wed, 19 Mar 2025 20:45:00 +0800 (CST) Received: from frapeml500008.china.huawei.com (unknown [7.182.85.71]) by mail.maildlp.com (Postfix) with ESMTPS id 64F62140892; Wed, 19 Mar 2025 20:47:55 +0800 (CST) Received: from localhost (10.203.177.66) by frapeml500008.china.huawei.com (7.182.85.71) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.1.2507.39; Wed, 19 Mar 2025 13:47:54 +0100 Date: Wed, 19 Mar 2025 12:47:53 +0000 From: Jonathan Cameron To: Raghavendra K T , Bharata B Rao , SeongJae Park , , CC: Michal Hocko , Dan Williams , , Matthew Wilcox , Johannes Weiner , Gregory Price Subject: [LSF/MM/BPF TOPIC v2] Unifying sources of page temperature information - what info is actually wanted? Message-ID: <20250319124552.0000344a@huawei.com> X-Mailer: Claws Mail 4.3.0 (GTK 3.24.42; x86_64-w64-mingw32) MIME-Version: 1.0 Content-Type: text/plain; charset="US-ASCII" Content-Transfer-Encoding: 7bit X-Originating-IP: [10.203.177.66] X-ClientProxiedBy: lhrpeml100012.china.huawei.com (7.191.174.184) To frapeml500008.china.huawei.com (7.182.85.71) X-Stat-Signature: b8eeugwgppsbciquqqq1h9qx4mz9dk9z X-Rspamd-Queue-Id: 4976B4000E X-Rspam-User: X-Rspamd-Server: rspam09 X-HE-Tag: 1742388479-895956 X-HE-Meta: U2FsdGVkX1/hFWCuAjsM2+dejD6AOWFymXrhxpdMg5vWcdLTdvYtYNinyuWuuGpRP03ShaXt9goL4pgSX4QRokLPwz5CrtPGlTUfYtubVtCC5mTt+nPs/H8xaJDGwEPxrJJLtdjqciTnrhe95MAyu/0zwFmwD5pvi47HqSQfcAmsPhepO4MQObshxdRsreqkAtP0A223oqfoILo6NBUfwuj6HGJnrh2ylPIwsbmfQoSHD8mEkXgS1SogdwNIopWsD7GI7CpNVSR3NxPIv894JIUPzMWjlRTRqbBBK2Q1BaxaA4gx/xepn7qbaSZ3Ne/4AotRh4AoTqujXx7dOH4te6JLHNYv6ucZ3MA/8jh7PHJpwF+USR0ODgO2n5Pio3XiXrUW+svqHsOtQyqr+V5Hp9FjqiZuI4A60B2sCFMel7hGuwwmIgqgMgt7NdoNjLA7aEX8Vibx3caASGN1d7yFSnXLDazFPuICca4vSm7Wul7pcLWjcRF7fVff+Y150vWHpATgiwVVjTUMCp+dgkcANz2SLDRq/vYPd4Ovq057/Ncv+lx6H3Q61UYRqUntvHbbfv8uWElvlUhMhaLmD6mQH4r4WPk6wnb0OIp77VPwekDR3cROrWxvXWhFIAQJSuDnslVeM7dqKY3UItx4O5cm/fR9xx5FY0ZguTF2zhyFQy9FcJW6vdOJBy1YtwbVfBDGpsYKhq2105qAZ0g06VpLs6svF/xCL4U61Mr3qOqfqymaXOZV/ot872sXs3pHjKB31AlyD6oxXwdMcUXiS7Bp1uiQimTGJ1aIKHNqNMoEJIOWvbdj6nFopLrkrO4FUMqpoPlhx2miFDAEYkYWgSA733/Q2mt2algWHF6z4cvfjc10QSn5SJIXQnL1AP6Gnzx9p116BLqseA3Rb7SvWpgDOToEm5E8LlZGDmiiV1zVFff1usaqSw1sKXz5WDCy+VR4quKcCIWvGPR5vnj3kC4 Gi8FzfF6 ir75psegXz/BL2oIjXkw8pqFVUZ2U3BT6lIiRgb3rLEDLfYGKZJTmEQfTF1tiPN9jfjCwWwY+kWsxdCN2Xslo7GUURmWhLRY5pnkMiR+wgWrv8WSM7b08JrYYeAgYVayrgaUsfJmt03bn5xc8ktVJtdDkqnRCEIIqZelrGuld82whqIgol+/jIOuIB0eOrlmGXiPFB3FA3XZ7Ow3KLNexfMP/9ISsgMiqaVyuvEwMmWmE8I7nE5DKZ1zWD3fjoUdQsqvDbrmjJEhj+5f5p84lJJpBxQk7Az3rwzqhjgMrIa4kbG0= X-Bogosity: Ham, tests=bogofilter, spamicity=0.004027, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Prior to LSFMM, this is an update on where the discussion has gone on list since the original proposal back in January (which was buried in the thread for Ragha's proposal focused on PTE A bit scanning) v1: https://lore.kernel.org/all/20250131130901.00000dd1@huawei.com/ Note that this is combining comments and discussion from many people and I may well have summarized things badly + missed key details. If time allows I'll update with a v3 when people have ripped up this straw man. Bharata has posted code for one approach and discussion is ongoing: https://lore.kernel.org/linux-mm/20250306054532.221138-1-bharata@amd.com/ This proposal overlaps with part of several other proposals, (Damon, access bit tracking etc) but the focus is intended to be more general. Abstract: We have: 1) A range of different technologies tracking what may be loosely defined as the hotness of regions of memory. 2) A set of use cases that care about this data. Question: Is it useful or feasible to aggregate the data from the sources (1) to some layer before providing answers to (2)? What should that layer look like? What services and abstractions should it provide? Is there commonality in what those use cases need? By aggregate I'm not necessarily implying multiple techniques in use at once, but more that we want one interface driven by whatever solution is the right balance on a particular system. That balance can be affected by hardware availability or characteristics of the system or workloa Note that many of the hotness driven actions are painful (e.g. migration of hot pages) and for those we need to be very sure it is a good idea to do anything at all! My assumption is that in at least some cases the problem will be too hard to solve in kernel but lets consider what we can do. On to the details: ------------------ Note: I'm ignoring the low level implementation details of each method and how they avoid resource exhaustion, tune sampling timing (epoch length) and what is sampled (scanning random etc) as in at least some cases that's a problem for the lowest technique specific level. Enumerating the cases (thanks to Bharata, Johannes, SJ and others for inputs on this!) Much of this is direct quotes from this thread: https://lore.kernel.org/all/de31971e-98fc-4baf-8f4f-09d153902e2e@amd.com/ (particularly Bharata's reply to my original questions) Here is a compilation of available temperature sources and how the hot/access data is consumed by different subsystems: PA-Physical address available VA-Virtual address available AA-Access time available NA-accessing Node info available ================================================== Temperature PA VA AA NA source ================================================== PROT_NONE faults Y Y Y Y -------------------------------------------------- folio_mark_accessed() Y Y Y -------------------------------------------------- PTE A bit Y Y N* N -------------------------------------------------- Platform hints Y Y Y Y (AMD IBS) -------------------------------------------------- Device hints Y N N N (CXL HMU) ================================================== * Some information available from scanning timing. In all cases other methods can be applied to fill in the missing data (rmap etc) And here is an attempt to compile how different subsystems use the above data: ========================================================================================== Source Subsystem Consumption Activation/Frequency ========================================================================================== PROT_NONE faults NUMAB NUMAB=1 locality based While task is running, via process pgtable balancing rate varies on observed walk NUMAB=2 hot page locality and sysctl knobs. promotion ========================================================================================== folio_mark_accessed() FS/filemap/GUP LRU list activation On cache access and unmap ========================================================================================== PTE A bit via Reclaim:LRU LRU list activation, During memory pressure rmap walk deactivation/demotion ========================================================================================== PTE A bit via Reclaim:MGLRU LRU list activation, - During memory pressure rmap walk and process deactivation/demotion - Continuous sampling (configurable) pgtable walk for workingset reporting ========================================================================================== PTE A bit via DAMON LRU activation, rmap walk hot page promotion, demotion etc ========================================================================================== Platform hints NUMAB NUMAB=1 Locality based (e.g. AMD IBS) balancing and NUMAB=2 hot page promotion ========================================================================================== Device hints NUMAB NUMAB=2 hot page (e.g. CXL HMU) promotion ========================================================================================== PG_young / PG_idle ? ========================================================================================== Technique trade offs: Why not just use one method? - Cost of capture, cost of use. * Run all the time - aggregate data for stability of hotness. * Run occasionally to minimize cost. - Different availability. e.g. IBS might be needed for other things, hardware monitors may not be available. Straw man (based part on IBS proposal linked above) --------------------------------------------------- Multiple sources become similar at different levels. Taking just tiering promotion as an example and keeping in mind the golden rule of tiered memory: Put data in the right place to start with if you can. So this is about when you can't: application unaware, changing memory pressure and workload mix etc. _____________________ __________________ | Sampling techniques | | Hardware units | | - Access counter, | | CXL HMU etc | | - Trace based | |_________________| |_____________________| | | Hot page Events | | | __________v___________ | | Events to counts | | | - hashtable, sketch | | | etc | | |______________________| | | | Hot page | | | ___________V______________________V_________ | Hot list - responsible for stability? | |____________________________________________| | Timely hotlist data | Additional data (process newness, stack location...?) __________v__________________|___ | Promotion Daemon | |_________________________________| For all paths where data is flowing down we probably need control parameters flowing back the other way + if we have multiple users of the datastream we need to satisfy each of their constraints. SJ has proposed perhaps extending Damon as a possible interface layer. I am yet to understand how that works in cases where regions do not provide a compact representation due to lack of contiguity in the hotness. An example usecase is hypervisor wanting to migrate data under unaware, cheap VMs. After a system has been running for a while (particularly with hot pages being migrated, swap etc) the hotness map looks much like noise. Now for the "there be monsters bit"... --------------------------------------- - Stability of hotness matters and is hard to establish. Predict a page will remain hot - various heuristics. a) It is hot, probably stays so? (super hot!) Sometimes enough to be detected as hot once, often not. b) It has been hot a while, probably stays so. Check this hot list against previous hot list, entries in both needed to promote. This has a problem if hotlist is small compared to total count of hot pages. Say list is 1%, 20% actually hot, low chance of repeats even in hot pages. c) It is hot, let's monitor a while before doing anything. Measurement technique may change. Maybe cheaper to monitor 'candidate' pages than all pages e.g. CXL HMU gives 1000 pages, then we use access bit sampling to check they are at least accessed N times in next second. d) It was hot, We moved it. Did it stay hot? More useful to identify when we are thrashing and should just stop doing anything. To late to fix this one! - Some data should be considered hot even when not in use (e.g. stack) - Usecases interfere. So it can't just be a broadcast mode where hotness information is sent to all users. - When to stop, start migration / tracking? a) Detecting bad decisions. Enough bad decisions, better to do nothing? b) Metadata beyond the counts is useful https://lore.kernel.org/all/87h64u2xkh.fsf@DESKTOP-5N7EMDA/ Promotion algorithms can need aggregate statistics for a memory device to decide how much to move. As noted above, this may well overlap with other sessions. One outcome of the discussion so far is to highlight what I think many already knew. This is hard! Jonathan