From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 08E83C36000 for ; Fri, 21 Mar 2025 15:30:54 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 12968280003; Fri, 21 Mar 2025 11:30:53 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 0DBA5280001; Fri, 21 Mar 2025 11:30:53 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id EE505280003; Fri, 21 Mar 2025 11:30:52 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id D1E9F280001 for ; Fri, 21 Mar 2025 11:30:52 -0400 (EDT) Received: from smtpin01.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay04.hostedemail.com (Postfix) with ESMTP id E86541A01CC for ; Fri, 21 Mar 2025 15:30:52 +0000 (UTC) X-FDA: 83245945944.01.147F76F Received: from frasgout.his.huawei.com (frasgout.his.huawei.com [185.176.79.56]) by imf03.hostedemail.com (Postfix) with ESMTP id 26AE52001C for ; Fri, 21 Mar 2025 15:30:49 +0000 (UTC) Authentication-Results: imf03.hostedemail.com; dkim=none; spf=pass (imf03.hostedemail.com: domain of jonathan.cameron@huawei.com designates 185.176.79.56 as permitted sender) smtp.mailfrom=jonathan.cameron@huawei.com; dmarc=pass (policy=quarantine) header.from=huawei.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1742571050; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=IJEqJuU/uOV0zeTNTZqvgGUuVUR2veNXcw0sbt0WdNE=; b=b8fBYpBO7hbBMb7hHv1Bn9NeAuhSZTZHxIgN5gr1jikrHV5BDmiYOBWEcIKppJkajgw9Lp 29+Km6lUZMSsWabMxvA7S+FG3Ium6lhpKE/DPClTTVlE6T5F00wbWHnW5gPcm/FQ7cX8cJ F+uz8BKO1x8dTR/A3eob3XvKG9heetI= ARC-Authentication-Results: i=1; imf03.hostedemail.com; dkim=none; spf=pass (imf03.hostedemail.com: domain of jonathan.cameron@huawei.com designates 185.176.79.56 as permitted sender) smtp.mailfrom=jonathan.cameron@huawei.com; dmarc=pass (policy=quarantine) header.from=huawei.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1742571050; a=rsa-sha256; cv=none; b=bHFmeWhtcvS/TcdWUfNbpkjBVbX1poZnFK5+eCzBCMePOCj+p9ZNjTvL4PDonA1zrS+MsE MQkqOADflVCjIvhgPhfrZYp3HrOlBn5Zo7aJWyVFO8RwQS4gFS4724OPbD1yjs25q5SS5b 2bPsZj+Q2mwuoOgWQ2I6S84DfXqqACw= Received: from mail.maildlp.com (unknown [172.18.186.216]) by frasgout.his.huawei.com (SkyGuard) with ESMTP id 4ZK5tB0RCKz6K9JN; Fri, 21 Mar 2025 23:27:46 +0800 (CST) Received: from frapeml500008.china.huawei.com (unknown [7.182.85.71]) by mail.maildlp.com (Postfix) with ESMTPS id 22CF11406AC; Fri, 21 Mar 2025 23:30:47 +0800 (CST) Received: from localhost (10.203.177.66) by frapeml500008.china.huawei.com (7.182.85.71) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.1.2507.39; Fri, 21 Mar 2025 16:30:46 +0100 Date: Fri, 21 Mar 2025 15:30:44 +0000 From: Jonathan Cameron To: SeongJae Park CC: Raghavendra K T , Bharata B Rao , , , Michal Hocko , Dan Williams , , Matthew Wilcox , Johannes Weiner , Gregory Price Subject: Re: [LSF/MM/BPF TOPIC v2] Unifying sources of page temperature information - what info is actually wanted? Message-ID: <20250321153044.000017aa@huawei.com> In-Reply-To: <20250319235029.54378-1-sj@kernel.org> References: <20250319124552.0000344a@huawei.com> <20250319235029.54378-1-sj@kernel.org> X-Mailer: Claws Mail 4.3.0 (GTK 3.24.42; x86_64-w64-mingw32) MIME-Version: 1.0 Content-Type: text/plain; charset="US-ASCII" Content-Transfer-Encoding: 7bit X-Originating-IP: [10.203.177.66] X-ClientProxiedBy: lhrpeml100010.china.huawei.com (7.191.174.197) To frapeml500008.china.huawei.com (7.182.85.71) X-Rspamd-Queue-Id: 26AE52001C X-Rspam-User: X-Rspamd-Server: rspam02 X-Stat-Signature: jqkzasw16gwawf66a1jdsgt1mouhsou5 X-HE-Tag: 1742571049-365496 X-HE-Meta: U2FsdGVkX1/Zb1FVKIi03K1R6PMBpZVeecI/444A1+hb+8l+4pbftCCs9fl/wthxIgli49LzwJWvlfvffjOgQSYEAIy/6PgD2SflYdFj//XrOOM1GmzemcTn5y/HhelIyluMNHtsrSXsfIkwVOSvSTDN5f4vaYPjNdzxhGgaQb8sLnanER4R9hJ/wqAS8TKuLc8qKqmwxMhYGVrCUi5uyBRht2wbfGjWoXYLmCbzCVIfCKlNjzUtsMBjQrv3xIxr1pixSoh7FnbLVNoct3S4jDAMCrkPIMd3Bg7FHcWX+ISy8umoFtgODFNcouPWcgwjouvcMfOBeLAiDgb11TFVSlKR2jKZinJ59pBh0+ITCdWby9HThmZqfh1Vq96kXFNAXLHOCaJ/4gzwClLB3iaYkDczX88z3AkCrHnwt1+exvY0ZY3ZHIyQdPNYi+q5PEqT/rHiSU2iBUcPZEgD0+I+Vo5aE5L2ZTkywQ9Dlc4QXU/FKIHj9MdHDNdnD9e/HOklHUaWT+oE7IfM6zjlCR6nKdLo47qRCqZH7hl7sgnR4iNhKSBOAVSgPRr8OyEyu5ZF2CXadmpUEvzMhTYXwL6wRs4pbN1Ewc+UkTJJr9Og5N3p5wYMPl+kfHYEnuyUL50t3o+xyhM0EaKIM/o+4Q7YdOo2A4z6c04BZ5u1O3FC3qaLO4swTPMkt0K8yRUOBP8h94o93XPgbt2TrHzZkvgQtR+fPqB1k+FJtewpsLWDS1ddCrrDrkwBIuU1e9wDke+HYIHN6v4ZQYmerW/wAdjXRv9J8XWSxcGGAG3QqgT1uBDwO/Ug+S3cgoE6XgCcKZktAhKZMYZFQuj3xKHidx9TF7SkLC4R2rXhewrVg6qG/DeLtYShocLjV2PtQLf0lXGL1SOV6jB7S3/a5Y/AJzGBVUj2fsHwNJn0/yq8ty/L3SLSHwDICB3raabsc6cAVYYUPPWFOEe1Hr8nOEpe1OV FQnlRvFv fI7hGf0NnAlGOMUQa72Ed1Qt2sQonjEc3LsEMaOk96onvJTCs668Zj4zmzwVS+zS1f0KwVPVyv6oZj1LBEls16ToemdFyf6mckDDW58H2tOIGaj4uLZGq1olIu3jj+eqfBMsk5lF23qjutkPpba0jIMwqYY3QPcxOQxwBH7YfQ3N0LfK2W27JAf6rphU2BgeyoSH32KEEPLOKDK5+BKbu+GwUQ8ClvwjUzVm07ijZcRC1hboMvJC9B+E6ZQ== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000041, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: > > And here is an attempt to compile how different subsystems > > use the above data: > > ========================================================================================== > > Source Subsystem Consumption Activation/Frequency > > ========================================================================================== > > PROT_NONE faults NUMAB NUMAB=1 locality based While task is running, > > via process pgtable balancing rate varies on observed > > walk NUMAB=2 hot page locality and sysctl knobs. > > promotion > > ========================================================================================== > > folio_mark_accessed() FS/filemap/GUP LRU list activation On cache access and unmap > > ========================================================================================== > > PTE A bit via Reclaim:LRU LRU list activation, During memory pressure > > rmap walk deactivation/demotion > > ========================================================================================== > > PTE A bit via Reclaim:MGLRU LRU list activation, - During memory pressure > > rmap walk and process deactivation/demotion - Continuous sampling (configurable) > > pgtable walk for workingset reporting > > ========================================================================================== > > PTE A bit via DAMON LRU activation, > > rmap walk hot page promotion, > > demotion etc > > For virtual address spaces monitoring mode, DAMON uses PTE A bit via pgtable > walk. > > It's activation and frequency is basically set as user requests. Activation > can be set to be reactive to memory pressure like events (using watermarks). > Frequency can be auto-tuned for pursuing access events per snapshot ratio. Thanks. I've added that (in very brief form) to the table in my slides. > > SJ has proposed perhaps extending Damon as a possible interface layer. I am > > yet to understand how that works in cases where regions do not provide > > a compact representation due to lack of contiguity in the hotness. > > An example usecase is hypervisor wanting to migrate data under unaware, > > cheap VMs. After a system has been running for a while (particularly with hot > > pages being migrated, swap etc) the hotness map looks much like noise. > > Similar concerns for DAMON's region abstraction were raised for physical > address space monitoring, because there is no cautious effort for making hot > pages gathered together (or, locality). > > I'd argue there is no cautious effort to make temperature be spread, though. > As a result, we can expect a level of uncautious bias, and that matches with my > experiences from DAMON use cases on products environemnts so far. Whilst I'm not in a position to share the data, as it's not mine :( I've seen graphs that show that for at least some use cases, even if we have some contiguity of hotness in the VA space, it looks like noise in PA. So I think this is a case of 'mileage may vary'. Damon works great sometimes but sometime the spared of access statistics happen to be wrong. > > Also, in practice, DAMON regions are used in combination with other > information. For example, DAMON-based reclaim checkes PTE A bit of each page > in DAMON-suggested cold memory region to make final decision about whether to > reclaim or not it, like MADV_PAGEOUT does. Makes sense. The MADV_PAGEOUT case was one of the motivators for mixing methods suggestion. Here it's kind of DAMON + dense A bit checking (on candidate pages). > > That is, yes, I agree DAMON's region abstraction is maybe not a good way to > find perfect answer to some questions such as finding N-th hottest single page. > And it has many rooms to improve. Nevertheless, even DAMON of today can give > good enough best-effort answers for questions that practical for some cases, > such as finding regions that may containing N most hot/cold pages, while > letting the monitoring overhead fixed as users ask. > > Also, please note that there is no reason to restrict DAMON to always use > regions abstraction. For different use-cases and situation, DAMON will be open > to be extended to use new abstractions. DAMON aims not to be a subsystem for > DAMON regions concept but data access monitoring for practical efficiency, and > continue random evolution for given environments. Absolutely understood. In my current thinking Damon sits at a particular layer in the stack and there may be one more abstraction on top of it (e.g. a list of hot /cold pages). Equally possible that the layers may fuse and it becomes an aspect of DAMON. > > > > > Now for the "there be monsters bit"... > > --------------------------------------- > > > > - Stability of hotness matters and is hard to establish. > > Predict a page will remain hot - various heuristics. > > a) It is hot, probably stays so? (super hot!) > > Sometimes enough to be detected as hot once, > > often not. > > b) It has been hot a while, probably stays so. > > Check this hot list against previous hot list, > > entries in both needed to promote. > > This has a problem if hotlist is small compared to > > total count of hot pages. Say list is 1%, 20% actually > > hot, low chance of repeats even in hot pages. > > c) It is hot, let's monitor a while before doing anything. > > Measurement technique may change. Maybe cheaper > > to monitor 'candidate' pages than all pages > > e.g. CXL HMU gives 1000 pages, then we use access bit > > sampling to check they are at least accessed N times > > in next second. > > d) It was hot, We moved it. Did it stay hot? > > More useful to identify when we are thrashing and should > > just stop doing anything. To late to fix this one! > > DAMON is providing a sort of b) approach, aka DAMON regions' age, for finding > both hot and cold regions. > > > - Some data should be considered hot even when not in use (e.g. stack) > > DAMOS filters is for this kind of exceptions, and DAMON kernel API is flexible > enough to let callers directly manipulate the regions information based on > thier special knowledges. We can further optimize the interface for easier > uses, of course. Nice. > > > - Usecases interfere. So it can't just be a broadcast mode > > where hotness information is sent to all users. > > - When to stop, start migration / tracking? > > a) Detecting bad decisions. Enough bad decisions, better to > > do nothing? > > b) Metadata beyond the counts is useful > > https://lore.kernel.org/all/87h64u2xkh.fsf@DESKTOP-5N7EMDA/ > > Promotion algorithms can need aggregate statistics for a memory > > device to decide how much to move. > > DAMOS quotas goal feature is a sort of a feature for this question. It allows > users to set target metric and value, and tune the aggressiveness. For > promotions and demotions, I suggested using upper tier utilization and free > ratio as such possible goal metric, and gonna post an implementation for that > soon. Those are certainly good metrics to consider, but I think we definitely also need a metric around how beneficial are the moves being made. That matters more on the promotion path, because that interrupts access to hot data and so will cause a temporary drop in performance / latency spike. > > > > > As noted above, this may well overlap with other sessions. > > One outcome of the discussion so far is to highlight what I think many > > already knew. This is hard! > > Indeed. Keeping more people on the same page is important and difficult. > Thank you for your effort again, and looking forward to discuss in more depth! > I'm not sure we'll succeed. This may well be a wild west situation for a while yet, but hopefully we can slowly converge or at least build some common parts. Jonathan p.s. Heathrow disruption means I'm crossing my fingers on actually getting to Montreal. > > Thanks, > SJ > > > > > Jonathan