From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 13E09E77188 for ; Thu, 2 Jan 2025 04:44:46 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 59BF66B007B; Wed, 1 Jan 2025 23:44:45 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 556FD6B0083; Wed, 1 Jan 2025 23:44:45 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 3C5F86B0085; Wed, 1 Jan 2025 23:44:45 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id 1B5A76B007B for ; Wed, 1 Jan 2025 23:44:45 -0500 (EST) Received: from smtpin10.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay03.hostedemail.com (Postfix) with ESMTP id 8215FA0EA2 for ; Thu, 2 Jan 2025 04:44:44 +0000 (UTC) X-FDA: 82961269734.10.7B3A603 Received: from mail-pl1-f174.google.com (mail-pl1-f174.google.com [209.85.214.174]) by imf11.hostedemail.com (Postfix) with ESMTP id AD12540004 for ; Thu, 2 Jan 2025 04:43:56 +0000 (UTC) Authentication-Results: imf11.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=mfmVdsCS; spf=pass (imf11.hostedemail.com: domain of rientjes@google.com designates 209.85.214.174 as permitted sender) smtp.mailfrom=rientjes@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1735793048; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=xcDhkGR72iAVlG2QKVfnX4uaMDLIT7VJkRmZnvKVC2k=; b=KBas6WsYy0qP3KVzrmZ4f/B61ZD/CXvhDOAWdtjWd3PJcX9rx6IQ5qokAU07eYpxNN+oI1 igfWLhUWDcAYc/Uw9nc8urPHEzHSCy/wcQQiGEHfo9pd7m3VmnTsrB7fohQIoSOg6lZp3o /UgMm5y0JW6hivOmfTcfkwWczGmOJyE= ARC-Authentication-Results: i=1; imf11.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=mfmVdsCS; spf=pass (imf11.hostedemail.com: domain of rientjes@google.com designates 209.85.214.174 as permitted sender) smtp.mailfrom=rientjes@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1735793048; a=rsa-sha256; cv=none; b=NFOTG2TnSwko4hJEvzHMdhP2tw7mQgcfz5DLEA4in7yRPZZUD7+nBsn0G91mrhTYXL8m5g VcG1azRNxGnaDXnZlGyHKRXPR86JyERUvsmAcroHEzVOw3mHuY7zw+bfl3pFTRe0VkjKQf jygvdL0FHSH1ToS5L8kS+2d/FupqP4Y= Received: by mail-pl1-f174.google.com with SMTP id d9443c01a7336-219f6ca9a81so1096755ad.1 for ; Wed, 01 Jan 2025 20:44:42 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1735793081; x=1736397881; darn=kvack.org; h=mime-version:references:message-id:in-reply-to:subject:cc:to:from :date:from:to:cc:subject:date:message-id:reply-to; bh=xcDhkGR72iAVlG2QKVfnX4uaMDLIT7VJkRmZnvKVC2k=; b=mfmVdsCSE+p9ZDTKa62lifRHaHn6VaLiBkWCrSgWvWskByR+jEpPz7l350Rfhd9RjF vnyiULchGwTKE2D5FjNSgzv2iHvP4vmJuEtbHtAn5klHUHlVRk+4BnEAaV3tzBxCBrNg ek4c3zq1CEwk10CJaWt48DiKGecTb0QqvAHB6p1Bh9/cygTRMmstesMxj5nsO6eVr1ms OPPT+stv12s8J//MmDNuNAUiVOVDu3l2U/1YKMTmvaebJX0eucF1FilTgO56pCnk5qMu 1RhuBPjC9RyH1fmk1WbEHN9JxX6Uors0tdjnkHVUwEHvP9OE1RDiprYB5w2EoYio+8XA +Xqg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1735793081; x=1736397881; h=mime-version:references:message-id:in-reply-to:subject:cc:to:from :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=xcDhkGR72iAVlG2QKVfnX4uaMDLIT7VJkRmZnvKVC2k=; b=anttRkH6Z1Bd4DxU4cMdbmx3ak2BpwCGsNEVutjLyCk9KHgAjfeuQwtVRjefthoLI8 B3nNkgE0dsuj5KZYfEcwwGY9nX0FNRCEh2nsEWVZfoyfmGuUD0XiTDnZlB5T62Jx6vRN g7jk9Tw5H5FDT6RMceXU9Q+4wFG27wVO3/l3dbxboaaUp+E2VyT5gT3eU9/BtXCyvpgO 81QDGHiQI5JQI3Y9JZDj9bsRPfHLzcRgWNgc34L0rSd8kjfoVVKGzEOvHmx/JCuyxcIr CDSkQV5ZC5V8r68wE0SoW8JNi3jMhcNkEedjlw455p/Yopb3RDhSZ1SeDNDZ4eei6Zdl yUHA== X-Forwarded-Encrypted: i=1; AJvYcCUeY9E5/NO7AKvfMRSXhP+Hm67GpvrM5ZjrRWAMyIDKc/+YcziIPEB9f8jQ3uzk0CenTPIcfYSN7g==@kvack.org X-Gm-Message-State: AOJu0Ywhq6GbERzpK8MCC9jAZeZ9Dg/eVZZ3VMzsrizFmxTTQG5YLt6E atskLcFT8fl+AxsFIORvtQ2XivW5EGTlTSqoN8zZDhn/Z+7aXv2xHNHtb0z4kg== X-Gm-Gg: ASbGncuXA1VX36xE7igmzXZXISFy3mXqTVJSHZ7udQJw1+i5fyAzqcIOr2CK7q6hI7A cmlzPYdOPPnAAiD87mUR6jcIhQY3qK5rXrIxe/uqbpdCTO5RrmqaAmBxnpC7sl3zSGCyJ+j2M31 Cfy1lXAHj62jbjLIwVznK0xv993KpJqPLi5eCBnvrFZbAEeYfgxvySU4IaxY/WCdjeqWMhN17Hd FKxot8ERbWNrul/dVl5q/jLclSavqVlyryP1ZHSkkPpbp1LDGf4D8KYUVp617lYz33xGGnbRHq4 Y8N1PsEdgDQ= X-Google-Smtp-Source: AGHT+IFh4dcIeGVPmN1EdgLzEzgCJhqqFIBSOY6ViYUefKwJpi9KQSNK0kc4KPhUFq11B+1cJTnVnQ== X-Received: by 2002:a17:902:ea08:b0:216:201e:1b4c with SMTP id d9443c01a7336-219e770dcc4mr17671055ad.9.1735793081025; Wed, 01 Jan 2025 20:44:41 -0800 (PST) Received: from [2620:0:1008:15:da27:938d:c928:812a] ([2620:0:1008:15:da27:938d:c928:812a]) by smtp.gmail.com with ESMTPSA id d2e1a72fcca58-72aad81581bsm23393488b3a.2.2025.01.01.20.44.40 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 01 Jan 2025 20:44:40 -0800 (PST) Date: Wed, 1 Jan 2025 20:44:39 -0800 (PST) From: David Rientjes To: Raghavendra K T cc: Aneesh Kumar , David Hildenbrand , John Hubbard , Kirill Shutemov , Matthew Wilcox , Mel Gorman , "Rao, Bharata Bhasker" , Rik van Riel , RaghavendraKT , Wei Xu , Suyeon Lee , Lei Chen , "Shukla, Santosh" , "Grimm, Jon" , sj@kernel.org, shy828301@gmail.com, Zi Yan , Liam Howlett , Gregory Price , linux-mm@kvack.org Subject: Re: Slow-tier Page Promotion discussion recap and open questions In-Reply-To: <32730696-5fc6-4d47-a623-74e951f704ec@amd.com> Message-ID: References: <6d582bb6-3ba5-1768-92f2-6025340a3cd4@google.com> <32730696-5fc6-4d47-a623-74e951f704ec@amd.com> MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII X-Rspamd-Server: rspam05 X-Stat-Signature: e6outjfbzw8ybowm4ruwzr7nrg5ynjtx X-Rspamd-Queue-Id: AD12540004 X-Rspam-User: X-HE-Tag: 1735793036-55013 X-HE-Meta: U2FsdGVkX1/6SDft0Ovmk7VsbDj5A3kLUgpEO44gqmEFDW4YI8Gf6EiSkzR7ujEfy5jK9CScyQGVZgw+/yGp0AgBX7YxGWtaKXbmQ48eym6jO5zd9r9jpzgaKgTIwlRspK1PRFfp9a89ld4qQzGWyoaJ3tOlEU+BqwN9Tz2UPEse6o3n+7TM3kZa5QS4pNB8QEt6Y9wJULQHzjN1Y+y8Oxxzg3ZxcXCBTBFU0KKIxcKJMbmjz7qPUP+3TKOgDCd0o8VYI9R/h79XOVTjxpCoiOoa9HPGdtCXbXpJkhjpzidCv+oFBRD0OrGhEEDQ0g7dzIVUtEhYVmp5jQdYDTmW3LmmNA7wrZNirTCVoz86gEXEEfezWXXSP9VYstaXbDey19O7uj+2xLYdn6GmV8Nuq6VSdsW3OoUCCFv5KuMiw0VpVZkNJ+ucb8pNXUaMewbZaOzZTCEz1VqA7IuasrtduimFdlOPk2L950lPenl0vMZhAn013Nz4dD4IiDCdw5OJo9IgGaisZ3N2cZu9tBM7vA7aYsFWN63h0mVcW/pnG4ZqVV7rrZkeYxjSaPBXLpnQhoK9D3g3yhhE/s3M/E0W+MvdlvtntZ2S24v5+GhcNDvvNPon+7+GPltvdEDMB4OurJuAzCc0MxhzLKE4FnslBEymm5wH6APhDcMO4EocsMy0eg/f/9gTCifN7gJhX1iXWaEpH62YWbaJg4fbklm/DzVABf3cEw2jPSK4GLuUir5XAHGSenU3N2lQwP3dvsmqsLlbd7YBeYkgcLi6eJk+yQfL1T18PSQBtmSgcQc8XcO/+4jJDr0nGO39XOIZosXSrbA8fhdo4jIt22II+gePUqTxItDB8Z9P9thowge+sWYf2gAK+JwZ4bewioe1XjHMShZ3bk5haRt0tdbRPkZMJwLqZB/gB7QTGopL+m7LW8HG2Ik8qKdLwo/8w92HBP/Iiv1bwnkpLn4D5si4bek N9NpndzF cGO0MJ/Kkoin0nb4EiVIgpI4XP1vIUTdaXmOcCGaaRlCxp/0n2ZfqvD2IVrQcDktZSyozFLSBNCPQLfYxnwaZqtjB0ZdmPOKJwElXwjxghZY1Q2IIWq/J5QhHGwNX66HQ4jzBOQ0nszpi5xp1KvHl5w6Y7XOCK8tCqW6YnkJbkvGJSBrllbtCjNe8SxzGjs9sXZlWyWNM31DPqK1aoeYWO0EqTdr3+MYD5vFzMi5niD4d2vJ55fpw8JxNQGmQzjbtlyz2PgfXJe+9M21qABSXt5vQ7/Z1kZfPaSfJy0aYxdYCngDDpDh1grfbB2ZG2FDtoC7rJNXJBRLHUn+5IpnmdpUH1JSeZbiuJD05oemrxbOUKUmGG/ycxNQ/ljLQSbnf0Di+TC7ZWQenTPenYpriwzWcf6P6oxr3f2XNRAqTSFwGtr6UyrOQjyIpAS1TywIUUG+ftme1L+EZQGTV0ImL1NHISg== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Fri, 20 Dec 2024, Raghavendra K T wrote: > > I asked if this was really done single threaded, which was confirmed. If > > only a single process has pages on a slow memory tier, for example, then > > flexible tuning of the scan period and size ensures we do not scan > > needlessly. The scan period can be tuned to be more responsive (down to > > 400ms in this proposal) depending on how many accesses we have on the > > last scan; similarly, it can be much less responsive (up to 5s) if memory > > is not found to be accessed. > > > > I also asked if scanning can be disabled entirely, Raghu clarified that > > it cannot be. > > > > We have a sysfs tunable (kmmscand/scan_enabled) to enable/disable the > whole scanning at a global level but not at process level granularity. > Thanks Raghu for the clarification. I think during discussion that there was a preference to make this multi-threaded so we didn't rely on a single kmmscand thread, perhaps this would be (at minimum) one kmmscand thread per NUMA node? > > Wei Xu asked if the scan period should be interpreted as the minimal > > interval between scans because kmmscand is single threaded and there are > > many processes. Raghu confirmed this is correct, the minimal delay. > > Even if the scan period is 400ms, in reality it could be multiple seconds > > based on load. > > > > Liam Howlett asked how we could have two scans colliding in a time > > segment. Raghu noted if we are able to complete the last scan in less > > time than 400ms, then we have this delay to avoid continuously scanning > > that results in increased cpu overhead. Liam further asked if processes > > opt into a scan or out of the scan, Raghu noted we always scan every > > process. John Hubbard suggested that we have per-process control. > > +1 for prctl() > > Also I want to add that, I will get data on: > > what is the min and max time required to finish the entire scan for the > current micro-benchmark and one of the real workload (such as Redis/ > Rocksdb...), so that we can check if we are meeting the deadline of > scanning with single kthread. > Do we want more fine-grained per-process control other than just the ability to opt out entire processes? There may be situations where we want to always serve latency tolerant jobs from CXL extended memory, we don't care to ever promote its memory, but I also think there will be processes that are between the two extremes (latency critical and latency tolerant). I think careful consideration needs to be given to how we handle per-process policy for multi-tenant systems that have different levels of latency sensitivity. If kmmscand becomes the standard way of doing page promotion in the kernel, the userspace API to inform it of these policy decisions is going to be key. There have been approaches where this was primarily driven by BPF that has to solve the same challenge. > > Wei noted an important point about separating hot page detection and > > promotion, which don't actually need to be coupled at all. This uses > > page table scanning while future support may not need to leverage this at > > all. We'd very much like to avoid multiple promotion solutions for > > different ways to track page hotness. > > > > I strongly supported this because I believe for CXL, at least within the > > next three years, that memory hotness will likely not be derived from > > page table Accessed bit scanning. Zi Yan agreed. > > > > The promotion path may also want to be much less aggressive than on first > > access. Raghu showed many improvements, including handling short lived > > processes, more accurate hot page detection using timestamp, etc. > > Some of these TODOs can be implemented in next version. > Thanks! Are you planning on sending out another RFC patch series soon or are you interested in publishing this on git.kernel.org or github? There may be an opportunity for others to send you pull requests into the series of patches while we discuss. > > ----->o----- > > I followed up on a discussion point early in the talk about whether this > > should be virtual address scanning like the current approach, walking > > mm_struct's, or the alternative approach which would be physical address > > scanning. > > > > Raghu sees this as a fully alternative approach such as what DAMON uses > > that is based on rmap. The only advantage appears to be avoiding > > scanning on top tier memory completely. > > Having a clarity here would help. Both the approaches have its own pros > and cons. > > Need to also explore on using / Reusing DMAON/ MGLRU.. to the extent possible > based on the approach. > Yeah, I definitely think this is a key point to discuss early on. Gregory had indicated that unmapped file cache is one of the key downsides to using only virtual memory scanning. While things like the CHMU are still on the way, I think there's benefit to making incremental progress from what we currently have available (NUMA Balancing) before we get there.