From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 95EA6E77188 for ; Thu, 19 Dec 2024 00:56:26 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id F0C056B0085; Wed, 18 Dec 2024 19:56:25 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id EBBBD6B0088; Wed, 18 Dec 2024 19:56:25 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id D5C036B0089; Wed, 18 Dec 2024 19:56:25 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id B6FD96B0085 for ; Wed, 18 Dec 2024 19:56:25 -0500 (EST) Received: from smtpin21.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay06.hostedemail.com (Postfix) with ESMTP id 64C90B0820 for ; Thu, 19 Dec 2024 00:56:25 +0000 (UTC) X-FDA: 82909891680.21.95CF44E Received: from mail-qk1-f181.google.com (mail-qk1-f181.google.com [209.85.222.181]) by imf25.hostedemail.com (Postfix) with ESMTP id 25434A0005 for ; Thu, 19 Dec 2024 00:56:00 +0000 (UTC) Authentication-Results: imf25.hostedemail.com; dkim=pass header.d=gourry.net header.s=google header.b=X1cS6cPD; spf=pass (imf25.hostedemail.com: domain of gourry@gourry.net designates 209.85.222.181 as permitted sender) smtp.mailfrom=gourry@gourry.net; dmarc=none ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1734569747; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=WA4gimd+vTXnZMaPkWv0cVJqAMkCRGjT+APRUpuQBqU=; b=OjXowPVXJ2GuWJVfU0iRtdbMJIuoIFqIhDjMLpxW35snxCipDIT5E1EPz0hbu8nUJ9Ykp1 5dyuMmTPOvA4esBd7vbii+hpN1MzOf/uydHsncLdfhhq6DhxjRn7baAF2aTo2YPNTRRrse yAGe/9dc6swiUIZs0G1vW0HbxV9QbAM= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1734569747; a=rsa-sha256; cv=none; b=NFkJCtoglAOs/R7nyUWu2Dlb/vycCoTHGQrpAfXZNKEhCKe2R+FkguBFBpG7hSvCLSk5Tg vIZPRydDdbFKWCRKRr9MErLUGhCkXLlNP3mXkkQW650rMoj/ATaZzLV+XjX4xxguZp2rE/ OK0raICf8jvyox/Fa4xWHxWG4m6wYrY= ARC-Authentication-Results: i=1; imf25.hostedemail.com; dkim=pass header.d=gourry.net header.s=google header.b=X1cS6cPD; spf=pass (imf25.hostedemail.com: domain of gourry@gourry.net designates 209.85.222.181 as permitted sender) smtp.mailfrom=gourry@gourry.net; dmarc=none Received: by mail-qk1-f181.google.com with SMTP id af79cd13be357-7b85d5cc39eso18662585a.3 for ; Wed, 18 Dec 2024 16:56:23 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gourry.net; s=google; t=1734569782; x=1735174582; darn=kvack.org; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:date:from:from:to:cc:subject:date:message-id:reply-to; bh=WA4gimd+vTXnZMaPkWv0cVJqAMkCRGjT+APRUpuQBqU=; b=X1cS6cPD68sdl4MPdWA/+ZDqk2OamDQqELVk8R9HtHkxeVIRlPbH2Xj1bgoieYriv3 hqck3I0fDvjfqG8ghD3h/yjdGXdJIhSDY9SDxZ4b9FzgtpHR2AR594vAUQLOLQjzx3Tf MFKNPAqTQ8GKkAnV9AsR+aZGHAHvaqVzTrplPqp9CYHXDBhqGoDBfX+JmreRz9k3Cin4 d6gGfChz07wbhw2q4XB7GsGvuI/lnX4rzCyw5uHMbpr1lbWyLhqWADa7xw0+VIa6VF6M ztTA15Zhx52WbNaZwORFQHgWu+4utsj2KyDnum/Db+OG7IGXarSy9iAz51Gw2AlZSzCp S1Pw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1734569782; x=1735174582; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:date:from:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=WA4gimd+vTXnZMaPkWv0cVJqAMkCRGjT+APRUpuQBqU=; b=mou3MYDcXWU0OQuBiBMpXFUVKJg744GZugBALsH7J+iFNXJi2vPFk4TNDRtAse7X9E EP/T4vxyj1EVizzOdBHH1cFkEI7dNGQGW8HC7gLM3VVO6yofz5W8diylmcMWa0FuaYf4 fauBnvrcrReGhlSqodRNxsVh4/p0CLA6fVWhoqZ9xAmPyBDp66rce6IZvDZ5O/GPECxt UlxVZoAK30SKbNiKuvcW2BcIFV7eR3Cx8j3Hj5G1CRf4wi21sCjTkBhsFwYt6fBpX1do JkdZTq3x7XYnFbwGaA8KIU158rOlzdKLFmwb8ANLyMEsjb3U4OFUXaBSKp/8Is08Nl6R afQQ== X-Forwarded-Encrypted: i=1; AJvYcCWVz/8Y6zLrs8pC/qQoPJ6noXvM/bB2DRc+Oc7fHeeJjBhX9IaFeZsd7/QEzIps5x8qNtQMVc3nsQ==@kvack.org X-Gm-Message-State: AOJu0YzEb75V4E2qy3gcDZ2RuAxhXUSS9eiTs6CK0HbHUJ2AU0I1JRNg oOVVZtybZ5P0FQCOh5TPJNHqtMGxKzfAILAIKqIvcK5PZVReQUXwPYDOaz33OcM= X-Gm-Gg: ASbGncur+LGSToXkDTh9zh6erAdBs50jj1stwxSwKKzC/9f2U5cnKyNKAZXPPgTvVGN zl95GVAg7goWP0JW72qHTKG86n+uvZrgsivP5xxVCMWte8qRyWlt3WvSXXmez7tf+FLGfhMNc7s NSmDTLMqQEH9BLGQj9vMGDPgG097V+JQz5fo+KHdQ+xzmxmTRAxB4xZ8ec27dGPz348wCoeoi39 xTqGmywNI70ES3ExgtIbIfOye+CxCVFgFo/fTUBl5TjP5jf6G858ue2D5eZkLbuc+xW X-Google-Smtp-Source: AGHT+IHlIE+S5CMUZ2u1ERkPLnermvkJiM40vD17WZ1POLu+2HUtnOdHl0wp95/oGDT7yPGPD663nQ== X-Received: by 2002:ac8:5741:0:b0:460:90aa:ba8e with SMTP id d75a77b69052e-46a3a8fcaeamr31675781cf.52.1734569782311; Wed, 18 Dec 2024 16:56:22 -0800 (PST) Received: from gourry-fedora-PF4VCD3F ([2620:10d:c091:400::5:a4af]) by smtp.gmail.com with ESMTPSA id d75a77b69052e-46a3e653893sm881161cf.1.2024.12.18.16.56.20 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 18 Dec 2024 16:56:21 -0800 (PST) From: Gregory Price X-Google-Original-From: Gregory Price Date: Wed, 18 Dec 2024 19:56:19 -0500 To: David Rientjes Cc: Aneesh Kumar , David Hildenbrand , John Hubbard , Kirill Shutemov , Matthew Wilcox , Mel Gorman , "Rao, Bharata Bhasker" , Rik van Riel , RaghavendraKT , Wei Xu , Suyeon Lee , Lei Chen , "Shukla, Santosh" , "Grimm, Jon" , sj@kernel.org, shy828301@gmail.com, Zi Yan , Liam Howlett , Gregory Price , linux-mm@kvack.org Subject: Re: Slow-tier Page Promotion discussion recap and open questions Message-ID: References: <6d582bb6-3ba5-1768-92f2-6025340a3cd4@google.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <6d582bb6-3ba5-1768-92f2-6025340a3cd4@google.com> X-Rspamd-Queue-Id: 25434A0005 X-Rspam-User: X-Rspamd-Server: rspam07 X-Stat-Signature: 6x5xjww416p1jtogt3fp886bwjax1bm4 X-HE-Tag: 1734569760-509116 X-HE-Meta: U2FsdGVkX18T3UjzNx4iZ1tWUku054bei/yF7Rgq0XV+0xkIuq89KqyBdgAnp0JMBXKcRx5gYbk6UH1wrnZR8BjuVs0tNtxZyp3Lg3ioeH40O7X1TAXUD6gu+w6ISehSIEjMuZQRVo2AHSoAio6tj9f+JXO8aj1DGaAQo9yFHo7MM1ljVi9vHGfDJWmZPZ1X2nCrnbSU6H9uAT/nY4T/v+kjWvXCZFxZv+xE0mKl1+DlzVHZjo8zrclP+uvrNRj7GQuBksagT75fZSLw4FTCrUyJ7wWOCOb2JTCNo8kxP+mC2ouu1t9FONoAjVCnFm45pAYihxN+w6ZX61n1xQOPfIlkW6aCjPXPS3jYCv9+7OmQjEXA530V2zCNev3dUglmwXM3RMbPONx5WjanpLjoo9gb+Dyb2C1sZHiwSEB7cWSZzKPHNTcLxoGFsVo2DLzL6pI3aJ01BC5nMxrFvEML/yyLElNqa7FrT/U/ZcjzSxtC7jT1Xrz3TfWnbfPw+66vyMedyFsLluSMdSxWl3+V5mjBQjT7HCI/rymfNUPthol9gU8qZ4fTwiNgmiLhXTm7ADSV2MYQ3wwM08r3HQ9JKNMIfMDeSBn01BObCJh21IbMSRjebRBVwrJBsqi1jertOqM+IPcJNBdlnZMGZKLvtik0gsNyKz5qS5YDt+JHuvgI0JYJvQFBVCEzTBoGlC14fSHj2LpT9jI/nvwvzfP25mhrcfsDPaelsPwKtYsOEOW3w1l/MbMubp+oTl5E41HpNo6Chaqjk+9HromVTMEpwjkB2uH4xMghmw5l3uEJMII205UinrDSkF18khYQebGK6HKdw/0Pe3KoB8o/UL84gBrmzcOVGRqr49ACWWKboPi19NdedO4ZuRU9s2tnf0ygSZygRvPxcQ3JkvyPiNo5ZzzXCFt4tKJnJNqCUmi6luEs9fa04qUWOuw/rzbsUiQwJ9so+xc99nZvenF3mIQ RW09OKEl i7lilum/pM9tmjAe82Mic2m/ehkQaghiZNTk1qsdARkf8rTHTN0sO7MZt9HT0akyIGfIRS9BuXGXB7ig5VyYBiicvF834M5nUJCCUBRSvQYEm/YSaJdRBF9G7u1o6hxKzwCU5zWykYAftph+9PK5sazL7MG9xOVL4e7OhVkqzRliP/XhKOtwcFEeiGhZg1ZT/uV+PN3YQ8a2btygpBmX5/d1YyfaVIXEBlR0T/8ycIXLaMotbp0VHOswW4Df5MAhXj41AP6vyN4UNSgHAlUhvFWtIspZqZZqMeqPtKhZ5WxuA1d6BSC9ZXT1k8wiCLLOD0S80VkllG4jPqxruhTHUv3D43GTkscpbJToHTSBWpOiTsr9faRIKf6PmFnIzHW5n/+9H9q2PfhJeJ3Pa7O5fR3bo9vVti3GO04JSag8cd11cSIBm2kpw2zxBOZNppXH2WikojTIuiX0/0tfMeaYwtJq4++44c1Cd2L10PaU1E0aGMNLoY022B9qxaQ== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000014, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Tue, Dec 17, 2024 at 08:19:56PM -0800, David Rientjes wrote: > ----->o----- > Raghu noted the current promotion destination is node 0 by default. Wei > noted we could get some page owner information to determine things like > mempolicies or compute the distance between nodes and, if multiple nodes > have the same distance, choose one of them just as we do for demotions. > > Gregory Price noted some downsides to using mempolicies for this based on > per-task, per-vma, and cross socket policies, so using the kernel's > memory tiering policies is probably the best way to go about it. > Slightly elaborating here: - In an async context, associating a page with a specific task is not presently possible (that I know of). The most we know is the last accessing CPU - maybe - in the page/folio struct. Right now this is disabled in favor of a timestamp when tiering is enabled. a process with 2 tasks which have access to the page may not run on the same socket, so we run the risk of migrating to a bad target. Best effort here would suggest either socket is fine - since they're both "fast nodes" - but this requires that we record the last accessing CPU for a page at identification time. - Even if you could associate with a particular task, the task and/or cgroup are not guaranteed to have a socket affinity. Though obviously if it does, that can be used (just doesn't satisfy default behavior). Basically just saying we shouldn't depend on this - per-vma mempolicies are a potential solution, but they're not very common in the wild - software would have to become numa aware and utilize mbind() on particular memory regions. Likewise we shouldn't depend on this either. - This holds for future mechanisms like CHMU, whose accessing data is even more abstract (no concept of accessing task / cpu / owner at all) More generally - in an async scanning context it's presently not possible to identify the optimal promotion node - and it likely is not possible without userland hints. So probably we should just leverage static configuration data (HMAT) and some basic math to put together a promotion target in a similar way to how we calculate a demotion target. Long winded way of saying I don't think an optimal solution is possible, so lets start with suboptimal and get data. > ----->o----- > My takeaways: > > - there is a definite need to separate hot page detection and the > promotion path since hot pages may be derived from multiple sources, > including hardware assists in the future > > - for the hot page tracking itself, a common abstraction to be used that > can effectively describe hotness regardless of the backend it is > deriving its information from would likely be quite useful > In a synchronous context (Accessing Task), something like: target_node = numa_node_id; # cpu we're currently operating on promote_pagevec(vec, numa_node_id, PROMOTE_DEFER); where the function promotion logic then does something like: promote_batch(pagevec, target) In an asynchronous context (Scanning Task), something like: promote_pagevec(vec, -1, PROMOTE_DEFER); where the promotion logic then does something like for page in pagevec: target = memory_tiers_promotion_target(page_to_nid(page)) promote(folio, target) Plumbing-wise this can be optimized to identify similarly located pages into a sub-pagevec and use promote_batch() semantics. My gut says this is the best we're going to get, since async contexts can't identify accessor locations easily (especially CHMU). > - I think virtual memory scanning is likely the only viable approach for Hard disagree. Virtual memory scanning misses an entire class of memory Unmapped file cache. https://lore.kernel.org/linux-mm/20241210213744.2968-1-gourry@gourry.net/ > this purpose and we could store state in the underlying struct page, This is contentious. Look at folio->_last_cpupid for context, we're already overloading fields in subtle ways to steal a 32 bit area. > similar to NUMA Balancing, but that all scanning should be driven by > walking the mm_struct's to harvest the Accessed bit > If the goal is to do multi-tenant tiering (i.e. many mm_struct's), then this scales poorly by design. Elsewhere, folks agreed that CXL-memory will have HMU-driven hotness data as the primary mechanism. This is a physical-memory hotness tracking mechanism that avoids scanning page tables or page structs. If we think that's the direction it's going, then we shouldn't bother investing a ton of effort into a virtual-memory driven design as the primary user. (Sure, support it, but don't dive too much further in) > - if there is any general pushback on leveraging a kthread for this, > this would be very good feedback to have early > I think for the promotion system, having one or more kthreads based on promotion pressure is a good idea. I'm not sure how well this will scale for many-process, high-memory systems (1TB+ on a scanning interval of 256MB is very low accuracy). Need more data! ~Gregory