From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id DB53BC3ABBC for ; Mon, 12 May 2025 08:23:07 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id A253B6B00D5; Mon, 12 May 2025 04:23:05 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 9D41C6B00D6; Mon, 12 May 2025 04:23:05 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 89C5E6B00D7; Mon, 12 May 2025 04:23:05 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id 68FA36B00D5 for ; Mon, 12 May 2025 04:23:05 -0400 (EDT) Received: from smtpin01.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay03.hostedemail.com (Postfix) with ESMTP id D254ABC4BC for ; Mon, 12 May 2025 08:23:06 +0000 (UTC) X-FDA: 83433565572.01.607E847 Received: from invmail4.hynix.com (exvmail4.hynix.com [166.125.252.92]) by imf14.hostedemail.com (Postfix) with ESMTP id 4B13A100007 for ; Mon, 12 May 2025 08:23:03 +0000 (UTC) Authentication-Results: imf14.hostedemail.com; dkim=none; dmarc=none; spf=pass (imf14.hostedemail.com: domain of rakie.kim@sk.com designates 166.125.252.92 as permitted sender) smtp.mailfrom=rakie.kim@sk.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1747038185; a=rsa-sha256; cv=none; b=QYzwqx1PVy2F2ShD+/CHBF0nqN41b2c1cTcI7Pbz3U1XwQn0Hkzsr1FsSXXq9j1TDy/DZj jtptP8yR4QotpYHzU8WeNsBnZoor/2dtpVncWd2bIWpt4YDjIF7X4cKhSEJ36A7TpNpY38 XES8kgihceBQr6s8fCuJIfE7d8eTPcw= ARC-Authentication-Results: i=1; imf14.hostedemail.com; dkim=none; dmarc=none; spf=pass (imf14.hostedemail.com: domain of rakie.kim@sk.com designates 166.125.252.92 as permitted sender) smtp.mailfrom=rakie.kim@sk.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1747038185; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=bM+RZpKOho+ggZjJKoFHpvUiGJclCkV0pR6ql2p13nk=; b=vaILCspJ0/OEsrJvWtYL2rgS9CwGRs1mjwZKjJ4bPnEvjnJqhE6ZPAqm+sw+gnJkK4H30E a9QXcl+Q3nD4FKiof7lVcSBrWB0JG3QJE5YuPXXal6P4c1sGYw5zCwZ2UndD0MS6YqBS34 4+8yF2M2whVvoZHfTaEBtm3A4bVvazQ= X-AuditID: a67dfc5b-669ff7000002311f-1e-6821afe537ee From: Rakie Kim To: Gregory Price Cc: joshua.hahnjy@gmail.com, akpm@linux-foundation.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, linux-cxl@vger.kernel.org, dan.j.williams@intel.com, ying.huang@linux.alibaba.com, kernel_team@skhynix.com, honggyu.kim@sk.com, yunjeong.mun@sk.com, Rakie Kim Subject: Re: [RFC] Add per-socket weight support for multi-socket systems in weighted interleave Date: Mon, 12 May 2025 17:22:50 +0900 Message-ID: <20250512082257.263-1-rakie.kim@sk.com> X-Mailer: git-send-email 2.48.1.windows.1 In-Reply-To: References: MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Brightmail-Tracker: H4sIAAAAAAAAA+NgFjrNLMWRmVeSWpSXmKPExsXC9ZZnoe7T9YoZBk9vGVrMWb+GzWL61AuM Fj/vHme3OL51HrvF+VmnWCwu75rDZnFvzX9Wi9VrMhw4PHbOusvu0d12md1j8Z6XTB6bPk1i 9zgx4zeLx86Hlh6fN8kFsEdx2aSk5mSWpRbp2yVwZRxevYKl4IlKxaytk9gaGFvluhg5OSQE TCROTn3NBGNf3/iRtYuRg4NNQEni2N4YEFNEQFWi7Yp7FyMXB7PAeiaJAxe3soOUCwskSJya d4kZxGYBqvne0QgW5xUwltja8AJqpKZEw6V7TCBzOAXMJP6/jgMJCwnwSLzasJ8RolxQ4uTM JywgNrOAvETz1tnMILskBM6wSdw/9JwZYo6kxMEVN1gmMPLPQtIzC0nPAkamVYxCmXlluYmZ OSZ6GZV5mRV6yfm5mxiBQbys9k/0DsZPF4IPMQpwMCrx8CasVcgQYk0sK67MPcQowcGsJMI7 lUE+Q4g3JbGyKrUoP76oNCe1+BCjNAeLkjiv0bfyFCGB9MSS1OzU1ILUIpgsEwenVAPjEq1k 35ftAtdWulXdWyE95UHP3nVWe017+/sOxEnXdPQkN+b9nnjhCX/SgWA+vZn5LGz7W275JgdO fTH5oc73Tz8Sw19cXzN5S3hNiMRBoxahzvk9jR8/HTYuCL0TMV1krmbPo7zkU//Ss23O/lWQ bfpze9rvhOA/CexF6l5cWvv+Cvd8vDhTiaU4I9FQi7moOBEATM4NQF4CAAA= X-Brightmail-Tracker: H4sIAAAAAAAAA+NgFjrKLMWRmVeSWpSXmKPExsXCNUNNS/fpesUMg3OP1SzmrF/DZjF96gVG i593j7NbfH72mtni+NZ57BaH555ktTg/6xSLxeVdc9gs7q35z2px6NpzVovVazIsfm9bwebA 47Fz1l12j+62y+wei/e8ZPLY9GkSu8eJGb9ZPHY+tPT4dtvDY/GLD0wenzfJBXBGcdmkpOZk lqUW6dslcGUcXr2CpeCJSsWsrZPYGhhb5boYOTkkBEwkrm/8yNrFyMHBJqAkcWxvDIgpIqAq 0XbFvYuRi4NZYD2TxIGLW9lByoUFEiROzbvEDGKzANV872gEi/MKGEtsbXjBBDFSU6Lh0j0m kDmcAmYS/1/HgYSFBHgkXm3YzwhRLihxcuYTFhCbWUBeonnrbOYJjDyzkKRmIUktYGRaxSiS mVeWm5iZY6pXnJ1RmZdZoZecn7uJERi4y2r/TNzB+OWy+yFGAQ5GJR5eCX/FDCHWxLLiytxD jBIczEoivFMZ5DOEeFMSK6tSi/Lji0pzUosPMUpzsCiJ83qFpyYICaQnlqRmp6YWpBbBZJk4 OKUaGK9n9bWpXct9O1f2mn1X+iFp+zVHrb1DJnCdeLidf4lpLWvLn56NHHd83+7aHvbUbvO/ gCD3X13v+RIVv0eksN2eJayv/dZV4JX5ZFmB23dfBZWcvnvc78OucuZ3ynV/m1zXhUy68d3n 7K7fB/7UBh7NVXsdbrqoS7Fk9h/xogDrhLMqLNeWrFZiKc5INNRiLipOBABf+eqeWAIAAA== X-CFilter-Loop: Reflected X-Rspam-User: X-Rspamd-Server: rspam07 X-Rspamd-Queue-Id: 4B13A100007 X-Stat-Signature: fyoqta7qgth3n6nmxbctdx9jm5xt1agb X-HE-Tag: 1747038183-263470 X-HE-Meta: U2FsdGVkX186x0cEfOCuQQCnaQyBmKV9wJxFR53P1PTb4aq6+/P5IjOJk2RBHgMZcLwiIfEGkcJtRwRNIvXTMDQv0TksNt1QOb5Kyasce8QSA3UW3sEgBvLZKNdFtR6mN/15Jyb0bpIOaZL/aCLn2+I9FvR3x4063vdYzDbxwHMEGhzdK8i3N9L+H5/Zc0TLxZeozoYdkc1VLEIOAIRZ6IVDL8t99ogWQonk+HSrUBfRnXiya4aobcXSHHaorVP6IQaUpjY++33VGzgVSCuLWY0Wi+DeJQbUvBiLB9SlCd46eQrPjgbimX9s/GbaMwUTeAGSyMZPbI0rlJhezq+C6gXCnjwiOCU5leWNgx9hFny26JvUrQV9CqnaeKQTthsPlSdOZSFwla220IY9mL1H7yppszWQ7APpTJ0A57w2jTeKFN9vtk/YMj83RAyXGOfclWNn2OHoY/VLdJphKCkszkD7mRUHnSOnKIIh9Rr4J9dp756EEUhHPIEMEuri3FTI1YVROv5nElUU2OoywJOT4tIHepD/F0ZL5tM91dZqoFfFqRMZ4fClhK6ayn3Cx8umHh6IQ8E06NNY7jjhq/wUDedeq03OOgNFGQGuTuRVk8GoTVG+OJ6FnRn/HzjGBgKRt9ms+pKFR9iSehHmXPvuXIcTpPjFBza8J8oUx0v9ZUcZRmTN5AE1CwnsUgMsNm0rxlpyO5bUlfRHtAfyLuKhF8LbEp/f2u4nHalIBr+MjWwlucM3hnS2RVjCTFIdqBxFiAoTup+s3OmAoVS5lyOWHAYaWeNIK5SXxFC8zDk8hvixv9EjgaBj7Q9CUPEFI1D7qUvjxwrxWrly02ETXDXWB9J0U9QQSj6RWq7bpaOl5TH50Fj9ubBY+Wuhgt+743UbwJdD+2L2FAWKN9ljkKYyX8W7mZ7ozZceRDebrRWogz1VeUG7RJJJrVRv3zfq1IVQYk5w7QshZ/GxlKVzcfv NA4Ik7kW lZcokmtQKlH2x5XmAAJkUSIT4SAY1Rnk5wLfrxyOvRNKxzI9XjNYQLCESdgj9nRSzQcFVJP2IbLntjUYSGmu8hHDlaXVlRSY5wSr3YFy39lpuo10f458Wh1Hx8LgBq3S8Nts2YBB0xrcPF9hYoiFdi+HnMOzqzesaArxVjNG4LOYucJeDSAtQhneg9p8wHHCqdZpTJceD5C0PjuS5HCDiwDi5O9zLYBkARhZzUACR+ZfFl9ZrbZYfMshyY8WXZ0nva5MXjLpWNKFrWD0= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Fri, 9 May 2025 01:49:59 -0400 Gregory Price wrote: > On Fri, May 09, 2025 at 11:30:26AM +0900, Rakie Kim wrote: > > > > Scenario 1: Adapt weighting based on the task's execution node > > A task prefers only the DRAM and locally attached CXL memory of the > > socket on which it is running, in order to avoid cross-socket access and > > optimize bandwidth. > > - A task running on CPU0 (node0) would prefer DRAM0 (w=3) and CXL0 (w=1) > > - A task running on CPU1 (node1) would prefer DRAM1 (w=3) and CXL1 (w=1) > ... snip ... > > > > However, Scenario 1 does not depend on such information. Rather, it is > > a locality-preserving optimization where we isolate memory access to > > each socket's DRAM and CXL nodes. I believe this use case is implementable > > today and worth considering independently from interconnect performance > > awareness. > > > > There's nothing to implement - all the controls exist: > > 1) --cpunodebind=0 > 2) --weighted-interleave=0,2 > 3) cpuset.mems > 4) cpuset.cpus Thank you again for your thoughtful response and the detailed suggestions. As you pointed out, it is indeed possible to construct node-local memory allocation behaviors using the existing interfaces such as --cpunodebind, --weighted-interleave, cpuset.mems, and cpuset.cpus. I appreciate you highlighting that path. However, what I am proposing in Scenario 1 (Adapt weighting based on the task's execution node) is slightly different in intent. The idea is to allow tasks to dynamically prefer the DRAM and CXL nodes attached to the socket on which they are executing without requiring a fixed execution node or manual nodemask configuration. For instance, if a task is running on node0, it would prefer node0 and node2; if running on node1, it would prefer node1 and node3. This differs from the current model, which relies on statically binding both the CPU and memory nodes. My proposal aims to express this behavior as a policy-level abstraction that dynamically adapts based on execution locality. So rather than being a combination of manual configuration and execution constraints, the intent is to incorporate locality-awareness into the memory policy itself. > > You might consider maybe something like "--local-tier" (akin to > --localalloc) that sets an explicitly fallback set based on the local > node. You'd end up doing something like > > current_nid = memtier_next_local_node(socket_nid, current_nid) > > Where this interface returns the preferred fallback ordering but doesn't > allow cross-socket fallback. > > That might be useful, i suppose, in letting a user do: > > --cpunodebind=0 --weighted-interleave --local-tier > > without having to know anything about the local memory tier structure. That said, I believe your suggestion for a "--local-tier" option is a very good one. It could provide a concise, user-friendly way to activate such locality-aware fallback behavior, even if the underlying mechanism requires some policy extension. In this regard, I fully agree that such an interface could greatly help users express their intent without requiring them to understand the details of the memory tier topology. > > > > At the same time we were discussing this, we were also discussing how to > > > do external task-mempolicy modifications - which seemed significantly > > > more useful, but ultimately more complex and without sufficient > > > interested parties / users. > > > > I'd like to learn more about that thread. If you happen to have a pointer > > to that discussion, it would be really helpful. > > > > https://lore.kernel.org/all/20231122211200.31620-1-gregory.price@memverge.com/ > https://lore.kernel.org/all/ZV5zGROLefrsEcHJ@r13-u19.micron.com/ > https://lore.kernel.org/linux-mm/ZWYsth2CtC4Ilvoz@memverge.com/ > https://lore.kernel.org/linux-mm/20221010094842.4123037-1-hezhongkun.hzk@bytedance.com/ > There are locking issues with these that aren't easy to fix. > > I think the bytedance method uses a task_work queueing to defer a > mempolicy update to the task itself the next time it makes a kernel/user > transition. That's probably the best overall approach i've seen. > > https://lore.kernel.org/linux-mm/ZWezcQk+BYEq%2FWiI@memverge.com/ > More notes gathered prior to implementing weighted interleave. Thank you for sharing the earlier links to related discussions and patches. They were very helpful, and I will review them carefully to gather more ideas and refine my thoughts further. I look forward to any further feedback you may have on this topic. Best regards, Rakie > > ~Gregory >