From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id C750DC47258 for ; Wed, 31 Jan 2024 06:54:17 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 617DF6B0078; Wed, 31 Jan 2024 01:54:17 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 5C72D6B007D; Wed, 31 Jan 2024 01:54:17 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 4417C6B007E; Wed, 31 Jan 2024 01:54:17 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id 31E0C6B0078 for ; Wed, 31 Jan 2024 01:54:17 -0500 (EST) Received: from smtpin16.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay04.hostedemail.com (Postfix) with ESMTP id C67D71A014C for ; Wed, 31 Jan 2024 06:54:16 +0000 (UTC) X-FDA: 81738692112.16.AEDEA1A Received: from mgamail.intel.com (mgamail.intel.com [192.198.163.12]) by imf05.hostedemail.com (Postfix) with ESMTP id 8356310001E for ; Wed, 31 Jan 2024 06:54:14 +0000 (UTC) Authentication-Results: imf05.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=d+k13Z1h; dmarc=pass (policy=none) header.from=intel.com; spf=pass (imf05.hostedemail.com: domain of ying.huang@intel.com designates 192.198.163.12 as permitted sender) smtp.mailfrom=ying.huang@intel.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1706684054; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=p2p/gLhIXyvXY+zwBK/zOGHYnoHFtgkj8dORv13jUuk=; b=PxnKMbqMJbiPyDFEP20O5otGxBz9HeU0kVbSl4uwO6fXiHZLRxiHFW5XAo9FgCyub3n5xW 9U26fMOlih64OeXlxOBnAfo5ZVwuttyPt68SgxWSyUJcpyM5KCwn2Oz0AuKVWx4BEYiYEE ae0Y9GL6SuwB4ZhIYaEoukwNxNNH51Y= ARC-Authentication-Results: i=1; imf05.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=d+k13Z1h; dmarc=pass (policy=none) header.from=intel.com; spf=pass (imf05.hostedemail.com: domain of ying.huang@intel.com designates 192.198.163.12 as permitted sender) smtp.mailfrom=ying.huang@intel.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1706684054; a=rsa-sha256; cv=none; b=zdPJ5jZPNRwMCWZFUU3w2mUH1xAmdJqls0v8FCoWAwc3LD9uNpklcq1mvkVufPKwGEUGPD vsyDemvRwAWolWOqju5lOqIW59ITqN7cSi+v1L2euTfhmBSYzbXRy5NeEsG0CyaqY4Xbc7 J9R49tVpUKo+my0btxK9AE687RYKQ9c= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1706684055; x=1738220055; h=from:to:cc:subject:in-reply-to:references:date: message-id:mime-version; bh=/F/ertmwsCGvy8CX6KW93prCcPXDnU1V09z8hHjJkU0=; b=d+k13Z1hpuJ7sSCQ5E+5Hr2x8hcpxfLy9KIiE1tQ1rT65QoxYn3vzc4g zuyKQ+ErK9dEiH2NGOsN16wt280uBtspA3OMKdcHT6GSJRt4X01KazSte wVvWYLXLatF8d+HmDFn27/BwuPLTvhuepBWY8rrB1GQ6w4g4PVRNWhNK5 SCTy26TnzPNeBrKEKB/1VRydNUmVXYWtHG8yUtBYoA453DdEmT7hdt4dx EeKhzog2Ay1fwBk52B3FYmHFX5T/txyYxpFWNGkzbVMkuGn21SFYC/DNv 6wqH3+MLnaU5D9D9HI5iu+W3MIXFXZ7qTbyoQqtF7/Pjd4rFNK6TJzpJ3 Q==; X-IronPort-AV: E=McAfee;i="6600,9927,10969"; a="3353182" X-IronPort-AV: E=Sophos;i="6.05,231,1701158400"; d="scan'208";a="3353182" Received: from fmviesa001.fm.intel.com ([10.60.135.141]) by fmvoesa106.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 30 Jan 2024 22:54:13 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.05,231,1701158400"; d="scan'208";a="30391993" Received: from yhuang6-desk2.sh.intel.com (HELO yhuang6-desk2.ccr.corp.intel.com) ([10.238.208.55]) by smtpauth.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 30 Jan 2024 22:54:11 -0800 From: "Huang, Ying" To: "Yasunori Gotou (Fujitsu)" Cc: "Zhijian Li (Fujitsu)" , Andrew Morton , Greg Kroah-Hartman , "rafael@kernel.org" , "linux-mm@kvack.org" , "linux-kernel@vger.kernel.org" Subject: Re: [PATCH RFC 1/4] drivers/base/node: Add demotion_nodes sys infterface In-Reply-To: (Yasunori Gotou's message of "Wed, 31 Jan 2024 06:23:12 +0000") References: <20231102025648.1285477-1-lizhijian@fujitsu.com> <20231102025648.1285477-2-lizhijian@fujitsu.com> <878r7g3ktj.fsf@yhuang6-desk2.ccr.corp.intel.com> <87fryegv9c.fsf@yhuang6-desk2.ccr.corp.intel.com> Date: Wed, 31 Jan 2024 14:52:15 +0800 Message-ID: <8734uegfkw.fsf@yhuang6-desk2.ccr.corp.intel.com> User-Agent: Gnus/5.13 (Gnus v5.13) MIME-Version: 1.0 Content-Type: text/plain; charset=ascii X-Rspam-User: X-Rspamd-Server: rspam12 X-Rspamd-Queue-Id: 8356310001E X-Stat-Signature: 8tycmpqo6ydtyw83b13e9a97hgwk9gqk X-HE-Tag: 1706684054-671810 X-HE-Meta: U2FsdGVkX1/q3gzKYICl8v+zHbvJCfxWLQWhzcbfggYe1sVtf0XyYXNTwy6pHmuEfYBGkdYVr5Q4k4aw0LdtGTXLpIMBGs6yF3AXZmAtvdV0qfmOLphKZ6O4CtyYnSyqb+yRZBQKDEQH1kzvwEANrxX29+MeS62BpTsIOFsYh0AbIl7p3PIqX/LeI6Msllpvn+YRtHs+lw9OrjM2poMvD6hAz41vk5tDU+nbcOpSPq2qfkIxcokuT2nUz+1OD7nD6nkMh6emDdUAyNnjO2DlvxIXsSJfeSHAXK2qluD+yQOzpejwrQ1yiiLjEvdCXA17G16aTy5JGlz+s9H28kdr4+fdnOWMQWagtmx3yAOlmOoc1G9SSS4/tAdsNnh7758h61pFGrU0kx4vAkhW6KeBh2ESJ8I8gkWXpJdkrORqfbNOnr5TQ55voqD9oMVyWCzyYsUzwg2c1zY8voYOldIJh2lEZHS3U2iael9id8zElpUVjePBxZmWH93r9A4q66NeZ2vLhwFptyvAGmyzMDCZNZepb7rJoeK/+kMoob3cj9Dy0B8T0cfn5vZp+fQGAhq+mG71dIxtN8ImPIEMuB0mN05kOt5q98l++wYlJkmY2PdSToKI0oldJQGhKTjhP4vBSXE5hF4uGHMwwPB3zzGg5M9DffG+hSclgfqZc7yLOnDbaMzmJ4G7m4x8+rMPkuBnI8fNOqz11QSowtj+yEfgr+SQC09bl5lWGcf6XOxgCauOG++ke3ZFdkObNPo+hcwgmAQLvzRJ97lntJMJNEyDwmPDLZsDfJ9JYXMzkEdKvamfXUpAtrblbo2edD9IX21mxFe8Fca8RtD+/JR8/NGA2yuYxD1bS4633qQK/dzfGE+V4ThES0zOIyfxPo5pX0bfuNjMJgY1fDqGHLOltFgZaW8fRvQDP8Y7U/2KncY5GwXFOb/rUVHGy3FX51MIGrxnW1gBtmUl2zkliKWzYSJ RagabuPR IzUORPopWcU2WJRhmyQiFc94ug3u3lisZlSCmHz3DWBUVzMlo8ajzSwxHGA+ekc/uwnmRhXfottdRBWTQL1IcH7Kcf2rSW97TxnfV0919XiwEl/mZM//+MvbsyaXhp8XEYgZKGklwTKMqMIf47S5wBmUfT67ezcbwJwms/WsiPYbY93jPjOPiFlU3Dnb83EWe802IMgDr3VprMm9SlJt5QGX32dBRrPd/UWkOz153iNY8ixjP3U9v29Z/0km+tS211tSOaaWHVZMgu6UktDD7GFthMpUBe2nnlf6Y X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: "Yasunori Gotou (Fujitsu)" writes: > Hello, > >> Li Zhijian writes: >> >> > Hi Ying >> > >> > I need to pick up this thread/patch again. >> > >> >> We have /sys/devices/virtual/memory_tiering/memory_tier*/nodelist >> >> already. A node in a higher tier can demote to any node in the lower >> >> tiers. What's more need to be displayed in nodeX/demotion_nodes? >> >> >> > >> > Yes, it's believed that >> > /sys/devices/virtual/memory_tiering/memory_tierN/nodelist >> > are intended to show nodes in memory_tierN. But IMHO, it's not enough, >> > especially for the preferred demotion node(s). >> > >> > Currently, when a demotion occurs, it will prioritize selecting a node >> > from the preferred nodes as the destination node for the demotion. If >> > the preferred nodes does not meet the requirements, it will try from >> > all the lower memory tier nodes until it finds a suitable demotion >> > destination node or ultimately fails. >> > >> > However, currently it only lists the nodes of each tier. If the >> > administrators want to know all the possible demotion destinations for >> > a given node, they need to calculate it themselves: >> > Step 1, find the memory tier where the given node is located Step 2, >> > list all nodes under all its lower tiers >> > >> > It will be even more difficult to know the preferred nodes which >> > depend on more factors, distance etc. For the following example, we >> > may have 6 nodes splitting into three memory tiers. >> > >> > For emulated hmat numa topology example: >> >> $ numactl -H >> >> available: 6 nodes (0-5) >> >> node 0 cpus: 0 >> >> node 0 size: 1974 MB >> >> node 0 free: 1767 MB >> >> node 1 cpus: 1 >> >> node 1 size: 1694 MB >> >> node 1 free: 1454 MB >> >> node 2 cpus: >> >> node 2 size: 896 MB >> >> node 2 free: 896 MB >> >> node 3 cpus: >> >> node 3 size: 896 MB >> >> node 3 free: 896 MB >> >> node 4 cpus: >> >> node 4 size: 896 MB >> >> node 4 free: 896 MB >> >> node 5 cpus: >> >> node 5 size: 896 MB >> >> node 5 free: 896 MB >> >> node distances: >> >> node 0 1 2 3 4 5 >> >> 0: 10 31 21 41 21 41 >> >> 1: 31 10 41 21 41 21 >> >> 2: 21 41 10 51 21 51 >> >> 3: 31 21 51 10 51 21 >> >> 4: 21 41 21 51 10 51 >> >> 5: 31 21 51 21 51 10 >> >> $ cat memory_tier4/nodelist >> >> 0-1 >> >> $ cat memory_tier12/nodelist >> >> 2,5 >> >> $ cat memory_tier54/nodelist >> >> 3-4 >> > >> > For above topology, memory-tier will build the demotion path for each >> > node like this: >> > node[0].preferred = 2 >> > node[0].demotion_targets = 2-5 >> > node[1].preferred = 5 >> > node[1].demotion_targets = 2-5 >> > node[2].preferred = 4 >> > node[2].demotion_targets = 3-4 >> > node[3].preferred = >> > node[3].demotion_targets = >> > node[4].preferred = >> > node[4].demotion_targets = >> > node[5].preferred = 3 >> > node[5].demotion_targets = 3-4 >> > >> > But this demotion path is not explicitly known to administrator. And >> > with the feedback from our customers, they also think it is helpful to >> > know demotion path built by kernel to understand the demotion >> > behaviors. >> > >> > So i think we should have 2 new interfaces for each node: >> > >> > /sys/devices/system/node/nodeN/demotion_allowed_nodes >> > /sys/devices/system/node/nodeN/demotion_preferred_nodes >> > >> > I value your opinion, and I'd like to know what you think about... >> >> Per my understanding, we will not expose everything inside kernel to user >> space. For page placement in a tiered memory system, demotion is just a part >> of the story. For example, if the DRAM of a system becomes full, new page >> allocation will fall back to the CXL memory. Have we exposed the default page >> allocation fallback order to user space? > > In extreme terms, users want to analyze all the memory behaviors of memory management > while executing their workload, and want to trace ALL of them if possible. > Of course, it is impossible due to the heavy load, then users want to have other ways as > a compromise. Our request, the demotion target information, is just one of them. > > In my impression, users worry about the impact of the CXL memory device on their workload, > and want to have a way to understand the impact. > If they know there is no information to remove their anxious, they may avoid to buy CXL memory. > > In addition, our support team also needs to have clues to solve users' performance problems. > Even if new page allocation will fall back to the CXL memory, we need to explain why it would > happen as accountability. I guess /proc//numa_maps /sys/fs/cgroup//memory.numa_stat may help to understand system behavior. -- Best Regards, Huang, Ying >> >> All in all, in my opinion, we only expose as little as possible to user space >> because we need to maintain the ABI for ever. > > I can understand there is a compatibility problem by our propose, and kernel may > change its logic in future. This is a tug-of-war situation between kernel developers > and users or support engineers. I suppose It often occurs in many place... > > Hmm... I hope there is a new idea to solve this situation even if our proposal is rejected.. > Anyone? > > Thanks, > ---- > Yasunori Goto > >> >> -- >> Best Regards, >> Huang, Ying >> >> > >> > On 02/11/2023 11:17, Huang, Ying wrote: >> >> Li Zhijian writes: >> >> >> >>> It shows the demotion target nodes of a node. Export this >> >>> information to user directly. >> >>> >> >>> Below is an example where node0 node1 are DRAM, node3 is a PMEM >> node. >> >>> - Before PMEM is online, no demotion_nodes for node0 and node1. >> >>> $ cat /sys/devices/system/node/node0/demotion_nodes >> >>> >> >>> - After node3 is online as kmem >> >>> $ daxctl reconfigure-device --mode=system-ram --no-online dax0.0 && >> >>> daxctl online-memory dax0.0 [ >> >>> { >> >>> "chardev":"dax0.0", >> >>> "size":1054867456, >> >>> "target_node":3, >> >>> "align":2097152, >> >>> "mode":"system-ram", >> >>> "online_memblocks":0, >> >>> "total_memblocks":7 >> >>> } >> >>> ] >> >>> $ cat /sys/devices/system/node/node0/demotion_nodes >> >>> 3 >> >>> $ cat /sys/devices/system/node/node1/demotion_nodes >> >>> 3 >> >>> $ cat /sys/devices/system/node/node3/demotion_nodes >> >>> >> >> We have /sys/devices/virtual/memory_tiering/memory_tier*/nodelist >> >> already. A node in a higher tier can demote to any node in the lower >> >> tiers. What's more need to be displayed in nodeX/demotion_nodes? >> >> -- >> >> Best Regards, >> >> Huang, Ying >> >> >> >>> Signed-off-by: Li Zhijian >> >>> --- >> >>> drivers/base/node.c | 13 +++++++++++++ >> >>> include/linux/memory-tiers.h | 6 ++++++ >> >>> mm/memory-tiers.c | 8 ++++++++ >> >>> 3 files changed, 27 insertions(+) >> >>> >> >>> diff --git a/drivers/base/node.c b/drivers/base/node.c index >> >>> 493d533f8375..27e8502548a7 100644 >> >>> --- a/drivers/base/node.c >> >>> +++ b/drivers/base/node.c >> >>> @@ -7,6 +7,7 @@ >> >>> #include >> >>> #include >> >>> #include >> >>> +#include >> >>> #include >> >>> #include >> >>> #include >> >>> @@ -569,11 +570,23 @@ static ssize_t node_read_distance(struct device >> *dev, >> >>> } >> >>> static DEVICE_ATTR(distance, 0444, node_read_distance, NULL); >> >>> +static ssize_t demotion_nodes_show(struct device *dev, >> >>> + struct device_attribute *attr, char *buf) { >> >>> + int ret; >> >>> + nodemask_t nmask = next_demotion_nodes(dev->id); >> >>> + >> >>> + ret = sysfs_emit(buf, "%*pbl\n", nodemask_pr_args(&nmask)); >> >>> + return ret; >> >>> +} >> >>> +static DEVICE_ATTR_RO(demotion_nodes); >> >>> + >> >>> static struct attribute *node_dev_attrs[] = { >> >>> &dev_attr_meminfo.attr, >> >>> &dev_attr_numastat.attr, >> >>> &dev_attr_distance.attr, >> >>> &dev_attr_vmstat.attr, >> >>> + &dev_attr_demotion_nodes.attr, >> >>> NULL >> >>> }; >> >>> diff --git a/include/linux/memory-tiers.h >> >>> b/include/linux/memory-tiers.h index 437441cdf78f..8eb04923f965 >> >>> 100644 >> >>> --- a/include/linux/memory-tiers.h >> >>> +++ b/include/linux/memory-tiers.h >> >>> @@ -38,6 +38,7 @@ void init_node_memory_type(int node, struct >> memory_dev_type *default_type); >> >>> void clear_node_memory_type(int node, struct memory_dev_type >> *memtype); >> >>> #ifdef CONFIG_MIGRATION >> >>> int next_demotion_node(int node); >> >>> +nodemask_t next_demotion_nodes(int node); >> >>> void node_get_allowed_targets(pg_data_t *pgdat, nodemask_t >> *targets); >> >>> bool node_is_toptier(int node); >> >>> #else >> >>> @@ -46,6 +47,11 @@ static inline int next_demotion_node(int node) >> >>> return NUMA_NO_NODE; >> >>> } >> >>> +static inline next_demotion_nodes next_demotion_nodes(int node) >> >>> +{ >> >>> + return NODE_MASK_NONE; >> >>> +} >> >>> + >> >>> static inline void node_get_allowed_targets(pg_data_t *pgdat, >> nodemask_t *targets) >> >>> { >> >>> *targets = NODE_MASK_NONE; >> >>> diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c index >> >>> 37a4f59d9585..90047f37d98a 100644 >> >>> --- a/mm/memory-tiers.c >> >>> +++ b/mm/memory-tiers.c >> >>> @@ -282,6 +282,14 @@ void node_get_allowed_targets(pg_data_t *pgdat, >> nodemask_t *targets) >> >>> rcu_read_unlock(); >> >>> } >> >>> +nodemask_t next_demotion_nodes(int node) >> >>> +{ >> >>> + if (!node_demotion) >> >>> + return NODE_MASK_NONE; >> >>> + >> >>> + return node_demotion[node].preferred; } >> >>> + >> >>> /** >> >>> * next_demotion_node() - Get the next node in the demotion path >> >>> * @node: The starting node to lookup the next node