From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 6E729C4332F for ; Thu, 2 Nov 2023 06:00:45 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 070B480015; Thu, 2 Nov 2023 02:00:45 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id F39FE8D0026; Thu, 2 Nov 2023 02:00:44 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id DB5D180015; Thu, 2 Nov 2023 02:00:44 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id C54F48D0026 for ; Thu, 2 Nov 2023 02:00:44 -0400 (EDT) Received: from smtpin17.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay07.hostedemail.com (Postfix) with ESMTP id 9AF1F160ECF for ; Thu, 2 Nov 2023 06:00:44 +0000 (UTC) X-FDA: 81411965208.17.60CD137 Received: from mgamail.intel.com (mgamail.intel.com [192.198.163.8]) by imf24.hostedemail.com (Postfix) with ESMTP id 676FE180023 for ; Thu, 2 Nov 2023 06:00:41 +0000 (UTC) Authentication-Results: imf24.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=m3l01khC; spf=pass (imf24.hostedemail.com: domain of ying.huang@intel.com designates 192.198.163.8 as permitted sender) smtp.mailfrom=ying.huang@intel.com; dmarc=pass (policy=none) header.from=intel.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1698904842; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=NSJ6jQGqgLYoZWvvBNsDqK30mwW/xnD5U2MKC/U0+6s=; b=uSKZDAEyAiuQuU2LTGIPysbwKDNoKTvPzXX7o8A/Rbd/K4yN7Xo3vUgSZkGt063AA1zspc qMZeirlKc0kVuOs8OmmtgRTpnFhgg5y27BIfBwLHT0bFc2DcqZattC5ov76rPOABxwmgbx 81Oi2v7XF4RfyL5jX1DyI/pLwBjYqQ0= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1698904842; a=rsa-sha256; cv=none; b=QF4esYMOGRKRayFo/Vf1z1At6hXdAsHdXp4PcvXT2NKJ/JAayvAYLyM0vNK3V4ZOTf/rIE gfLLJXNGiNsu1G86Qj7TKxm4BFKSRb2KbwuGmmeXUqGLt/t8t6aEQCIP90+oIEbEujYggx dcHzRxKFO0uU+1EcljzKn0r21cc/8u4= ARC-Authentication-Results: i=1; imf24.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=m3l01khC; spf=pass (imf24.hostedemail.com: domain of ying.huang@intel.com designates 192.198.163.8 as permitted sender) smtp.mailfrom=ying.huang@intel.com; dmarc=pass (policy=none) header.from=intel.com DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1698904842; x=1730440842; h=from:to:cc:subject:in-reply-to:references:date: message-id:mime-version:content-transfer-encoding; bh=+Ub5rXV4gU4qWNncHCmz5ntRW6Mg4aPIijG5wpwL/jA=; b=m3l01khCQwXx88PZATquDotDn1EyF19tqdH9njm5I3LKACsN2VFTw2/d xy2YZ+SymDcKQQX9F5MFw9K252EgyG4h2/Y/dR2j+fWAR/rMnZ+2Tkoeq CoOUiyWJSWzT8bi7y/hgLkHa//SktenPtdAcPlHR8yGBtzOQoScowfs49 5N3fg04q1aDYz7fBk4QbjwPZkQ8pVOYREqGUmX4T/tUz4Sjhf8s0Og4IZ H6lYLy+X7vPxsP0Vbe9o2OJ0IFcSldv2jhhJRz8L7yqsNMyI9wZhBUMcr gIYpnz2vg8mKBCHztFLLWx1B1r7b+L94ia9tS285w4Bi/2qC6iIOf8N0z A==; X-IronPort-AV: E=McAfee;i="6600,9927,10881"; a="1510149" X-IronPort-AV: E=Sophos;i="6.03,270,1694761200"; d="scan'208";a="1510149" Received: from fmsmga005.fm.intel.com ([10.253.24.32]) by fmvoesa102.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 01 Nov 2023 23:00:40 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6600,9927,10881"; a="1092603797" X-IronPort-AV: E=Sophos;i="6.03,270,1694761200"; d="scan'208";a="1092603797" Received: from yhuang6-desk2.sh.intel.com (HELO yhuang6-desk2.ccr.corp.intel.com) ([10.238.208.55]) by fmsmga005-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 01 Nov 2023 23:00:37 -0700 From: "Huang, Ying" To: "Zhijian Li (Fujitsu)" Cc: Andrew Morton , Greg Kroah-Hartman , "rafael@kernel.org" , "linux-mm@kvack.org" , "Yasunori Gotou (Fujitsu)" , "linux-kernel@vger.kernel.org" Subject: Re: [PATCH RFC 1/4] drivers/base/node: Add demotion_nodes sys infterface In-Reply-To: (Zhijian Li's message of "Thu, 2 Nov 2023 05:54:47 +0000") References: <20231102025648.1285477-1-lizhijian@fujitsu.com> <20231102025648.1285477-2-lizhijian@fujitsu.com> <878r7g3ktj.fsf@yhuang6-desk2.ccr.corp.intel.com> <87zfzw20nd.fsf@yhuang6-desk2.ccr.corp.intel.com> Date: Thu, 02 Nov 2023 13:58:36 +0800 Message-ID: <87msvw1ysz.fsf@yhuang6-desk2.ccr.corp.intel.com> User-Agent: Gnus/5.13 (Gnus v5.13) MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Stat-Signature: yyusz71u753zixf5gr17k5qmhd75kos7 X-Rspamd-Server: rspam10 X-Rspamd-Queue-Id: 676FE180023 X-Rspam-User: X-HE-Tag: 1698904841-135679 X-HE-Meta: U2FsdGVkX1/ebII0uKmuffWOSYdhMen6pACJ5zKAQJdbuCTinZVP4MeRIvH890n4bDGbMaFYQIkxrnDqjkaJiTUgeMZxwfwFddA/twSUJShLdVi2lWMt/QFfxDEB333RLX/GoCxLdmnDHCKCUCOR0wTBYVlGBHtMdR0PICyX8uIiv3w5IOwFdxnabpg30wIUUSIIxKhI/vpCsRJjIWbQjQBa0LB9rGiEnHcj7UeFcIxFzGCuWX0Hhu/R61oAOrtcp7gTzVQnBtqRIsINAX8WpycJt5hm3mMwVIAkKp6YRow0CwMkFjZWyG3ix2+qim5l2/HMaIYTPp7W17lRsEpbHaW+tV/qPFZA6VPBkL9GfRhjogVsMmIDx9x+2r98qEHDVImtDOb8ecZORfktMkYxKhUioFRiX3Uit3sDwpc4i0FW2ZfYRSgfxDTbcU1ms74nscdQIqy5lcH58o2EK3PhOS/dvhv+5VcyqJGl3aYOLElB4W4lBQ48Hh9SMKpueeNwM+YJQoSp2d0LOFPLbAd4qTyC22GPlKODIrhw6VDqF4qiqv7qZm9LkIiFf/M1jP57a5+yY5jc8tF8xZOsKEXq1ByoIGWYUVEXjQ9pu8jpBavGOKFU6dKrGyxiPrmDBQ7sssEl0JVKlruS64MKWPTPM5y5cmJfG1prbGqd6zseH8N7MO/FHG8WU8zMG+Q9rzkvwgiws0niBodW3yZUGsflR2OCaTJbLnQtnVpls3WBGqNlj+GXXxkW3RLECtlMtM97afDgbNXWicr549ehNO0kJ0EcQATHTtchHpnLWSEG5yKCAUYue2iEfwz+nBEjJRKw3ZDCyg6CoCNPeaHpQWKaIjOIbs3YtrzWUbuicSolc4M/QGP6vaV3A/W8LCi4rq1Ff6m+9VbUb09WCk/yWu+ito78u3ga0rPMpB4F2Lykfek3WFirXT+CTmVUxA5WzhIXG/ohDDoH7FA3eCUJcz+ Aaf/9nyP bsIf/cL/AVc+1m3XT8gRtHsFqz+k0gYMatfZ3zFc2FyqEVEnBHB8Fsoj91MiNKqhoLsRhsHEU7af6qajw6O9p4FxqZmbP4vf8yWSo3aM/Buh3rRWLLigyLIxUXCqt8j56TPy6m15p+ftDuuV+BTmba97z2sDMC1QGu19Hw2S6hdX9r8NxgDwgUR1e7hdYR8yTBY6qBOk/VwOWRa9gtMcuF+a8k2An7bb5qAfwlGPQ/hLKBKciqZSSZCgBuFhusWZa9yLm+gjm4KAiPClmbOvDuxe2CUmFBN8pHcoK6PlVgW9JlQz6lcXsWTpI7VOxI803/azkC6S6ueo8arDPaPXxNXP0LJIdS94vzQSQAFLhgmFh5kTMWxvdzu4BKhXIT1iSPWWUMhUn9iDbZG3yDDkKLEBmgw== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: "Zhijian Li (Fujitsu)" writes: > On 02/11/2023 13:18, Huang, Ying wrote: >> "Zhijian Li (Fujitsu)" writes: >>=20 >>>> We have /sys/devices/virtual/memory_tiering/memory_tier*/nodelist >>>> already. A node in a higher tier can demote to any node in the lower >>>> tiers. What's more need to be displayed in nodeX/demotion_nodes? >>> >>> IIRC, they are not the same. memory_tier[number], where the number is s= hared by >>> the memory using the same memory driver(dax/kmem etc). Not reflect the = actual distance >>> across nodes(different distance will be grouped into the same memory_ti= er). >>> But demotion will only select the nearest nodelist to demote. >>=20 >> In the following patchset, we will use the performance information from >> HMAT to place nodes using the same memory driver into different memory >> tiers. >>=20 >> https://lore.kernel.org/all/20230926060628.265989-1-ying.huang@intel.com/ > > Thanks for your reminder. It seems like I've fallen behind the world by m= onths. > I will rebase on it later if this patch is still needed. > >>=20 >> The patch is in mm-stable tree. >>=20 >>> Below is an example, node0 node1 are DRAM, node2 node3 are PMEM, but di= stance to DRAM nodes >>> are different. >>>=20=20=20 >>> # numactl -H >>> available: 4 nodes (0-3) >>> node 0 cpus: 0 >>> node 0 size: 964 MB >>> node 0 free: 746 MB >>> node 1 cpus: 1 >>> node 1 size: 685 MB >>> node 1 free: 455 MB >>> node 2 cpus: >>> node 2 size: 896 MB >>> node 2 free: 897 MB >>> node 3 cpus: >>> node 3 size: 896 MB >>> node 3 free: 896 MB >>> node distances: >>> node 0 1 2 3 >>> 0: 10 20 20 25 >>> 1: 20 10 25 20 >>> 2: 20 25 10 20 >>> 3: 25 20 20 10 >>> # cat /sys/devices/system/node/node0/demotion_nodes >>> 2 >>=20 >> node 2 is only the preferred demotion target. In fact, memory in node 0 >> can be demoted to node 2,3. Please check demote_folio_list() for >> details. > > Have I missed something, at least the on master tree, nd->preferred only = include the > nearest ones(by specific algorithms), so in above numa topology, nd->pref= erred of > node0 is node2 only. node0 distance to node3 is 25 greater than to node2(= 20). > >> 1657 int target_nid =3D next_demotion_node(pgdat->node_id); > > So target_nid cannot be node3 IIUC. > > (I cooked this patches weeks ago, maybe something has changed, i will als= o take a deep look later.) > > 1650 /* > 1651 * Take folios on @demote_folios and attempt to demote them to anoth= er node. > 1652 * Folios which are not demoted are left on @demote_folios. > 1653 */ > 1654 static unsigned int demote_folio_list(struct list_head *demote_folio= s, > 1655 struct pglist_data *pgdat) > 1656 { > 1657 int target_nid =3D next_demotion_node(pgdat->node_id); > 1658 unsigned int nr_succeeded; > 1659 nodemask_t allowed_mask; > 1660 > 1661 struct migration_target_control mtc =3D { > 1662 /* > 1663 * Allocate from 'node', or fail quickly and quietly. > 1664 * When this happens, 'page' will likely just be dis= carded > 1665 * instead of migrated. > 1666 */ > 1667 .gfp_mask =3D (GFP_HIGHUSER_MOVABLE & ~__GFP_RECLAIM= ) | __GFP_NOWARN | > 1668 __GFP_NOMEMALLOC | GFP_NOWAIT, > 1669 .nid =3D target_nid, > 1670 .nmask =3D &allowed_mask > 1671 }; > 1672 > 1673 if (list_empty(demote_folios)) > 1674 return 0; > 1675 > 1676 if (target_nid =3D=3D NUMA_NO_NODE) > 1677 return 0; > 1678 > 1679 node_get_allowed_targets(pgdat, &allowed_mask); > 1680 > 1681 /* Demotion ignores all cpuset and mempolicy settings */ > 1682 migrate_pages(demote_folios, alloc_demote_folio, NULL, > 1683 (unsigned long)&mtc, MIGRATE_ASYNC, MR_DEMOTIO= N, > 1684 &nr_succeeded); > In alloc_demote_folio(), target_nid is tried firstly. Then, if allocation fails, any node in allowed_mask will be tried. -- Best Regards, Huang, Ying >>=20 >>> # cat /sys/devices/system/node/node1/demotion_nodes >>> 3 >>> # cat /sys/devices/virtual/memory_tiering/memory_tier22/nodelist >>> 2-3 >>> >>> Thanks >>> Zhijian >>> >>> (I hate the outlook native reply composition format.) >>> ________________________________________ >>> From: Huang, Ying >>> Sent: Thursday, November 2, 2023 11:17 >>> To: Li, Zhijian/=E6=9D=8E =E6=99=BA=E5=9D=9A >>> Cc: Andrew Morton; Greg Kroah-Hartman; rafael@kernel.org; linux-mm@kvac= k.org; Gotou, Yasunori/=E4=BA=94=E5=B3=B6 =E5=BA=B7=E6=96=87; linux-kernel@= vger.kernel.org >>> Subject: Re: [PATCH RFC 1/4] drivers/base/node: Add demotion_nodes sys = infterface >>> >>> Li Zhijian writes: >>> >>>> It shows the demotion target nodes of a node. Export this information = to >>>> user directly. >>>> >>>> Below is an example where node0 node1 are DRAM, node3 is a PMEM node. >>>> - Before PMEM is online, no demotion_nodes for node0 and node1. >>>> $ cat /sys/devices/system/node/node0/demotion_nodes >>>> >>>> - After node3 is online as kmem >>>> $ daxctl reconfigure-device --mode=3Dsystem-ram --no-online dax0.0 && = daxctl online-memory dax0.0 >>>> [ >>>> { >>>> "chardev":"dax0.0", >>>> "size":1054867456, >>>> "target_node":3, >>>> "align":2097152, >>>> "mode":"system-ram", >>>> "online_memblocks":0, >>>> "total_memblocks":7 >>>> } >>>> ] >>>> $ cat /sys/devices/system/node/node0/demotion_nodes >>>> 3 >>>> $ cat /sys/devices/system/node/node1/demotion_nodes >>>> 3 >>>> $ cat /sys/devices/system/node/node3/demotion_nodes >>>> >>> >>> We have /sys/devices/virtual/memory_tiering/memory_tier*/nodelist >>> already. A node in a higher tier can demote to any node in the lower >>> tiers. What's more need to be displayed in nodeX/demotion_nodes? >>> >>> -- >>> Best Regards, >>> Huang, Ying >>> >>>> Signed-off-by: Li Zhijian >>>> --- >>>> drivers/base/node.c | 13 +++++++++++++ >>>> include/linux/memory-tiers.h | 6 ++++++ >>>> mm/memory-tiers.c | 8 ++++++++ >>>> 3 files changed, 27 insertions(+) >>>> >>>> diff --git a/drivers/base/node.c b/drivers/base/node.c >>>> index 493d533f8375..27e8502548a7 100644 >>>> --- a/drivers/base/node.c >>>> +++ b/drivers/base/node.c >>>> @@ -7,6 +7,7 @@ >>>> #include >>>> #include >>>> #include >>>> +#include >>>> #include >>>> #include >>>> #include >>>> @@ -569,11 +570,23 @@ static ssize_t node_read_distance(struct device = *dev, >>>> } >>>> static DEVICE_ATTR(distance, 0444, node_read_distance, NULL); >>>> >>>> +static ssize_t demotion_nodes_show(struct device *dev, >>>> + struct device_attribute *attr, char *buf) >>>> +{ >>>> + int ret; >>>> + nodemask_t nmask =3D next_demotion_nodes(dev->id); >>>> + >>>> + ret =3D sysfs_emit(buf, "%*pbl\n", nodemask_pr_args(&nmask)); >>>> + return ret; >>>> +} >>>> +static DEVICE_ATTR_RO(demotion_nodes); >>>> + >>>> static struct attribute *node_dev_attrs[] =3D { >>>> &dev_attr_meminfo.attr, >>>> &dev_attr_numastat.attr, >>>> &dev_attr_distance.attr, >>>> &dev_attr_vmstat.attr, >>>> + &dev_attr_demotion_nodes.attr, >>>> NULL >>>> }; >>>> >>>> diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers= .h >>>> index 437441cdf78f..8eb04923f965 100644 >>>> --- a/include/linux/memory-tiers.h >>>> +++ b/include/linux/memory-tiers.h >>>> @@ -38,6 +38,7 @@ void init_node_memory_type(int node, struct memory_d= ev_type *default_type); >>>> void clear_node_memory_type(int node, struct memory_dev_type *memtyp= e); >>>> #ifdef CONFIG_MIGRATION >>>> int next_demotion_node(int node); >>>> +nodemask_t next_demotion_nodes(int node); >>>> void node_get_allowed_targets(pg_data_t *pgdat, nodemask_t *targets); >>>> bool node_is_toptier(int node); >>>> #else >>>> @@ -46,6 +47,11 @@ static inline int next_demotion_node(int node) >>>> return NUMA_NO_NODE; >>>> } >>>> >>>> +static inline next_demotion_nodes next_demotion_nodes(int node) >>>> +{ >>>> + return NODE_MASK_NONE; >>>> +} >>>> + >>>> static inline void node_get_allowed_targets(pg_data_t *pgdat, nodema= sk_t *targets) >>>> { >>>> *targets =3D NODE_MASK_NONE; >>>> diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c >>>> index 37a4f59d9585..90047f37d98a 100644 >>>> --- a/mm/memory-tiers.c >>>> +++ b/mm/memory-tiers.c >>>> @@ -282,6 +282,14 @@ void node_get_allowed_targets(pg_data_t *pgdat, n= odemask_t *targets) >>>> rcu_read_unlock(); >>>> } >>>> >>>> +nodemask_t next_demotion_nodes(int node) >>>> +{ >>>> + if (!node_demotion) >>>> + return NODE_MASK_NONE; >>>> + >>>> + return node_demotion[node].preferred; >>>> +} >>>> + >>>> /** >>>> * next_demotion_node() - Get the next node in the demotion path >>>> * @node: The starting node to lookup the next node