From: "Huang, Ying" <ying.huang@intel.com>
To: "Yasunori Gotou (Fujitsu)" <y-goto@fujitsu.com>
Cc: "Zhijian Li (Fujitsu)" <lizhijian@fujitsu.com>,
Andrew Morton <akpm@linux-foundation.org>,
Greg Kroah-Hartman <gregkh@linuxfoundation.org>,
"rafael@kernel.org" <rafael@kernel.org>,
"linux-mm@kvack.org" <linux-mm@kvack.org>,
"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH RFC 1/4] drivers/base/node: Add demotion_nodes sys infterface
Date: Wed, 31 Jan 2024 14:52:15 +0800 [thread overview]
Message-ID: <8734uegfkw.fsf@yhuang6-desk2.ccr.corp.intel.com> (raw)
In-Reply-To: <TYWPR01MB100828CBEE3E032C191C6D08E907C2@TYWPR01MB10082.jpnprd01.prod.outlook.com> (Yasunori Gotou's message of "Wed, 31 Jan 2024 06:23:12 +0000")
"Yasunori Gotou (Fujitsu)" <y-goto@fujitsu.com> writes:
> Hello,
>
>> Li Zhijian <lizhijian@fujitsu.com> writes:
>>
>> > Hi Ying
>> >
>> > I need to pick up this thread/patch again.
>> >
>> >> We have /sys/devices/virtual/memory_tiering/memory_tier*/nodelist
>> >> already. A node in a higher tier can demote to any node in the lower
>> >> tiers. What's more need to be displayed in nodeX/demotion_nodes?
>> >>
>> >
>> > Yes, it's believed that
>> > /sys/devices/virtual/memory_tiering/memory_tierN/nodelist
>> > are intended to show nodes in memory_tierN. But IMHO, it's not enough,
>> > especially for the preferred demotion node(s).
>> >
>> > Currently, when a demotion occurs, it will prioritize selecting a node
>> > from the preferred nodes as the destination node for the demotion. If
>> > the preferred nodes does not meet the requirements, it will try from
>> > all the lower memory tier nodes until it finds a suitable demotion
>> > destination node or ultimately fails.
>> >
>> > However, currently it only lists the nodes of each tier. If the
>> > administrators want to know all the possible demotion destinations for
>> > a given node, they need to calculate it themselves:
>> > Step 1, find the memory tier where the given node is located Step 2,
>> > list all nodes under all its lower tiers
>> >
>> > It will be even more difficult to know the preferred nodes which
>> > depend on more factors, distance etc. For the following example, we
>> > may have 6 nodes splitting into three memory tiers.
>> >
>> > For emulated hmat numa topology example:
>> >> $ numactl -H
>> >> available: 6 nodes (0-5)
>> >> node 0 cpus: 0
>> >> node 0 size: 1974 MB
>> >> node 0 free: 1767 MB
>> >> node 1 cpus: 1
>> >> node 1 size: 1694 MB
>> >> node 1 free: 1454 MB
>> >> node 2 cpus:
>> >> node 2 size: 896 MB
>> >> node 2 free: 896 MB
>> >> node 3 cpus:
>> >> node 3 size: 896 MB
>> >> node 3 free: 896 MB
>> >> node 4 cpus:
>> >> node 4 size: 896 MB
>> >> node 4 free: 896 MB
>> >> node 5 cpus:
>> >> node 5 size: 896 MB
>> >> node 5 free: 896 MB
>> >> node distances:
>> >> node 0 1 2 3 4 5
>> >> 0: 10 31 21 41 21 41
>> >> 1: 31 10 41 21 41 21
>> >> 2: 21 41 10 51 21 51
>> >> 3: 31 21 51 10 51 21
>> >> 4: 21 41 21 51 10 51
>> >> 5: 31 21 51 21 51 10
>> >> $ cat memory_tier4/nodelist
>> >> 0-1
>> >> $ cat memory_tier12/nodelist
>> >> 2,5
>> >> $ cat memory_tier54/nodelist
>> >> 3-4
>> >
>> > For above topology, memory-tier will build the demotion path for each
>> > node like this:
>> > node[0].preferred = 2
>> > node[0].demotion_targets = 2-5
>> > node[1].preferred = 5
>> > node[1].demotion_targets = 2-5
>> > node[2].preferred = 4
>> > node[2].demotion_targets = 3-4
>> > node[3].preferred = <empty>
>> > node[3].demotion_targets = <empty>
>> > node[4].preferred = <empty>
>> > node[4].demotion_targets = <empty>
>> > node[5].preferred = 3
>> > node[5].demotion_targets = 3-4
>> >
>> > But this demotion path is not explicitly known to administrator. And
>> > with the feedback from our customers, they also think it is helpful to
>> > know demotion path built by kernel to understand the demotion
>> > behaviors.
>> >
>> > So i think we should have 2 new interfaces for each node:
>> >
>> > /sys/devices/system/node/nodeN/demotion_allowed_nodes
>> > /sys/devices/system/node/nodeN/demotion_preferred_nodes
>> >
>> > I value your opinion, and I'd like to know what you think about...
>>
>> Per my understanding, we will not expose everything inside kernel to user
>> space. For page placement in a tiered memory system, demotion is just a part
>> of the story. For example, if the DRAM of a system becomes full, new page
>> allocation will fall back to the CXL memory. Have we exposed the default page
>> allocation fallback order to user space?
>
> In extreme terms, users want to analyze all the memory behaviors of memory management
> while executing their workload, and want to trace ALL of them if possible.
> Of course, it is impossible due to the heavy load, then users want to have other ways as
> a compromise. Our request, the demotion target information, is just one of them.
>
> In my impression, users worry about the impact of the CXL memory device on their workload,
> and want to have a way to understand the impact.
> If they know there is no information to remove their anxious, they may avoid to buy CXL memory.
>
> In addition, our support team also needs to have clues to solve users' performance problems.
> Even if new page allocation will fall back to the CXL memory, we need to explain why it would
> happen as accountability.
I guess
/proc/<PID>/numa_maps
/sys/fs/cgroup/<CGNAME>/memory.numa_stat
may help to understand system behavior.
--
Best Regards,
Huang, Ying
>>
>> All in all, in my opinion, we only expose as little as possible to user space
>> because we need to maintain the ABI for ever.
>
> I can understand there is a compatibility problem by our propose, and kernel may
> change its logic in future. This is a tug-of-war situation between kernel developers
> and users or support engineers. I suppose It often occurs in many place...
>
> Hmm... I hope there is a new idea to solve this situation even if our proposal is rejected..
> Anyone?
>
> Thanks,
> ----
> Yasunori Goto
>
>>
>> --
>> Best Regards,
>> Huang, Ying
>>
>> >
>> > On 02/11/2023 11:17, Huang, Ying wrote:
>> >> Li Zhijian <lizhijian@fujitsu.com> writes:
>> >>
>> >>> It shows the demotion target nodes of a node. Export this
>> >>> information to user directly.
>> >>>
>> >>> Below is an example where node0 node1 are DRAM, node3 is a PMEM
>> node.
>> >>> - Before PMEM is online, no demotion_nodes for node0 and node1.
>> >>> $ cat /sys/devices/system/node/node0/demotion_nodes
>> >>> <show nothing>
>> >>> - After node3 is online as kmem
>> >>> $ daxctl reconfigure-device --mode=system-ram --no-online dax0.0 &&
>> >>> daxctl online-memory dax0.0 [
>> >>> {
>> >>> "chardev":"dax0.0",
>> >>> "size":1054867456,
>> >>> "target_node":3,
>> >>> "align":2097152,
>> >>> "mode":"system-ram",
>> >>> "online_memblocks":0,
>> >>> "total_memblocks":7
>> >>> }
>> >>> ]
>> >>> $ cat /sys/devices/system/node/node0/demotion_nodes
>> >>> 3
>> >>> $ cat /sys/devices/system/node/node1/demotion_nodes
>> >>> 3
>> >>> $ cat /sys/devices/system/node/node3/demotion_nodes
>> >>> <show nothing>
>> >> We have /sys/devices/virtual/memory_tiering/memory_tier*/nodelist
>> >> already. A node in a higher tier can demote to any node in the lower
>> >> tiers. What's more need to be displayed in nodeX/demotion_nodes?
>> >> --
>> >> Best Regards,
>> >> Huang, Ying
>> >>
>> >>> Signed-off-by: Li Zhijian <lizhijian@fujitsu.com>
>> >>> ---
>> >>> drivers/base/node.c | 13 +++++++++++++
>> >>> include/linux/memory-tiers.h | 6 ++++++
>> >>> mm/memory-tiers.c | 8 ++++++++
>> >>> 3 files changed, 27 insertions(+)
>> >>>
>> >>> diff --git a/drivers/base/node.c b/drivers/base/node.c index
>> >>> 493d533f8375..27e8502548a7 100644
>> >>> --- a/drivers/base/node.c
>> >>> +++ b/drivers/base/node.c
>> >>> @@ -7,6 +7,7 @@
>> >>> #include <linux/init.h>
>> >>> #include <linux/mm.h>
>> >>> #include <linux/memory.h>
>> >>> +#include <linux/memory-tiers.h>
>> >>> #include <linux/vmstat.h>
>> >>> #include <linux/notifier.h>
>> >>> #include <linux/node.h>
>> >>> @@ -569,11 +570,23 @@ static ssize_t node_read_distance(struct device
>> *dev,
>> >>> }
>> >>> static DEVICE_ATTR(distance, 0444, node_read_distance, NULL);
>> >>> +static ssize_t demotion_nodes_show(struct device *dev,
>> >>> + struct device_attribute *attr, char *buf) {
>> >>> + int ret;
>> >>> + nodemask_t nmask = next_demotion_nodes(dev->id);
>> >>> +
>> >>> + ret = sysfs_emit(buf, "%*pbl\n", nodemask_pr_args(&nmask));
>> >>> + return ret;
>> >>> +}
>> >>> +static DEVICE_ATTR_RO(demotion_nodes);
>> >>> +
>> >>> static struct attribute *node_dev_attrs[] = {
>> >>> &dev_attr_meminfo.attr,
>> >>> &dev_attr_numastat.attr,
>> >>> &dev_attr_distance.attr,
>> >>> &dev_attr_vmstat.attr,
>> >>> + &dev_attr_demotion_nodes.attr,
>> >>> NULL
>> >>> };
>> >>> diff --git a/include/linux/memory-tiers.h
>> >>> b/include/linux/memory-tiers.h index 437441cdf78f..8eb04923f965
>> >>> 100644
>> >>> --- a/include/linux/memory-tiers.h
>> >>> +++ b/include/linux/memory-tiers.h
>> >>> @@ -38,6 +38,7 @@ void init_node_memory_type(int node, struct
>> memory_dev_type *default_type);
>> >>> void clear_node_memory_type(int node, struct memory_dev_type
>> *memtype);
>> >>> #ifdef CONFIG_MIGRATION
>> >>> int next_demotion_node(int node);
>> >>> +nodemask_t next_demotion_nodes(int node);
>> >>> void node_get_allowed_targets(pg_data_t *pgdat, nodemask_t
>> *targets);
>> >>> bool node_is_toptier(int node);
>> >>> #else
>> >>> @@ -46,6 +47,11 @@ static inline int next_demotion_node(int node)
>> >>> return NUMA_NO_NODE;
>> >>> }
>> >>> +static inline next_demotion_nodes next_demotion_nodes(int node)
>> >>> +{
>> >>> + return NODE_MASK_NONE;
>> >>> +}
>> >>> +
>> >>> static inline void node_get_allowed_targets(pg_data_t *pgdat,
>> nodemask_t *targets)
>> >>> {
>> >>> *targets = NODE_MASK_NONE;
>> >>> diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c index
>> >>> 37a4f59d9585..90047f37d98a 100644
>> >>> --- a/mm/memory-tiers.c
>> >>> +++ b/mm/memory-tiers.c
>> >>> @@ -282,6 +282,14 @@ void node_get_allowed_targets(pg_data_t *pgdat,
>> nodemask_t *targets)
>> >>> rcu_read_unlock();
>> >>> }
>> >>> +nodemask_t next_demotion_nodes(int node)
>> >>> +{
>> >>> + if (!node_demotion)
>> >>> + return NODE_MASK_NONE;
>> >>> +
>> >>> + return node_demotion[node].preferred; }
>> >>> +
>> >>> /**
>> >>> * next_demotion_node() - Get the next node in the demotion path
>> >>> * @node: The starting node to lookup the next node
next prev parent reply other threads:[~2024-01-31 6:54 UTC|newest]
Thread overview: 32+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-11-02 2:56 Subject: [PATCH RFC 0/4] Demotion Profiling Improvements Li Zhijian
2023-11-02 2:56 ` [PATCH RFC 1/4] drivers/base/node: Add demotion_nodes sys infterface Li Zhijian
2023-11-02 3:17 ` Huang, Ying
2023-11-02 3:39 ` Zhijian Li (Fujitsu)
2023-11-02 5:18 ` Huang, Ying
2023-11-02 5:54 ` Zhijian Li (Fujitsu)
2023-11-02 5:58 ` Huang, Ying
2023-11-03 3:05 ` Zhijian Li (Fujitsu)
2024-01-30 8:53 ` Li Zhijian
2024-01-31 1:13 ` Huang, Ying
2024-01-31 3:18 ` Zhijian Li (Fujitsu)
2024-02-02 7:43 ` Zhijian Li (Fujitsu)
2024-02-02 8:19 ` Huang, Ying
2024-02-05 7:31 ` Zhijian Li (Fujitsu)
2024-01-31 6:23 ` Yasunori Gotou (Fujitsu)
2024-01-31 6:52 ` Huang, Ying [this message]
2023-11-02 2:56 ` [PATCH RFC 2/4] mm/vmstat: Move pgdemote_* to per-node stats Li Zhijian
2023-11-02 4:56 ` Huang, Ying
2023-11-02 5:43 ` Huang, Ying
2023-11-02 5:57 ` Zhijian Li (Fujitsu)
2023-11-02 2:56 ` [PATCH RFC 3/4] mm/vmstat: rename pgdemote_* to pgdemote_dst_* and add pgdemote_src_* Li Zhijian
2023-11-02 5:45 ` Huang, Ying
2023-11-02 6:34 ` Zhijian Li (Fujitsu)
2023-11-02 6:56 ` Huang, Ying
2023-11-02 7:38 ` Yasunori Gotou (Fujitsu)
2023-11-02 7:46 ` Huang, Ying
2023-11-02 9:45 ` Yasunori Gotou (Fujitsu)
2023-11-03 6:14 ` Huang, Ying
2023-11-06 5:02 ` Yasunori Gotou (Fujitsu)
2023-11-02 2:56 ` [PATCH RFC 4/4] drivers/base/node: add demote_src and demote_dst to numastat Li Zhijian
2023-11-02 5:40 ` Greg Kroah-Hartman
2023-11-02 8:15 ` Zhijian Li (Fujitsu)
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=8734uegfkw.fsf@yhuang6-desk2.ccr.corp.intel.com \
--to=ying.huang@intel.com \
--cc=akpm@linux-foundation.org \
--cc=gregkh@linuxfoundation.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=lizhijian@fujitsu.com \
--cc=rafael@kernel.org \
--cc=y-goto@fujitsu.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox