Re: [PATCH v6 3/3] mm/mempolicy: Support memory hotplug in weighted interleave

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Rakie Kim <rakie.kim@sk.com>
To: David Hildenbrand <david@redhat.com>
Cc: gourry@gourry.net, linux-mm@kvack.org,
	linux-kernel@vger.kernel.org, linux-cxl@vger.kernel.org,
	joshua.hahnjy@gmail.com, dan.j.williams@intel.com,
	ying.huang@linux.alibaba.com, Jonathan.Cameron@huawei.com,
	osalvador@suse.de, kernel_team@skhynix.com, honggyu.kim@sk.com,
	yunjeong.mun@sk.com, Rakie Kim <rakie.kim@sk.com>,
	akpm@linux-foundation.org
Subject: Re: [PATCH v6 3/3] mm/mempolicy: Support memory hotplug in weighted interleave
Date: Mon,  7 Apr 2025 18:39:19 +0900	[thread overview]
Message-ID: <20250407093926.450-1-rakie.kim@sk.com> (raw)
In-Reply-To: <198f2cbe-b1cb-4239-833e-9aac33d978fa@redhat.com>

On Fri, 4 Apr 2025 22:45:00 +0200 David Hildenbrand <david@redhat.com> wrote:
> On 04.04.25 09:46, Rakie Kim wrote:
> > The weighted interleave policy distributes page allocations across multiple
> > NUMA nodes based on their performance weight, thereby improving memory
> > bandwidth utilization. The weight values for each node are configured
> > through sysfs.
> > 
> > Previously, sysfs entries for configuring weighted interleave were created
> > for all possible nodes (N_POSSIBLE) at initialization, including nodes that
> > might not have memory. However, not all nodes in N_POSSIBLE are usable at
> > runtime, as some may remain memoryless or offline.
> > This led to sysfs entries being created for unusable nodes, causing
> > potential misconfiguration issues.
> > 
> > To address this issue, this patch modifies the sysfs creation logic to:
> > 1) Limit sysfs entries to nodes that are online and have memory, avoiding
> >     the creation of sysfs entries for nodes that cannot be used.
> > 2) Support memory hotplug by dynamically adding and removing sysfs entries
> >     based on whether a node transitions into or out of the N_MEMORY state.
> > 
> > Additionally, the patch ensures that sysfs attributes are properly managed
> > when nodes go offline, preventing stale or redundant entries from persisting
> > in the system.
> > 
> > By making these changes, the weighted interleave policy now manages its
> > sysfs entries more efficiently, ensuring that only relevant nodes are
> > considered for interleaving, and dynamically adapting to memory hotplug
> > events.
> > 
> > Signed-off-by: Rakie Kim <rakie.kim@sk.com>
> > Signed-off-by: Honggyu Kim <honggyu.kim@sk.com>
> > Signed-off-by: Yunjeong Mun <yunjeong.mun@sk.com>
> > ---
> >   mm/mempolicy.c | 109 ++++++++++++++++++++++++++++++++++++++-----------
> >   1 file changed, 86 insertions(+), 23 deletions(-)
> > 
> > diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> > index 73a9405ff352..f25c2c7f8fcf 100644
> > --- a/mm/mempolicy.c
> > +++ b/mm/mempolicy.c
> > @@ -113,6 +113,7 @@
> >   #include <asm/tlbflush.h>
> >   #include <asm/tlb.h>
> >   #include <linux/uaccess.h>
> > +#include <linux/memory.h>
> >   
> >   #include "internal.h"
> >   
> > @@ -3390,6 +3391,7 @@ struct iw_node_attr {
> >   
> >   struct sysfs_wi_group {
> >   	struct kobject wi_kobj;
> > +	struct mutex kobj_lock;
> >   	struct iw_node_attr *nattrs[];
> >   };
> >   
> > @@ -3439,13 +3441,24 @@ static ssize_t node_store(struct kobject *kobj, struct kobj_attribute *attr,
> >   
> >   static void sysfs_wi_node_delete(int nid)
> >   {
> > -	if (!wi_group->nattrs[nid])
> > +	struct iw_node_attr *attr;
> > +
> > +	if (nid < 0 || nid >= nr_node_ids)
> > +		return;
> > +
> > +	mutex_lock(&wi_group->kobj_lock);
> > +	attr = wi_group->nattrs[nid];
> > +	if (!attr) {
> > +		mutex_unlock(&wi_group->kobj_lock);
> >   		return;
> > +	}
> > +
> > +	wi_group->nattrs[nid] = NULL;
> > +	mutex_unlock(&wi_group->kobj_lock);
> >   
> > -	sysfs_remove_file(&wi_group->wi_kobj,
> > -			  &wi_group->nattrs[nid]->kobj_attr.attr);
> > -	kfree(wi_group->nattrs[nid]->kobj_attr.attr.name);
> > -	kfree(wi_group->nattrs[nid]);
> > +	sysfs_remove_file(&wi_group->wi_kobj, &attr->kobj_attr.attr);
> > +	kfree(attr->kobj_attr.attr.name);
> > +	kfree(attr);
> >   }
> >   
> >   static void sysfs_wi_release(struct kobject *wi_kobj)
> > @@ -3464,35 +3477,80 @@ static const struct kobj_type wi_ktype = {
> >   
> >   static int sysfs_wi_node_add(int nid)
> >   {
> > -	struct iw_node_attr *node_attr;
> > +	int ret = 0;
> >   	char *name;
> > +	struct iw_node_attr *new_attr = NULL;
> >   
> > -	node_attr = kzalloc(sizeof(*node_attr), GFP_KERNEL);
> > -	if (!node_attr)
> > +	if (nid < 0 || nid >= nr_node_ids) {
> > +		pr_err("Invalid node id: %d\n", nid);
> > +		return -EINVAL;
> > +	}
> > +
> > +	new_attr = kzalloc(sizeof(struct iw_node_attr), GFP_KERNEL);
> > +	if (!new_attr)
> >   		return -ENOMEM;
> >   
> >   	name = kasprintf(GFP_KERNEL, "node%d", nid);
> >   	if (!name) {
> > -		kfree(node_attr);
> > +		kfree(new_attr);
> >   		return -ENOMEM;
> >   	}
> >   
> > -	sysfs_attr_init(&node_attr->kobj_attr.attr);
> > -	node_attr->kobj_attr.attr.name = name;
> > -	node_attr->kobj_attr.attr.mode = 0644;
> > -	node_attr->kobj_attr.show = node_show;
> > -	node_attr->kobj_attr.store = node_store;
> > -	node_attr->nid = nid;
> > +	mutex_lock(&wi_group->kobj_lock);
> > +	if (wi_group->nattrs[nid]) {
> > +		mutex_unlock(&wi_group->kobj_lock);
> > +		pr_info("Node [%d] already exists\n", nid);
> > +		kfree(new_attr);
> > +		kfree(name);
> > +		return 0;
> > +	}
> > +	wi_group->nattrs[nid] = new_attr;
> >   
> > -	if (sysfs_create_file(&wi_group->wi_kobj, &node_attr->kobj_attr.attr)) {
> > -		kfree(node_attr->kobj_attr.attr.name);
> > -		kfree(node_attr);
> > -		pr_err("failed to add attribute to weighted_interleave\n");
> > -		return -ENOMEM;
> > +	sysfs_attr_init(&wi_group->nattrs[nid]->kobj_attr.attr);
> > +	wi_group->nattrs[nid]->kobj_attr.attr.name = name;
> > +	wi_group->nattrs[nid]->kobj_attr.attr.mode = 0644;
> > +	wi_group->nattrs[nid]->kobj_attr.show = node_show;
> > +	wi_group->nattrs[nid]->kobj_attr.store = node_store;
> > +	wi_group->nattrs[nid]->nid = nid;
> > +
> > +	ret = sysfs_create_file(&wi_group->wi_kobj,
> > +				&wi_group->nattrs[nid]->kobj_attr.attr);
> > +	if (ret) {
> > +		kfree(wi_group->nattrs[nid]->kobj_attr.attr.name);
> > +		kfree(wi_group->nattrs[nid]);
> > +		wi_group->nattrs[nid] = NULL;
> > +		pr_err("Failed to add attribute to weighted_interleave: %d\n", ret);
> >   	}
> > +	mutex_unlock(&wi_group->kobj_lock);
> >   
> > -	wi_group->nattrs[nid] = node_attr;
> > -	return 0;
> > +	return ret;
> > +}
> > +
> > +static int wi_node_notifier(struct notifier_block *nb,
> > +			       unsigned long action, void *data)
> > +{
> > +	int err;
> > +	struct memory_notify *arg = data;
> > +	int nid = arg->status_change_nid;
> > +
> > +	if (nid < 0)
> > +		goto notifier_end;
> > +
> > +	switch(action) {
> > +	case MEM_ONLINE:
> 
> MEM_ONLINE is too late, we cannot fail hotplug at that point.
> 
> Would MEM_GOING_ONLINE / MEM_CANCEL_ONLINE be better?

Hi David,

Thank you for raising these points. I would appreciate your clarification
on the following:

Issue1: I want to invoke sysfs_wi_node_add() after a node with memory
has been fully transitioned to the online state. Does replacing
MEM_ONLINE with MEM_GOING_ONLINE or MEM_CANCEL_ONLINE still ensure
that the node is considered online and usable by that point?

> 
> 
> > +		err = sysfs_wi_node_add(nid);
> > +		if (err) {
> > +			pr_err("failed to add sysfs [node%d]\n", nid);
> > +			return NOTIFY_BAD;
> 
> Note that NOTIFY_BAD includes NOTIFY_STOP_MASK. So you wouldn't call 
> other notifiers, but the overall memory onlining would succeed, which is 
> bad.
> 
> If we don't care about the error (not prevent hotplug) we could only 
> pr_warn() and continue. Maybe this (unlikely) case is not a good reason 
> to stop memory from getting onlined. OTOH, it will barely ever trigger 
> in practice ...
> 

Issue2: Regarding your note about NOTIFY_BAD ? are you saying that
if sysfs_wi_node_add() returns NOTIFY_BAD, it will trigger
NOTIFY_STOP_MASK, preventing other notifiers from running, while
still allowing the memory hotplug operation to complete?

If so, then I'm thinking of resolving both issues as follows:
- For Issue1: I keep using MEM_ONLINE, assuming it is safe and
  sufficient to ensure the node is fully online.
- For Issue2: I avoid returning NOTIFY_BAD from the notifier.
  Instead, I log the error using pr_err() and continue the operation.

This would result in the following code:

	if (nid < 0)
		return NOTIFY_OK;

	switch (action) {
	case MEM_ONLINE: // Issue1: keeping this unchanged
		err = sysfs_wi_node_add(nid);
		if (err) {
			pr_err("failed to add sysfs [node%d]\n", nid);
			// Issue2: Do not return NOTIFY_BAD
		}
		break;
	case MEM_OFFLINE:
		sysfs_wi_node_delete(nid);
		break;
	}

	// Always return NOTIFY_OK
	return NOTIFY_OK;

Please let me know if this approach is acceptable.

Rakie

> -- 
> Cheers,
> 
> David / dhildenb
>

next prev parent reply	other threads:[~2025-04-07  9:39 UTC|newest]

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-04-04  7:46 [PATCH v6 0/3] Enhance sysfs handling for " Rakie Kim
2025-04-04  7:46 ` [PATCH v6 1/3] mm/mempolicy: Fix memory leaks in weighted interleave sysfs Rakie Kim
2025-04-04 12:59   ` Jonathan Cameron
2025-04-07  9:37     ` Rakie Kim
2025-04-04  7:46 ` [PATCH v6 2/3] mm/mempolicy: Prepare weighted interleave sysfs for memory hotplug Rakie Kim
2025-04-04 13:05   ` Jonathan Cameron
2025-04-04 17:23     ` Dan Williams
2025-04-07  9:38       ` Rakie Kim
2025-04-07  9:49       ` Jonathan Cameron
2025-04-04  7:46 ` [PATCH v6 3/3] mm/mempolicy: Support memory hotplug in weighted interleave Rakie Kim
2025-04-04  8:43   ` Oscar Salvador
2025-04-07  9:37     ` Rakie Kim
2025-04-04 20:45   ` David Hildenbrand
2025-04-07  9:39     ` Rakie Kim [this message]
2025-04-07 10:47       ` David Hildenbrand

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20250407093926.450-1-rakie.kim@sk.com \
    --to=rakie.kim@sk.com \
    --cc=Jonathan.Cameron@huawei.com \
    --cc=akpm@linux-foundation.org \
    --cc=dan.j.williams@intel.com \
    --cc=david@redhat.com \
    --cc=gourry@gourry.net \
    --cc=honggyu.kim@sk.com \
    --cc=joshua.hahnjy@gmail.com \
    --cc=kernel_team@skhynix.com \
    --cc=linux-cxl@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=osalvador@suse.de \
    --cc=ying.huang@linux.alibaba.com \
    --cc=yunjeong.mun@sk.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox