[PATCH 1/2 v6] mm/mempolicy: Weighted Interleave Auto-tuning

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 1/2 v6] mm/mempolicy: Weighted Interleave Auto-tuning
@ 2025-02-26 21:35 ` Joshua Hahn
  2025-02-26 21:35   ` [PATCH 2/2 v6] mm/mempolicy: Don't create weight sysfs for memoryless nodes Joshua Hahn
                     ` (2 more replies)
  0 siblings, 3 replies; 34+ messages in thread
From: Joshua Hahn @ 2025-02-26 21:35 UTC (permalink / raw)
  To: gourry, harry.yoo, ying.huang
  Cc: honggyu.kim, gregkh, rakie.kim, akpm, rafael, lenb,
	dan.j.williams, Jonathan.Cameron, dave.jiang, horen.chuang,
	hannes, linux-kernel, linux-acpi, linux-mm, kernel-team

On machines with multiple memory nodes, interleaving page allocations
across nodes allows for better utilization of each node's bandwidth.
Previous work by Gregory Price [1] introduced weighted interleave, which
allowed for pages to be allocated across nodes according to user-set ratios.

Ideally, these weights should be proportional to their bandwidth, so
that under bandwidth pressure, each node uses its maximal efficient
bandwidth and prevents latency from increasing exponentially.

Previously, weighted interleave's default weights were just 1s -- which
would be equivalent to the (unweighted) interleave mempolicy, which goes
through the nodes in a round-robin fashion, ignoring bandwidth information.

This patch has two main goals:
First, it makes weighted interleave easier to use for users who wish to
relieve bandwidth pressure when using nodes with varying bandwidth (CXL).
By providing a set of "real" default weights that just work out of the
box, users who might not have the capability (or wish to) perform
experimentation to find the most optimal weights for their system can
still take advantage of bandwidth-informed weighted interleave.

Second, it allows for weighted interleave to dynamically adjust to
hotplugged memory with new bandwidth information. Instead of manually
updating node weights every time new bandwidth information is reported
or taken off, weighted interleave adjusts and provides a new set of
default weights for weighted interleave to use when there is a change
in bandwidth information.

To meet these goals, this patch introduces an auto-configuration mode
for the interleave weights that provides a reasonable set of default
weights, calculated using bandwidth data reported by the system. In auto
mode, weights are dynamically adjusted based on whatever the current
bandwidth information reports (and responds to hotplug events).

This patch still supports users manually writing weights into the nodeN
sysfs interface by entering into manual mode. When a user enters manual
mode, the system stops dynamically updating any of the node weights,
even during hotplug events that shift the optimal weight distribution.

A new sysfs interface "auto" is introduced, which allows users to switch
between the auto (writing 1 or Y) and manual (writing 0 or N) modes. The
system also automatically enters manual mode when a nodeN interface is
manually written to.

There is one functional change that this patch makes to the existing
weighted_interleave ABI: previously, writing 0 directly to a nodeN
interface was said to reset the weight to the system default. Before
this patch, the default for all weights were 1, which meant that writing
0 and 1 were functionally equivalent.

[1] https://lore.kernel.org/linux-mm/20240202170238.90004-1-gregory.price@memverge.com/

Signed-off-by: Joshua Hahn <joshua.hahnjy@gmail.com>
Co-developed-by: Gregory Price <gourry@gourry.net>
Signed-off-by: Gregory Price <gourry@gourry.net>
Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
---
Changelog
v6:
- iw_weights and mode_auto are combined into one rcu-protected struct.
- Protection against memoryless nodes, as suggested by Oscar Salvador
- Wordsmithing (documentation, commit message and comments), as suggested
  by Andrew Morton.
- Removed unnecessary #include statement in hmat.c, as pointed out by
  Harry (Hyeonggon) Yoo and Ying Huang.
- Bandwidth values changed from u64_t to unsigned int, as pointed out by
  Ying Huang and Dan Carpenter.
- RCU optimizations, as suggested by Ying Huang.
- A second patch is included to fix unintended behavior that creates a
  weight knob for memoryless nodes as well.
- Sysfs show/store functions use str_true_false & kstrtobool.
- Fix a build error in 32-bit systems, which are unable to perform
  64-bit division by casting 64-bit values to 32-bit, if under the range.

v5:
- I accidentally forgot to add the mm/mempolicy: subject tag since v1 of
  this patch. Added to the subject now!
- Wordsmithing, correcting typos, and re-naming variables for clarity.
- No functional changes.

v4:
- Renamed the mode interface to the "auto" interface, which now only
  emits either 'Y' or 'N'. Users can now interact with it by
  writing 'Y', '1', 'N', or '0' to it.
- Added additional documentation to the nodeN sysfs interface.
- Makes sure iw_table locks are properly held.
- Removed unlikely() call in reduce_interleave_weights.
- Wordsmithing

v3:
- Weightiness (max_node_weight) is now fixed to 32.
- Instead, the sysfs interface now exposes a "mode" parameter, which
  can either be "auto" or "manual".
  - Thank you Hyeonggon and Honggyu for the feedback.
- Documentation updated to reflect new sysfs interface, explicitly
  specifies that 0 is invalid.
  - Thank you Gregory and Ying for the discussion on how best to
    handle the 0 case.
- Re-worked nodeN sysfs store to handle auto --> manual shifts
- mempolicy_set_node_perf internally handles the auto / manual
  case differently now. bw is always updated, iw updates depend on
  what mode the user is in.
- Wordsmithing comments for clarity.
- Removed RFC tag.

v2:
- Name of the interface is changed: "max_node_weight" --> "weightiness"
- Default interleave weight table no longer exists. Rather, the
  interleave weight table is initialized with the defaults, if bandwidth
  information is available.
  - In addition, all sections that handle iw_table have been changed
    to reference iw_table if it exists, otherwise defaulting to 1.
- All instances of unsigned long are converted to uint64_t to guarantee
  support for both 32-bit and 64-bit machines
- sysfs initialization cleanup
- Documentation has been rewritten to explicitly outline expected
  behavior and expand on the interpretation of "weightiness".
- kzalloc replaced with kcalloc for readability
- Thank you Gregory and Hyeonggon for your review & feedback!

 ...fs-kernel-mm-mempolicy-weighted-interleave |  34 +-
 drivers/base/node.c                           |   9 +
 include/linux/mempolicy.h                     |   9 +
 mm/mempolicy.c                                | 323 +++++++++++++++---
 4 files changed, 322 insertions(+), 53 deletions(-)

diff --git a/Documentation/ABI/testing/sysfs-kernel-mm-mempolicy-weighted-interleave b/Documentation/ABI/testing/sysfs-kernel-mm-mempolicy-weighted-interleave
index 0b7972de04e9..862b19943a85 100644
--- a/Documentation/ABI/testing/sysfs-kernel-mm-mempolicy-weighted-interleave
+++ b/Documentation/ABI/testing/sysfs-kernel-mm-mempolicy-weighted-interleave
@@ -20,6 +20,34 @@ Description:	Weight configuration interface for nodeN
 		Minimum weight: 1
 		Maximum weight: 255
 
-		Writing an empty string or `0` will reset the weight to the
-		system default. The system default may be set by the kernel
-		or drivers at boot or during hotplug events.
+		Writing invalid values (i.e. any values not in [1,255],
+		empty string, ...) will return -EINVAL.
+
+		Changing the weight to a valid value will automatically
+		update the system to manual mode as well.
+
+What:		/sys/kernel/mm/mempolicy/weighted_interleave/auto
+Date:		February 2025
+Contact:	Linux memory management mailing list <linux-mm@kvack.org>
+Description:	Auto-weighting configuration interface
+
+		Configuration mode for weighted interleave. A 'Y' indicates
+		that the system is in auto mode, and a 'N' indicates that
+		the system is in manual mode. All other values are invalid.
+
+		In auto mode, all node weights are re-calculated and overwritten
+		(visible via the nodeN interfaces) whenever new bandwidth data
+		is made available during either boot or hotplug events.
+
+		In manual mode, node weights can only be updated by the user.
+		Note that nodes that are onlined with previously set weights
+		will inherit those weights. If they were not previously set or
+		are onlined with missing bandwidth data, the weights will use
+		a default weight of 1.
+
+		Writing Y or 1 to the interface will enable auto mode, while
+		writing N or 0 will enable manual mode. All other strings will
+		be ignored, and -EINVAL will be returned.
+
+		Writing a new weight to a node directly via the nodeN interface
+		will also automatically update the system to manual mode.
diff --git a/drivers/base/node.c b/drivers/base/node.c
index 0ea653fa3433..f3c01fb90db1 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -7,6 +7,7 @@
 #include <linux/init.h>
 #include <linux/mm.h>
 #include <linux/memory.h>
+#include <linux/mempolicy.h>
 #include <linux/vmstat.h>
 #include <linux/notifier.h>
 #include <linux/node.h>
@@ -214,6 +215,14 @@ void node_set_perf_attrs(unsigned int nid, struct access_coordinate *coord,
 			break;
 		}
 	}
+
+	/* When setting CPU access coordinates, update mempolicy */
+	if (access == ACCESS_COORDINATE_CPU) {
+		if (mempolicy_set_node_perf(nid, coord)) {
+			pr_info("failed to set mempolicy attrs for node %d\n",
+				nid);
+		}
+	}
 }
 EXPORT_SYMBOL_GPL(node_set_perf_attrs);
 
diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h
index ce9885e0178a..78f1299bdd42 100644
--- a/include/linux/mempolicy.h
+++ b/include/linux/mempolicy.h
@@ -11,6 +11,7 @@
 #include <linux/slab.h>
 #include <linux/rbtree.h>
 #include <linux/spinlock.h>
+#include <linux/node.h>
 #include <linux/nodemask.h>
 #include <linux/pagemap.h>
 #include <uapi/linux/mempolicy.h>
@@ -56,6 +57,11 @@ struct mempolicy {
 	} w;
 };
 
+struct weighted_interleave_state {
+	bool mode_auto;
+	u8 iw_table[]; /* A null iw_table is interpreted as an array of 1s. */
+};
+
 /*
  * Support for managing mempolicy data objects (clone, copy, destroy)
  * The default fast path of a NULL MPOL_DEFAULT policy is always inlined.
@@ -178,6 +184,9 @@ static inline bool mpol_is_preferred_many(struct mempolicy *pol)
 
 extern bool apply_policy_zone(struct mempolicy *policy, enum zone_type zone);
 
+extern int mempolicy_set_node_perf(unsigned int node,
+				   struct access_coordinate *coords);
+
 #else
 
 struct mempolicy {};
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index bbaadbeeb291..4cc04ff8f12c 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -109,6 +109,7 @@
 #include <linux/mmu_notifier.h>
 #include <linux/printk.h>
 #include <linux/swapops.h>
+#include <linux/gcd.h>
 
 #include <asm/tlbflush.h>
 #include <asm/tlb.h>
@@ -139,31 +140,151 @@ static struct mempolicy default_policy = {
 static struct mempolicy preferred_node_policy[MAX_NUMNODES];
 
 /*
- * iw_table is the sysfs-set interleave weight table, a value of 0 denotes
- * system-default value should be used. A NULL iw_table also denotes that
- * system-default values should be used. Until the system-default table
- * is implemented, the system-default is always 1.
- *
- * iw_table is RCU protected
+ * weightiness balances the tradeoff between small weights (cycles through nodes
+ * faster, more fair/even distribution) and large weights (smaller errors
+ * between actual bandwidth ratios and weight ratios). 32 is a number that has
+ * been found to perform at a reasonable compromise between the two goals.
+ */
+static const int weightiness = 32;
+
+/* wi_state is RCU protected */
+static struct weighted_interleave_state __rcu *wi_state;
+static unsigned int *node_bw_table;
+
+/*
+ * iw_table_lock protects both wi_state and node_bw_table.
+ * node_bw_table is only used by writers to update wi_state.
  */
-static u8 __rcu *iw_table;
 static DEFINE_MUTEX(iw_table_lock);
 
 static u8 get_il_weight(int node)
 {
-	u8 *table;
-	u8 weight;
+	u8 weight = 1;
 
 	rcu_read_lock();
-	table = rcu_dereference(iw_table);
-	/* if no iw_table, use system default */
-	weight = table ? table[node] : 1;
-	/* if value in iw_table is 0, use system default */
-	weight = weight ? weight : 1;
+	if (rcu_access_pointer(wi_state))
+		weight = rcu_dereference(wi_state)->iw_table[node];
 	rcu_read_unlock();
+
 	return weight;
 }
 
+/*
+ * Convert bandwidth values into weighted interleave weights.
+ * Call with iw_table_lock.
+ */
+static void reduce_interleave_weights(unsigned int *bw, u8 *new_iw)
+{
+	u64 sum_bw = 0;
+	unsigned int cast_sum_bw, sum_iw = 0;
+	unsigned int scaling_factor = 1, iw_gcd = 1;
+	int nid;
+
+	/* Recalculate the bandwidth distribution given the new info */
+	for_each_node_state(nid, N_MEMORY)
+		sum_bw += bw[nid];
+
+	for (nid = 0; nid < nr_node_ids; nid++) {
+		/* Set memoryless nodes' weights to 1 to prevent div/0 later */
+		if (!node_state(nid, N_MEMORY)) {
+			new_iw[nid] = 1;
+			continue;
+		}
+
+		scaling_factor = 100 * bw[nid];
+
+		/*
+		 * Try not to perform 64-bit division.
+		 * If sum_bw < scaling_factor, then sum_bw < U32_MAX.
+		 * If sum_bw > scaling_factor, then bw[nid] is less than
+		 * 1% of the total bandwidth. Round up to 1%.
+		 */
+		if (bw[nid] && sum_bw < scaling_factor) {
+			cast_sum_bw = (unsigned int)sum_bw;
+			new_iw[nid] = scaling_factor / cast_sum_bw;
+		} else {
+			new_iw[nid] = 1;
+		}
+		sum_iw += new_iw[nid];
+	}
+
+	/*
+	 * Scale each node's share of the total bandwidth from percentages
+	 * to whole numbers in the range [1, weightiness]
+	 */
+	for_each_node_state(nid, N_MEMORY) {
+		scaling_factor = weightiness * new_iw[nid];
+		new_iw[nid] = max(scaling_factor / sum_iw, 1);
+		if (nid == 0)
+			iw_gcd = new_iw[0];
+		iw_gcd = gcd(iw_gcd, new_iw[nid]);
+	}
+
+	/* 1:2 is strictly better than 16:32. Reduce by the weights' GCD. */
+	for_each_node_state(nid, N_MEMORY)
+		new_iw[nid] /= iw_gcd;
+}
+
+int mempolicy_set_node_perf(unsigned int node, struct access_coordinate *coords)
+{
+	struct weighted_interleave_state *new_wi_state, *old_wi_state = NULL;
+	unsigned int *old_bw, *new_bw;
+	unsigned int bw_val;
+
+	bw_val = min(coords->read_bandwidth, coords->write_bandwidth);
+	new_bw = kcalloc(nr_node_ids, sizeof(unsigned int), GFP_KERNEL);
+	if (!new_bw)
+		return -ENOMEM;
+
+	new_wi_state = kzalloc(struct_size(new_wi_state, iw_table, nr_node_ids),
+			       GFP_KERNEL);
+	if (!new_wi_state) {
+		kfree(new_bw);
+		return -ENOMEM;
+	}
+
+	/*
+	 * Update bandwidth info, even in manual mode. That way, when switching
+	 * to auto mode in the future, iw_table can be overwritten using
+	 * accurate bw data.
+	 */
+	mutex_lock(&iw_table_lock);
+
+	old_bw = node_bw_table;
+	if (old_bw)
+		memcpy(new_bw, old_bw, nr_node_ids * sizeof(unsigned int));
+	new_bw[node] = bw_val;
+	node_bw_table = new_bw;
+
+	/* wi_state not initialized yet; assume auto == true */
+	if (!rcu_access_pointer(wi_state))
+		goto reduce;
+
+	old_wi_state = rcu_dereference_protected(wi_state,
+					lockdep_is_held(&iw_table_lock));
+	if (old_wi_state->mode_auto)
+		goto reduce;
+
+	mutex_unlock(&iw_table_lock);
+	kfree(new_wi_state);
+	kfree(old_bw);
+	return 0;
+
+reduce:
+	new_wi_state->mode_auto = true;
+	reduce_interleave_weights(new_bw, new_wi_state->iw_table);
+
+	rcu_assign_pointer(wi_state, new_wi_state);
+	mutex_unlock(&iw_table_lock);
+	if (old_wi_state) {
+		synchronize_rcu();
+		kfree(old_wi_state);
+	}
+	kfree(old_bw);
+
+	return 0;
+}
+
 /**
  * numa_nearest_node - Find nearest node by state
  * @node: Node id to start the search
@@ -1988,34 +2109,33 @@ static unsigned int weighted_interleave_nid(struct mempolicy *pol, pgoff_t ilx)
 	u8 *table;
 	unsigned int weight_total = 0;
 	u8 weight;
-	int nid;
+	int nid = 0;
 
 	nr_nodes = read_once_policy_nodemask(pol, &nodemask);
 	if (!nr_nodes)
 		return numa_node_id();
 
 	rcu_read_lock();
-	table = rcu_dereference(iw_table);
+	if (!rcu_access_pointer(wi_state))
+		goto out;
+
+	table = rcu_dereference(wi_state)->iw_table;
 	/* calculate the total weight */
-	for_each_node_mask(nid, nodemask) {
-		/* detect system default usage */
-		weight = table ? table[nid] : 1;
-		weight = weight ? weight : 1;
-		weight_total += weight;
-	}
+	for_each_node_mask(nid, nodemask)
+		weight_total += table ? table[nid] : 1;
 
 	/* Calculate the node offset based on totals */
 	target = ilx % weight_total;
 	nid = first_node(nodemask);
 	while (target) {
 		/* detect system default usage */
-		weight = table ? table[nid] : 1;
-		weight = weight ? weight : 1;
+		weight = table[nid];
 		if (target < weight)
 			break;
 		target -= weight;
 		nid = next_node_in(nid, nodemask);
 	}
+out:
 	rcu_read_unlock();
 	return nid;
 }
@@ -2411,13 +2531,14 @@ static unsigned long alloc_pages_bulk_weighted_interleave(gfp_t gfp,
 		struct mempolicy *pol, unsigned long nr_pages,
 		struct page **page_array)
 {
+	struct weighted_interleave_state *state;
 	struct task_struct *me = current;
 	unsigned int cpuset_mems_cookie;
 	unsigned long total_allocated = 0;
 	unsigned long nr_allocated = 0;
 	unsigned long rounds;
 	unsigned long node_pages, delta;
-	u8 *table, *weights, weight;
+	u8 *weights, weight;
 	unsigned int weight_total = 0;
 	unsigned long rem_pages = nr_pages;
 	nodemask_t nodes;
@@ -2467,17 +2588,19 @@ static unsigned long alloc_pages_bulk_weighted_interleave(gfp_t gfp,
 		return total_allocated;
 
 	rcu_read_lock();
-	table = rcu_dereference(iw_table);
-	if (table)
-		memcpy(weights, table, nr_node_ids);
-	rcu_read_unlock();
+	if (rcu_access_pointer(wi_state)) {
+		state = rcu_dereference(wi_state);
+		memcpy(weights, state->iw_table, nr_node_ids * sizeof(u8));
+		rcu_read_unlock();
+	} else {
+		rcu_read_unlock();
+		for (i = 0; i < nr_node_ids; i++)
+			weights[i] = 1;
+	}
 
 	/* calculate total, detect system default usage */
-	for_each_node_mask(node, nodes) {
-		if (!weights[node])
-			weights[node] = 1;
+	for_each_node_mask(node, nodes)
 		weight_total += weights[node];
-	}
 
 	/*
 	 * Calculate rounds/partial rounds to minimize __alloc_pages_bulk calls.
@@ -3402,36 +3525,113 @@ static ssize_t node_show(struct kobject *kobj, struct kobj_attribute *attr,
 static ssize_t node_store(struct kobject *kobj, struct kobj_attribute *attr,
 			  const char *buf, size_t count)
 {
+	struct weighted_interleave_state *new_wi_state, *old_wi_state = NULL;
 	struct iw_node_attr *node_attr;
-	u8 *new;
-	u8 *old;
 	u8 weight = 0;
+	int i;
 
 	node_attr = container_of(attr, struct iw_node_attr, kobj_attr);
 	if (count == 0 || sysfs_streq(buf, ""))
 		weight = 0;
-	else if (kstrtou8(buf, 0, &weight))
+	else if (kstrtou8(buf, 0, &weight) || weight == 0)
 		return -EINVAL;
 
-	new = kzalloc(nr_node_ids, GFP_KERNEL);
-	if (!new)
+	new_wi_state = kzalloc(struct_size(new_wi_state, iw_table, nr_node_ids),
+			       GFP_KERNEL);
+	if (!new_wi_state)
 		return -ENOMEM;
 
 	mutex_lock(&iw_table_lock);
-	old = rcu_dereference_protected(iw_table,
+	if (rcu_access_pointer(wi_state)) {
+		old_wi_state = rcu_dereference_protected(wi_state,
 					lockdep_is_held(&iw_table_lock));
-	if (old)
-		memcpy(new, old, nr_node_ids);
-	new[node_attr->nid] = weight;
-	rcu_assign_pointer(iw_table, new);
+		memcpy(new_wi_state->iw_table, old_wi_state->iw_table,
+					nr_node_ids * sizeof(u8));
+	} else {
+		for (i = 0; i < nr_node_ids; i++)
+			new_wi_state->iw_table[i] = 1;
+	}
+	new_wi_state->iw_table[node_attr->nid] = weight;
+	new_wi_state->mode_auto = false;
+
+	rcu_assign_pointer(wi_state, new_wi_state);
 	mutex_unlock(&iw_table_lock);
-	synchronize_rcu();
-	kfree(old);
+	if (old_wi_state) {
+		synchronize_rcu();
+		kfree(old_wi_state);
+	}
 	return count;
 }
 
 static struct iw_node_attr **node_attrs;
 
+static ssize_t weighted_interleave_auto_show(struct kobject *kobj,
+		struct kobj_attribute *attr, char *buf)
+{
+	bool wi_auto = true;
+
+	rcu_read_lock();
+	if (rcu_access_pointer(wi_state))
+		wi_auto = rcu_dereference(wi_state)->mode_auto;
+	rcu_read_unlock();
+
+	return sysfs_emit(buf, "%s\n", str_true_false(wi_auto));
+}
+
+static ssize_t weighted_interleave_auto_store(struct kobject *kobj,
+		struct kobj_attribute *attr, const char *buf, size_t count)
+{
+	struct weighted_interleave_state *new_wi_state, *old_wi_state = NULL;
+	unsigned int *bw;
+	bool input;
+	int i;
+
+	if (kstrtobool(buf, &input))
+		return -EINVAL;
+
+	new_wi_state = kzalloc(struct_size(new_wi_state, iw_table, nr_node_ids),
+			       GFP_KERNEL);
+	if (!new_wi_state)
+		return -ENOMEM;
+	mutex_lock(&iw_table_lock);
+
+	if (!input) {
+		if (rcu_access_pointer(wi_state)) {
+			old_wi_state = rcu_dereference_protected(wi_state,
+					lockdep_is_held(&iw_table_lock));
+			memcpy(new_wi_state->iw_table, old_wi_state->iw_table,
+					nr_node_ids * sizeof(u8));
+		} else {
+			for (i = 0; i < nr_node_ids; i++)
+				new_wi_state->iw_table[i] = 1;
+		}
+		goto update_wi_state;
+	}
+
+	bw = node_bw_table;
+	if (!bw) {
+		mutex_unlock(&iw_table_lock);
+		kfree(new_wi_state);
+		return -ENODEV;
+	}
+
+	new_wi_state->mode_auto = true;
+	reduce_interleave_weights(bw, new_wi_state->iw_table);
+
+update_wi_state:
+	rcu_assign_pointer(wi_state, new_wi_state);
+	mutex_unlock(&iw_table_lock);
+	if (old_wi_state) {
+		synchronize_rcu();
+		kfree(old_wi_state);
+	}
+	return count;
+}
+
+static struct kobj_attribute wi_attr =
+	__ATTR(auto, 0664, weighted_interleave_auto_show,
+			   weighted_interleave_auto_store);
+
 static void sysfs_wi_node_release(struct iw_node_attr *node_attr,
 				  struct kobject *parent)
 {
@@ -3489,6 +3689,15 @@ static int add_weight_node(int nid, struct kobject *wi_kobj)
 	return 0;
 }
 
+static struct attribute *wi_default_attrs[] = {
+	&wi_attr.attr,
+	NULL
+};
+
+static const struct attribute_group wi_attr_group = {
+	.attrs = wi_default_attrs,
+};
+
 static int add_weighted_interleave_group(struct kobject *root_kobj)
 {
 	struct kobject *wi_kobj;
@@ -3505,6 +3714,13 @@ static int add_weighted_interleave_group(struct kobject *root_kobj)
 		return err;
 	}
 
+	err = sysfs_create_group(wi_kobj, &wi_attr_group);
+	if (err) {
+		pr_err("failed to add sysfs [auto]\n");
+		kobject_put(wi_kobj);
+		return err;
+	}
+
 	for_each_node_state(nid, N_POSSIBLE) {
 		err = add_weight_node(nid, wi_kobj);
 		if (err) {
@@ -3519,15 +3735,22 @@ static int add_weighted_interleave_group(struct kobject *root_kobj)
 
 static void mempolicy_kobj_release(struct kobject *kobj)
 {
-	u8 *old;
+	struct weighted_interleave_state *old_wi_state;
 
 	mutex_lock(&iw_table_lock);
-	old = rcu_dereference_protected(iw_table,
-					lockdep_is_held(&iw_table_lock));
-	rcu_assign_pointer(iw_table, NULL);
+	if (!rcu_access_pointer(wi_state)) {
+		mutex_unlock(&iw_table_lock);
+		goto out;
+	}
+
+	old_wi_state = rcu_dereference_protected(wi_state,
+			lockdep_is_held(&iw_table_lock));
+
+	rcu_assign_pointer(wi_state, NULL);
 	mutex_unlock(&iw_table_lock);
 	synchronize_rcu();
-	kfree(old);
+	kfree(old_wi_state);
+out:
 	kfree(node_attrs);
 	kfree(kobj);
 }
-- 
2.43.5


^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH 2/2 v6] mm/mempolicy: Don't create weight sysfs for memoryless nodes
  2025-02-26 21:35 ` [PATCH 1/2 v6] mm/mempolicy: Weighted Interleave Auto-tuning Joshua Hahn
@ 2025-02-26 21:35   ` Joshua Hahn
  2025-02-27  2:32     ` Honggyu Kim
  2025-02-28  0:16   ` [PATCH 1/2 v6] mm/mempolicy: Weighted Interleave Auto-tuning yunjeong.mun
  2025-02-28  6:39   ` Yunjeong Mun
  2 siblings, 1 reply; 34+ messages in thread
From: Joshua Hahn @ 2025-02-26 21:35 UTC (permalink / raw)
  To: gourry, harry.yoo, ying.huang
  Cc: honggyu.kim, gregkh, rakie.kim, akpm, rafael, lenb,
	dan.j.williams, Jonathan.Cameron, dave.jiang, horen.chuang,
	hannes, linux-kernel, linux-acpi, linux-mm, kernel-team

We should never try to allocate memory from a memoryless node. Creating a
sysfs knob to control its weighted interleave weight does not make sense,
and can be unsafe.

Only create weighted interleave weight knobs for nodes with memory.

Signed-off-by: Joshua Hahn <joshua.hahnjy@gmail.com>
---
 mm/mempolicy.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 4cc04ff8f12c..50cbb7c047fa 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -3721,7 +3721,7 @@ static int add_weighted_interleave_group(struct kobject *root_kobj)
 		return err;
 	}
 
-	for_each_node_state(nid, N_POSSIBLE) {
+	for_each_node_state(nid, N_MEMORY) {
 		err = add_weight_node(nid, wi_kobj);
 		if (err) {
 			pr_err("failed to add sysfs [node%d]\n", nid);
-- 
2.43.5


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 2/2 v6] mm/mempolicy: Don't create weight sysfs for memoryless nodes
  2025-02-26 21:35   ` [PATCH 2/2 v6] mm/mempolicy: Don't create weight sysfs for memoryless nodes Joshua Hahn
@ 2025-02-27  2:32     ` Honggyu Kim
  2025-02-27  3:20       ` Honggyu Kim
                         ` (2 more replies)
  0 siblings, 3 replies; 34+ messages in thread
From: Honggyu Kim @ 2025-02-27  2:32 UTC (permalink / raw)
  To: Joshua Hahn, gourry, harry.yoo, ying.huang
  Cc: kernel_team, gregkh, rakie.kim, akpm, rafael, lenb,
	dan.j.williams, Jonathan.Cameron, dave.jiang, horen.chuang,
	hannes, linux-kernel, linux-acpi, linux-mm, kernel-team,
	yunjeong.mun

Hi Joshua,

On 2/27/2025 6:35 AM, Joshua Hahn wrote:
> We should never try to allocate memory from a memoryless node. Creating a
> sysfs knob to control its weighted interleave weight does not make sense,
> and can be unsafe.
> 
> Only create weighted interleave weight knobs for nodes with memory.
> 
> Signed-off-by: Joshua Hahn <joshua.hahnjy@gmail.com>
> ---
>   mm/mempolicy.c | 2 +-
>   1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index 4cc04ff8f12c..50cbb7c047fa 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -3721,7 +3721,7 @@ static int add_weighted_interleave_group(struct kobject *root_kobj)
>   		return err;
>   	}
>   
> -	for_each_node_state(nid, N_POSSIBLE) {

Actually, we're aware of this issue and currently trying to fix this.
In our system, we've attached 4ch of CXL memory for each socket as
follows.

         node0             node1
       +-------+   UPI   +-------+
       | CPU 0 |-+-----+-| CPU 1 |
       +-------+         +-------+
       | DRAM0 |         | DRAM1 |
       +---+---+         +---+---+
           |                 |
       +---+---+         +---+---+
       | CXL 0 |         | CXL 4 |
       +---+---+         +---+---+
       | CXL 1 |         | CXL 5 |
       +---+---+         +---+---+
       | CXL 2 |         | CXL 6 |
       +---+---+         +---+---+
       | CXL 3 |         | CXL 7 |
       +---+---+         +---+---+
         node2             node3

The 4ch of CXL memory are detected as a single NUMA node in each socket,
but it shows as follows with the current N_POSSIBLE loop.

$ ls /sys/kernel/mm/mempolicy/weighted_interleave/
node0 node1 node2 node3 node4 node5
node6 node7 node8 node9 node10 node11

> +	for_each_node_state(nid, N_MEMORY) {

But using N_MEMORY doesn't fix this problem and it hides the entire CXL
memory nodes in our system because the CXL memory isn't detected at this
point of creating node*.  Maybe there is some difference when multiple
CXL memory is detected as a single node.

We have to create more nodes when CXL memory is detected later.  In 
addition, this part can be changed to "for_each_online_node(nid)"
although N_MEMORY is also fine here.

We've internally fixed it using a memory hotpluging callback so we can
upload another working version later.

Do you mind if we continue fixing this work?

Thanks,
Honggyu

>   		err = add_weight_node(nid, wi_kobj);
>   		if (err) {
>   			pr_err("failed to add sysfs [node%d]\n", nid);



^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 2/2 v6] mm/mempolicy: Don't create weight sysfs for memoryless nodes
  2025-02-27  2:32     ` Honggyu Kim
@ 2025-02-27  3:20       ` Honggyu Kim
  2025-03-03 21:56         ` Joshua Hahn
  2025-03-03 16:19       ` Gregory Price
  2025-03-04 16:29       ` Gregory Price
  2 siblings, 1 reply; 34+ messages in thread
From: Honggyu Kim @ 2025-02-27  3:20 UTC (permalink / raw)
  To: Joshua Hahn, gourry, harry.yoo, ying.huang
  Cc: kernel_team, gregkh, rakie.kim, akpm, rafael, lenb,
	dan.j.williams, Jonathan.Cameron, dave.jiang, horen.chuang,
	hannes, linux-kernel, linux-acpi, linux-mm, kernel-team,
	yunjeong.mun



On 2/27/2025 11:32 AM, Honggyu Kim wrote:
> Hi Joshua,
> 
> On 2/27/2025 6:35 AM, Joshua Hahn wrote:
>> We should never try to allocate memory from a memoryless node. Creating a
>> sysfs knob to control its weighted interleave weight does not make sense,
>> and can be unsafe.
>>
>> Only create weighted interleave weight knobs for nodes with memory.
>>
>> Signed-off-by: Joshua Hahn <joshua.hahnjy@gmail.com>
>> ---
>>   mm/mempolicy.c | 2 +-
>>   1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
>> index 4cc04ff8f12c..50cbb7c047fa 100644
>> --- a/mm/mempolicy.c
>> +++ b/mm/mempolicy.c
>> @@ -3721,7 +3721,7 @@ static int add_weighted_interleave_group(struct 
>> kobject *root_kobj)
>>           return err;
>>       }
>> -    for_each_node_state(nid, N_POSSIBLE) {
> 
> Actually, we're aware of this issue and currently trying to fix this.
> In our system, we've attached 4ch of CXL memory for each socket as
> follows.
> 
>          node0             node1
>        +-------+   UPI   +-------+
>        | CPU 0 |-+-----+-| CPU 1 |
>        +-------+         +-------+
>        | DRAM0 |         | DRAM1 |
>        +---+---+         +---+---+
>            |                 |
>        +---+---+         +---+---+
>        | CXL 0 |         | CXL 4 |
>        +---+---+         +---+---+
>        | CXL 1 |         | CXL 5 |
>        +---+---+         +---+---+
>        | CXL 2 |         | CXL 6 |
>        +---+---+         +---+---+
>        | CXL 3 |         | CXL 7 |
>        +---+---+         +---+---+
>          node2             node3
> 
> The 4ch of CXL memory are detected as a single NUMA node in each socket,
> but it shows as follows with the current N_POSSIBLE loop.
> 
> $ ls /sys/kernel/mm/mempolicy/weighted_interleave/
> node0 node1 node2 node3 node4 node5
> node6 node7 node8 node9 node10 node11
> 
>> +    for_each_node_state(nid, N_MEMORY) {

Thinking it again, we can leave it as a separate patch but add our patch 
on top of it.

The only concern I have is having only N_MEMORY patch hides weight
setting knobs for CXL memory and it makes there is no way to set weight 
values to CXL memory in my system.

IMHO, this and our patch is better to be submitted together.

Thanks,
Honggyu

> 
> But using N_MEMORY doesn't fix this problem and it hides the entire CXL
> memory nodes in our system because the CXL memory isn't detected at this
> point of creating node*.  Maybe there is some difference when multiple
> CXL memory is detected as a single node.
> 
> We have to create more nodes when CXL memory is detected later.  In 
> addition, this part can be changed to "for_each_online_node(nid)"
> although N_MEMORY is also fine here.
> 
> We've internally fixed it using a memory hotpluging callback so we can
> upload another working version later.
> 
> Do you mind if we continue fixing this work?
> 
> Thanks,
> Honggyu
> 
>>           err = add_weight_node(nid, wi_kobj);
>>           if (err) {
>>               pr_err("failed to add sysfs [node%d]\n", nid);
> 



^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/2 v6] mm/mempolicy: Weighted Interleave Auto-tuning
  2025-02-26 21:35 ` [PATCH 1/2 v6] mm/mempolicy: Weighted Interleave Auto-tuning Joshua Hahn
  2025-02-26 21:35   ` [PATCH 2/2 v6] mm/mempolicy: Don't create weight sysfs for memoryless nodes Joshua Hahn
@ 2025-02-28  0:16   ` yunjeong.mun
  2025-02-28  6:39   ` Yunjeong Mun
  2 siblings, 0 replies; 34+ messages in thread
From: yunjeong.mun @ 2025-02-28  0:16 UTC (permalink / raw)
  To: yunjeong.mun, Joshua Hahn
  Cc: honggyu.kim, gregkh, rakie.kim, akpm, rafael, lenb,
	dan.j.williams, Jonathan.Cameron, dave.jiang, horen.chuang,
	hannes, linux-kernel, linux-acpi, linux-mm, kernel-team,
	kernel_team

On Wed, 26 Feb 2025 13:35:17 -0800 Joshua Hahn <joshua.hahnjy@gmail.com> wrote:
> On machines with multiple memory nodes, interleaving page allocations
> across nodes allows for better utilization of each node's bandwidth.
> Previous work by Gregory Price [1] introduced weighted interleave, which
> allowed for pages to be allocated across nodes according to user-set ratios.
> 
> Ideally, these weights should be proportional to their bandwidth, so
> that under bandwidth pressure, each node uses its maximal efficient
> bandwidth and prevents latency from increasing exponentially.
> 
> Previously, weighted interleave's default weights were just 1s -- which
> would be equivalent to the (unweighted) interleave mempolicy, which goes
> through the nodes in a round-robin fashion, ignoring bandwidth information.
> 
> This patch has two main goals:
> First, it makes weighted interleave easier to use for users who wish to
> relieve bandwidth pressure when using nodes with varying bandwidth (CXL).
> By providing a set of "real" default weights that just work out of the
> box, users who might not have the capability (or wish to) perform
> experimentation to find the most optimal weights for their system can
> still take advantage of bandwidth-informed weighted interleave.
> 
> Second, it allows for weighted interleave to dynamically adjust to
> hotplugged memory with new bandwidth information. Instead of manually
> updating node weights every time new bandwidth information is reported
> or taken off, weighted interleave adjusts and provides a new set of
> default weights for weighted interleave to use when there is a change
> in bandwidth information.
> 
> To meet these goals, this patch introduces an auto-configuration mode
> for the interleave weights that provides a reasonable set of default
> weights, calculated using bandwidth data reported by the system. In auto
> mode, weights are dynamically adjusted based on whatever the current
> bandwidth information reports (and responds to hotplug events).
> 
> This patch still supports users manually writing weights into the nodeN
> sysfs interface by entering into manual mode. When a user enters manual
> mode, the system stops dynamically updating any of the node weights,
> even during hotplug events that shift the optimal weight distribution.
> 
> A new sysfs interface "auto" is introduced, which allows users to switch
> between the auto (writing 1 or Y) and manual (writing 0 or N) modes. The
> system also automatically enters manual mode when a nodeN interface is
> manually written to.
> 
> There is one functional change that this patch makes to the existing
> weighted_interleave ABI: previously, writing 0 directly to a nodeN
> interface was said to reset the weight to the system default. Before
> this patch, the default for all weights were 1, which meant that writing
> 0 and 1 were functionally equivalent.
> 
> [1] https://lore.kernel.org/linux-mm/20240202170238.90004-1-gregory.price@memverge.com/
> 
> Signed-off-by: Joshua Hahn <joshua.hahnjy@gmail.com>
> Co-developed-by: Gregory Price <gourry@gourry.net>
> Signed-off-by: Gregory Price <gourry@gourry.net>
> Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
> ---
> Changelog
> v6:
> - iw_weights and mode_auto are combined into one rcu-protected struct.
> - Protection against memoryless nodes, as suggested by Oscar Salvador
> - Wordsmithing (documentation, commit message and comments), as suggested
>   by Andrew Morton.
> - Removed unnecessary #include statement in hmat.c, as pointed out by
>   Harry (Hyeonggon) Yoo and Ying Huang.
> - Bandwidth values changed from u64_t to unsigned int, as pointed out by
>   Ying Huang and Dan Carpenter.
> - RCU optimizations, as suggested by Ying Huang.
> - A second patch is included to fix unintended behavior that creates a
>   weight knob for memoryless nodes as well.
> - Sysfs show/store functions use str_true_false & kstrtobool.
> - Fix a build error in 32-bit systems, which are unable to perform
>   64-bit division by casting 64-bit values to 32-bit, if under the range.
> 
> v5:
> - I accidentally forgot to add the mm/mempolicy: subject tag since v1 of
>   this patch. Added to the subject now!
> - Wordsmithing, correcting typos, and re-naming variables for clarity.
> - No functional changes.
> 
> v4:
> - Renamed the mode interface to the "auto" interface, which now only
>   emits either 'Y' or 'N'. Users can now interact with it by
>   writing 'Y', '1', 'N', or '0' to it.
> - Added additional documentation to the nodeN sysfs interface.
> - Makes sure iw_table locks are properly held.
> - Removed unlikely() call in reduce_interleave_weights.
> - Wordsmithing
> 
> v3:
> - Weightiness (max_node_weight) is now fixed to 32.
> - Instead, the sysfs interface now exposes a "mode" parameter, which
>   can either be "auto" or "manual".
>   - Thank you Hyeonggon and Honggyu for the feedback.
> - Documentation updated to reflect new sysfs interface, explicitly
>   specifies that 0 is invalid.
>   - Thank you Gregory and Ying for the discussion on how best to
>     handle the 0 case.
> - Re-worked nodeN sysfs store to handle auto --> manual shifts
> - mempolicy_set_node_perf internally handles the auto / manual
>   case differently now. bw is always updated, iw updates depend on
>   what mode the user is in.
> - Wordsmithing comments for clarity.
> - Removed RFC tag.
> 
> v2:
> - Name of the interface is changed: "max_node_weight" --> "weightiness"
> - Default interleave weight table no longer exists. Rather, the
>   interleave weight table is initialized with the defaults, if bandwidth
>   information is available.
>   - In addition, all sections that handle iw_table have been changed
>     to reference iw_table if it exists, otherwise defaulting to 1.
> - All instances of unsigned long are converted to uint64_t to guarantee
>   support for both 32-bit and 64-bit machines
> - sysfs initialization cleanup
> - Documentation has been rewritten to explicitly outline expected
>   behavior and expand on the interpretation of "weightiness".
> - kzalloc replaced with kcalloc for readability
> - Thank you Gregory and Hyeonggon for your review & feedback!
> 
>  ...fs-kernel-mm-mempolicy-weighted-interleave |  34 +-
>  drivers/base/node.c                           |   9 +
>  include/linux/mempolicy.h                     |   9 +
>  mm/mempolicy.c                                | 323 +++++++++++++++---
>  4 files changed, 322 insertions(+), 53 deletions(-)
> 
> diff --git a/Documentation/ABI/testing/sysfs-kernel-mm-mempolicy-weighted-interleave b/Documentation/ABI/testing/sysfs-kernel-mm-mempolicy-weighted-interleave
> index 0b7972de04e9..862b19943a85 100644
> --- a/Documentation/ABI/testing/sysfs-kernel-mm-mempolicy-weighted-interleave
> +++ b/Documentation/ABI/testing/sysfs-kernel-mm-mempolicy-weighted-interleave
> @@ -20,6 +20,34 @@ Description:	Weight configuration interface for nodeN
>  		Minimum weight: 1
>  		Maximum weight: 255
>  
> -		Writing an empty string or `0` will reset the weight to the
> -		system default. The system default may be set by the kernel
> -		or drivers at boot or during hotplug events.
> +		Writing invalid values (i.e. any values not in [1,255],
> +		empty string, ...) will return -EINVAL.
> +
> +		Changing the weight to a valid value will automatically
> +		update the system to manual mode as well.
> +
> +What:		/sys/kernel/mm/mempolicy/weighted_interleave/auto
> +Date:		February 2025
> +Contact:	Linux memory management mailing list <linux-mm@kvack.org>
> +Description:	Auto-weighting configuration interface
> +
> +		Configuration mode for weighted interleave. A 'Y' indicates
> +		that the system is in auto mode, and a 'N' indicates that
> +		the system is in manual mode. All other values are invalid.
> +
> +		In auto mode, all node weights are re-calculated and overwritten
> +		(visible via the nodeN interfaces) whenever new bandwidth data
> +		is made available during either boot or hotplug events.
> +
> +		In manual mode, node weights can only be updated by the user.
> +		Note that nodes that are onlined with previously set weights
> +		will inherit those weights. If they were not previously set or
> +		are onlined with missing bandwidth data, the weights will use
> +		a default weight of 1.
> +
> +		Writing Y or 1 to the interface will enable auto mode, while
> +		writing N or 0 will enable manual mode. All other strings will
> +		be ignored, and -EINVAL will be returned.
> +
> +		Writing a new weight to a node directly via the nodeN interface
> +		will also automatically update the system to manual mode.
> diff --git a/drivers/base/node.c b/drivers/base/node.c
> index 0ea653fa3433..f3c01fb90db1 100644
> --- a/drivers/base/node.c
> +++ b/drivers/base/node.c
> @@ -7,6 +7,7 @@
>  #include <linux/init.h>
>  #include <linux/mm.h>
>  #include <linux/memory.h>
> +#include <linux/mempolicy.h>
>  #include <linux/vmstat.h>
>  #include <linux/notifier.h>
>  #include <linux/node.h>
> @@ -214,6 +215,14 @@ void node_set_perf_attrs(unsigned int nid, struct access_coordinate *coord,
>  			break;
>  		}
>  	}
> +
> +	/* When setting CPU access coordinates, update mempolicy */
> +	if (access == ACCESS_COORDINATE_CPU) {
> +		if (mempolicy_set_node_perf(nid, coord)) {
> +			pr_info("failed to set mempolicy attrs for node %d\n",
> +				nid);
> +		}
> +	}
>  }
>  EXPORT_SYMBOL_GPL(node_set_perf_attrs);
>  
> diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h
> index ce9885e0178a..78f1299bdd42 100644
> --- a/include/linux/mempolicy.h
> +++ b/include/linux/mempolicy.h
> @@ -11,6 +11,7 @@
>  #include <linux/slab.h>
>  #include <linux/rbtree.h>
>  #include <linux/spinlock.h>
> +#include <linux/node.h>
>  #include <linux/nodemask.h>
>  #include <linux/pagemap.h>
>  #include <uapi/linux/mempolicy.h>
> @@ -56,6 +57,11 @@ struct mempolicy {
>  	} w;
>  };
>  
> +struct weighted_interleave_state {
> +	bool mode_auto;
> +	u8 iw_table[]; /* A null iw_table is interpreted as an array of 1s. */
> +};
> +
>  /*
>   * Support for managing mempolicy data objects (clone, copy, destroy)
>   * The default fast path of a NULL MPOL_DEFAULT policy is always inlined.
> @@ -178,6 +184,9 @@ static inline bool mpol_is_preferred_many(struct mempolicy *pol)
>  
>  extern bool apply_policy_zone(struct mempolicy *policy, enum zone_type zone);
>  
> +extern int mempolicy_set_node_perf(unsigned int node,
> +				   struct access_coordinate *coords);
> +
>  #else
>  
>  struct mempolicy {};
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index bbaadbeeb291..4cc04ff8f12c 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -109,6 +109,7 @@
>  #include <linux/mmu_notifier.h>
>  #include <linux/printk.h>
>  #include <linux/swapops.h>
> +#include <linux/gcd.h>
>  
>  #include <asm/tlbflush.h>
>  #include <asm/tlb.h>
> @@ -139,31 +140,151 @@ static struct mempolicy default_policy = {
>  static struct mempolicy preferred_node_policy[MAX_NUMNODES];
>  
>  /*
> - * iw_table is the sysfs-set interleave weight table, a value of 0 denotes
> - * system-default value should be used. A NULL iw_table also denotes that
> - * system-default values should be used. Until the system-default table
> - * is implemented, the system-default is always 1.
> - *
> - * iw_table is RCU protected
> + * weightiness balances the tradeoff between small weights (cycles through nodes
> + * faster, more fair/even distribution) and large weights (smaller errors
> + * between actual bandwidth ratios and weight ratios). 32 is a number that has
> + * been found to perform at a reasonable compromise between the two goals.
> + */
> +static const int weightiness = 32;
> +
> +/* wi_state is RCU protected */
> +static struct weighted_interleave_state __rcu *wi_state;
> +static unsigned int *node_bw_table;
> +
> +/*
> + * iw_table_lock protects both wi_state and node_bw_table.
> + * node_bw_table is only used by writers to update wi_state.
>   */
> -static u8 __rcu *iw_table;
>  static DEFINE_MUTEX(iw_table_lock);
>  
>  static u8 get_il_weight(int node)
>  {
> -	u8 *table;
> -	u8 weight;
> +	u8 weight = 1;
>  
>  	rcu_read_lock();
> -	table = rcu_dereference(iw_table);
> -	/* if no iw_table, use system default */
> -	weight = table ? table[node] : 1;
> -	/* if value in iw_table is 0, use system default */
> -	weight = weight ? weight : 1;
> +	if (rcu_access_pointer(wi_state))
> +		weight = rcu_dereference(wi_state)->iw_table[node];
>  	rcu_read_unlock();
> +
>  	return weight;
>  }
>  
> +/*
> + * Convert bandwidth values into weighted interleave weights.
> + * Call with iw_table_lock.
> + */
> +static void reduce_interleave_weights(unsigned int *bw, u8 *new_iw)
> +{
> +	u64 sum_bw = 0;
> +	unsigned int cast_sum_bw, sum_iw = 0;
> +	unsigned int scaling_factor = 1, iw_gcd = 1;
> +	int nid;
> +
> +	/* Recalculate the bandwidth distribution given the new info */
> +	for_each_node_state(nid, N_MEMORY)
> +		sum_bw += bw[nid];
> +
> +	for (nid = 0; nid < nr_node_ids; nid++) {
> +		/* Set memoryless nodes' weights to 1 to prevent div/0 later */
> +		if (!node_state(nid, N_MEMORY)) {
> +			new_iw[nid] = 1;
> +			continue;
> +		}
> +
> +		scaling_factor = 100 * bw[nid];
> +
> +		/*
> +		 * Try not to perform 64-bit division.
> +		 * If sum_bw < scaling_factor, then sum_bw < U32_MAX.
> +		 * If sum_bw > scaling_factor, then bw[nid] is less than
> +		 * 1% of the total bandwidth. Round up to 1%.
> +		 */
> +		if (bw[nid] && sum_bw < scaling_factor) {
> +			cast_sum_bw = (unsigned int)sum_bw;
> +			new_iw[nid] = scaling_factor / cast_sum_bw;
> +		} else {
> +			new_iw[nid] = 1;
> +		}
> +		sum_iw += new_iw[nid];
> +	}
> +
> +	/*
> +	 * Scale each node's share of the total bandwidth from percentages
> +	 * to whole numbers in the range [1, weightiness]
> +	 */
> +	for_each_node_state(nid, N_MEMORY) {
> +		scaling_factor = weightiness * new_iw[nid];
> +		new_iw[nid] = max(scaling_factor / sum_iw, 1);
> +		if (nid == 0)
> +			iw_gcd = new_iw[0];
> +		iw_gcd = gcd(iw_gcd, new_iw[nid]);
> +	}
> +
> +	/* 1:2 is strictly better than 16:32. Reduce by the weights' GCD. */
> +	for_each_node_state(nid, N_MEMORY)
> +		new_iw[nid] /= iw_gcd;
> +}
> +
> +int mempolicy_set_node_perf(unsigned int node, struct access_coordinate *coords)
> +{
> +	struct weighted_interleave_state *new_wi_state, *old_wi_state = NULL;
> +	unsigned int *old_bw, *new_bw;
> +	unsigned int bw_val;
> +
> +	bw_val = min(coords->read_bandwidth, coords->write_bandwidth);
> +	new_bw = kcalloc(nr_node_ids, sizeof(unsigned int), GFP_KERNEL);
> +	if (!new_bw)
> +		return -ENOMEM;
> +
> +	new_wi_state = kzalloc(struct_size(new_wi_state, iw_table, nr_node_ids),
> +			       GFP_KERNEL);
> +	if (!new_wi_state) {
> +		kfree(new_bw);
> +		return -ENOMEM;
> +	}
> +
> +	/*
> +	 * Update bandwidth info, even in manual mode. That way, when switching
> +	 * to auto mode in the future, iw_table can be overwritten using
> +	 * accurate bw data.
> +	 */
> +	mutex_lock(&iw_table_lock);
> +
> +	old_bw = node_bw_table;
> +	if (old_bw)
> +		memcpy(new_bw, old_bw, nr_node_ids * sizeof(unsigned int));
> +	new_bw[node] = bw_val;
> +	node_bw_table = new_bw;
> +
> +	/* wi_state not initialized yet; assume auto == true */
> +	if (!rcu_access_pointer(wi_state))
> +		goto reduce;
> +
> +	old_wi_state = rcu_dereference_protected(wi_state,
> +					lockdep_is_held(&iw_table_lock));
> +	if (old_wi_state->mode_auto)
> +		goto reduce;
> +
> +	mutex_unlock(&iw_table_lock);
> +	kfree(new_wi_state);
> +	kfree(old_bw);
> +	return 0;
> +
> +reduce:
> +	new_wi_state->mode_auto = true;
> +	reduce_interleave_weights(new_bw, new_wi_state->iw_table);
> +
> +	rcu_assign_pointer(wi_state, new_wi_state);
> +	mutex_unlock(&iw_table_lock);
> +	if (old_wi_state) {
> +		synchronize_rcu();
> +		kfree(old_wi_state);
> +	}
> +	kfree(old_bw);
> +
> +	return 0;
> +}
> +
>  /**
>   * numa_nearest_node - Find nearest node by state
>   * @node: Node id to start the search
> @@ -1988,34 +2109,33 @@ static unsigned int weighted_interleave_nid(struct mempolicy *pol, pgoff_t ilx)
>  	u8 *table;
>  	unsigned int weight_total = 0;
>  	u8 weight;
> -	int nid;
> +	int nid = 0;
>  
>  	nr_nodes = read_once_policy_nodemask(pol, &nodemask);
>  	if (!nr_nodes)
>  		return numa_node_id();
>  
>  	rcu_read_lock();
> -	table = rcu_dereference(iw_table);
> +	if (!rcu_access_pointer(wi_state))
> +		goto out;
> +
> +	table = rcu_dereference(wi_state)->iw_table;
>  	/* calculate the total weight */
> -	for_each_node_mask(nid, nodemask) {
> -		/* detect system default usage */
> -		weight = table ? table[nid] : 1;
> -		weight = weight ? weight : 1;
> -		weight_total += weight;
> -	}
> +	for_each_node_mask(nid, nodemask)
> +		weight_total += table ? table[nid] : 1;
>  
>  	/* Calculate the node offset based on totals */
>  	target = ilx % weight_total;
>  	nid = first_node(nodemask);
>  	while (target) {
>  		/* detect system default usage */
> -		weight = table ? table[nid] : 1;
> -		weight = weight ? weight : 1;
> +		weight = table[nid];
>  		if (target < weight)
>  			break;
>  		target -= weight;
>  		nid = next_node_in(nid, nodemask);
>  	}
> +out:
>  	rcu_read_unlock();
>  	return nid;
>  }
> @@ -2411,13 +2531,14 @@ static unsigned long alloc_pages_bulk_weighted_interleave(gfp_t gfp,
>  		struct mempolicy *pol, unsigned long nr_pages,
>  		struct page **page_array)
>  {
> +	struct weighted_interleave_state *state;
>  	struct task_struct *me = current;
>  	unsigned int cpuset_mems_cookie;
>  	unsigned long total_allocated = 0;
>  	unsigned long nr_allocated = 0;
>  	unsigned long rounds;
>  	unsigned long node_pages, delta;
> -	u8 *table, *weights, weight;
> +	u8 *weights, weight;
>  	unsigned int weight_total = 0;
>  	unsigned long rem_pages = nr_pages;
>  	nodemask_t nodes;
> @@ -2467,17 +2588,19 @@ static unsigned long alloc_pages_bulk_weighted_interleave(gfp_t gfp,
>  		return total_allocated;
>  
>  	rcu_read_lock();
> -	table = rcu_dereference(iw_table);
> -	if (table)
> -		memcpy(weights, table, nr_node_ids);
> -	rcu_read_unlock();
> +	if (rcu_access_pointer(wi_state)) {
> +		state = rcu_dereference(wi_state);
> +		memcpy(weights, state->iw_table, nr_node_ids * sizeof(u8));
> +		rcu_read_unlock();
> +	} else {
> +		rcu_read_unlock();
> +		for (i = 0; i < nr_node_ids; i++)
> +			weights[i] = 1;
> +	}
>  
>  	/* calculate total, detect system default usage */
> -	for_each_node_mask(node, nodes) {
> -		if (!weights[node])
> -			weights[node] = 1;
> +	for_each_node_mask(node, nodes)
>  		weight_total += weights[node];
> -	}
>  
>  	/*
>  	 * Calculate rounds/partial rounds to minimize __alloc_pages_bulk calls.
> @@ -3402,36 +3525,113 @@ static ssize_t node_show(struct kobject *kobj, struct kobj_attribute *attr,
>  static ssize_t node_store(struct kobject *kobj, struct kobj_attribute *attr,
>  			  const char *buf, size_t count)
>  {
> +	struct weighted_interleave_state *new_wi_state, *old_wi_state = NULL;
>  	struct iw_node_attr *node_attr;
> -	u8 *new;
> -	u8 *old;
>  	u8 weight = 0;
> +	int i;
>  
>  	node_attr = container_of(attr, struct iw_node_attr, kobj_attr);
>  	if (count == 0 || sysfs_streq(buf, ""))
>  		weight = 0;
> -	else if (kstrtou8(buf, 0, &weight))
> +	else if (kstrtou8(buf, 0, &weight) || weight == 0)
>  		return -EINVAL;
>  
> -	new = kzalloc(nr_node_ids, GFP_KERNEL);
> -	if (!new)
> +	new_wi_state = kzalloc(struct_size(new_wi_state, iw_table, nr_node_ids),
> +			       GFP_KERNEL);
> +	if (!new_wi_state)
>  		return -ENOMEM;
>  
>  	mutex_lock(&iw_table_lock);
> -	old = rcu_dereference_protected(iw_table,
> +	if (rcu_access_pointer(wi_state)) {
> +		old_wi_state = rcu_dereference_protected(wi_state,
>  					lockdep_is_held(&iw_table_lock));
> -	if (old)
> -		memcpy(new, old, nr_node_ids);
> -	new[node_attr->nid] = weight;
> -	rcu_assign_pointer(iw_table, new);
> +		memcpy(new_wi_state->iw_table, old_wi_state->iw_table,
> +					nr_node_ids * sizeof(u8));
> +	} else {
> +		for (i = 0; i < nr_node_ids; i++)
> +			new_wi_state->iw_table[i] = 1;
> +	}
> +	new_wi_state->iw_table[node_attr->nid] = weight;
> +	new_wi_state->mode_auto = false;
> +
> +	rcu_assign_pointer(wi_state, new_wi_state);
>  	mutex_unlock(&iw_table_lock);
> -	synchronize_rcu();
> -	kfree(old);
> +	if (old_wi_state) {
> +		synchronize_rcu();
> +		kfree(old_wi_state);
> +	}
>  	return count;
>  }
>  
>  static struct iw_node_attr **node_attrs;
>  
> +static ssize_t weighted_interleave_auto_show(struct kobject *kobj,
> +		struct kobj_attribute *attr, char *buf)
> +{
> +	bool wi_auto = true;
> +
> +	rcu_read_lock();
> +	if (rcu_access_pointer(wi_state))
> +		wi_auto = rcu_dereference(wi_state)->mode_auto;
> +	rcu_read_unlock();
> +
> +	return sysfs_emit(buf, "%s\n", str_true_false(wi_auto));
> +}
> +
> +static ssize_t weighted_interleave_auto_store(struct kobject *kobj,
> +		struct kobj_attribute *attr, const char *buf, size_t count)
> +{
> +	struct weighted_interleave_state *new_wi_state, *old_wi_state = NULL;
> +	unsigned int *bw;
> +	bool input;
> +	int i;
> +
> +	if (kstrtobool(buf, &input))
> +		return -EINVAL;
> +
> +	new_wi_state = kzalloc(struct_size(new_wi_state, iw_table, nr_node_ids),
> +			       GFP_KERNEL);
> +	if (!new_wi_state)
> +		return -ENOMEM;
> +	mutex_lock(&iw_table_lock);
> +
> +	if (!input) {
> +		if (rcu_access_pointer(wi_state)) {
> +			old_wi_state = rcu_dereference_protected(wi_state,
> +					lockdep_is_held(&iw_table_lock));
> +			memcpy(new_wi_state->iw_table, old_wi_state->iw_table,
> +					nr_node_ids * sizeof(u8));
> +		} else {
> +			for (i = 0; i < nr_node_ids; i++)
> +				new_wi_state->iw_table[i] = 1;
> +		}
> +		goto update_wi_state;
> +	}
> +
> +	bw = node_bw_table;
> +	if (!bw) {
> +		mutex_unlock(&iw_table_lock);
> +		kfree(new_wi_state);
> +		return -ENODEV;
> +	}
> +
> +	new_wi_state->mode_auto = true;
> +	reduce_interleave_weights(bw, new_wi_state->iw_table);
> +
> +update_wi_state:
> +	rcu_assign_pointer(wi_state, new_wi_state);
> +	mutex_unlock(&iw_table_lock);
> +	if (old_wi_state) {
> +		synchronize_rcu();
> +		kfree(old_wi_state);
> +	}
> +	return count;
> +}
> +
> +static struct kobj_attribute wi_attr =
> +	__ATTR(auto, 0664, weighted_interleave_auto_show,
> +			   weighted_interleave_auto_store);
> +
>  static void sysfs_wi_node_release(struct iw_node_attr *node_attr,
>  				  struct kobject *parent)
>  {
> @@ -3489,6 +3689,15 @@ static int add_weight_node(int nid, struct kobject *wi_kobj)
>  	return 0;
>  }
>  
> +static struct attribute *wi_default_attrs[] = {
> +	&wi_attr.attr,
> +	NULL
> +};
> +
> +static const struct attribute_group wi_attr_group = {
> +	.attrs = wi_default_attrs,
> +};
> +
>  static int add_weighted_interleave_group(struct kobject *root_kobj)
>  {
>  	struct kobject *wi_kobj;
> @@ -3505,6 +3714,13 @@ static int add_weighted_interleave_group(struct kobject *root_kobj)
>  		return err;
>  	}
>  
> +	err = sysfs_create_group(wi_kobj, &wi_attr_group);
> +	if (err) {
> +		pr_err("failed to add sysfs [auto]\n");
> +		kobject_put(wi_kobj);
> +		return err;
> +	}
> +
>  	for_each_node_state(nid, N_POSSIBLE) {
>  		err = add_weight_node(nid, wi_kobj);
>  		if (err) {
> @@ -3519,15 +3735,22 @@ static int add_weighted_interleave_group(struct kobject *root_kobj)
>  
>  static void mempolicy_kobj_release(struct kobject *kobj)
>  {
> -	u8 *old;
> +	struct weighted_interleave_state *old_wi_state;
>  
>  	mutex_lock(&iw_table_lock);
> -	old = rcu_dereference_protected(iw_table,
> -					lockdep_is_held(&iw_table_lock));
> -	rcu_assign_pointer(iw_table, NULL);
> +	if (!rcu_access_pointer(wi_state)) {
> +		mutex_unlock(&iw_table_lock);
> +		goto out;
> +	}
> +
> +	old_wi_state = rcu_dereference_protected(wi_state,
> +			lockdep_is_held(&iw_table_lock));
> +
> +	rcu_assign_pointer(wi_state, NULL);
>  	mutex_unlock(&iw_table_lock);
>  	synchronize_rcu();
> -	kfree(old);
> +	kfree(old_wi_state);
> +out:
>  	kfree(node_attrs);
>  	kfree(kobj);
>  }
> -- 
> 2.43.5
> 


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/2 v6] mm/mempolicy: Weighted Interleave Auto-tuning
  2025-02-26 21:35 ` [PATCH 1/2 v6] mm/mempolicy: Weighted Interleave Auto-tuning Joshua Hahn
  2025-02-26 21:35   ` [PATCH 2/2 v6] mm/mempolicy: Don't create weight sysfs for memoryless nodes Joshua Hahn
  2025-02-28  0:16   ` [PATCH 1/2 v6] mm/mempolicy: Weighted Interleave Auto-tuning yunjeong.mun
@ 2025-02-28  6:39   ` Yunjeong Mun
  2025-02-28 16:24     ` Joshua Hahn
  2025-03-04 21:56     ` Joshua Hahn
  2 siblings, 2 replies; 34+ messages in thread
From: Yunjeong Mun @ 2025-02-28  6:39 UTC (permalink / raw)
  To: Joshua Hahn
  Cc: honggyu.kim, gregkh, rakie.kim, akpm, rafael, lenb,
	dan.j.williams, Jonathan.Cameron, dave.jiang, horen.chuang,
	hannes, linux-kernel, linux-acpi, linux-mm, kernel-team,
	kernel_team

Hi, Joshua. 

First of all I accidentally sent the wrong email a few hours ago.
Please disregard it. Sorry for the confusion.

On Wed, 26 Feb 2025 13:35:17 -0800 Joshua Hahn <joshua.hahnjy@gmail.com> wrote:

[...snip...]
>  
> +/*
> + * Convert bandwidth values into weighted interleave weights.
> + * Call with iw_table_lock.
> + */
> +static void reduce_interleave_weights(unsigned int *bw, u8 *new_iw)
> +{
> +	u64 sum_bw = 0;
> +	unsigned int cast_sum_bw, sum_iw = 0;
> +	unsigned int scaling_factor = 1, iw_gcd = 1;
> +	int nid;
> +
> +	/* Recalculate the bandwidth distribution given the new info */
> +	for_each_node_state(nid, N_MEMORY)
> +		sum_bw += bw[nid];
> +
> +	for (nid = 0; nid < nr_node_ids; nid++) {
> +		/* Set memoryless nodes' weights to 1 to prevent div/0 later */
> +		if (!node_state(nid, N_MEMORY)) {
> +			new_iw[nid] = 1;
> +			continue;
> +		}
> +
> +		scaling_factor = 100 * bw[nid];
> +
> +		/*
> +		 * Try not to perform 64-bit division.
> +		 * If sum_bw < scaling_factor, then sum_bw < U32_MAX.
> +		 * If sum_bw > scaling_factor, then bw[nid] is less than
> +		 * 1% of the total bandwidth. Round up to 1%.
> +		 */
> +		if (bw[nid] && sum_bw < scaling_factor) {
> +			cast_sum_bw = (unsigned int)sum_bw;
> +			new_iw[nid] = scaling_factor / cast_sum_bw;
> +		} else {
> +			new_iw[nid] = 1;
> +		}
> +		sum_iw += new_iw[nid];
> +	}
> +
> +	/*
> +	 * Scale each node's share of the total bandwidth from percentages
> +	 * to whole numbers in the range [1, weightiness]
> +	 */
> +	for_each_node_state(nid, N_MEMORY) {
> +		scaling_factor = weightiness * new_iw[nid];
> +		new_iw[nid] = max(scaling_factor / sum_iw, 1);
> +		if (nid == 0)
> +			iw_gcd = new_iw[0];
> +		iw_gcd = gcd(iw_gcd, new_iw[nid]);
> +	}
> +
> +	/* 1:2 is strictly better than 16:32. Reduce by the weights' GCD. */
> +	for_each_node_state(nid, N_MEMORY)
> +		new_iw[nid] /= iw_gcd;
> +}

In my understanding, new_iw[nid] values are scaled twice, first to 100 and then to a 
weightines value of 32. I think this scaling can be done just once, directly 
to weightness value as follows:

diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 50cbb7c047fa..65a7e2baf161 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -176,47 +176,22 @@ static u8 get_il_weight(int node)
 static void reduce_interleave_weights(unsigned int *bw, u8 *new_iw)
 {
	u64 sum_bw = 0;
-	unsigned int cast_sum_bw, sum_iw = 0;
-	unsigned int scaling_factor = 1, iw_gcd = 1;
+	unsigned int scaling_factor = 1, iw_gcd = 0;
	int nid;

	/* Recalculate the bandwidth distribution given the new info */
	for_each_node_state(nid, N_MEMORY)
		sum_bw += bw[nid];

-       for (nid = 0; nid < nr_node_ids; nid++) {
 			[...snip...]
-		/*
-		 * Try not to perform 64-bit division.
-		 * If sum_bw < scaling_factor, then sum_bw < U32_MAX.
-		 * If sum_bw > scaling_factor, then bw[nid] is less than
-		 * 1% of the total bandwidth. Round up to 1%.
-		 */
 			[...snip...]
-		sum_iw += new_iw[nid];
-	}
-
     
	/*
	 * Scale each node's share of the total bandwidth from percentages
	 * to whole numbers in the range [1, weightiness]
	 */
	for_each_node_state(nid, N_MEMORY) {
-		scaling_factor = weightiness * new_iw[nid];
-		new_iw[nid] = max(scaling_factor / sum_iw, 1);
-		if (nid == 0)
-			iw_gcd = new_iw[0];
+		scaling_factor = weightiness * bw[nid];
+		new_iw[nid] = max(scaling_factor / sum_bw, 1);
+		if (!iw_gcd)
+			iw_gcd = new_iw[nid];
		iw_gcd = gcd(iw_gcd, new_iw[nid]);
	}

Please let me know how you think about this.

Best regards,
Yunjeong


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/2 v6] mm/mempolicy: Weighted Interleave Auto-tuning
  2025-02-28  6:39   ` Yunjeong Mun
@ 2025-02-28 16:24     ` Joshua Hahn
  2025-03-04 21:56     ` Joshua Hahn
  1 sibling, 0 replies; 34+ messages in thread
From: Joshua Hahn @ 2025-02-28 16:24 UTC (permalink / raw)
  To: Yunjeong Mun
  Cc: honggyu.kim, gregkh, rakie.kim, akpm, rafael, lenb,
	dan.j.williams, Jonathan.Cameron, dave.jiang, horen.chuang,
	hannes, linux-kernel, linux-acpi, linux-mm, kernel-team,
	kernel_team

On Fri, 28 Feb 2025 15:39:55 +0900 Yunjeong Mun <yunjeong.mun@sk.com> wrote:

Hi Yunjeong, thank you for taking time to review my work!

> Hi, Joshua. 
> 
> First of all I accidentally sent the wrong email a few hours ago.
> Please disregard it. Sorry for the confusion.

No worries at all!

> On Wed, 26 Feb 2025 13:35:17 -0800 Joshua Hahn <joshua.hahnjy@gmail.com> wrote:
> 
> [...snip...]
> >  
> > +/*
> > + * Convert bandwidth values into weighted interleave weights.
> > + * Call with iw_table_lock.
> > + */
> > +static void reduce_interleave_weights(unsigned int *bw, u8 *new_iw)
> > +{
> > +	u64 sum_bw = 0;
> > +	unsigned int cast_sum_bw, sum_iw = 0;
> > +	unsigned int scaling_factor = 1, iw_gcd = 1;
> > +	int nid;
> > +
> > +	/* Recalculate the bandwidth distribution given the new info */
> > +	for_each_node_state(nid, N_MEMORY)
> > +		sum_bw += bw[nid];
> > +
> > +	for (nid = 0; nid < nr_node_ids; nid++) {
> > +		/* Set memoryless nodes' weights to 1 to prevent div/0 later */
> > +		if (!node_state(nid, N_MEMORY)) {
> > +			new_iw[nid] = 1;
> > +			continue;
> > +		}
> > +
> > +		scaling_factor = 100 * bw[nid];
> > +
> > +		/*
> > +		 * Try not to perform 64-bit division.
> > +		 * If sum_bw < scaling_factor, then sum_bw < U32_MAX.
> > +		 * If sum_bw > scaling_factor, then bw[nid] is less than
> > +		 * 1% of the total bandwidth. Round up to 1%.
> > +		 */
> > +		if (bw[nid] && sum_bw < scaling_factor) {
> > +			cast_sum_bw = (unsigned int)sum_bw;
> > +			new_iw[nid] = scaling_factor / cast_sum_bw;
> > +		} else {
> > +			new_iw[nid] = 1;
> > +		}
> > +		sum_iw += new_iw[nid];
> > +	}
> > +
> > +	/*
> > +	 * Scale each node's share of the total bandwidth from percentages
> > +	 * to whole numbers in the range [1, weightiness]
> > +	 */
> > +	for_each_node_state(nid, N_MEMORY) {
> > +		scaling_factor = weightiness * new_iw[nid];
> > +		new_iw[nid] = max(scaling_factor / sum_iw, 1);
> > +		if (nid == 0)
> > +			iw_gcd = new_iw[0];
> > +		iw_gcd = gcd(iw_gcd, new_iw[nid]);
> > +	}
> > +
> > +	/* 1:2 is strictly better than 16:32. Reduce by the weights' GCD. */
> > +	for_each_node_state(nid, N_MEMORY)
> > +		new_iw[nid] /= iw_gcd;
> > +}
> 
> In my understanding, new_iw[nid] values are scaled twice, first to 100 and then to a 
> weightines value of 32. I think this scaling can be done just once, directly 
> to weightness value as follows:

Yes,  you are correct. I want to provide a bit of context on how this
patch has changed over time: In the first few iterations of this patch, 
"weightiness" was actually exposed as a sysfs interface that users could
change to change how much they scaled for high numbers (better weight
accuracy, but worse page allocation distributon fairness) and small numbers
(bigger errors, but better local fairness).

The reason why this matters is that we use a heuristic of "round all
weights whose weights are less than 1% of the total weight sum to 1%".
So if we have bandwidth ratios of 100 : 1000 : 3000 : 4000 : 6000,
we have a sum total of 14100. Then 100/14100 is only ~0.7%, and we would
want to round it up to 1% before moving on (since weights that are too
small don't end up helping). This problem only gets worse for machines
with more nodes, and it becomes possible for a node to have something like
0.1% of the total bandwidth.

When users could set weightiness to be up to 255, this was problematic,
becuase scenarios where weights become 1:255:255:255:255... become possible,
where we allocate a single page from one node, then allocate 255 pages from
the remaining nr_node_ids - 1 nodes (which is of course, not ideal).
However, with weightiness fixed to 32, maybe this heuristic makes less sense,
since the worst-case-scenario looks like 1:32:32:32:32...

I think this proposed change makes a lot of sense. It does seem silly to
have to round twice, and without giving the users the ability to set thier
own weightiness value, rounding just once seems to be enough to prevent
the worst case scenario. I will incorporate this into a v7.

I'm also going to wait a bit for more feedback to come in for this version,
so it may be a little bit before I send v7 out : -)

Thanks again for your review and the proposed change. Have a great day!
Joshua

> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index 50cbb7c047fa..65a7e2baf161 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -176,47 +176,22 @@ static u8 get_il_weight(int node)
>  static void reduce_interleave_weights(unsigned int *bw, u8 *new_iw)
>  {
> 	u64 sum_bw = 0;
> -	unsigned int cast_sum_bw, sum_iw = 0;
> -	unsigned int scaling_factor = 1, iw_gcd = 1;
> +	unsigned int scaling_factor = 1, iw_gcd = 0;
> 	int nid;
> 
> 	/* Recalculate the bandwidth distribution given the new info */
> 	for_each_node_state(nid, N_MEMORY)
> 		sum_bw += bw[nid];
> 
> -       for (nid = 0; nid < nr_node_ids; nid++) {
>  			[...snip...]
> -		/*
> -		 * Try not to perform 64-bit division.
> -		 * If sum_bw < scaling_factor, then sum_bw < U32_MAX.
> -		 * If sum_bw > scaling_factor, then bw[nid] is less than
> -		 * 1% of the total bandwidth. Round up to 1%.
> -		 */
>  			[...snip...]
> -		sum_iw += new_iw[nid];
> -	}
> -
>      
> 	/*
> 	 * Scale each node's share of the total bandwidth from percentages
> 	 * to whole numbers in the range [1, weightiness]
> 	 */
> 	for_each_node_state(nid, N_MEMORY) {
> -		scaling_factor = weightiness * new_iw[nid];
> -		new_iw[nid] = max(scaling_factor / sum_iw, 1);
> -		if (nid == 0)
> -			iw_gcd = new_iw[0];
> +		scaling_factor = weightiness * bw[nid];
> +		new_iw[nid] = max(scaling_factor / sum_bw, 1);
> +		if (!iw_gcd)
> +			iw_gcd = new_iw[nid];
> 		iw_gcd = gcd(iw_gcd, new_iw[nid]);
> 	}
> 
> Please let me know how you think about this.
> 
> Best regards,
> Yunjeong

Sent using hkml (https://github.com/sjp38/hackermail)


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 2/2 v6] mm/mempolicy: Don't create weight sysfs for memoryless nodes
  2025-02-27  2:32     ` Honggyu Kim
  2025-02-27  3:20       ` Honggyu Kim
@ 2025-03-03 16:19       ` Gregory Price
  2025-03-04 13:03         ` Honggyu Kim
  2025-03-04 16:29       ` Gregory Price
  2 siblings, 1 reply; 34+ messages in thread
From: Gregory Price @ 2025-03-03 16:19 UTC (permalink / raw)
  To: Honggyu Kim
  Cc: Joshua Hahn, harry.yoo, ying.huang, kernel_team, gregkh,
	rakie.kim, akpm, rafael, lenb, dan.j.williams, Jonathan.Cameron,
	dave.jiang, horen.chuang, hannes, linux-kernel, linux-acpi,
	linux-mm, kernel-team, yunjeong.mun

On Thu, Feb 27, 2025 at 11:32:26AM +0900, Honggyu Kim wrote:
> 
> But using N_MEMORY doesn't fix this problem and it hides the entire CXL
> memory nodes in our system because the CXL memory isn't detected at this
> point of creating node*.  Maybe there is some difference when multiple
> CXL memory is detected as a single node.
> 

Hm, well, the node is "created" during early boot when ACPI tables are
read and the CFMW are discovered - but they aren't necessarily "online"
at the time they're created.

There is no true concept of a "Hotplug NUMA Node" - as the node must be
created at boot time. (tl;dr: N_POSSIBLE will never change).

This patch may have been a bit overzealous of us, I forgot to ask
whether N_MEMORY is set for nodes created but not onlined at boot. So
this is a good observation.

It also doesn't help that this may introduce a subtle race condition.

If a node exists (N_POSSIBLE) but hasn't been onlined (!N_MEMORY) and
bandwidth information is reported - then we store the bandwidth info
but don't include the node in the reduction.  Then if the node comes
online later, we don't re-trigger reduction.

Joshua we should just drop this patch for now and work with Honggyu and
friends separately on this issue.  In the meantime we can stick with
N_POSSIBLE.

There are more problems in this space - namely how to handle a system
whereby 8 CXL nodes are "possible" but the user only configures 2 (as
described by Hyonggye here).  We will probably need to introduce
hotplug/node on/offline callbacks to re-configure weights.

~Gregory

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 2/2 v6] mm/mempolicy: Don't create weight sysfs for memoryless nodes
  2025-02-27  3:20       ` Honggyu Kim
@ 2025-03-03 21:56         ` Joshua Hahn
  2025-03-04 12:53           ` Honggyu Kim
  0 siblings, 1 reply; 34+ messages in thread
From: Joshua Hahn @ 2025-03-03 21:56 UTC (permalink / raw)
  To: Honggyu Kim
  Cc: gourry, harry.yoo, ying.huang, kernel_team, gregkh, rakie.kim,
	akpm, rafael, lenb, dan.j.williams, Jonathan.Cameron, dave.jiang,
	horen.chuang, hannes, linux-kernel, linux-acpi, linux-mm,
	kernel-team, yunjeong.mun

On Thu, 27 Feb 2025 12:20:03 +0900 Honggyu Kim <honggyu.kim@sk.com> wrote:

Hi Honggyu, thank you for taking time to review my patch, as always!
I thought I had sent this, but it seems like it was left in my draft
without being sent. 

I will follow Gregory's advice and we will drop the patch from this series,
and send the first patch only (with Yunjeong's changes). Thanks again!

> 
> On 2/27/2025 11:32 AM, Honggyu Kim wrote:
> > Hi Joshua,
> > 
> > On 2/27/2025 6:35 AM, Joshua Hahn wrote:
> >> We should never try to allocate memory from a memoryless node. Creating a
> >> sysfs knob to control its weighted interleave weight does not make sense,
> >> and can be unsafe.
> >>
> >> Only create weighted interleave weight knobs for nodes with memory.
> >>
> >> Signed-off-by: Joshua Hahn <joshua.hahnjy@gmail.com>
> >> ---
> >>   mm/mempolicy.c | 2 +-
> >>   1 file changed, 1 insertion(+), 1 deletion(-)
> >>
> >> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> >> index 4cc04ff8f12c..50cbb7c047fa 100644
> >> --- a/mm/mempolicy.c
> >> +++ b/mm/mempolicy.c
> >> @@ -3721,7 +3721,7 @@ static int add_weighted_interleave_group(struct 
> >> kobject *root_kobj)
> >>           return err;
> >>       }
> >> -    for_each_node_state(nid, N_POSSIBLE) {
> > 
> > Actually, we're aware of this issue and currently trying to fix this.
> > In our system, we've attached 4ch of CXL memory for each socket as
> > follows.
> > 
> >          node0             node1
> >        +-------+   UPI   +-------+
> >        | CPU 0 |-+-----+-| CPU 1 |
> >        +-------+         +-------+
> >        | DRAM0 |         | DRAM1 |
> >        +---+---+         +---+---+
> >            |                 |
> >        +---+---+         +---+---+
> >        | CXL 0 |         | CXL 4 |
> >        +---+---+         +---+---+
> >        | CXL 1 |         | CXL 5 |
> >        +---+---+         +---+---+
> >        | CXL 2 |         | CXL 6 |
> >        +---+---+         +---+---+
> >        | CXL 3 |         | CXL 7 |
> >        +---+---+         +---+---+
> >          node2             node3
> > 
> > The 4ch of CXL memory are detected as a single NUMA node in each socket,
> > but it shows as follows with the current N_POSSIBLE loop.
> > 
> > $ ls /sys/kernel/mm/mempolicy/weighted_interleave/
> > node0 node1 node2 node3 node4 node5
> > node6 node7 node8 node9 node10 node11

I see. For my education, would you mind explaining how the numbering works
here? I am not very familiar with this setup, and not sure how you would
figure out what node is which, just by looking at the numbering.

> >> +    for_each_node_state(nid, N_MEMORY) {
> 
> Thinking it again, we can leave it as a separate patch but add our patch 
> on top of it.

That sounds good to me. 

> The only concern I have is having only N_MEMORY patch hides weight
> setting knobs for CXL memory and it makes there is no way to set weight 
> values to CXL memory in my system.

You can use weighted interleave auto-tuning : -)
In all seriousness, this makes sense. It seems pretty problematic that
the knobs aren't created for the CXL channels, and I'm not sure that hiding
it is the correct approach here (it was not my intent, either).

> IMHO, this and our patch is better to be submitted together.

That sounds good. We can hold off on this patch then, and just consider
the first patch of this series. Thank you for letting me know!

Thank you for always reviewing my patches. Have a great day!
Joshua

Sent using hkml (https://github.com/sjp38/hackermail)



^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 2/2 v6] mm/mempolicy: Don't create weight sysfs for memoryless nodes
  2025-03-03 21:56         ` Joshua Hahn
@ 2025-03-04 12:53           ` Honggyu Kim
  0 siblings, 0 replies; 34+ messages in thread
From: Honggyu Kim @ 2025-03-04 12:53 UTC (permalink / raw)
  To: Joshua Hahn
  Cc: kernel_team, gourry, harry.yoo, ying.huang, gregkh, rakie.kim,
	akpm, rafael, lenb, dan.j.williams, Jonathan.Cameron, dave.jiang,
	horen.chuang, hannes, linux-kernel, linux-acpi, linux-mm,
	kernel-team, yunjeong.mun

Hi Joshua,

On 3/4/2025 6:56 AM, Joshua Hahn wrote:
> On Thu, 27 Feb 2025 12:20:03 +0900 Honggyu Kim <honggyu.kim@sk.com> wrote:
> 
> Hi Honggyu, thank you for taking time to review my patch, as always!

My pleasure!

> I thought I had sent this, but it seems like it was left in my draft
> without being sent.
> 
> I will follow Gregory's advice and we will drop the patch from this series,
> and send the first patch only (with Yunjeong's changes). Thanks again!

It'd be great if you could add her with the following.
Co-developed-by: Yunjeong Mun <yunjeong.mun@sk.com>

> 
>>
>> On 2/27/2025 11:32 AM, Honggyu Kim wrote:
>>> Hi Joshua,
>>>
>>> On 2/27/2025 6:35 AM, Joshua Hahn wrote:
>>>> We should never try to allocate memory from a memoryless node. Creating a
>>>> sysfs knob to control its weighted interleave weight does not make sense,
>>>> and can be unsafe.
>>>>
>>>> Only create weighted interleave weight knobs for nodes with memory.
>>>>
>>>> Signed-off-by: Joshua Hahn <joshua.hahnjy@gmail.com>
>>>> ---
>>>>    mm/mempolicy.c | 2 +-
>>>>    1 file changed, 1 insertion(+), 1 deletion(-)
>>>>
>>>> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
>>>> index 4cc04ff8f12c..50cbb7c047fa 100644
>>>> --- a/mm/mempolicy.c
>>>> +++ b/mm/mempolicy.c
>>>> @@ -3721,7 +3721,7 @@ static int add_weighted_interleave_group(struct
>>>> kobject *root_kobj)
>>>>            return err;
>>>>        }
>>>> -    for_each_node_state(nid, N_POSSIBLE) {
>>>
>>> Actually, we're aware of this issue and currently trying to fix this.
>>> In our system, we've attached 4ch of CXL memory for each socket as
>>> follows.
>>>
>>>           node0             node1
>>>         +-------+   UPI   +-------+
>>>         | CPU 0 |-+-----+-| CPU 1 |
>>>         +-------+         +-------+
>>>         | DRAM0 |         | DRAM1 |
>>>         +---+---+         +---+---+
>>>             |                 |
>>>         +---+---+         +---+---+
>>>         | CXL 0 |         | CXL 4 |
>>>         +---+---+         +---+---+
>>>         | CXL 1 |         | CXL 5 |
>>>         +---+---+         +---+---+
>>>         | CXL 2 |         | CXL 6 |
>>>         +---+---+         +---+---+
>>>         | CXL 3 |         | CXL 7 |
>>>         +---+---+         +---+---+
>>>           node2             node3
>>>
>>> The 4ch of CXL memory are detected as a single NUMA node in each socket,
>>> but it shows as follows with the current N_POSSIBLE loop.
>>>
>>> $ ls /sys/kernel/mm/mempolicy/weighted_interleave/
>>> node0 node1 node2 node3 node4 node5
>>> node6 node7 node8 node9 node10 node11

FYI, we used to set node2 and node3 only for weights for CXL memory here
and ignored node{4-11}.  That sounds silly but it worked.

> 
> I see. For my education, would you mind explaining how the numbering works
> here? I am not very familiar with this setup, and not sure how you would
> figure out what node is which, just by looking at the numbering.

Regarding the numbering, I'm not 100% sure, but I guess there could be a
logical NUMA node that combines 4ch of CXL memory and 4 nodes for CXL 
memory so in total 5 nodes per socket.

I don't have much knowledge on this but maybe this is related to PXM
(Proximity Domain).

> 
>>>> +    for_each_node_state(nid, N_MEMORY) {
>>
>> Thinking it again, we can leave it as a separate patch but add our patch
>> on top of it.
> 
> That sounds good to me.
> 
>> The only concern I have is having only N_MEMORY patch hides weight
>> setting knobs for CXL memory and it makes there is no way to set weight
>> values to CXL memory in my system.
> 
> You can use weighted interleave auto-tuning : -)

Not possible because using N_MEMORY doesn't provide "node" knobs for CXL
memory at all as follows.

$ ls /sys/kernel/mm/mempolicy/weighted_interleave/
node0 node1

We need node2 and node3 for CXL memory here.

> In all seriousness, this makes sense. It seems pretty problematic that
> the knobs aren't created for the CXL channels,

Yeah, it's even worse than the current status.

> and I'm not sure that hiding> it is the correct approach here (it was not my intent, either).

It isn't your problem but we shouldn't hide those nodes until it is
correctly fixed with hot plugging event handler.

> 
>> IMHO, this and our patch is better to be submitted together.
> 
> That sounds good. We can hold off on this patch then, and just consider
> the first patch of this series. Thank you for letting me know!

The N_POSSIBLE and N_MEMORY stuffs should had been fixed earlier than
this work. I will take a few days if we can submit it together.

> 
> Thank you for always reviewing my patches. Have a great day!
> Joshua

Thanks for your work and have a great day you too!

Kind regards,
Honggyu

> 
> Sent using hkml (https://github.com/sjp38/hackermail)
> 



^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 2/2 v6] mm/mempolicy: Don't create weight sysfs for memoryless nodes
  2025-03-03 16:19       ` Gregory Price
@ 2025-03-04 13:03         ` Honggyu Kim
  2025-03-04 16:16           ` Gregory Price
  0 siblings, 1 reply; 34+ messages in thread
From: Honggyu Kim @ 2025-03-04 13:03 UTC (permalink / raw)
  To: Gregory Price
  Cc: kernel_team, Joshua Hahn, harry.yoo, ying.huang, gregkh,
	rakie.kim, akpm, rafael, lenb, dan.j.williams, Jonathan.Cameron,
	dave.jiang, horen.chuang, hannes, linux-kernel, linux-acpi,
	linux-mm, kernel-team, yunjeong.mun

Hi Gregory,

On 3/4/2025 1:19 AM, Gregory Price wrote:
> On Thu, Feb 27, 2025 at 11:32:26AM +0900, Honggyu Kim wrote:
>>
>> But using N_MEMORY doesn't fix this problem and it hides the entire CXL
>> memory nodes in our system because the CXL memory isn't detected at this
>> point of creating node*.  Maybe there is some difference when multiple
>> CXL memory is detected as a single node.
>>
> 
> Hm, well, the node is "created" during early boot when ACPI tables are
> read and the CFMW are discovered - but they aren't necessarily "online"
> at the time they're created.
> 
> There is no true concept of a "Hotplug NUMA Node" - as the node must be
> created at boot time. (tl;dr: N_POSSIBLE will never change).
> 
> This patch may have been a bit overzealous of us, I forgot to ask
> whether N_MEMORY is set for nodes created but not onlined at boot. So
> this is a good observation.

I didn't want to make more noise but we found many issues again after
getting a new machine and started using it with multiple CXL memory.

> 
> It also doesn't help that this may introduce a subtle race condition.
> 
> If a node exists (N_POSSIBLE) but hasn't been onlined (!N_MEMORY) and
> bandwidth information is reported - then we store the bandwidth info
> but don't include the node in the reduction.  Then if the node comes
> online later, we don't re-trigger reduction.
> 
> Joshua we should just drop this patch for now and work with Honggyu and
> friends separately on this issue.  In the meantime we can stick with
> N_POSSIBLE.
> 
> There are more problems in this space - namely how to handle a system
> whereby 8 CXL nodes are "possible" but the user only configures 2 (as
> described by Hyonggye here).  We will probably need to introduce
> hotplug/node on/offline callbacks to re-configure weights.
> 
> ~Gregory

This work won't take a long time so I think we can submit a patch within 
a few days.

Thanks,
Honggyu


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 2/2 v6] mm/mempolicy: Don't create weight sysfs for memoryless nodes
  2025-03-04 13:03         ` Honggyu Kim
@ 2025-03-04 16:16           ` Gregory Price
  0 siblings, 0 replies; 34+ messages in thread
From: Gregory Price @ 2025-03-04 16:16 UTC (permalink / raw)
  To: Honggyu Kim
  Cc: kernel_team, Joshua Hahn, harry.yoo, ying.huang, gregkh,
	rakie.kim, akpm, rafael, lenb, dan.j.williams, Jonathan.Cameron,
	dave.jiang, horen.chuang, hannes, linux-kernel, linux-acpi,
	linux-mm, kernel-team, yunjeong.mun

On Tue, Mar 04, 2025 at 10:03:22PM +0900, Honggyu Kim wrote:
> Hi Gregory,
> 
> > This patch may have been a bit overzealous of us, I forgot to ask
> > whether N_MEMORY is set for nodes created but not onlined at boot. So
> > this is a good observation.
> 
> I didn't want to make more noise but we found many issues again after
> getting a new machine and started using it with multiple CXL memory.
> 

I spent yesterday looking into how nodes are created and marked N_MEMORY
and I think now that this patch is just not correct.

N_MEMORY for a given nid is toggled:
  1) during mm_init if any page is associated with that node (DRAM)
  2) memory_hotplug when a memory block is onlined/offlined  (CXL)

This means a CXL node which is deferred to the driver will come up as
memoryless at boot (mm_init) but has N_MEMORY toggled on when the first
hotplug memory block is onlined.  However, its access_coordinate data is
reported during cxl driver probe - well prior to memory hotplug.

This means we must expose a node entry for every possible node, always,
because we can't predict what nodes will have hotplug memory.

We COULD try to react to hotplug memory blocks, but this increase in
complexity just doesn't seem worth the hassle - the hotplug callback has
timing restrictions (callback must occur AFTER N_MEMORY is toggled).

It seems better to include all nodes with reported data in the reduction.

This has two downsides:
  1) stale data may be used if hotplug occurs and the new device does
     not have CDAT/HMAT/access_coordinate data.
  2) any device without CDAT/HMAT/access_coordinate data will not be
     included in the reduction by default.

I think we can work around #2 by detecting this (on reduction, if data
is missing but N_MEMORY is set, fire a warning).  We can't do much about
#1 unless we field physical device hot-un-plug callbacks - and that
seems like a bit much.

~Gregory

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 2/2 v6] mm/mempolicy: Don't create weight sysfs for memoryless nodes
  2025-02-27  2:32     ` Honggyu Kim
  2025-02-27  3:20       ` Honggyu Kim
  2025-03-03 16:19       ` Gregory Price
@ 2025-03-04 16:29       ` Gregory Price
  2025-03-06 12:39         ` Honggyu Kim
  2 siblings, 1 reply; 34+ messages in thread
From: Gregory Price @ 2025-03-04 16:29 UTC (permalink / raw)
  To: Honggyu Kim
  Cc: Joshua Hahn, harry.yoo, ying.huang, kernel_team, gregkh,
	rakie.kim, akpm, rafael, lenb, dan.j.williams, Jonathan.Cameron,
	dave.jiang, horen.chuang, hannes, linux-kernel, linux-acpi,
	linux-mm, kernel-team, yunjeong.mun

On Thu, Feb 27, 2025 at 11:32:26AM +0900, Honggyu Kim wrote:
> Actually, we're aware of this issue and currently trying to fix this.
> In our system, we've attached 4ch of CXL memory for each socket as
> follows.
> 
>         node0             node1
>       +-------+   UPI   +-------+
>       | CPU 0 |-+-----+-| CPU 1 |
>       +-------+         +-------+
>       | DRAM0 |         | DRAM1 |
>       +---+---+         +---+---+
>           |                 |
>       +---+---+         +---+---+
>       | CXL 0 |         | CXL 4 |
>       +---+---+         +---+---+
>       | CXL 1 |         | CXL 5 |
>       +---+---+         +---+---+
>       | CXL 2 |         | CXL 6 |
>       +---+---+         +---+---+
>       | CXL 3 |         | CXL 7 |
>       +---+---+         +---+---+
>         node2             node3
> 
> The 4ch of CXL memory are detected as a single NUMA node in each socket,
> but it shows as follows with the current N_POSSIBLE loop.
> 
> $ ls /sys/kernel/mm/mempolicy/weighted_interleave/
> node0 node1 node2 node3 node4 node5
> node6 node7 node8 node9 node10 node11

This is insufficient information for me to assess the correctness of the
configuration. Can you please show the contents of your CEDT/CFMWS and
SRAT/Memory Affinity structures?

mkdir acpi_data && cd acpi_data
acpidump -b
iasl -d *
cat cedt.dsl  <- find all CFMWS entries
cat srat.dsl  <- find all Memory Affinity entries

Basically I need to know:
1) Is each CXL device on a dedicated Host Bridge?
2) Is inter-host-bridge interleaving configured?
3) Is intra-host-bridge interleaving configured?
4) Do SRAT entries exist for all nodes?
5) Why are there 12 nodes but only 10 sources? Are there additional
   devices left out of your diagram? Are there 2 CFMWS but and 8 Memory
   Affinity records - resulting in 10 nodes? This is strange.

By default, Linux creates a node for each proximity domain ("PXM")
detected in the SRAT Memory Affinity tables. If SRAT entries for a
memory region described in a CFMWS is absent, it will also create an
node for that CFMWS.

Your reported configuration and results lead me to believe you have
a combination of CFMWS/SRAT configurations that are unexpected.

~Gregory

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/2 v6] mm/mempolicy: Weighted Interleave Auto-tuning
  2025-02-28  6:39   ` Yunjeong Mun
  2025-02-28 16:24     ` Joshua Hahn
@ 2025-03-04 21:56     ` Joshua Hahn
  2025-03-04 22:22       ` Joshua Hahn
  1 sibling, 1 reply; 34+ messages in thread
From: Joshua Hahn @ 2025-03-04 21:56 UTC (permalink / raw)
  To: Yunjeong Mun
  Cc: honggyu.kim, gregkh, rakie.kim, akpm, rafael, lenb,
	dan.j.williams, Jonathan.Cameron, dave.jiang, horen.chuang,
	hannes, linux-kernel, linux-acpi, linux-mm, kernel-team,
	kernel_team

On Fri, 28 Feb 2025 15:39:55 +0900 Yunjeong Mun <yunjeong.mun@sk.com> wrote:

Hi Yunjeong,

While applying your patch, I realized that it re-introduces a build error
that was fixed in v6, which I am noting below. 

> Hi, Joshua. 

[...snip...]
 
> In my understanding, new_iw[nid] values are scaled twice, first to 100 and then to a 
> weightines value of 32. I think this scaling can be done just once, directly 
> to weightness value as follows:
> 
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index 50cbb7c047fa..65a7e2baf161 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -176,47 +176,22 @@ static u8 get_il_weight(int node)
>  static void reduce_interleave_weights(unsigned int *bw, u8 *new_iw)
>  {
> 	u64 sum_bw = 0;
> -	unsigned int cast_sum_bw, sum_iw = 0;
> -	unsigned int scaling_factor = 1, iw_gcd = 1;
> +	unsigned int scaling_factor = 1, iw_gcd = 0;
> 	int nid;
> 
> 	/* Recalculate the bandwidth distribution given the new info */
> 	for_each_node_state(nid, N_MEMORY)
> 		sum_bw += bw[nid];
> 
> -       for (nid = 0; nid < nr_node_ids; nid++) {
>  			[...snip...]
> -		/*
> -		 * Try not to perform 64-bit division.
> -		 * If sum_bw < scaling_factor, then sum_bw < U32_MAX.
> -		 * If sum_bw > scaling_factor, then bw[nid] is less than
> -		 * 1% of the total bandwidth. Round up to 1%.
> -		 */
>  			[...snip...]

We cannot remove this part here, since this is what allows us to divide
in the next for loop below. sum_bw is a u64, so performing division
by this value will create a build error for 32-bit machines. I've gone and
re-added this comment and parts to the bottom part; the logic should not
change at all from the patch that you proposed (except for the build error).

It's not a big deal, but I just wanted to note that this patch was not applied
in its entirety in the v7 of my patch, in case you look at v7 and see
that the code looks different from your branch.

Honggyu also suggested that I add a Co-developed-by tag, which I am happy
to do. However, this requires a subsequent Signed-off-by tag as well, as
per the kerneldoc on patches [1]. I just wanted to have your explicit
signed-off-by so that I could add it to the patch.

> -		sum_iw += new_iw[nid];
> -	}
> -
>      
> 	/*
> 	 * Scale each node's share of the total bandwidth from percentages
> 	 * to whole numbers in the range [1, weightiness]
> 	 */
> 	for_each_node_state(nid, N_MEMORY) {
> -		scaling_factor = weightiness * new_iw[nid];
> -		new_iw[nid] = max(scaling_factor / sum_iw, 1);
> -		if (nid == 0)
> -			iw_gcd = new_iw[0];
> +		scaling_factor = weightiness * bw[nid];
> +		new_iw[nid] = max(scaling_factor / sum_bw, 1);
						 ^^^^^^^^
						This causes a build error for
						32 bit machines

[...snip...]

> Please let me know how you think about this.
> 
> Best regards,
> Yunjeong

Thanks again Yunjeong, I hope you have a great day!
Joshua

[1] https://www.kernel.org/doc/html/v4.17/process/submitting-patches.html#when-to-use-acked-by-cc-and-co-developed-by

Sent using hkml (https://github.com/sjp38/hackermail)



^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/2 v6] mm/mempolicy: Weighted Interleave Auto-tuning
  2025-03-04 21:56     ` Joshua Hahn
@ 2025-03-04 22:22       ` Joshua Hahn
  2025-03-05  9:49         ` Yunjeong Mun
  0 siblings, 1 reply; 34+ messages in thread
From: Joshua Hahn @ 2025-03-04 22:22 UTC (permalink / raw)
  To: Joshua Hahn
  Cc: Yunjeong Mun, honggyu.kim, gregkh, rakie.kim, akpm, rafael, lenb,
	dan.j.williams, Jonathan.Cameron, dave.jiang, horen.chuang,
	hannes, linux-kernel, linux-acpi, linux-mm, kernel-team,
	kernel_team

Hi Yunjeong, sorry for the noise, but I have discovered another potential
concern that your patch introduces, which I have explained below. 

On Tue,  4 Mar 2025 13:56:11 -0800 Joshua Hahn <joshua.hahnjy@gmail.com> wrote:

> On Fri, 28 Feb 2025 15:39:55 +0900 Yunjeong Mun <yunjeong.mun@sk.com> wrote:
> 
> Hi Yunjeong,
> 
> While applying your patch, I realized that it re-introduces a build error
> that was fixed in v6, which I am noting below. 
> 
> > Hi, Joshua. 
> 
> [...snip...]
>  
> > In my understanding, new_iw[nid] values are scaled twice, first to 100 and then to a 
> > weightines value of 32. I think this scaling can be done just once, directly 
> > to weightness value as follows:
> > 
> > diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> > index 50cbb7c047fa..65a7e2baf161 100644
> > --- a/mm/mempolicy.c
> > +++ b/mm/mempolicy.c
> > @@ -176,47 +176,22 @@ static u8 get_il_weight(int node)
> >  static void reduce_interleave_weights(unsigned int *bw, u8 *new_iw)
> >  {
> > 	u64 sum_bw = 0;
> > -	unsigned int cast_sum_bw, sum_iw = 0;
> > -	unsigned int scaling_factor = 1, iw_gcd = 1;
> > +	unsigned int scaling_factor = 1, iw_gcd = 0;
> > 	int nid;
> > 
> > 	/* Recalculate the bandwidth distribution given the new info */
> > 	for_each_node_state(nid, N_MEMORY)
> > 		sum_bw += bw[nid];
> > 
> > -       for (nid = 0; nid < nr_node_ids; nid++) {
> >  			[...snip...]
			^^^^^^^^^^^^
When I was originally writing the response, I missed reviewing the contents
inside this snipped section, which looks like this:
		if (!node_state(nid, N_MEMORY)) {
			new_iw[nid] = 1;
			continue;
		}
I introduced this check in v6 because without this, we end up with the
possibility of memoryless nodes having a 0 in the table, which can lead to some
problems down the line (e.g. div by 0 in alloc_pages_bulk_weighted_interleave).

Respectfully, I would prefer to write my own version that takes your
suggestion, as opposed to applying this patch directly on top of mine so that
we do not introduce the build error or the potential div0. However, v7 will
include your suggestion, so it will go through only one loop as opposed to two.

Thank you for your feedback again. I hope you have a great day!
Joshua

> > -		/*
> > -		 * Try not to perform 64-bit division.
> > -		 * If sum_bw < scaling_factor, then sum_bw < U32_MAX.
> > -		 * If sum_bw > scaling_factor, then bw[nid] is less than
> > -		 * 1% of the total bandwidth. Round up to 1%.
> > -		 */
> >  			[...snip...]
> 
> We cannot remove this part here, since this is what allows us to divide
> in the next for loop below. sum_bw is a u64, so performing division
> by this value will create a build error for 32-bit machines. I've gone and
> re-added this comment and parts to the bottom part; the logic should not
> change at all from the patch that you proposed (except for the build error).

[...snip...]

Sent using hkml (https://github.com/sjp38/hackermail)



^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/2 v6] mm/mempolicy: Weighted Interleave Auto-tuning
  2025-03-04 22:22       ` Joshua Hahn
@ 2025-03-05  9:49         ` Yunjeong Mun
  2025-03-05 16:28           ` Joshua Hahn
  0 siblings, 1 reply; 34+ messages in thread
From: Yunjeong Mun @ 2025-03-05  9:49 UTC (permalink / raw)
  To: Joshua Hahn
  Cc: honggyu.kim, gregkh, rakie.kim, akpm, rafael, lenb,
	dan.j.williams, Jonathan.Cameron, dave.jiang, horen.chuang,
	hannes, linux-kernel, linux-acpi, linux-mm, kernel-team,
	kernel_team

Hi Joshua, thanks for reviewing my patch and for your kind explanation.

On Tue,  4 Mar 2025 14:22:51 -0800 Joshua Hahn <joshua.hahnjy@gmail.com> wrote:
> Hi Yunjeong, sorry for the noise, but I have discovered another potential
> concern that your patch introduces, which I have explained below. 
> 
> On Tue,  4 Mar 2025 13:56:11 -0800 Joshua Hahn <joshua.hahnjy@gmail.com> wrote:
> 
> > On Fri, 28 Feb 2025 15:39:55 +0900 Yunjeong Mun <yunjeong.mun@sk.com> wrote:
> > 
> > Hi Yunjeong,
> > 
> > While applying your patch, I realized that it re-introduces a build error
> > that was fixed in v6, which I am noting below. 
> > 
> > > Hi, Joshua. 
> > 
> > [...snip...]
> >  
> > > In my understanding, new_iw[nid] values are scaled twice, first to 100 and then to a 
> > > weightines value of 32. I think this scaling can be done just once, directly 
> > > to weightness value as follows:
> > > 
> > > diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> > > index 50cbb7c047fa..65a7e2baf161 100644
> > > --- a/mm/mempolicy.c
> > > +++ b/mm/mempolicy.c
> > > @@ -176,47 +176,22 @@ static u8 get_il_weight(int node)
> > >  static void reduce_interleave_weights(unsigned int *bw, u8 *new_iw)
> > >  {
> > > 	u64 sum_bw = 0;
> > > -	unsigned int cast_sum_bw, sum_iw = 0;
> > > -	unsigned int scaling_factor = 1, iw_gcd = 1;
> > > +	unsigned int scaling_factor = 1, iw_gcd = 0;
> > > 	int nid;
> > > 
> > > 	/* Recalculate the bandwidth distribution given the new info */
> > > 	for_each_node_state(nid, N_MEMORY)
> > > 		sum_bw += bw[nid];
> > > 
> > > -       for (nid = 0; nid < nr_node_ids; nid++) {
> > >  			[...snip...]
> 			^^^^^^^^^^^^
> When I was originally writing the response, I missed reviewing the contents
> inside this snipped section, which looks like this:
> 		if (!node_state(nid, N_MEMORY)) {
> 			new_iw[nid] = 1;
> 			continue;
> 		}
> I introduced this check in v6 because without this, we end up with the
> possibility of memoryless nodes having a 0 in the table, which can lead to some
> problems down the line (e.g. div by 0 in alloc_pages_bulk_weighted_interleave).

To prevent division by 0 errors, how about setting new_iw to 1 when it is first 
created, instead of setting it in the reduce function?

> 
> Respectfully, I would prefer to write my own version that takes your
> suggestion, as opposed to applying this patch directly on top of mine so that
> we do not introduce the build error or the potential div0. However, v7 will
> include your suggestion, so it will go through only one loop as opposed to two.

Thanks for considering my suggestion. I look forward to the v7.

Best regards,
Yunjeong
> 
> Thank you for your feedback again. I hope you have a great day!
> Joshua
> 
> > > -		/*
> > > -		 * Try not to perform 64-bit division.
> > > -		 * If sum_bw < scaling_factor, then sum_bw < U32_MAX.
> > > -		 * If sum_bw > scaling_factor, then bw[nid] is less than
> > > -		 * 1% of the total bandwidth. Round up to 1%.
> > > -		 */
> > >  			[...snip...]
> > 
> > We cannot remove this part here, since this is what allows us to divide
> > in the next for loop below. sum_bw is a u64, so performing division
> > by this value will create a build error for 32-bit machines. I've gone and
> > re-added this comment and parts to the bottom part; the logic should not
> > change at all from the patch that you proposed (except for the build error).
> 
> [...snip...]
> 
> Sent using hkml (https://github.com/sjp38/hackermail)
> 
> 


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/2 v6] mm/mempolicy: Weighted Interleave Auto-tuning
  2025-03-05  9:49         ` Yunjeong Mun
@ 2025-03-05 16:28           ` Joshua Hahn
  0 siblings, 0 replies; 34+ messages in thread
From: Joshua Hahn @ 2025-03-05 16:28 UTC (permalink / raw)
  To: Yunjeong Mun
  Cc: honggyu.kim, gregkh, rakie.kim, akpm, rafael, lenb,
	dan.j.williams, Jonathan.Cameron, dave.jiang, horen.chuang,
	hannes, linux-kernel, linux-acpi, linux-mm, kernel-team,
	kernel_team

Hi Yunjeong,

On Wed,  5 Mar 2025 18:49:11 +0900 Yunjeong Mun <yunjeong.mun@sk.com> wrote:

> Hi Joshua, thanks for reviewing my patch and for your kind explanation.

[...snip...]

> > > > diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> > > > index 50cbb7c047fa..65a7e2baf161 100644
> > > > --- a/mm/mempolicy.c
> > > > +++ b/mm/mempolicy.c
> > > > @@ -176,47 +176,22 @@ static u8 get_il_weight(int node)
> > > >  static void reduce_interleave_weights(unsigned int *bw, u8 *new_iw)
> > > >  {
> > > > 	u64 sum_bw = 0;
> > > > -	unsigned int cast_sum_bw, sum_iw = 0;
> > > > -	unsigned int scaling_factor = 1, iw_gcd = 1;
> > > > +	unsigned int scaling_factor = 1, iw_gcd = 0;
> > > > 	int nid;
> > > > 
> > > > 	/* Recalculate the bandwidth distribution given the new info */
> > > > 	for_each_node_state(nid, N_MEMORY)
> > > > 		sum_bw += bw[nid];
> > > > 
> > > > -       for (nid = 0; nid < nr_node_ids; nid++) {
> > > >  			[...snip...]
> > 			^^^^^^^^^^^^
> > When I was originally writing the response, I missed reviewing the contents
> > inside this snipped section, which looks like this:
> > 		if (!node_state(nid, N_MEMORY)) {
> > 			new_iw[nid] = 1;
> > 			continue;
> > 		}
> > I introduced this check in v6 because without this, we end up with the
> > possibility of memoryless nodes having a 0 in the table, which can lead to some
> > problems down the line (e.g. div by 0 in alloc_pages_bulk_weighted_interleave).
> 
> To prevent division by 0 errors, how about setting new_iw to 1 when it is first 
> created, instead of setting it in the reduce function?

I think this makes sense. The original motivation for including it in
reduce_interleave_weights is because this function is usually called on newly
allocated tables, so I thought I would just combine the functionality of
initializing the table and reducing weights into one function. Howver, I now
see that there are actually a few spots when either a table is initialized
but this function isn't called, or when an already-initialized table is
given to this function.

The other rationale was that it seems a bit silly to go through and set all
weights to 1, and then immediately overwrite them with the reduced interleave
weights. With that said, none of this code is in any critical section, I'm
sure that going through one more iteration and setting the weights to 1 is
not unreasonable. 

[...snip...]

> > 
> > Respectfully, I would prefer to write my own version that takes your
> > suggestion, as opposed to applying this patch directly on top of mine so that
> > we do not introduce the build error or the potential div0. However, v7 will
> > include your suggestion, so it will go through only one loop as opposed to two.
> 
> Thanks for considering my suggestion. I look forward to the v7.
> 
> Best regards,
> Yunjeong

Thank you again for your suggestions, Yunjeong! I'll re-write the code
to incorporate them in v7. I hope you have a great day!
Joshua

Sent using hkml (https://github.com/sjp38/hackermail)


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 2/2 v6] mm/mempolicy: Don't create weight sysfs for memoryless nodes
  2025-03-04 16:29       ` Gregory Price
@ 2025-03-06 12:39         ` Honggyu Kim
  2025-03-06 17:32           ` Gregory Price
  0 siblings, 1 reply; 34+ messages in thread
From: Honggyu Kim @ 2025-03-06 12:39 UTC (permalink / raw)
  To: Gregory Price
  Cc: kernel_team, Joshua Hahn, harry.yoo, ying.huang, gregkh,
	rakie.kim, akpm, rafael, lenb, dan.j.williams, Jonathan.Cameron,
	dave.jiang, horen.chuang, hannes, linux-kernel, linux-acpi,
	linux-mm, kernel-team, yunjeong.mun

Hi Gregory,

On 3/5/2025 1:29 AM, Gregory Price wrote:
> On Thu, Feb 27, 2025 at 11:32:26AM +0900, Honggyu Kim wrote:
>> Actually, we're aware of this issue and currently trying to fix this.
>> In our system, we've attached 4ch of CXL memory for each socket as
>> follows.
>>
>>          node0             node1
>>        +-------+   UPI   +-------+
>>        | CPU 0 |-+-----+-| CPU 1 |
>>        +-------+         +-------+
>>        | DRAM0 |         | DRAM1 |
>>        +---+---+         +---+---+
>>            |                 |
>>        +---+---+         +---+---+
>>        | CXL 0 |         | CXL 4 |
>>        +---+---+         +---+---+
>>        | CXL 1 |         | CXL 5 |
>>        +---+---+         +---+---+
>>        | CXL 2 |         | CXL 6 |
>>        +---+---+         +---+---+
>>        | CXL 3 |         | CXL 7 |
>>        +---+---+         +---+---+
>>          node2             node3
>>
>> The 4ch of CXL memory are detected as a single NUMA node in each socket,
>> but it shows as follows with the current N_POSSIBLE loop.
>>
>> $ ls /sys/kernel/mm/mempolicy/weighted_interleave/
>> node0 node1 node2 node3 node4 node5
>> node6 node7 node8 node9 node10 node11
> 
> This is insufficient information for me to assess the correctness of the
> configuration. Can you please show the contents of your CEDT/CFMWS and
> SRAT/Memory Affinity structures?
> 
> mkdir acpi_data && cd acpi_data
> acpidump -b
> iasl -d *
> cat cedt.dsl  <- find all CFMWS entries
> cat srat.dsl  <- find all Memory Affinity entries

I'm not able to provide all the details as srat.dsl has too much info.

   $ wc -l srat.dsl
   25229 srat.dsl

Instead, I can show you that there are 4 diffferent proximity domains
with "Enabled : 1" with the following filtered output from srat.dsl.

   $ grep -E "Proximity Domain :|Enabled : " srat.dsl | cut -c 31- | sed 
'N;s/\n//' | sort | uniq
          Enabled : 0       Enabled : 0
   Proximity Domain : 00000000       Enabled : 0
   Proximity Domain : 00000000       Enabled : 1
   Proximity Domain : 00000001       Enabled : 1
   Proximity Domain : 00000006       Enabled : 1
   Proximity Domain : 00000007       Enabled : 1

We don't actually have to use those complicated commands to check this
as dmesg clearly prints the SRAT and node numbers as follows.

   [    0.009915] ACPI: SRAT: Node 0 PXM 0 [mem 0x00000000-0x7fffffff]
   [    0.009917] ACPI: SRAT: Node 0 PXM 0 [mem 0x100000000-0x207fffffff]
   [    0.009919] ACPI: SRAT: Node 1 PXM 1 [mem 
0x60f80000000-0x64f7fffffff]
   [    0.009924] ACPI: SRAT: Node 2 PXM 6 [mem 
0x2080000000-0x807fffffff] hotplug
   [    0.009925] ACPI: SRAT: Node 3 PXM 7 [mem 
0x64f80000000-0x6cf7fffffff] hotplug

The memoryless nodes are printed as follows after those ACPI, SRAT,
Node N PXM M messages.

   [    0.010927] Initmem setup node 0 [mem 
0x0000000000001000-0x000000207effffff]
   [    0.010930] Initmem setup node 1 [mem 
0x0000060f80000000-0x0000064f7fffffff]
   [    0.010992] Initmem setup node 2 as memoryless
   [    0.011055] Initmem setup node 3 as memoryless
   [    0.011115] Initmem setup node 4 as memoryless
   [    0.011177] Initmem setup node 5 as memoryless
   [    0.011238] Initmem setup node 6 as memoryless
   [    0.011299] Initmem setup node 7 as memoryless
   [    0.011361] Initmem setup node 8 as memoryless
   [    0.011422] Initmem setup node 9 as memoryless
   [    0.011484] Initmem setup node 10 as memoryless
   [    0.011544] Initmem setup node 11 as memoryless

This is related why the 12 nodes at sysfs knobs are provided with the
current N_POSSIBLE loop.

> 
> Basically I need to know:
> 1) Is each CXL device on a dedicated Host Bridge?
> 2) Is inter-host-bridge interleaving configured?
> 3) Is intra-host-bridge interleaving configured?
> 4) Do SRAT entries exist for all nodes?

Are there some simple commands that I can get those info?

> 5) Why are there 12 nodes but only 10 sources? Are there additional
>     devices left out of your diagram? Are there 2 CFMWS but and 8 Memory
>     Affinity records - resulting in 10 nodes? This is strange.

My blind guess is that there could be a logic node that combines 4ch of
CXL memory so there are 5 nodes per each socket.  Adding 2 nodes for
local CPU/DRAM makes 12 nodes in total.

> 
> By default, Linux creates a node for each proximity domain ("PXM")
> detected in the SRAT Memory Affinity tables. If SRAT entries for a
> memory region described in a CFMWS is absent, it will also create an
> node for that CFMWS.
> 
> Your reported configuration and results lead me to believe you have
> a combination of CFMWS/SRAT configurations that are unexpected.
> 
> ~Gregory

Not sure about this part but our approach with hotplug_memory_notifier()
resolves this problem.  Rakie will submit an initial working patchset 
soonish.

Thanks,
Honggyu


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 2/2 v6] mm/mempolicy: Don't create weight sysfs for memoryless nodes
  2025-03-06 12:39         ` Honggyu Kim
@ 2025-03-06 17:32           ` Gregory Price
  2025-03-07 11:46             ` Honggyu Kim
  0 siblings, 1 reply; 34+ messages in thread
From: Gregory Price @ 2025-03-06 17:32 UTC (permalink / raw)
  To: Honggyu Kim
  Cc: kernel_team, Joshua Hahn, harry.yoo, ying.huang, gregkh,
	rakie.kim, akpm, rafael, lenb, dan.j.williams, Jonathan.Cameron,
	dave.jiang, horen.chuang, hannes, linux-kernel, linux-acpi,
	linux-mm, kernel-team, yunjeong.mun

On Thu, Mar 06, 2025 at 09:39:26PM +0900, Honggyu Kim wrote:
> 
> The memoryless nodes are printed as follows after those ACPI, SRAT,
> Node N PXM M messages.
> 
>   [    0.010927] Initmem setup node 0 [mem
> 0x0000000000001000-0x000000207effffff]
>   [    0.010930] Initmem setup node 1 [mem
> 0x0000060f80000000-0x0000064f7fffffff]
>   [    0.010992] Initmem setup node 2 as memoryless
>   [    0.011055] Initmem setup node 3 as memoryless
>   [    0.011115] Initmem setup node 4 as memoryless
>   [    0.011177] Initmem setup node 5 as memoryless
>   [    0.011238] Initmem setup node 6 as memoryless
>   [    0.011299] Initmem setup node 7 as memoryless
>   [    0.011361] Initmem setup node 8 as memoryless
>   [    0.011422] Initmem setup node 9 as memoryless
>   [    0.011484] Initmem setup node 10 as memoryless
>   [    0.011544] Initmem setup node 11 as memoryless
> 
> This is related why the 12 nodes at sysfs knobs are provided with the
> current N_POSSIBLE loop.
> 

This isn't actually why, this is another symptom.  This gets printed
because someone is marking nodes 4-11 as possible and setup_nr_node_ids
reports 12 total nodes

void __init setup_nr_node_ids(void)
{
        unsigned int highest;

        highest = find_last_bit(node_possible_map.bits, MAX_NUMNODES);
        nr_node_ids = highest + 1;
}

Given your configuration data so far, we may have a bug somewhere (or
i'm missing a configuration piece).

> > Basically I need to know:
> > 1) Is each CXL device on a dedicated Host Bridge?
> > 2) Is inter-host-bridge interleaving configured?
> > 3) Is intra-host-bridge interleaving configured?
> > 4) Do SRAT entries exist for all nodes?
> 
> Are there some simple commands that I can get those info?
> 

The content of the CEDT would be sufficient - that will show us the
number of CXL host bridges.

> > 5) Why are there 12 nodes but only 10 sources? Are there additional
> >     devices left out of your diagram? Are there 2 CFMWS but and 8 Memory
> >     Affinity records - resulting in 10 nodes? This is strange.
> 
> My blind guess is that there could be a logic node that combines 4ch of
> CXL memory so there are 5 nodes per each socket.  Adding 2 nodes for
> local CPU/DRAM makes 12 nodes in total.
>

The issue is that nodes have associated memory regions.  If there are
multiple nodes with overlapping memory regions, that seems problematic.

If there are "possible nodes" without memory and no real use case
(because the memory is associated with the aggregate node) then those
nodes probably shouldn't be reported as possible.

the tl;dr here is we should figure out what is marking those nodes as
possible.

> Not sure about this part but our approach with hotplug_memory_notifier()
> resolves this problem.  Rakie will submit an initial working patchset
> soonish.

This may just be a bandaid on the issue.  We should get our node
configuration correct from the get-go.

~Gregory


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 2/2 v6] mm/mempolicy: Don't create weight sysfs for memoryless nodes
  2025-03-06 17:32           ` Gregory Price
@ 2025-03-07 11:46             ` Honggyu Kim
  2025-03-07 17:51               ` Gregory Price
  0 siblings, 1 reply; 34+ messages in thread
From: Honggyu Kim @ 2025-03-07 11:46 UTC (permalink / raw)
  To: Gregory Price
  Cc: kernel_team, Joshua Hahn, harry.yoo, ying.huang, gregkh,
	rakie.kim, akpm, rafael, lenb, dan.j.williams, Jonathan.Cameron,
	dave.jiang, horen.chuang, hannes, linux-kernel, linux-acpi,
	linux-mm, kernel-team, yunjeong.mun



On 3/7/2025 2:32 AM, Gregory Price wrote:
> On Thu, Mar 06, 2025 at 09:39:26PM +0900, Honggyu Kim wrote:
>>
>> The memoryless nodes are printed as follows after those ACPI, SRAT,
>> Node N PXM M messages.
>>
>>    [    0.010927] Initmem setup node 0 [mem
>> 0x0000000000001000-0x000000207effffff]
>>    [    0.010930] Initmem setup node 1 [mem
>> 0x0000060f80000000-0x0000064f7fffffff]
>>    [    0.010992] Initmem setup node 2 as memoryless
>>    [    0.011055] Initmem setup node 3 as memoryless
>>    [    0.011115] Initmem setup node 4 as memoryless
>>    [    0.011177] Initmem setup node 5 as memoryless
>>    [    0.011238] Initmem setup node 6 as memoryless
>>    [    0.011299] Initmem setup node 7 as memoryless
>>    [    0.011361] Initmem setup node 8 as memoryless
>>    [    0.011422] Initmem setup node 9 as memoryless
>>    [    0.011484] Initmem setup node 10 as memoryless
>>    [    0.011544] Initmem setup node 11 as memoryless
>>
>> This is related why the 12 nodes at sysfs knobs are provided with the
>> current N_POSSIBLE loop.
>>
> 
> This isn't actually why, this is another symptom.  This gets printed
> because someone is marking nodes 4-11 as possible and setup_nr_node_ids
> reports 12 total nodes
> 
> void __init setup_nr_node_ids(void)
> {
>          unsigned int highest;
> 
>          highest = find_last_bit(node_possible_map.bits, MAX_NUMNODES);
>          nr_node_ids = highest + 1;
> }
> 
> Given your configuration data so far, we may have a bug somewhere (or
> i'm missing a configuration piece).

Maybe there could be some misunderstanding on this issue.
This isn't a problem of NUMA detection for CXL memory but just a problem
of number of "node" knobs only for weighted interleave.

The number of nodes in 'numactl -H' shows the correct nodes even without
our fix.

   $ numactl -H
   available: 4 nodes (0-3)
   node 0 cpus: 0 1 2 3 ...
   node 0 size: 128504 MB
   node 0 free: 118563 MB
   node 1 cpus: 144 145 146 147 ...
   node 1 size: 257961 MB
   node 1 free: 242628 MB
   node 2 cpus:
   node 2 size: 393216 MB
   node 2 free: 393216 MB
   node 3 cpus:
   node 3 size: 524288 MB
   node 3 free: 524288 MB
   node distances:
   node     0    1    2    3
      0:   10   21   14   24
      1:   21   10   24   14
      2:   14   24   10   26
      3:   24   14   26   10

You can see more info below.

   $ cd /sys/devices/system/node

   $ ls -d node*
   node0  node1  node2  node3

   $ cat possible
   0-11

   $ cat online
   0-3

   $ cat has_memory
   0-3

   $ cat has_normal_memory
   0-1

   $ cat has_cpu
   0-1

>>> Basically I need to know:
>>> 1) Is each CXL device on a dedicated Host Bridge?
>>> 2) Is inter-host-bridge interleaving configured?
>>> 3) Is intra-host-bridge interleaving configured?
>>> 4) Do SRAT entries exist for all nodes?
>>
>> Are there some simple commands that I can get those info?
>>
> 
> The content of the CEDT would be sufficient - that will show us the
> number of CXL host bridges.

Which command do we need for this info specifically?  My output doesn't
provide some useful info for that.

   $ acpidump -b
   $ iasl -d *
   $ cat cedt.dsl
       ...
   **** Unknown ACPI table signature [CEDT]

> 
>>> 5) Why are there 12 nodes but only 10 sources? Are there additional
>>>      devices left out of your diagram? Are there 2 CFMWS but and 8 Memory
>>>      Affinity records - resulting in 10 nodes? This is strange.
>>
>> My blind guess is that there could be a logic node that combines 4ch of
>> CXL memory so there are 5 nodes per each socket.  Adding 2 nodes for
>> local CPU/DRAM makes 12 nodes in total.
>>
> 
> The issue is that nodes have associated memory regions.  If there are
> multiple nodes with overlapping memory regions, that seems problematic.
> 
> If there are "possible nodes" without memory and no real use case
> (because the memory is associated with the aggregate node) then those
> nodes probably shouldn't be reported as possible.
> 
> the tl;dr here is we should figure out what is marking those nodes as
> possible.
> 
>> Not sure about this part but our approach with hotplug_memory_notifier()
>> resolves this problem.  Rakie will submit an initial working patchset
>> soonish.
> 
> This may just be a bandaid on the issue.  We should get our node
> configuration correct from the get-go.

Not sure about it.  This must be fixed ASAP because current kernel is
broken on this issue and the fix should go into hotfix tree first.

If you can think this is just a bandaid, but leaving it bleeding as is
not the right approach.

Our fix was posted a few hours ago.  Please have a look, then think
about the apprach again.
https://lore.kernel.org/linux-mm/20250307063534.540-1-rakie.kim@sk.com

Thanks,
Honggyu

> 
> ~Gregory



^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 2/2 v6] mm/mempolicy: Don't create weight sysfs for memoryless nodes
  2025-03-07 11:46             ` Honggyu Kim
@ 2025-03-07 17:51               ` Gregory Price
  2025-03-10 12:26                 ` Honggyu Kim
  0 siblings, 1 reply; 34+ messages in thread
From: Gregory Price @ 2025-03-07 17:51 UTC (permalink / raw)
  To: Honggyu Kim
  Cc: kernel_team, Joshua Hahn, harry.yoo, ying.huang, gregkh,
	rakie.kim, akpm, rafael, lenb, dan.j.williams, Jonathan.Cameron,
	dave.jiang, horen.chuang, hannes, linux-kernel, linux-acpi,
	linux-mm, kernel-team, yunjeong.mun

On Fri, Mar 07, 2025 at 08:46:46PM +0900, Honggyu Kim wrote:
> You can see more info below.
> 
>   $ cd /sys/devices/system/node
> 
>   $ ls -d node*
>   node0  node1  node2  node3
> 
>   $ cat possible
>   0-11

We're split across two threads now, but i'll add this context

I'm basically asking whether there should be 12 nodes possible. It seems
like there should only be 4 nodes possible - 2 for sockets, 2 for host
bridges.

Unless I'm misunderstanding, it should be the case that a given physical
address can only be hosted by 1 numa node (proximity domain).

So it *should* be the case that you either have 4 nodes possible or 10
nodes possible, not 12.  But I could be missing a piece of context.

> Which command do we need for this info specifically?  My output doesn't
> provide some useful info for that.
> 
>   $ acpidump -b
>   $ iasl -d *
>   $ cat cedt.dsl
>       ...
>   **** Unknown ACPI table signature [CEDT]
>

You probably have an old version of acpidump here, you might need to get
a newer version that knows about the CEDT.

You'll also want to get all the Memory Affinity entries from srat.dsl

> Not sure about it.  This must be fixed ASAP because current kernel is
> broken on this issue and the fix should go into hotfix tree first.
>

I agree something is broken, I'm not convinced what is broken.

> If you can think this is just a bandaid, but leaving it bleeding as is
> not the right approach.
>

This affects userland, we shouldn't thrash on this. Lets get it right.

~Gregory

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 2/2 v6] mm/mempolicy: Don't create weight sysfs for memoryless nodes
  2025-03-07 17:51               ` Gregory Price
@ 2025-03-10 12:26                 ` Honggyu Kim
  2025-03-10 14:22                   ` Gregory Price
  0 siblings, 1 reply; 34+ messages in thread
From: Honggyu Kim @ 2025-03-10 12:26 UTC (permalink / raw)
  To: Gregory Price
  Cc: kernel_team, Joshua Hahn, harry.yoo, ying.huang, gregkh,
	rakie.kim, akpm, rafael, lenb, dan.j.williams, Jonathan.Cameron,
	dave.jiang, horen.chuang, hannes, linux-kernel, linux-acpi,
	linux-mm, kernel-team, yunjeong.mun

Hi Gregory,

On 3/8/2025 2:51 AM, Gregory Price wrote:
> On Fri, Mar 07, 2025 at 08:46:46PM +0900, Honggyu Kim wrote:
>> You can see more info below.
>>
>>    $ cd /sys/devices/system/node
>>
>>    $ ls -d node*
>>    node0  node1  node2  node3
>>
>>    $ cat possible
>>    0-11
> 
> We're split across two threads now, but i'll add this context
> 
> I'm basically asking whether there should be 12 nodes possible. It seems
> like there should only be 4 nodes possible - 2 for sockets, 2 for host
> bridges.

Ack. If the N_POSSIBLE itself becomes 4, then I agree this problem can simply be 
resolved.

> 
> Unless I'm misunderstanding, it should be the case that a given physical
> address can only be hosted by 1 numa node (proximity domain).

Yeah, the proximity domain detects the node correctly as follows in dmesg.

  [  0.009915] ACPI: SRAT: Node 0 PXM 0 [mem 0x00000000-0x7fffffff]
  [  0.009917] ACPI: SRAT: Node 0 PXM 0 [mem 0x100000000-0x207fffffff]
  [  0.009919] ACPI: SRAT: Node 1 PXM 1 [mem 0x60f80000000-0x64f7fffffff]
  [  0.009924] ACPI: SRAT: Node 2 PXM 6 [mem 0x2080000000-0x807fffffff] hotplug
  [  0.009925] ACPI: SRAT: Node 3 PXM 7 [mem 0x64f80000000-0x6cf7fffffff] hotplug

It is printed even before CXL detection.

> 
> So it *should* be the case that you either have 4 nodes possible or 10
> nodes possible, not 12.  But I could be missing a piece of context.
> 
>> Which command do we need for this info specifically?  My output doesn't
>> provide some useful info for that.
>>
>>    $ acpidump -b
>>    $ iasl -d *
>>    $ cat cedt.dsl
>>        ...
>>    **** Unknown ACPI table signature [CEDT]
>>
> 
> You probably have an old version of acpidump here, you might need to get
> a newer version that knows about the CEDT.

I just used the newest acpica and was able to dump CEDT properly.  But its
output is also very long so it'd be helpful if you could tell us which specific
info you need.

> 
> You'll also want to get all the Memory Affinity entries from srat.dsl
> 
>> Not sure about it.  This must be fixed ASAP because current kernel is
>> broken on this issue and the fix should go into hotfix tree first.
>>
> 
> I agree something is broken, I'm not convinced what is broken.

Yeah, we should fix the broken status hopefully before v6.14 release.

Thanks,
Honggyu

> 
>> If you can think this is just a bandaid, but leaving it bleeding as is
>> not the right approach.
>>
> 
> This affects userland, we shouldn't thrash on this. Lets get it right.
> 
> ~Gregory



^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 2/2 v6] mm/mempolicy: Don't create weight sysfs for memoryless nodes
  2025-03-10 12:26                 ` Honggyu Kim
@ 2025-03-10 14:22                   ` Gregory Price
  2025-03-11  2:07                     ` Yunjeong Mun
  0 siblings, 1 reply; 34+ messages in thread
From: Gregory Price @ 2025-03-10 14:22 UTC (permalink / raw)
  To: Honggyu Kim
  Cc: kernel_team, Joshua Hahn, harry.yoo, ying.huang, gregkh,
	rakie.kim, akpm, rafael, lenb, dan.j.williams, Jonathan.Cameron,
	dave.jiang, horen.chuang, hannes, linux-kernel, linux-acpi,
	linux-mm, kernel-team, yunjeong.mun

On Mon, Mar 10, 2025 at 09:26:48PM +0900, Honggyu Kim wrote:
> Yeah, the proximity domain detects the node correctly as follows in dmesg.
> 
>  [  0.009915] ACPI: SRAT: Node 0 PXM 0 [mem 0x00000000-0x7fffffff]
>  [  0.009917] ACPI: SRAT: Node 0 PXM 0 [mem 0x100000000-0x207fffffff]
>  [  0.009919] ACPI: SRAT: Node 1 PXM 1 [mem 0x60f80000000-0x64f7fffffff]
>  [  0.009924] ACPI: SRAT: Node 2 PXM 6 [mem 0x2080000000-0x807fffffff] hotplug
>  [  0.009925] ACPI: SRAT: Node 3 PXM 7 [mem 0x64f80000000-0x6cf7fffffff] hotplug
> 
> It is printed even before CXL detection.
> 

I wrote a some documentation on some example configurations last friday

https://lore.kernel.org/linux-mm/Z226PG9t-Ih7fJDL@gourry-fedora-PF4VCD3F/T/#m2780e47df7f0962a79182502afc99843bb046205

This isn't exhaustive, but as I said with Rakie, I think your
configuration is probably ok - if slightly confusing.

What I'm going to guess is happening is you have 1 CFMWS per device that
do not have matching SRAT entries, and then the CFMWS covering these:

>  [  0.009924] ACPI: SRAT: Node 2 PXM 6 [mem 0x2080000000-0x807fffffff] hotplug
>  [  0.009925] ACPI: SRAT: Node 3 PXM 7 [mem 0x64f80000000-0x6cf7fffffff] hotplug

have interleave set up



^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 2/2 v6] mm/mempolicy: Don't create weight sysfs for memoryless nodes
  2025-03-10 14:22                   ` Gregory Price
@ 2025-03-11  2:07                     ` Yunjeong Mun
  2025-03-11  2:42                       ` Gregory Price
  0 siblings, 1 reply; 34+ messages in thread
From: Yunjeong Mun @ 2025-03-11  2:07 UTC (permalink / raw)
  To: Gregory Price
  Cc: kernel_team, Joshua Hahn, harry.yoo, ying.huang, gregkh,
	rakie.kim, akpm, rafael, lenb, dan.j.williams, Jonathan.Cameron,
	dave.jiang, horen.chuang, hannes, linux-kernel, linux-acpi,
	linux-mm, kernel-team, yunjeong.mun, Honggyu Kim

Hi Gregory,

On Mon, 10 Mar 2025 10:22:11 -0400 Gregory Price <gourry@gourry.net> wrote:
> On Mon, Mar 10, 2025 at 09:26:48PM +0900, Honggyu Kim wrote:
> > Yeah, the proximity domain detects the node correctly as follows in dmesg.
> > 
> >  [  0.009915] ACPI: SRAT: Node 0 PXM 0 [mem 0x00000000-0x7fffffff]
> >  [  0.009917] ACPI: SRAT: Node 0 PXM 0 [mem 0x100000000-0x207fffffff]
> >  [  0.009919] ACPI: SRAT: Node 1 PXM 1 [mem 0x60f80000000-0x64f7fffffff]
> >  [  0.009924] ACPI: SRAT: Node 2 PXM 6 [mem 0x2080000000-0x807fffffff] hotplug
> >  [  0.009925] ACPI: SRAT: Node 3 PXM 7 [mem 0x64f80000000-0x6cf7fffffff] hotplug
> > 
> > It is printed even before CXL detection.
> > 
> 
> I wrote a some documentation on some example configurations last friday
> 
> https://lore.kernel.org/linux-mm/Z226PG9t-Ih7fJDL@gourry-fedora-PF4VCD3F/T/#m2780e47df7f0962a79182502afc99843bb046205
> 

I've checked our CFMWS configurations using the information from this link and 
wrote them below.

> This isn't exhaustive, but as I said with Rakie, I think your
> configuration is probably ok - if slightly confusing.
> 

I wonder if the below data is sufficient for your review in Rakies's
email [1]. Please let me know.

> What I'm going to guess is happening is you have 1 CFMWS per device that
> do not have matching SRAT entries, and then the CFMWS covering these:
> 
> >  [  0.009924] ACPI: SRAT: Node 2 PXM 6 [mem 0x2080000000-0x807fffffff] hotplug
> >  [  0.009925] ACPI: SRAT: Node 3 PXM 7 [mem 0x64f80000000-0x6cf7fffffff] hotplug
> 
> have interleave set up
> 

In my understanding, both SRAT and CFMWS have the above device and interleave setup.

Below are the SRAT entries (with some unnecessary lines removed) :

[A128h 41256 001h]               Subtable Type : 01 [Memory Affinity]  
[A12Ah 41258 004h]            Proximity Domain : 00000006  
[A130h 41264 008h]                Base Address : 0000002080000000
[A138h 41272 008h]              Address Length : 0000006000000000  
                                     Enabled : 1  
                               Hot Pluggable : 1  
...
  
[A150h 41296 001h]               Subtable Type : 01 [Memory Affinity]  
[A152h 41298 004h]            Proximity Domain : 00000007  
[A158h 41304 008h]                Base Address : 0000064F80000000
[A160h 41312 008h]              Address Length : 0000008000000000  
                                     Enabled : 1  
                               Hot Pluggable : 1  

and below are the CFMWS configurations (with some unnecessary lines removed):

[0A4h 0164 001h]               Subtable Type : 01 [CXL Fixed Memory Window Structure]
[0ACh 0172 008h]         Window base address : 0000002080000000 <- Memory region
[0B4h 0180 008h]                 Window size : 0000032780000000
[0BCh 0188 001h]          Interleave Members : 01           <-- 2-way interleave
[0BDh 0189 001h]       Interleave Arithmetic : 01
[0C8h 0200 004h]                First Target : 00000043     <-- host bridge id  
[0CCh 0204 004h]                 Next Target : 00000053     <-- host bridge id 

...

[170h 0368 001h]               Subtable Type : 01 [CXL Fixed Memory Window Structure]
[178h 0376 008h]         Window base address : 0000064F80000000
[180h 0384 008h]                 Window size : 0000033000000000
[188h 0392 001h]          Interleave Members : 01          <-- 2-way interleave
[189h 0393 001h]       Interleave Arithmetic : 01
[194h 0404 004h]                First Target : 00000143    <-- host bridge id
[198h 0408 004h]                 Next Target : 00000153    <-- host bridge id

[1] https://lore.kernel.org/linux-mm/Z87zpg3TLRReikgu@gourry-fedora-PF4VCD3F/

Best regards,
Yunjeong


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 2/2 v6] mm/mempolicy: Don't create weight sysfs for memoryless nodes
  2025-03-11  2:07                     ` Yunjeong Mun
@ 2025-03-11  2:42                       ` Gregory Price
  2025-03-11  4:02                         ` Yunjeong Mun
  0 siblings, 1 reply; 34+ messages in thread
From: Gregory Price @ 2025-03-11  2:42 UTC (permalink / raw)
  To: Yunjeong Mun
  Cc: kernel_team, Joshua Hahn, harry.yoo, ying.huang, gregkh,
	rakie.kim, akpm, rafael, lenb, dan.j.williams, Jonathan.Cameron,
	dave.jiang, horen.chuang, hannes, linux-kernel, linux-acpi,
	linux-mm, kernel-team, Honggyu Kim

On Tue, Mar 11, 2025 at 11:07:27AM +0900, Yunjeong Mun wrote:
> Hi Gregory,
> 
> In my understanding, both SRAT and CFMWS have the above device and interleave setup.
> 
> and below are the CFMWS configurations (with some unnecessary lines removed):
> 
> [0A4h 0164 001h]               Subtable Type : 01 [CXL Fixed Memory Window Structure]
> [0ACh 0172 008h]         Window base address : 0000002080000000 <- Memory region
> [0B4h 0180 008h]                 Window size : 0000032780000000
> [0BCh 0188 001h]          Interleave Members : 01           <-- 2-way interleave
> [0BDh 0189 001h]       Interleave Arithmetic : 01
> [0C8h 0200 004h]                First Target : 00000043     <-- host bridge id  
> [0CCh 0204 004h]                 Next Target : 00000053     <-- host bridge id 
> 
> ...
> 
> [170h 0368 001h]               Subtable Type : 01 [CXL Fixed Memory Window Structure]
> [178h 0376 008h]         Window base address : 0000064F80000000
> [180h 0384 008h]                 Window size : 0000033000000000
> [188h 0392 001h]          Interleave Members : 01          <-- 2-way interleave
> [189h 0393 001h]       Interleave Arithmetic : 01
> [194h 0404 004h]                First Target : 00000143    <-- host bridge id
> [198h 0408 004h]                 Next Target : 00000153    <-- host bridge id
>

Are you able to share all CXL Fixed Memory Window Structures in the
CEDT?  Just want to confirm some suspicions here about why we're seeing
12 NUMA nodes.  This explains 2 and suggests there's at least 4 host
bridges - but not the other 8.

~Gregory


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 2/2 v6] mm/mempolicy: Don't create weight sysfs for memoryless nodes
  2025-03-11  2:42                       ` Gregory Price
@ 2025-03-11  4:02                         ` Yunjeong Mun
  2025-03-11  4:42                           ` Gregory Price
  0 siblings, 1 reply; 34+ messages in thread
From: Yunjeong Mun @ 2025-03-11  4:02 UTC (permalink / raw)
  To: Gregory Price
  Cc: kernel_team, Joshua Hahn, harry.yoo, ying.huang, gregkh,
	rakie.kim, akpm, rafael, lenb, dan.j.williams, Jonathan.Cameron,
	dave.jiang, horen.chuang, hannes, linux-kernel, linux-acpi,
	linux-mm, kernel-team, Honggyu Kim

Hi Gregory,

On Mon, 10 Mar 2025 22:42:52 -0400 Gregory Price <gourry@gourry.net> wrote:
> On Tue, Mar 11, 2025 at 11:07:27AM +0900, Yunjeong Mun wrote:
> > Hi Gregory,
> > 
> > In my understanding, both SRAT and CFMWS have the above device and interleave setup.
> > 
> > and below are the CFMWS configurations (with some unnecessary lines removed):
> > 
> > [0A4h 0164 001h]               Subtable Type : 01 [CXL Fixed Memory Window Structure]
> > [0ACh 0172 008h]         Window base address : 0000002080000000 <- Memory region
> > [0B4h 0180 008h]                 Window size : 0000032780000000
> > [0BCh 0188 001h]          Interleave Members : 01           <-- 2-way interleave
> > [0BDh 0189 001h]       Interleave Arithmetic : 01
> > [0C8h 0200 004h]                First Target : 00000043     <-- host bridge id  
> > [0CCh 0204 004h]                 Next Target : 00000053     <-- host bridge id 
> > 
> > ...
> > 
> > [170h 0368 001h]               Subtable Type : 01 [CXL Fixed Memory Window Structure]
> > [178h 0376 008h]         Window base address : 0000064F80000000
> > [180h 0384 008h]                 Window size : 0000033000000000
> > [188h 0392 001h]          Interleave Members : 01          <-- 2-way interleave
> > [189h 0393 001h]       Interleave Arithmetic : 01
> > [194h 0404 004h]                First Target : 00000143    <-- host bridge id
> > [198h 0408 004h]                 Next Target : 00000153    <-- host bridge id
> >
> 
> Are you able to share all CXL Fixed Memory Window Structures in the
> CEDT?  Just want to confirm some suspicions here about why we're seeing
> 12 NUMA nodes.  This explains 2 and suggests there's at least 4 host
> bridges - but not the other 8.
>

Ok, there are 10 CFMWS entries and 4 host bridges in cedt.dsl.
Below is the information for all CFMWS entries and host bridge in the CEDT:

[024h 0036 001h]               Subtable Type : 00 [CXL Host Bridge Structure]
[028h 0040 004h]      Associated host bridge : 00000043

[044h 0068 001h]               Subtable Type : 00 [CXL Host Bridge Structure]
[048h 0072 004h]      Associated host bridge : 00000053

[064h 0100 001h]               Subtable Type : 00 [CXL Host Bridge Structure]
[068h 0104 004h]      Associated host bridge : 00000143

[084h 0132 001h]               Subtable Type : 00 [CXL Host Bridge Structure]
[088h 0136 004h]      Associated host bridge : 00000153
 
[0A4h 0164 001h]               Subtable Type : 01 [CXL Fixed Memory Window Structure]
[0ACh 0172 008h]         Window base address : 0000002080000000
[0B4h 0180 008h]                 Window size : 0000032780000000
[0BCh 0188 001h]          Interleave Members : 01
[0BDh 0189 001h]       Interleave Arithmetic : 01
[0C8h 0200 004h]                First Target : 00000043
[0CCh 0204 004h]                 Next Target : 00000053

[0D0h 0208 001h]               Subtable Type : 01 [CXL Fixed Memory Window Structure]
[0D8h 0216 008h]         Window base address : 0000034800000000
[0E0h 0224 008h]                 Window size : 00000163C0000000
[0E8h 0232 001h]          Interleave Members : 00
[0F4h 0244 004h]                First Target : 00000043

[0F8h 0248 001h]               Subtable Type : 01 [CXL Fixed Memory Window Structure]
[100h 0256 008h]         Window base address : 000004ABC0000000
[108h 0264 008h]                 Window size : 00000163C0000000
[110h 0272 001h]          Interleave Members : 00
[111h 0273 001h]       Interleave Arithmetic : 00
[11Ch 0284 004h]                First Target : 00000053

[120h 0288 001h]               Subtable Type : 01 [CXL Fixed Memory Window Structure]
[128h 0296 008h]         Window base address : 00000C2F80000000
[130h 0304 008h]                 Window size : 000002F840000000
[138h 0312 001h]          Interleave Members : 00
[139h 0313 001h]       Interleave Arithmetic : 00
[144h 0324 004h]                First Target : 00000043


[148h 0328 001h]               Subtable Type : 01 [CXL Fixed Memory Window Structure]
[150h 0336 008h]         Window base address : 00000F27C0000000
[158h 0344 008h]                 Window size : 000002F840000000
[160h 0352 001h]          Interleave Members : 00
[161h 0353 001h]       Interleave Arithmetic : 00
[16Ch 0364 004h]                First Target : 00000053

[170h 0368 001h]               Subtable Type : 01 [CXL Fixed Memory Window Structure]
[178h 0376 008h]         Window base address : 0000064F80000000
[180h 0384 008h]                 Window size : 0000033000000000
[188h 0392 001h]          Interleave Members : 01
[189h 0393 001h]       Interleave Arithmetic : 01
[194h 0404 004h]                First Target : 00000143
[198h 0408 004h]                 Next Target : 00000153

[19Ch 0412 001h]               Subtable Type : 01 [CXL Fixed Memory Window Structure]
[1A4h 0420 008h]         Window base address : 0000097F80000000
[1ACh 0428 008h]                 Window size : 0000015800000000
[1B4h 0436 001h]          Interleave Members : 00
[1B5h 0437 001h]       Interleave Arithmetic : 00
[1C0h 0448 004h]                First Target : 00000143

[1C4h 0452 001h]               Subtable Type : 01 [CXL Fixed Memory Window Structure]
[1CCh 0460 008h]         Window base address : 00000AD780000000
[1D4h 0468 008h]                 Window size : 0000015800000000
[1DCh 0476 001h]         Interleave Members : 00 
[1DDh 0477 001h]       Interleave Arithmetic : 00 
[1E8h 0488 004h]                First Target : 00000153

[1ECh 0492 001h]               Subtable Type : 01 [CXL Fixed Memory Window Structure]
[1F4h 0500 008h]         Window base address : 0000122000000000
[1FCh 0508 008h]                 Window size : 000002F000000000
[204h 0516 001h]          Interleave Members : 00
[205h 0517 001h]       Interleave Arithmetic : 00
[210h 0528 004h]                First Target : 00000143

[214h 0532 001h]               Subtable Type : 01 [CXL Fixed Memory Window Structure]
[21Ch 0540 008h]         Window base address : 0000151000000000
[224h 0548 008h]                 Window size : 000002F000000000
[22Ch 0556 001h]          Interleave Members : 00
[22Dh 0557 001h]       Interleave Arithmetic : 00
[238h 0568 004h]                First Target : 00000153

In my understanding, the reason we are seeing 12 NUMA node is because
it loops through node_states[N_POSSIBLE] and its value is 4095 (twelves ones)
in the code [1]  below:

        for_each_node_state(nid, N_POSSIBLE) {
                err = add_weight_node(nid, wi_kobj);
                if (err) {
                        pr_err("failed to add sysfs [node%d]\n", nid);
                        break;
                }
        }

Since I'm not familiar with the CEDT, I'm wondering if you mean that N_POSSIBLE 
is associated with the number of CFMWS.

[1] https://github.com/torvalds/linux/blob/v6.14-rc6/mm/mempolicy.c#L3508-L3514

> ~Gregory
> 

Best regrads,
Yunjeong


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 2/2 v6] mm/mempolicy: Don't create weight sysfs for memoryless nodes
  2025-03-11  4:02                         ` Yunjeong Mun
@ 2025-03-11  4:42                           ` Gregory Price
  2025-03-11  9:51                             ` Yunjeong Mun
  2025-03-18  8:02                             ` Yunjeong Mun
  0 siblings, 2 replies; 34+ messages in thread
From: Gregory Price @ 2025-03-11  4:42 UTC (permalink / raw)
  To: Yunjeong Mun
  Cc: kernel_team, Joshua Hahn, harry.yoo, ying.huang, gregkh,
	rakie.kim, akpm, rafael, lenb, dan.j.williams, Jonathan.Cameron,
	dave.jiang, horen.chuang, hannes, linux-kernel, linux-acpi,
	linux-mm, kernel-team, Honggyu Kim

On Tue, Mar 11, 2025 at 01:02:07PM +0900, Yunjeong Mun wrote:

forenote - Hi Andrew, please hold off on the auto-configuration patch
for now, the sk group has identified a hotplug issue we need to work out
and we'll likely need to merge these two patch set together.  I really
appreciate your patience with this feature.

> Hi Gregory,
>
> In my understanding, the reason we are seeing 12 NUMA node is because
> it loops through node_states[N_POSSIBLE] and its value is 4095 (twelves ones)
> in the code [1]  below:
> 
... snip ...

Appreciated, so yes this confirms what i thought was going on.  There's
4 host bridges, 2 devices on each host bridge, and an extra CFMWS per
socket that is intended to interleave across the host bridges.

As you mention below, the code in acpi/numa/srat.c will create 1 NUMA
node per SRAT Memory Affinity Entry - and then also 1 NUMA node per
CFMWS that doesn't have a matching SRAT entry (with a known corner case
for a missing SRAT which doesn't apply here).

So essentialy what the system is doing is marking that it's absolutely
possible to create 1 region per device and also 1 region that
interleaves across host each pair of host bridges (I presume this is a
dual socket system?).

So, tl;dr: All these nodes are valid and this configuration is correct.

Weighted interleave presently works fine as intended, but with the
inclusion of the auto-configuration, there will be issues for your
system configuration. This means we probably need to consider
merging these as a group.

During boot, the following will occur

1) drivers/acpi/numa/srat.c marks 12 nodes as possible
   0-1) Socket nodes
   2-3) Cross-host-bridge interleave nodes
   4-11) single region nodes

2) drivers/cxl/* will probe the various devices and create
   a root decoder for each CXL Fixed Memory Window
   decoder0.0 - decoder11.0  (or maybe decoder0.0 - decoder0.11)

3) during probe auto-configuration of wieghted interleave occurs as a
   result of this code being called with hmat or cdat data:

void node_set_perf_attrs() {
...
	/* When setting CPU access coordinates, update mempolicy */
	if (access == ACCESS_COORDINATE_CPU) {
		if (mempolicy_set_node_perf(nid, coord)) {
			pr_info("failed to set mempolicy attrs for node %d\n",
				nid);
		}
	}
...
}

under the current system, since we calculate with N_POSSIBLE, all nodes
will be assigned weights (assuming HMAT or CDAT data is available for
all of them).

We actually have a few issues here

1) If all nodes are included in the weighting reduction, we're actually
   over-representing a particular set of hardware.  The interleave node
   and the individual device nodes would actually over-represent the
   bandwidth available (comparative to the CPU nodes).

2) As stated on this patch line, just switching to N_MEMORY causes
   issues with hotplug - where the bandwidth can be reported, but if
   memory hasn't been added yet then we'll end up with wrong weights
   because it wasn't included in the calculation.

3) However, not exposing the nodes because N_MEMORY isn't set yet
   a) prevents pre-configuration before memory is onlined, and
   b) hides the implications of hotplugging memory into a node from the
      user (adding memory causes a re-weight and may affect an
      interleave-all configuration).

but - i think it's reasonable that anyone using weighted-interleave is
*probably* not going to have nodes come and go.  It just seems like a
corner case that isn't reasonable to spend time supporting.

So coming back around to the hotplug patch line, I do think it's
reasonable hide nodes marked !N_MEMORY, but consider two issues:

1) In auto mode, we need to re-weight on hotplug to only include
   onlined nodes.  This is because the reduction may be sensitive
   to the available bandwidth changes.

   This behavior needs to be clearly documented.

2) We need to clearly define what the weight of a node will be when
   in manual mode and a node goes (memory -> no memory -> memory)
   a) does it retain it's old, manually set weight?
   b) does it revert to 1?

Sorry for the long email, just working through all the implications.

I think the proposed hotplug patch is a requirement for the
auto-configuration patch set.

~Gregory

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 2/2 v6] mm/mempolicy: Don't create weight sysfs for memoryless nodes
  2025-03-11  4:42                           ` Gregory Price
@ 2025-03-11  9:51                             ` Yunjeong Mun
  2025-03-11 15:52                               ` Gregory Price
  2025-03-18  8:02                             ` Yunjeong Mun
  1 sibling, 1 reply; 34+ messages in thread
From: Yunjeong Mun @ 2025-03-11  9:51 UTC (permalink / raw)
  To: Gregory Price
  Cc: kernel_team, Joshua Hahn, harry.yoo, ying.huang, gregkh,
	rakie.kim, akpm, rafael, lenb, dan.j.williams, Jonathan.Cameron,
	dave.jiang, horen.chuang, hannes, linux-kernel, linux-acpi,
	linux-mm, kernel-team, Honggyu Kim

Hi Gregory, thanks for kind explanation.

On Tue, 11 Mar 2025 00:42:49 -0400 Gregory Price <gourry@gourry.net> wrote:
> On Tue, Mar 11, 2025 at 01:02:07PM +0900, Yunjeong Mun wrote:
> 
> forenote - Hi Andrew, please hold off on the auto-configuration patch
> for now, the sk group has identified a hotplug issue we need to work out
> and we'll likely need to merge these two patch set together.  I really
> appreciate your patience with this feature.
> 
> > Hi Gregory,
> >
> > In my understanding, the reason we are seeing 12 NUMA node is because
> > it loops through node_states[N_POSSIBLE] and its value is 4095 (twelves ones)
> > in the code [1]  below:
> > 
> ... snip ...
> 
> Appreciated, so yes this confirms what i thought was going on.  There's
> 4 host bridges, 2 devices on each host bridge, and an extra CFMWS per
> socket that is intended to interleave across the host bridges.
> 
> As you mention below, the code in acpi/numa/srat.c will create 1 NUMA
> node per SRAT Memory Affinity Entry - and then also 1 NUMA node per
> CFMWS that doesn't have a matching SRAT entry (with a known corner case
> for a missing SRAT which doesn't apply here).
> 
> So essentialy what the system is doing is marking that it's absolutely
> possible to create 1 region per device and also 1 region that
> interleaves across host each pair of host bridges (I presume this is a
> dual socket system?).

Correct, it is a dual socket system. Thank you for the detailed explanation.
It has been helpful for analyzing the code.
> 
> So, tl;dr: All these nodes are valid and this configuration is correct.
> 
> Weighted interleave presently works fine as intended, but with the
> inclusion of the auto-configuration, there will be issues for your
> system configuration. This means we probably need to consider
> merging these as a group.
> 

We believe our propsed hot plug patch should be added as a hot-fix to v6.14, 
because it can be addressed independently of auto-configuring feature.
Rakie will send v2 patch soon.

> During boot, the following will occur
> 
> 1) drivers/acpi/numa/srat.c marks 12 nodes as possible
>    0-1) Socket nodes
>    2-3) Cross-host-bridge interleave nodes
>    4-11) single region nodes
> 
> 2) drivers/cxl/* will probe the various devices and create
>    a root decoder for each CXL Fixed Memory Window
>    decoder0.0 - decoder11.0  (or maybe decoder0.0 - decoder0.11)
> 
> 3) during probe auto-configuration of wieghted interleave occurs as a
>    result of this code being called with hmat or cdat data:
> 
> void node_set_perf_attrs() {
> ...
> 	/* When setting CPU access coordinates, update mempolicy */
> 	if (access == ACCESS_COORDINATE_CPU) {
> 		if (mempolicy_set_node_perf(nid, coord)) {
> 			pr_info("failed to set mempolicy attrs for node %d\n",
> 				nid);
> 		}
> 	}
> ...
> }
> 
> under the current system, since we calculate with N_POSSIBLE, all nodes
> will be assigned weights (assuming HMAT or CDAT data is available for
> all of them).
> 
> We actually have a few issues here
> 
> 1) If all nodes are included in the weighting reduction, we're actually
>    over-representing a particular set of hardware.  The interleave node
>    and the individual device nodes would actually over-represent the
>    bandwidth available (comparative to the CPU nodes).
> 
> 2) As stated on this patch line, just switching to N_MEMORY causes
>    issues with hotplug - where the bandwidth can be reported, but if
>    memory hasn't been added yet then we'll end up with wrong weights
>    because it wasn't included in the calculation.
> 
> 3) However, not exposing the nodes because N_MEMORY isn't set yet
>    a) prevents pre-configuration before memory is onlined, and
>    b) hides the implications of hotplugging memory into a node from the
>       user (adding memory causes a re-weight and may affect an
>       interleave-all configuration).
> 
> but - i think it's reasonable that anyone using weighted-interleave is
> *probably* not going to have nodes come and go.  It just seems like a
> corner case that isn't reasonable to spend time supporting.
> 
> So coming back around to the hotplug patch line, I do think it's
> reasonable hide nodes marked !N_MEMORY, but consider two issues:
> 
> 1) In auto mode, we need to re-weight on hotplug to only include
>    onlined nodes.  This is because the reduction may be sensitive
>    to the available bandwidth changes.
> 
>    This behavior needs to be clearly documented.
> 
> 2) We need to clearly define what the weight of a node will be when
>    in manual mode and a node goes (memory -> no memory -> memory)
>    a) does it retain it's old, manually set weight?
>    b) does it revert to 1?
> 
> Sorry for the long email, just working through all the implications.
> 
> I think the proposed hotplug patch is a requirement for the
> auto-configuration patch set.
> 
> ~Gregory
> 
Best regards,
Yunjeong


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 2/2 v6] mm/mempolicy: Don't create weight sysfs for memoryless nodes
  2025-03-11  9:51                             ` Yunjeong Mun
@ 2025-03-11 15:52                               ` Gregory Price
  0 siblings, 0 replies; 34+ messages in thread
From: Gregory Price @ 2025-03-11 15:52 UTC (permalink / raw)
  To: Yunjeong Mun
  Cc: kernel_team, Joshua Hahn, harry.yoo, ying.huang, gregkh,
	rakie.kim, akpm, rafael, lenb, dan.j.williams, Jonathan.Cameron,
	dave.jiang, horen.chuang, hannes, linux-kernel, linux-acpi,
	linux-mm, kernel-team, Honggyu Kim

On Tue, Mar 11, 2025 at 06:51:45PM +0900, Yunjeong Mun wrote:
> Hi Gregory, thanks for kind explanation.
> 
> We believe our propsed hot plug patch should be added as a hot-fix to v6.14, 
> because it can be addressed independently of auto-configuring feature.
> Rakie will send v2 patch soon.
> 

I think having the answer to at least the following question documented
is needed.  I'll wait for v2 to do another review.

> > 2) We need to clearly define what the weight of a node will be when
> >    in manual mode and a node goes (memory -> no memory -> memory)
> >    a) does it retain it's old, manually set weight?
> >    b) does it revert to 1?
> > 

~Gregory


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 2/2 v6] mm/mempolicy: Don't create weight sysfs for memoryless nodes
  2025-03-11  4:42                           ` Gregory Price
  2025-03-11  9:51                             ` Yunjeong Mun
@ 2025-03-18  8:02                             ` Yunjeong Mun
  2025-03-18 11:02                               ` Honggyu Kim
  1 sibling, 1 reply; 34+ messages in thread
From: Yunjeong Mun @ 2025-03-18  8:02 UTC (permalink / raw)
  To: Gregory Price
  Cc: kernel_team, Joshua Hahn, harry.yoo, ying.huang, gregkh,
	rakie.kim, akpm, rafael, lenb, dan.j.williams, Jonathan.Cameron,
	dave.jiang, horen.chuang, hannes, linux-kernel, linux-acpi,
	linux-mm, kernel-team, Honggyu Kim

Hi Gregory, I have one more question below.

On Tue, 11 Mar 2025 00:42:49 -0400 Gregory Price <gourry@gourry.net> wrote:
> On Tue, Mar 11, 2025 at 01:02:07PM +0900, Yunjeong Mun wrote:
> 
> forenote - Hi Andrew, please hold off on the auto-configuration patch
> for now, the sk group has identified a hotplug issue we need to work out
> and we'll likely need to merge these two patch set together.  I really
> appreciate your patience with this feature.
> 
> > Hi Gregory,
> >
> > In my understanding, the reason we are seeing 12 NUMA node is because
> > it loops through node_states[N_POSSIBLE] and its value is 4095 (twelves ones)
> > in the code [1]  below:
> > 
> ... snip ...
> 
> Appreciated, so yes this confirms what i thought was going on.  There's
> 4 host bridges, 2 devices on each host bridge, and an extra CFMWS per
> socket that is intended to interleave across the host bridges.
> 

Thanks for confirm. Honggyu represented it as a tree sturcture:
rootport/ 
├── socket0
│   ├── cross-host-bridge0 -> SRAT && CEDT (interleave on) --> NODE 2
│   │   ├── host-bridge0 -> CEDT
│   │   │   ├── cxl0 -> CEDT
│   │   │   └── cxl1-> CEDT
│   │   └── host-bridge1 -> CEDT
│   │       ├── cxl2 -> CEDT
│   │       └── cxl3 -> CEDT
│   └── dram0 -> SRAT ---------------------------------------> NODE 0
└── socket1
    ├── cross-host-bridge1 -> SRAT && CEDT (interleave on)---> NODE 3
    │   ├── host-bridge2 -> CEDT
    │   │   ├── cxl4 -> CEDT
    │   │   └── cxl5 -> CEDT
    │   └── host-bridge3 -> CEDT
    │       ├── cxl6 -> CEDT
    │       └── cxl7 -> CEDT
    └── dram1 -> SRAT ---------------------------------------> NODE 1

> As you mention below, the code in acpi/numa/srat.c will create 1 NUMA
> node per SRAT Memory Affinity Entry - and then also 1 NUMA node per
> CFMWS that doesn't have a matching SRAT entry (with a known corner case
> for a missing SRAT which doesn't apply here).
> 
> So essentialy what the system is doing is marking that it's absolutely
> possible to create 1 region per device and also 1 region that
> interleaves across host each pair of host bridges (I presume this is a
> dual socket system?).
> 
> So, tl;dr: All these nodes are valid and this configuration is correct.

I am wondering if all 12 nodes specifed as 'possible' is indeed correct.
The definiton of 'possible' is:
 - 'Nodes that could be possibly become online at some point'. 
IMHO, it seems like there should only be 4 nodes specified as 'possible'.

> 
> Weighted interleave presently works fine as intended, but with the
> inclusion of the auto-configuration, there will be issues for your
> system configuration. This means we probably need to consider
> merging these as a group.
> 
> During boot, the following will occur
> 
> 1) drivers/acpi/numa/srat.c marks 12 nodes as possible
>    0-1) Socket nodes
>    2-3) Cross-host-bridge interleave nodes
>    4-11) single region nodes
> 
> 2) drivers/cxl/* will probe the various devices and create
>    a root decoder for each CXL Fixed Memory Window
>    decoder0.0 - decoder11.0  (or maybe decoder0.0 - decoder0.11)
> 
> 3) during probe auto-configuration of wieghted interleave occurs as a
>    result of this code being called with hmat or cdat data:
> 
> void node_set_perf_attrs() {
> ...
> 	/* When setting CPU access coordinates, update mempolicy */
> 	if (access == ACCESS_COORDINATE_CPU) {
> 		if (mempolicy_set_node_perf(nid, coord)) {
> 			pr_info("failed to set mempolicy attrs for node %d\n",
> 				nid);
> 		}
> 	}
> ...
> }
> 
> under the current system, since we calculate with N_POSSIBLE, all nodes
> will be assigned weights (assuming HMAT or CDAT data is available for
> all of them).
> 
> We actually have a few issues here
> 
> 1) If all nodes are included in the weighting reduction, we're actually
>    over-representing a particular set of hardware.  The interleave node
>    and the individual device nodes would actually over-represent the
>    bandwidth available (comparative to the CPU nodes).
> 
> 2) As stated on this patch line, just switching to N_MEMORY causes
>    issues with hotplug - where the bandwidth can be reported, but if
>    memory hasn't been added yet then we'll end up with wrong weights
>    because it wasn't included in the calculation.
> 
> 3) However, not exposing the nodes because N_MEMORY isn't set yet
>    a) prevents pre-configuration before memory is onlined, and
>    b) hides the implications of hotplugging memory into a node from the
>       user (adding memory causes a re-weight and may affect an
>       interleave-all configuration).
> 
> but - i think it's reasonable that anyone using weighted-interleave is
> *probably* not going to have nodes come and go.  It just seems like a
> corner case that isn't reasonable to spend time supporting.
> 
> So coming back around to the hotplug patch line, I do think it's
> reasonable hide nodes marked !N_MEMORY, but consider two issues:
> 
> 1) In auto mode, we need to re-weight on hotplug to only include
>    onlined nodes.  This is because the reduction may be sensitive
>    to the available bandwidth changes.
> 
>    This behavior needs to be clearly documented.
> 
> 2) We need to clearly define what the weight of a node will be when
>    in manual mode and a node goes (memory -> no memory -> memory)
>    a) does it retain it's old, manually set weight?
>    b) does it revert to 1?
> 
> Sorry for the long email, just working through all the implications.
> 
> I think the proposed hotplug patch is a requirement for the
> auto-configuration patch set.
> 
> ~Gregory
> 

Best regards,
Yunjeong


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 2/2 v6] mm/mempolicy: Don't create weight sysfs for memoryless nodes
  2025-03-18  8:02                             ` Yunjeong Mun
@ 2025-03-18 11:02                               ` Honggyu Kim
  2025-03-18 15:13                                 ` Gregory Price
  0 siblings, 1 reply; 34+ messages in thread
From: Honggyu Kim @ 2025-03-18 11:02 UTC (permalink / raw)
  To: Yunjeong Mun, Gregory Price
  Cc: kernel_team, Joshua Hahn, harry.yoo, ying.huang, gregkh,
	rakie.kim, akpm, rafael, lenb, dan.j.williams, Jonathan.Cameron,
	dave.jiang, horen.chuang, hannes, linux-kernel, linux-acpi,
	linux-mm, kernel-team



On 3/18/2025 5:02 PM, Yunjeong Mun wrote:
> Hi Gregory, I have one more question below.
> 
> On Tue, 11 Mar 2025 00:42:49 -0400 Gregory Price <gourry@gourry.net> wrote:
>> On Tue, Mar 11, 2025 at 01:02:07PM +0900, Yunjeong Mun wrote:
>>
>> forenote - Hi Andrew, please hold off on the auto-configuration patch
>> for now, the sk group has identified a hotplug issue we need to work out
>> and we'll likely need to merge these two patch set together.  I really
>> appreciate your patience with this feature.
>>
>>> Hi Gregory,
>>>
>>> In my understanding, the reason we are seeing 12 NUMA node is because
>>> it loops through node_states[N_POSSIBLE] and its value is 4095 (twelves ones)
>>> in the code [1]  below:
>>>
>> ... snip ...
>>
>> Appreciated, so yes this confirms what i thought was going on.  There's
>> 4 host bridges, 2 devices on each host bridge, and an extra CFMWS per
>> socket that is intended to interleave across the host bridges.
>>
> 
> Thanks for confirm. Honggyu represented it as a tree sturcture:
> rootport/
> ├── socket0
> │   ├── cross-host-bridge0 -> SRAT && CEDT (interleave on) --> NODE 2
> │   │   ├── host-bridge0 -> CEDT
> │   │   │   ├── cxl0 -> CEDT
> │   │   │   └── cxl1-> CEDT
> │   │   └── host-bridge1 -> CEDT
> │   │       ├── cxl2 -> CEDT
> │   │       └── cxl3 -> CEDT
> │   └── dram0 -> SRAT ---------------------------------------> NODE 0
> └── socket1
>      ├── cross-host-bridge1 -> SRAT && CEDT (interleave on)---> NODE 3
>      │   ├── host-bridge2 -> CEDT
>      │   │   ├── cxl4 -> CEDT
>      │   │   └── cxl5 -> CEDT
>      │   └── host-bridge3 -> CEDT
>      │       ├── cxl6 -> CEDT
>      │       └── cxl7 -> CEDT
>      └── dram1 -> SRAT ---------------------------------------> NODE 1

Some simple corrections here.  host-bridge{0-3} above aren't detected from CEDT.
The corrected structure is as follows.

rootport/
├── socket0
│   ├── cross-host-bridge0 -> SRAT && CEDT (interleave on) --> NODE 2
│   │   ├── host-bridge0
│   │   │   ├── cxl0 -> CEDT
│   │   │   └── cxl1-> CEDT
│   │   └── host-bridge1
│   │       ├── cxl2 -> CEDT
│   │       └── cxl3 -> CEDT
│   └── dram0 -> SRAT ---------------------------------------> NODE 0
└── socket1
      ├── cross-host-bridge1 -> SRAT && CEDT (interleave on)---> NODE 3
      │   ├── host-bridge2
      │   │   ├── cxl4 -> CEDT
      │   │   └── cxl5 -> CEDT
      │   └── host-bridge3
      │       ├── cxl6 -> CEDT
      │       └── cxl7 -> CEDT
      └── dram1 -> SRAT ---------------------------------------> NODE 1

Thanks,
Honggyu


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 2/2 v6] mm/mempolicy: Don't create weight sysfs for memoryless nodes
  2025-03-18 11:02                               ` Honggyu Kim
@ 2025-03-18 15:13                                 ` Gregory Price
  2025-03-19  9:56                                   ` Yunjeong Mun
  0 siblings, 1 reply; 34+ messages in thread
From: Gregory Price @ 2025-03-18 15:13 UTC (permalink / raw)
  To: Honggyu Kim
  Cc: Yunjeong Mun, kernel_team, Joshua Hahn, harry.yoo, ying.huang,
	gregkh, rakie.kim, akpm, rafael, lenb, dan.j.williams,
	Jonathan.Cameron, dave.jiang, horen.chuang, hannes, linux-kernel,
	linux-acpi, linux-mm, kernel-team

On Tue, Mar 18, 2025 at 08:02:46PM +0900, Honggyu Kim wrote:
> 
> 
> On 3/18/2025 5:02 PM, Yunjeong Mun wrote:
> 
> Some simple corrections here.  host-bridge{0-3} above aren't detected from CEDT.
> The corrected structure is as follows.
> 
> rootport/
> ├── socket0
> │   ├── cross-host-bridge0 -> SRAT && CEDT (interleave on) --> NODE 2
> │   │   ├── host-bridge0
> │   │   │   ├── cxl0 -> CEDT
                          node 4
> │   │   │   └── cxl1-> CEDT
                          node 5
> │   │   └── host-bridge1
> │   │       ├── cxl2 -> CEDT
                          node 6
> │   │       └── cxl3 -> CEDT
                          node 7
> │   └── dram0 -> SRAT ---------------------------------------> NODE 0
> └── socket1
>      ├── cross-host-bridge1 -> SRAT && CEDT (interleave on)---> NODE 3
>      │   ├── host-bridge2
>      │   │   ├── cxl4 -> CEDT
                           node 8
>      │   │   └── cxl5 -> CEDT
                           node 9
>      │   └── host-bridge3
>      │       ├── cxl6 -> CEDT
                           node 10
>      │       └── cxl7 -> CEDT
                           node 11
>      └── dram1 -> SRAT ---------------------------------------> NODE 1
> 

This is correct and expected.

All of these nodes are "possible" depending on how the user decides to
program the CXL decoders and expose memory to the page allocator.

In your /sys/bus/cxl/devices/ you should have something like

  decoder0.0  decoder0.1  decoder0.2   decoder0.3
  decoder0.4  decoder0.5  decoder0.6   decoder0.7
  decoder0.8  decoder0.9

These are the root decoders that should map up directly with each CEDT
CFMWS entry.

2 of them should have interleave settings.

If you were to then program the endpoint and hostbridge decoders with
the matching non-interleave address values from the other CEDT entries,
you could bring each individual device online in its own NUMA node.

Or, you can do what you're doing now, and program the endpoints to map
to the 2 cross-host bridge interleave root decoders.

So your platform is giving you the option of how to online your devices,
and as such it needs to mark nodes as "possible" even if they're unused.

~Gregory


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 2/2 v6] mm/mempolicy: Don't create weight sysfs for memoryless nodes
  2025-03-18 15:13                                 ` Gregory Price
@ 2025-03-19  9:56                                   ` Yunjeong Mun
  2025-03-19 14:54                                     ` Gregory Price
  0 siblings, 1 reply; 34+ messages in thread
From: Yunjeong Mun @ 2025-03-19  9:56 UTC (permalink / raw)
  To: Gregory Price
  Cc: kernel_team, Joshua Hahn, harry.yoo, ying.huang, gregkh,
	rakie.kim, akpm, rafael, lenb, dan.j.williams, Jonathan.Cameron,
	dave.jiang, horen.chuang, hannes, linux-kernel, linux-acpi,
	linux-mm, kernel-team, Honggyu Kim

Hi Gregory, thanks for your kind explanation.

On Tue, 18 Mar 2025 11:13:13 -0400 Gregory Price <gourry@gourry.net> wrote:
> On Tue, Mar 18, 2025 at 08:02:46PM +0900, Honggyu Kim wrote:
> > 
> > 
> > On 3/18/2025 5:02 PM, Yunjeong Mun wrote:
> > 
> > Some simple corrections here.  host-bridge{0-3} above aren't detected from CEDT.
> > The corrected structure is as follows.
> > 
> > rootport/
> > ├── socket0
> > │   ├── cross-host-bridge0 -> SRAT && CEDT (interleave on) --> NODE 2
> > │   │   ├── host-bridge0
> > │   │   │   ├── cxl0 -> CEDT
>                           node 4
> > │   │   │   └── cxl1-> CEDT
>                           node 5
> > │   │   └── host-bridge1
> > │   │       ├── cxl2 -> CEDT
>                           node 6
> > │   │       └── cxl3 -> CEDT
>                           node 7
> > │   └── dram0 -> SRAT ---------------------------------------> NODE 0
> > └── socket1
> >      ├── cross-host-bridge1 -> SRAT && CEDT (interleave on)---> NODE 3
> >      │   ├── host-bridge2
> >      │   │   ├── cxl4 -> CEDT
>                            node 8
> >      │   │   └── cxl5 -> CEDT
>                            node 9
> >      │   └── host-bridge3
> >      │       ├── cxl6 -> CEDT
>                            node 10
> >      │       └── cxl7 -> CEDT
>                            node 11
> >      └── dram1 -> SRAT ---------------------------------------> NODE 1
> > 
> 
> This is correct and expected.
> 
> All of these nodes are "possible" depending on how the user decides to
> program the CXL decoders and expose memory to the page allocator.
> 
> In your /sys/bus/cxl/devices/ you should have something like
> 
>   decoder0.0  decoder0.1  decoder0.2   decoder0.3
>   decoder0.4  decoder0.5  decoder0.6   decoder0.7
>   decoder0.8  decoder0.9
> 

Yes, I can see many decoder#.# files in there, and their devtype values are
shown below:
    $ cat /sys/bus/cxl/devices/decoder*/devtype
    cxl_decoder_root
    ...
    cxl_decoder_switch
    ...
    cxl_decoder_endpoint

> These are the root decoders that should map up directly with each CEDT
> CFMWS entry.
> 
> 2 of them should have interleave settings.
> 
> If you were to then program the endpoint and hostbridge decoders with
> the matching non-interleave address values from the other CEDT entries,
> you could bring each individual device online in its own NUMA node.
> 

I think this means that I can program the endpoint(=cxl_decoder_endpoint)
to map to the 8 CFMWS, and the hostbridge decoder (=cxl_decoder switch) to map
to another 2 CFMWS(cross-host bridge).

> Or, you can do what you're doing now, and program the endpoints to map
> to the 2 cross-host bridge interleave root decoders.

In my understanding, that kind of programming is done at the firmware or BIOS 
layer, right?

> 
> So your platform is giving you the option of how to online your devices,
> and as such it needs to mark nodes as "possible" even if they're unused.
> 

Thank you for the clear explanation. I now understand why 'possible' has such
value.

> ~Gregory
> 

Best regards,
Yunjeong


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 2/2 v6] mm/mempolicy: Don't create weight sysfs for memoryless nodes
  2025-03-19  9:56                                   ` Yunjeong Mun
@ 2025-03-19 14:54                                     ` Gregory Price
  0 siblings, 0 replies; 34+ messages in thread
From: Gregory Price @ 2025-03-19 14:54 UTC (permalink / raw)
  To: Yunjeong Mun
  Cc: kernel_team, Joshua Hahn, harry.yoo, ying.huang, gregkh,
	rakie.kim, akpm, rafael, lenb, dan.j.williams, Jonathan.Cameron,
	dave.jiang, horen.chuang, hannes, linux-kernel, linux-acpi,
	linux-mm, kernel-team, Honggyu Kim

On Wed, Mar 19, 2025 at 06:56:21PM +0900, Yunjeong Mun wrote:
> > These are the root decoders that should map up directly with each CEDT
> > CFMWS entry.
> > 
> > 2 of them should have interleave settings.
> > 
> > If you were to then program the endpoint and hostbridge decoders with
> > the matching non-interleave address values from the other CEDT entries,
> > you could bring each individual device online in its own NUMA node.
> > 
> 
> I think this means that I can program the endpoint(=cxl_decoder_endpoint)
> to map to the 8 CFMWS, and the hostbridge decoder (=cxl_decoder switch) to map
> to another 2 CFMWS(cross-host bridge).
> 

you can program any given endpoint decoders to map to either per-device
CFMWS, or to the interleave CFMWS - decoders are flexible in this way.
The existince of both sets gives you the option of how to program the
topology - it's up to you.

~Gregory


^ permalink raw reply	[flat|nested] 34+ messages in thread

end of thread, other threads:[~2025-03-19 14:54 UTC | newest]

Thread overview: 34+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <20250228001631.1102-1-yunjeong.mun@sk.com>
2025-02-26 21:35 ` [PATCH 1/2 v6] mm/mempolicy: Weighted Interleave Auto-tuning Joshua Hahn
2025-02-26 21:35   ` [PATCH 2/2 v6] mm/mempolicy: Don't create weight sysfs for memoryless nodes Joshua Hahn
2025-02-27  2:32     ` Honggyu Kim
2025-02-27  3:20       ` Honggyu Kim
2025-03-03 21:56         ` Joshua Hahn
2025-03-04 12:53           ` Honggyu Kim
2025-03-03 16:19       ` Gregory Price
2025-03-04 13:03         ` Honggyu Kim
2025-03-04 16:16           ` Gregory Price
2025-03-04 16:29       ` Gregory Price
2025-03-06 12:39         ` Honggyu Kim
2025-03-06 17:32           ` Gregory Price
2025-03-07 11:46             ` Honggyu Kim
2025-03-07 17:51               ` Gregory Price
2025-03-10 12:26                 ` Honggyu Kim
2025-03-10 14:22                   ` Gregory Price
2025-03-11  2:07                     ` Yunjeong Mun
2025-03-11  2:42                       ` Gregory Price
2025-03-11  4:02                         ` Yunjeong Mun
2025-03-11  4:42                           ` Gregory Price
2025-03-11  9:51                             ` Yunjeong Mun
2025-03-11 15:52                               ` Gregory Price
2025-03-18  8:02                             ` Yunjeong Mun
2025-03-18 11:02                               ` Honggyu Kim
2025-03-18 15:13                                 ` Gregory Price
2025-03-19  9:56                                   ` Yunjeong Mun
2025-03-19 14:54                                     ` Gregory Price
2025-02-28  0:16   ` [PATCH 1/2 v6] mm/mempolicy: Weighted Interleave Auto-tuning yunjeong.mun
2025-02-28  6:39   ` Yunjeong Mun
2025-02-28 16:24     ` Joshua Hahn
2025-03-04 21:56     ` Joshua Hahn
2025-03-04 22:22       ` Joshua Hahn
2025-03-05  9:49         ` Yunjeong Mun
2025-03-05 16:28           ` Joshua Hahn

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox