[PATCH v2 0/3] Implement numa node notifier

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v2 0/3] Implement numa node notifier
@ 2025-04-08  8:41 Oscar Salvador
  2025-04-08  8:41 ` [PATCH v2 1/3] mm,slub: Do not special case N_NORMAL nodes for slab_nodes Oscar Salvador
                   ` (2 more replies)
  0 siblings, 3 replies; 17+ messages in thread
From: Oscar Salvador @ 2025-04-08  8:41 UTC (permalink / raw)
  To: Andrew Morton
  Cc: David Hildenbrand, linux-mm, linux-kernel, Vlastimil Babka,
	Harry Yoo, Jonathan Cameron, linux-cxl, Oscar Salvador

v1 -> v2:
  - Remove status_change_nid_normal and the code that
    deals with it (David & Vlastimil)
  - Remove slab_mem_offline_callback (David & Vlastimil)
  - Change the order of canceling the notifiers
    in {online,offline}_pages (Vlastimil)
  - Fix up a couple of whitespaces (Jonathan Cameron)
  - Add RBs-by

Memory notifier is a tool that allow consumers to get notified whenever
memory gets onlined or offlined in the system.
Currently, there are 10 consumers of that, but 5 out of those 10 consumers
are only interested in getting notifications when a numa node changes its
memory state.
That means going from memoryless to memory-aware of vice versa.

Which means that for every {online,offline}_pages operation they get
notified even though the numa node might not have changed its state.

While we are doing this, remove status_change_nid_normal, as the only
current user (slub) does not really need it.
This allows us to further simplify and clean up the code.

The first patch gets rid of status_change_nid_normal in slub.
The second patch implements a numa node notifier that does just that, and have
those consumers register in there, so they get notified only when they are
interested.

The third patch replaces 'status_change_nid{_normal}' fields within
memory_notify with a 'nid', as that is only what we need for memory
notifer and update the only user of it (page_ext).

Consumers that are only interested in numa node states change are:

 - memory-tier
 - slub
 - cpuset
 - hmat
 - cxl

Oscar Salvador (3):
  mm,slub: Do not special case N_NORMAL nodes for slab_nodes
  mm,memory_hotplug: Implement numa node notifier
  mm,memory_hotplug: Rename status_change_nid parameter in memory_notify

 drivers/acpi/numa/hmat.c  |   6 +-
 drivers/base/node.c       |  19 +++++++
 drivers/cxl/core/region.c |  14 ++---
 drivers/cxl/cxl.h         |   4 +-
 include/linux/memory.h    |  38 ++++++++++++-
 kernel/cgroup/cpuset.c    |   2 +-
 mm/memory-tiers.c         |   8 +--
 mm/memory_hotplug.c       | 117 +++++++++++++++++++++-----------------
 mm/page_ext.c             |  12 +---
 mm/slub.c                 |  41 ++-----------
 10 files changed, 145 insertions(+), 116 deletions(-)

-- 
2.49.0

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH v2 1/3] mm,slub: Do not special case N_NORMAL nodes for slab_nodes
  2025-04-08  8:41 [PATCH v2 0/3] Implement numa node notifier Oscar Salvador
@ 2025-04-08  8:41 ` Oscar Salvador
  2025-04-08 10:17   ` David Hildenbrand
  2025-04-08  8:41 ` [PATCH v2 2/3] mm,memory_hotplug: Implement numa node notifier Oscar Salvador
  2025-04-08  8:41 ` [PATCH v2 3/3] mm,memory_hotplug: Rename status_change_nid parameter in memory_notify Oscar Salvador
  2 siblings, 1 reply; 17+ messages in thread
From: Oscar Salvador @ 2025-04-08  8:41 UTC (permalink / raw)
  To: Andrew Morton
  Cc: David Hildenbrand, linux-mm, linux-kernel, Vlastimil Babka,
	Harry Yoo, Jonathan Cameron, linux-cxl, Oscar Salvador

Currently, slab_mem_going_going_callback() checks whether the node has
N_NORMAL memory in order to be set in slab_nodes.
While it is true that gettind rid of that enforcing would mean
ending up with movables nodes in slab_nodes, the memory waste that comes
with that is negligible.

So stop checking for status_change_nid_normal and just use status_change_nid
instead which works for both types of memory.

Also, once we allocate the kmem_cache_node cache  for the node in
slab_mem_online_callback(), we never deallocate it in
slab_mem_off_callback() when the node goes memoryless, so we can just
get rid of it.

The only side effect is that we will stop clearing the node from slab_nodes.

Signed-off-by: Oscar Salvador <osalvador@suse.de>
---
 mm/slub.c | 30 +-----------------------------
 1 file changed, 1 insertion(+), 29 deletions(-)

diff --git a/mm/slub.c b/mm/slub.c
index b46f87662e71..e716b4cb2d0e 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -6164,36 +6164,12 @@ static int slab_mem_going_offline_callback(void *arg)
 	return 0;
 }
 
-static void slab_mem_offline_callback(void *arg)
-{
-	struct memory_notify *marg = arg;
-	int offline_node;
-
-	offline_node = marg->status_change_nid_normal;
-
-	/*
-	 * If the node still has available memory. we need kmem_cache_node
-	 * for it yet.
-	 */
-	if (offline_node < 0)
-		return;
-
-	mutex_lock(&slab_mutex);
-	node_clear(offline_node, slab_nodes);
-	/*
-	 * We no longer free kmem_cache_node structures here, as it would be
-	 * racy with all get_node() users, and infeasible to protect them with
-	 * slab_mutex.
-	 */
-	mutex_unlock(&slab_mutex);
-}
-
 static int slab_mem_going_online_callback(void *arg)
 {
 	struct kmem_cache_node *n;
 	struct kmem_cache *s;
 	struct memory_notify *marg = arg;
-	int nid = marg->status_change_nid_normal;
+	int nid = marg->status_change_nid;
 	int ret = 0;
 
 	/*
@@ -6251,10 +6227,6 @@ static int slab_memory_callback(struct notifier_block *self,
 	case MEM_GOING_OFFLINE:
 		ret = slab_mem_going_offline_callback(arg);
 		break;
-	case MEM_OFFLINE:
-	case MEM_CANCEL_ONLINE:
-		slab_mem_offline_callback(arg);
-		break;
 	case MEM_ONLINE:
 	case MEM_CANCEL_OFFLINE:
 		break;
-- 
2.49.0



^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH v2 2/3] mm,memory_hotplug: Implement numa node notifier
  2025-04-08  8:41 [PATCH v2 0/3] Implement numa node notifier Oscar Salvador
  2025-04-08  8:41 ` [PATCH v2 1/3] mm,slub: Do not special case N_NORMAL nodes for slab_nodes Oscar Salvador
@ 2025-04-08  8:41 ` Oscar Salvador
  2025-04-09 13:44   ` kernel test robot
  2025-04-08  8:41 ` [PATCH v2 3/3] mm,memory_hotplug: Rename status_change_nid parameter in memory_notify Oscar Salvador
  2 siblings, 1 reply; 17+ messages in thread
From: Oscar Salvador @ 2025-04-08  8:41 UTC (permalink / raw)
  To: Andrew Morton
  Cc: David Hildenbrand, linux-mm, linux-kernel, Vlastimil Babka,
	Harry Yoo, Jonathan Cameron, linux-cxl, Oscar Salvador

There are at least five consumers of hotplug_memory_notifier that what they
really are interested in is whether any numa node changed its state, e.g: going
from being memory aware to becoming memoryless and vice versa.

Implement a specific notifier for numa nodes when their state gets changed,
and have those consumers that only care about numa node state changes use it.

Signed-off-by: Oscar Salvador <osalvador@suse.de>
Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
---
 drivers/acpi/numa/hmat.c  |   6 +-
 drivers/base/node.c       |  19 +++++++
 drivers/cxl/core/region.c |  14 ++---
 drivers/cxl/cxl.h         |   4 +-
 include/linux/memory.h    |  38 ++++++++++++-
 kernel/cgroup/cpuset.c    |   2 +-
 mm/memory-tiers.c         |   8 +--
 mm/memory_hotplug.c       | 117 +++++++++++++++++++++-----------------
 mm/slub.c                 |  13 ++---
 9 files changed, 144 insertions(+), 77 deletions(-)

diff --git a/drivers/acpi/numa/hmat.c b/drivers/acpi/numa/hmat.c
index bfbb08b1e6af..d18f3efa2149 100644
--- a/drivers/acpi/numa/hmat.c
+++ b/drivers/acpi/numa/hmat.c
@@ -918,10 +918,10 @@ static int hmat_callback(struct notifier_block *self,
 			 unsigned long action, void *arg)
 {
 	struct memory_target *target;
-	struct memory_notify *mnb = arg;
+	struct node_notify *mnb = arg;
 	int pxm, nid = mnb->status_change_nid;
 
-	if (nid == NUMA_NO_NODE || action != MEM_ONLINE)
+	if (nid == NUMA_NO_NODE || action != NODE_BECAME_MEM_AWARE)
 		return NOTIFY_OK;
 
 	pxm = node_to_pxm(nid);
@@ -1074,7 +1074,7 @@ static __init int hmat_init(void)
 	hmat_register_targets();
 
 	/* Keep the table and structures if the notifier may use them */
-	if (hotplug_memory_notifier(hmat_callback, HMAT_CALLBACK_PRI))
+	if (hotplug_node_notifier(hmat_callback, HMAT_CALLBACK_PRI))
 		goto out_put;
 
 	if (!hmat_set_default_dram_perf())
diff --git a/drivers/base/node.c b/drivers/base/node.c
index 0ea653fa3433..182c71dfb5b8 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -110,6 +110,25 @@ static const struct attribute_group *node_access_node_groups[] = {
 	NULL,
 };
 
+static BLOCKING_NOTIFIER_HEAD(node_chain);
+
+int register_node_notifier(struct notifier_block *nb)
+{
+	return blocking_notifier_chain_register(&node_chain, nb);
+}
+EXPORT_SYMBOL(register_node_notifier);
+
+void unregister_node_notifier(struct notifier_block *nb)
+{
+	blocking_notifier_chain_unregister(&node_chain, nb);
+}
+EXPORT_SYMBOL(unregister_node_notifier);
+
+int node_notify(unsigned long val, void *v)
+{
+	return blocking_notifier_call_chain(&node_chain, val, v);
+}
+
 static void node_remove_accesses(struct node *node)
 {
 	struct node_access_nodes *c, *cnext;
diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c
index e8d11a988fd9..7d187088f557 100644
--- a/drivers/cxl/core/region.c
+++ b/drivers/cxl/core/region.c
@@ -2409,12 +2409,12 @@ static int cxl_region_perf_attrs_callback(struct notifier_block *nb,
 					  unsigned long action, void *arg)
 {
 	struct cxl_region *cxlr = container_of(nb, struct cxl_region,
-					       memory_notifier);
-	struct memory_notify *mnb = arg;
+					       node_notifier);
+	struct node_notify *mnb = arg;
 	int nid = mnb->status_change_nid;
 	int region_nid;
 
-	if (nid == NUMA_NO_NODE || action != MEM_ONLINE)
+	if (nid == NUMA_NO_NODE || action != NODE_BECAME_MEM_AWARE)
 		return NOTIFY_DONE;
 
 	/*
@@ -3388,7 +3388,7 @@ static void shutdown_notifiers(void *_cxlr)
 {
 	struct cxl_region *cxlr = _cxlr;
 
-	unregister_memory_notifier(&cxlr->memory_notifier);
+	unregister_node_notifier(&cxlr->node_notifier);
 	unregister_mt_adistance_algorithm(&cxlr->adist_notifier);
 }
 
@@ -3427,9 +3427,9 @@ static int cxl_region_probe(struct device *dev)
 	if (rc)
 		return rc;
 
-	cxlr->memory_notifier.notifier_call = cxl_region_perf_attrs_callback;
-	cxlr->memory_notifier.priority = CXL_CALLBACK_PRI;
-	register_memory_notifier(&cxlr->memory_notifier);
+	cxlr->node_notifier.notifier_call = cxl_region_perf_attrs_callback;
+	cxlr->node_notifier.priority = CXL_CALLBACK_PRI;
+	register_node_notifier(&cxlr->node_notifier);
 
 	cxlr->adist_notifier.notifier_call = cxl_region_calculate_adistance;
 	cxlr->adist_notifier.priority = 100;
diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
index bbbaa0d0a670..d4c9a499de7a 100644
--- a/drivers/cxl/cxl.h
+++ b/drivers/cxl/cxl.h
@@ -532,7 +532,7 @@ struct cxl_region_params {
  * @flags: Region state flags
  * @params: active + config params for the region
  * @coord: QoS access coordinates for the region
- * @memory_notifier: notifier for setting the access coordinates to node
+ * @node_notifier: notifier for setting the access coordinates to node
  * @adist_notifier: notifier for calculating the abstract distance of node
  */
 struct cxl_region {
@@ -545,7 +545,7 @@ struct cxl_region {
 	unsigned long flags;
 	struct cxl_region_params params;
 	struct access_coordinate coord[ACCESS_COORDINATE_MAX];
-	struct notifier_block memory_notifier;
+	struct notifier_block node_notifier;
 	struct notifier_block adist_notifier;
 };
 
diff --git a/include/linux/memory.h b/include/linux/memory.h
index 12daa6ec7d09..a5b8068cf182 100644
--- a/include/linux/memory.h
+++ b/include/linux/memory.h
@@ -99,6 +99,14 @@ int set_memory_block_size_order(unsigned int order);
 #define	MEM_PREPARE_ONLINE	(1<<6)
 #define	MEM_FINISH_OFFLINE	(1<<7)
 
+/* These states are used for numa node notifiers */
+#define NODE_BECOMING_MEM_AWARE		(1<<0)
+#define NODE_BECAME_MEM_AWARE		(1<<1)
+#define NODE_BECOMING_MEMORYLESS	(1<<2)
+#define NODE_BECAME_MEMORYLESS		(1<<3)
+#define NODE_CANCEL_MEM_AWARE		(1<<4)
+#define NODE_CANCEL_MEMORYLESS		(1<<5)
+
 struct memory_notify {
 	/*
 	 * The altmap_start_pfn and altmap_nr_pages fields are designated for
@@ -109,7 +117,10 @@ struct memory_notify {
 	unsigned long altmap_nr_pages;
 	unsigned long start_pfn;
 	unsigned long nr_pages;
-	int status_change_nid_normal;
+	int status_change_nid;
+};
+
+struct node_notify {
 	int status_change_nid;
 };
 
@@ -149,15 +160,34 @@ static inline int hotplug_memory_notifier(notifier_fn_t fn, int pri)
 {
 	return 0;
 }
+
+static inline int register_node_notifier(struct notifier_block *nb)
+{
+	return 0;
+}
+static inline void unregister_node_notifier(struct notifier_block *nb)
+{
+}
+static inline int node_notify(unsigned long val, void *v)
+{
+	return 0;
+}
+static inline int hotplug_node_notifier(notifier_fn_t fn, int pri)
+{
+	return 0;
+}
 #else /* CONFIG_MEMORY_HOTPLUG */
 extern int register_memory_notifier(struct notifier_block *nb);
+extern int register_node_notifier(struct notifier_block *nb);
 extern void unregister_memory_notifier(struct notifier_block *nb);
+extern void unregister_node_notifier(struct notifier_block *nb);
 int create_memory_block_devices(unsigned long start, unsigned long size,
 				struct vmem_altmap *altmap,
 				struct memory_group *group);
 void remove_memory_block_devices(unsigned long start, unsigned long size);
 extern void memory_dev_init(void);
 extern int memory_notify(unsigned long val, void *v);
+extern int node_notify(unsigned long val, void *v);
 extern struct memory_block *find_memory_block(unsigned long section_nr);
 typedef int (*walk_memory_blocks_func_t)(struct memory_block *, void *);
 extern int walk_memory_blocks(unsigned long start, unsigned long size,
@@ -177,6 +207,12 @@ int walk_dynamic_memory_groups(int nid, walk_memory_groups_func_t func,
 	register_memory_notifier(&fn##_mem_nb);			\
 })
 
+#define hotplug_node_notifier(fn, pri) ({		\
+	static __meminitdata struct notifier_block fn##_node_nb =\
+		{ .notifier_call = fn, .priority = pri };\
+	register_node_notifier(&fn##_node_nb);			\
+})
+
 #ifdef CONFIG_NUMA
 void memory_block_add_nid(struct memory_block *mem, int nid,
 			  enum meminit_context context);
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 39c1fc643d77..3323cf2124bf 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -3938,7 +3938,7 @@ void __init cpuset_init_smp(void)
 	cpumask_copy(top_cpuset.effective_cpus, cpu_active_mask);
 	top_cpuset.effective_mems = node_states[N_MEMORY];
 
-	hotplug_memory_notifier(cpuset_track_online_nodes, CPUSET_CALLBACK_PRI);
+	hotplug_node_notifier(cpuset_track_online_nodes, CPUSET_CALLBACK_PRI);
 
 	cpuset_migrate_mm_wq = alloc_ordered_workqueue("cpuset_migrate_mm", 0);
 	BUG_ON(!cpuset_migrate_mm_wq);
diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c
index fc14fe53e9b7..dfe6c28c8352 100644
--- a/mm/memory-tiers.c
+++ b/mm/memory-tiers.c
@@ -872,7 +872,7 @@ static int __meminit memtier_hotplug_callback(struct notifier_block *self,
 					      unsigned long action, void *_arg)
 {
 	struct memory_tier *memtier;
-	struct memory_notify *arg = _arg;
+	struct node_notify *arg = _arg;
 
 	/*
 	 * Only update the node migration order when a node is
@@ -882,13 +882,13 @@ static int __meminit memtier_hotplug_callback(struct notifier_block *self,
 		return notifier_from_errno(0);
 
 	switch (action) {
-	case MEM_OFFLINE:
+	case NODE_BECAME_MEMORYLESS:
 		mutex_lock(&memory_tier_lock);
 		if (clear_node_memory_tier(arg->status_change_nid))
 			establish_demotion_targets();
 		mutex_unlock(&memory_tier_lock);
 		break;
-	case MEM_ONLINE:
+	case NODE_BECAME_MEM_AWARE:
 		mutex_lock(&memory_tier_lock);
 		memtier = set_node_memory_tier(arg->status_change_nid);
 		if (!IS_ERR(memtier))
@@ -929,7 +929,7 @@ static int __init memory_tier_init(void)
 	nodes_and(default_dram_nodes, node_states[N_MEMORY],
 		  node_states[N_CPU]);
 
-	hotplug_memory_notifier(memtier_hotplug_callback, MEMTIER_HOTPLUG_PRI);
+	hotplug_node_notifier(memtier_hotplug_callback, MEMTIER_HOTPLUG_PRI);
 	return 0;
 }
 subsys_initcall(memory_tier_init);
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 8305483de38b..84248f2e36f8 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -701,24 +701,18 @@ static void online_pages_range(unsigned long start_pfn, unsigned long nr_pages)
 
 /* check which state of node_states will be changed when online memory */
 static void node_states_check_changes_online(unsigned long nr_pages,
-	struct zone *zone, struct memory_notify *arg)
+	struct zone *zone, struct node_notify *arg)
 {
 	int nid = zone_to_nid(zone);
 
 	arg->status_change_nid = NUMA_NO_NODE;
-	arg->status_change_nid_normal = NUMA_NO_NODE;
 
 	if (!node_state(nid, N_MEMORY))
 		arg->status_change_nid = nid;
-	if (zone_idx(zone) <= ZONE_NORMAL && !node_state(nid, N_NORMAL_MEMORY))
-		arg->status_change_nid_normal = nid;
 }
 
-static void node_states_set_node(int node, struct memory_notify *arg)
+static void node_states_set_node(int node, struct node_notify *arg)
 {
-	if (arg->status_change_nid_normal >= 0)
-		node_set_state(node, N_NORMAL_MEMORY);
-
 	if (arg->status_change_nid >= 0)
 		node_set_state(node, N_MEMORY);
 }
@@ -1177,7 +1171,9 @@ int online_pages(unsigned long pfn, unsigned long nr_pages,
 	int need_zonelists_rebuild = 0;
 	const int nid = zone_to_nid(zone);
 	int ret;
-	struct memory_notify arg;
+	struct memory_notify mem_arg;
+	struct node_notify node_arg;
+	bool cancel_mem_notifier_on_err = false, cancel_node_notifier_on_err = false;
 
 	/*
 	 * {on,off}lining is constrained to full memory sections (or more
@@ -1194,11 +1190,22 @@ int online_pages(unsigned long pfn, unsigned long nr_pages,
 	/* associate pfn range with the zone */
 	move_pfn_range_to_zone(zone, pfn, nr_pages, NULL, MIGRATE_ISOLATE);
 
-	arg.start_pfn = pfn;
-	arg.nr_pages = nr_pages;
-	node_states_check_changes_online(nr_pages, zone, &arg);
+	mem_arg.start_pfn = pfn;
+	mem_arg.nr_pages = nr_pages;
+	node_states_check_changes_online(nr_pages, zone, &node_arg);
+
+	if (node_arg.status_change_nid >= 0) {
+		/* Node is becoming memory aware. Notify consumers */
+		cancel_node_notifier_on_err = true;
+		ret = node_notify(NODE_BECOMING_MEM_AWARE, &node_arg);
+		ret = notifier_to_errno(ret);
+		if (ret)
+			goto failed_addition;
+	}
 
-	ret = memory_notify(MEM_GOING_ONLINE, &arg);
+	cancel_mem_notifier_on_err = true;
+	mem_arg.status_change_nid = node_arg.status_change_nid;
+	ret = memory_notify(MEM_GOING_ONLINE, &mem_arg);
 	ret = notifier_to_errno(ret);
 	if (ret)
 		goto failed_addition;
@@ -1224,7 +1231,7 @@ int online_pages(unsigned long pfn, unsigned long nr_pages,
 	online_pages_range(pfn, nr_pages);
 	adjust_present_page_count(pfn_to_page(pfn), group, nr_pages);
 
-	node_states_set_node(nid, &arg);
+	node_states_set_node(nid, &node_arg);
 	if (need_zonelists_rebuild)
 		build_all_zonelists(NULL);
 
@@ -1245,16 +1252,26 @@ int online_pages(unsigned long pfn, unsigned long nr_pages,
 	kswapd_run(nid);
 	kcompactd_run(nid);
 
+	if (node_arg.status_change_nid >= 0)
+		/*
+		 * Node went from memoryless to have memory. Notifiy interested
+		 * consumers
+		 */
+		node_notify(NODE_BECAME_MEM_AWARE, &node_arg);
+
 	writeback_set_ratelimit();
 
-	memory_notify(MEM_ONLINE, &arg);
+	memory_notify(MEM_ONLINE, &mem_arg);
 	return 0;
 
 failed_addition:
 	pr_debug("online_pages [mem %#010llx-%#010llx] failed\n",
 		 (unsigned long long) pfn << PAGE_SHIFT,
 		 (((unsigned long long) pfn + nr_pages) << PAGE_SHIFT) - 1);
-	memory_notify(MEM_CANCEL_ONLINE, &arg);
+	if (cancel_mem_notifier_on_err)
+		memory_notify(MEM_CANCEL_ONLINE, &mem_arg);
+	if (cancel_node_notifier_on_err)
+		node_notify(NODE_CANCEL_MEM_AWARE, &node_arg);
 	remove_pfn_range_from_zone(zone, pfn, nr_pages);
 	return ret;
 }
@@ -1892,48 +1909,29 @@ early_param("movable_node", cmdline_parse_movable_node);
 
 /* check which state of node_states will be changed when offline memory */
 static void node_states_check_changes_offline(unsigned long nr_pages,
-		struct zone *zone, struct memory_notify *arg)
+		struct zone *zone, struct node_notify *arg)
 {
 	struct pglist_data *pgdat = zone->zone_pgdat;
 	unsigned long present_pages = 0;
 	enum zone_type zt;
 
 	arg->status_change_nid = NUMA_NO_NODE;
-	arg->status_change_nid_normal = NUMA_NO_NODE;
 
 	/*
-	 * Check whether node_states[N_NORMAL_MEMORY] will be changed.
-	 * If the memory to be offline is within the range
-	 * [0..ZONE_NORMAL], and it is the last present memory there,
-	 * the zones in that range will become empty after the offlining,
-	 * thus we can determine that we need to clear the node from
-	 * node_states[N_NORMAL_MEMORY].
+	 * Here we count the possible pages within the range [0..ZONE_MOVABLE].
+	 * If after having accounted all the pages, we see that the nr_pages to
+	 * be offlined is over or equal to the accounted pages, we know that the
+	 * node will become empty, ans so, we can clear it for N_MEMORY.
 	 */
-	for (zt = 0; zt <= ZONE_NORMAL; zt++)
+	for (zt = 0; zt <= ZONE_MOVABLE; zt++)
 		present_pages += pgdat->node_zones[zt].present_pages;
-	if (zone_idx(zone) <= ZONE_NORMAL && nr_pages >= present_pages)
-		arg->status_change_nid_normal = zone_to_nid(zone);
-
-	/*
-	 * We have accounted the pages from [0..ZONE_NORMAL); ZONE_HIGHMEM
-	 * does not apply as we don't support 32bit.
-	 * Here we count the possible pages from ZONE_MOVABLE.
-	 * If after having accounted all the pages, we see that the nr_pages
-	 * to be offlined is over or equal to the accounted pages,
-	 * we know that the node will become empty, and so, we can clear
-	 * it for N_MEMORY as well.
-	 */
-	present_pages += pgdat->node_zones[ZONE_MOVABLE].present_pages;
 
 	if (nr_pages >= present_pages)
 		arg->status_change_nid = zone_to_nid(zone);
 }
 
-static void node_states_clear_node(int node, struct memory_notify *arg)
+static void node_states_clear_node(int node, struct node_notify *arg)
 {
-	if (arg->status_change_nid_normal >= 0)
-		node_clear_state(node, N_NORMAL_MEMORY);
-
 	if (arg->status_change_nid >= 0)
 		node_clear_state(node, N_MEMORY);
 }
@@ -1957,7 +1955,9 @@ int offline_pages(unsigned long start_pfn, unsigned long nr_pages,
 	unsigned long pfn, managed_pages, system_ram_pages = 0;
 	const int node = zone_to_nid(zone);
 	unsigned long flags;
-	struct memory_notify arg;
+	struct memory_notify mem_arg;
+	struct node_notify node_arg;
+	bool cancel_mem_notifier_on_err = false, cancel_node_notifier_on_err = false;
 	char *reason;
 	int ret;
 
@@ -2016,11 +2016,21 @@ int offline_pages(unsigned long start_pfn, unsigned long nr_pages,
 		goto failed_removal_pcplists_disabled;
 	}
 
-	arg.start_pfn = start_pfn;
-	arg.nr_pages = nr_pages;
-	node_states_check_changes_offline(nr_pages, zone, &arg);
+	mem_arg.start_pfn = start_pfn;
+	mem_arg.nr_pages = nr_pages;
+	node_states_check_changes_offline(nr_pages, zone, &node_arg);
+
+	if (node_arg.status_change_nid >= 0) {
+		cancel_node_notifier_on_err = true;
+		ret = node_notify(NODE_BECOMING_MEMORYLESS, &node_arg);
+		ret = notifier_to_errno(ret);
+		if (ret)
+			goto failed_removal_isolated;
+	}
 
-	ret = memory_notify(MEM_GOING_OFFLINE, &arg);
+	cancel_mem_notifier_on_err = true;
+	mem_arg.status_change_nid = node_arg.status_change_nid;
+	ret = memory_notify(MEM_GOING_OFFLINE, &mem_arg);
 	ret = notifier_to_errno(ret);
 	if (ret) {
 		reason = "notifier failure";
@@ -2100,27 +2110,32 @@ int offline_pages(unsigned long start_pfn, unsigned long nr_pages,
 	 * Make sure to mark the node as memory-less before rebuilding the zone
 	 * list. Otherwise this node would still appear in the fallback lists.
 	 */
-	node_states_clear_node(node, &arg);
+	node_states_clear_node(node, &node_arg);
 	if (!populated_zone(zone)) {
 		zone_pcp_reset(zone);
 		build_all_zonelists(NULL);
 	}
 
-	if (arg.status_change_nid >= 0) {
+	if (node_arg.status_change_nid >= 0) {
 		kcompactd_stop(node);
 		kswapd_stop(node);
+		/* Node went memoryless. Notifiy interested consumers */
+		node_notify(NODE_BECAME_MEMORYLESS, &node_arg);
 	}
 
 	writeback_set_ratelimit();
 
-	memory_notify(MEM_OFFLINE, &arg);
+	memory_notify(MEM_OFFLINE, &mem_arg);
 	remove_pfn_range_from_zone(zone, start_pfn, nr_pages);
 	return 0;
 
 failed_removal_isolated:
 	/* pushback to free area */
 	undo_isolate_page_range(start_pfn, end_pfn, MIGRATE_MOVABLE);
-	memory_notify(MEM_CANCEL_OFFLINE, &arg);
+	if (cancel_mem_notifier_on_err)
+		memory_notify(MEM_CANCEL_OFFLINE, &mem_arg);
+	if (cancel_node_notifier_on_err)
+		node_notify(NODE_CANCEL_MEMORYLESS, &node_arg);
 failed_removal_pcplists_disabled:
 	lru_cache_enable();
 	zone_pcp_enable(zone);
diff --git a/mm/slub.c b/mm/slub.c
index e716b4cb2d0e..5c0f5d33b551 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -6168,8 +6168,8 @@ static int slab_mem_going_online_callback(void *arg)
 {
 	struct kmem_cache_node *n;
 	struct kmem_cache *s;
-	struct memory_notify *marg = arg;
-	int nid = marg->status_change_nid;
+	struct node_notify *narg = arg;
+	int nid = narg->status_change_nid;
 	int ret = 0;
 
 	/*
@@ -6221,15 +6221,12 @@ static int slab_memory_callback(struct notifier_block *self,
 	int ret = 0;
 
 	switch (action) {
-	case MEM_GOING_ONLINE:
+	case NODE_BECOMING_MEM_AWARE:
 		ret = slab_mem_going_online_callback(arg);
 		break;
-	case MEM_GOING_OFFLINE:
+	case NODE_BECOMING_MEMORYLESS:
 		ret = slab_mem_going_offline_callback(arg);
 		break;
-	case MEM_ONLINE:
-	case MEM_CANCEL_OFFLINE:
-		break;
 	}
 	if (ret)
 		ret = notifier_from_errno(ret);
@@ -6304,7 +6301,7 @@ void __init kmem_cache_init(void)
 			sizeof(struct kmem_cache_node),
 			SLAB_HWCACHE_ALIGN | SLAB_NO_OBJ_EXT, 0, 0);
 
-	hotplug_memory_notifier(slab_memory_callback, SLAB_CALLBACK_PRI);
+	hotplug_node_notifier(slab_memory_callback, SLAB_CALLBACK_PRI);
 
 	/* Able to allocate the per node structures */
 	slab_state = PARTIAL;
-- 
2.49.0



^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH v2 3/3] mm,memory_hotplug: Rename status_change_nid parameter in memory_notify
  2025-04-08  8:41 [PATCH v2 0/3] Implement numa node notifier Oscar Salvador
  2025-04-08  8:41 ` [PATCH v2 1/3] mm,slub: Do not special case N_NORMAL nodes for slab_nodes Oscar Salvador
  2025-04-08  8:41 ` [PATCH v2 2/3] mm,memory_hotplug: Implement numa node notifier Oscar Salvador
@ 2025-04-08  8:41 ` Oscar Salvador
  2 siblings, 0 replies; 17+ messages in thread
From: Oscar Salvador @ 2025-04-08  8:41 UTC (permalink / raw)
  To: Andrew Morton
  Cc: David Hildenbrand, linux-mm, linux-kernel, Vlastimil Babka,
	Harry Yoo, Jonathan Cameron, linux-cxl, Oscar Salvador

The 'status_change_nid' field was used to track changes in the memory
state of a numa node, but that funcionality has been decoupled from
memory_notify and moved to node_notify.
Current consumers of memory_notify are only interested in which node the
memory we are adding belongs to, so rename current 'status_change_nid'
to 'nid'.

Signed-off-by: Oscar Salvador <osalvador@suse.de>
Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
---
 include/linux/memory.h |  2 +-
 mm/memory_hotplug.c    |  4 ++--
 mm/page_ext.c          | 12 +-----------
 3 files changed, 4 insertions(+), 14 deletions(-)

diff --git a/include/linux/memory.h b/include/linux/memory.h
index a5b8068cf182..241a98e31277 100644
--- a/include/linux/memory.h
+++ b/include/linux/memory.h
@@ -117,7 +117,7 @@ struct memory_notify {
 	unsigned long altmap_nr_pages;
 	unsigned long start_pfn;
 	unsigned long nr_pages;
-	int status_change_nid;
+	int nid;
 };
 
 struct node_notify {
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 84248f2e36f8..19f2f8a08645 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1192,6 +1192,7 @@ int online_pages(unsigned long pfn, unsigned long nr_pages,
 
 	mem_arg.start_pfn = pfn;
 	mem_arg.nr_pages = nr_pages;
+	mem_arg.nid = nid;
 	node_states_check_changes_online(nr_pages, zone, &node_arg);
 
 	if (node_arg.status_change_nid >= 0) {
@@ -1204,7 +1205,6 @@ int online_pages(unsigned long pfn, unsigned long nr_pages,
 	}
 
 	cancel_mem_notifier_on_err = true;
-	mem_arg.status_change_nid = node_arg.status_change_nid;
 	ret = memory_notify(MEM_GOING_ONLINE, &mem_arg);
 	ret = notifier_to_errno(ret);
 	if (ret)
@@ -2018,6 +2018,7 @@ int offline_pages(unsigned long start_pfn, unsigned long nr_pages,
 
 	mem_arg.start_pfn = start_pfn;
 	mem_arg.nr_pages = nr_pages;
+	mem_arg.nid = node;
 	node_states_check_changes_offline(nr_pages, zone, &node_arg);
 
 	if (node_arg.status_change_nid >= 0) {
@@ -2029,7 +2030,6 @@ int offline_pages(unsigned long start_pfn, unsigned long nr_pages,
 	}
 
 	cancel_mem_notifier_on_err = true;
-	mem_arg.status_change_nid = node_arg.status_change_nid;
 	ret = memory_notify(MEM_GOING_OFFLINE, &mem_arg);
 	ret = notifier_to_errno(ret);
 	if (ret) {
diff --git a/mm/page_ext.c b/mm/page_ext.c
index c351fdfe9e9a..477e6f24b7ab 100644
--- a/mm/page_ext.c
+++ b/mm/page_ext.c
@@ -378,16 +378,6 @@ static int __meminit online_page_ext(unsigned long start_pfn,
 	start = SECTION_ALIGN_DOWN(start_pfn);
 	end = SECTION_ALIGN_UP(start_pfn + nr_pages);
 
-	if (nid == NUMA_NO_NODE) {
-		/*
-		 * In this case, "nid" already exists and contains valid memory.
-		 * "start_pfn" passed to us is a pfn which is an arg for
-		 * online__pages(), and start_pfn should exist.
-		 */
-		nid = pfn_to_nid(start_pfn);
-		VM_BUG_ON(!node_online(nid));
-	}
-
 	for (pfn = start; !fail && pfn < end; pfn += PAGES_PER_SECTION)
 		fail = init_section_page_ext(pfn, nid);
 	if (!fail)
@@ -436,7 +426,7 @@ static int __meminit page_ext_callback(struct notifier_block *self,
 	switch (action) {
 	case MEM_GOING_ONLINE:
 		ret = online_page_ext(mn->start_pfn,
-				   mn->nr_pages, mn->status_change_nid);
+				   mn->nr_pages, mn->nid);
 		break;
 	case MEM_OFFLINE:
 		offline_page_ext(mn->start_pfn,
-- 
2.49.0



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 1/3] mm,slub: Do not special case N_NORMAL nodes for slab_nodes
  2025-04-08  8:41 ` [PATCH v2 1/3] mm,slub: Do not special case N_NORMAL nodes for slab_nodes Oscar Salvador
@ 2025-04-08 10:17   ` David Hildenbrand
  2025-04-08 12:49     ` Oscar Salvador
  2025-04-08 14:18     ` Harry Yoo
  0 siblings, 2 replies; 17+ messages in thread
From: David Hildenbrand @ 2025-04-08 10:17 UTC (permalink / raw)
  To: Oscar Salvador, Andrew Morton
  Cc: linux-mm, linux-kernel, Vlastimil Babka, Harry Yoo,
	Jonathan Cameron, linux-cxl

On 08.04.25 10:41, Oscar Salvador wrote:
> Currently, slab_mem_going_going_callback() checks whether the node has
> N_NORMAL memory in order to be set in slab_nodes.
> While it is true that gettind rid of that enforcing would mean
> ending up with movables nodes in slab_nodes, the memory waste that comes
> with that is negligible.
> 
> So stop checking for status_change_nid_normal and just use status_change_nid
> instead which works for both types of memory.
> 
> Also, once we allocate the kmem_cache_node cache  for the node in
> slab_mem_online_callback(), we never deallocate it in
> slab_mem_off_callback() when the node goes memoryless, so we can just
> get rid of it.
> 
> The only side effect is that we will stop clearing the node from slab_nodes.
> 

Feel free to add a Suggested-by: if you think it applies.


Do we have to take care of the N_NORMAL_MEMORY check in kmem_cache_init() ? Likely it
would have to be a N_MEMORY check.


But, I was wondering if we could get rid of the "slab_nodes" thingy as a first step?

 From 518a2b83a9c5bd85d74ddabbc36ce5d181a88ed6 Mon Sep 17 00:00:00 2001
From: David Hildenbrand <david@redhat.com>
Date: Tue, 8 Apr 2025 12:16:13 +0200
Subject: [PATCH] tmp

Signed-off-by: David Hildenbrand <david@redhat.com>
---
  mm/slub.c | 56 ++++---------------------------------------------------
  1 file changed, 4 insertions(+), 52 deletions(-)

diff --git a/mm/slub.c b/mm/slub.c
index b46f87662e71d..afe31149e7f4e 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -445,14 +445,6 @@ static inline struct kmem_cache_node *get_node(struct kmem_cache *s, int node)
  	for (__node = 0; __node < nr_node_ids; __node++) \
  		 if ((__n = get_node(__s, __node)))
  
-/*
- * Tracks for which NUMA nodes we have kmem_cache_nodes allocated.
- * Corresponds to node_state[N_NORMAL_MEMORY], but can temporarily
- * differ during memory hotplug/hotremove operations.
- * Protected by slab_mutex.
- */
-static nodemask_t slab_nodes;
-
  #ifndef CONFIG_SLUB_TINY
  /*
   * Workqueue used for flush_cpu_slab().
@@ -3706,10 +3698,9 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
  	if (!slab) {
  		/*
  		 * if the node is not online or has no normal memory, just
-		 * ignore the node constraint
+		 * ignore the node constraint.
  		 */
-		if (unlikely(node != NUMA_NO_NODE &&
-			     !node_isset(node, slab_nodes)))
+		if (unlikely(node != NUMA_NO_NODE && !node_state(node, N_NORMAL_MEMORY)))
  			node = NUMA_NO_NODE;
  		goto new_slab;
  	}
@@ -3719,7 +3710,7 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
  		 * same as above but node_match() being false already
  		 * implies node != NUMA_NO_NODE
  		 */
-		if (!node_isset(node, slab_nodes)) {
+		if (!node_state(node, N_NORMAL_MEMORY)) {
  			node = NUMA_NO_NODE;
  		} else {
  			stat(s, ALLOC_NODE_MISMATCH);
@@ -5623,7 +5614,7 @@ static int init_kmem_cache_nodes(struct kmem_cache *s)
  {
  	int node;
  
-	for_each_node_mask(node, slab_nodes) {
+	for_each_node_state(node, N_NORMAL_MEMORY) {
  		struct kmem_cache_node *n;
  
  		if (slab_state == DOWN) {
@@ -6164,30 +6155,6 @@ static int slab_mem_going_offline_callback(void *arg)
  	return 0;
  }
  
-static void slab_mem_offline_callback(void *arg)
-{
-	struct memory_notify *marg = arg;
-	int offline_node;
-
-	offline_node = marg->status_change_nid_normal;
-
-	/*
-	 * If the node still has available memory. we need kmem_cache_node
-	 * for it yet.
-	 */
-	if (offline_node < 0)
-		return;
-
-	mutex_lock(&slab_mutex);
-	node_clear(offline_node, slab_nodes);
-	/*
-	 * We no longer free kmem_cache_node structures here, as it would be
-	 * racy with all get_node() users, and infeasible to protect them with
-	 * slab_mutex.
-	 */
-	mutex_unlock(&slab_mutex);
-}
-
  static int slab_mem_going_online_callback(void *arg)
  {
  	struct kmem_cache_node *n;
@@ -6229,11 +6196,6 @@ static int slab_mem_going_online_callback(void *arg)
  		init_kmem_cache_node(n);
  		s->node[nid] = n;
  	}
-	/*
-	 * Any cache created after this point will also have kmem_cache_node
-	 * initialized for the new node.
-	 */
-	node_set(nid, slab_nodes);
  out:
  	mutex_unlock(&slab_mutex);
  	return ret;
@@ -6253,8 +6215,6 @@ static int slab_memory_callback(struct notifier_block *self,
  		break;
  	case MEM_OFFLINE:
  	case MEM_CANCEL_ONLINE:
-		slab_mem_offline_callback(arg);
-		break;
  	case MEM_ONLINE:
  	case MEM_CANCEL_OFFLINE:
  		break;
@@ -6309,7 +6269,6 @@ void __init kmem_cache_init(void)
  {
  	static __initdata struct kmem_cache boot_kmem_cache,
  		boot_kmem_cache_node;
-	int node;
  
  	if (debug_guardpage_minorder())
  		slub_max_order = 0;
@@ -6321,13 +6280,6 @@ void __init kmem_cache_init(void)
  	kmem_cache_node = &boot_kmem_cache_node;
  	kmem_cache = &boot_kmem_cache;
  
-	/*
-	 * Initialize the nodemask for which we will allocate per node
-	 * structures. Here we don't need taking slab_mutex yet.
-	 */
-	for_each_node_state(node, N_NORMAL_MEMORY)
-		node_set(node, slab_nodes);
-
  	create_boot_cache(kmem_cache_node, "kmem_cache_node",
  			sizeof(struct kmem_cache_node),
  			SLAB_HWCACHE_ALIGN | SLAB_NO_OBJ_EXT, 0, 0);
-- 
2.48.1


Not sure if there are any races to consider ... just an idea.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 1/3] mm,slub: Do not special case N_NORMAL nodes for slab_nodes
  2025-04-08 10:17   ` David Hildenbrand
@ 2025-04-08 12:49     ` Oscar Salvador
  2025-04-08 15:15       ` Harry Yoo
  2025-04-08 14:18     ` Harry Yoo
  1 sibling, 1 reply; 17+ messages in thread
From: Oscar Salvador @ 2025-04-08 12:49 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Andrew Morton, linux-mm, linux-kernel, Vlastimil Babka,
	Harry Yoo, Jonathan Cameron, linux-cxl

On Tue, Apr 08, 2025 at 12:17:52PM +0200, David Hildenbrand wrote:
> Feel free to add a Suggested-by: if you think it applies.

Sorry David, my bad, totally missed it.
I shall add it.

> Do we have to take care of the N_NORMAL_MEMORY check in kmem_cache_init() ? Likely it
> would have to be a N_MEMORY check.

Yes, should be N_MEMORY.

> But, I was wondering if we could get rid of the "slab_nodes" thingy as a first step?

I glanced over it and I did not see anything wrong with it, just a
question below.

So, if Vlastimil and Harry think this is fine, we can indeed do this.
If so, I would combine this and the #1 first of this series and add
your Signed-off-by as co-autor. Is that fine by you?


> @@ -3706,10 +3698,9 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
>  	if (!slab) {
>  		/*
>  		 * if the node is not online or has no normal memory, just
> -		 * ignore the node constraint
> +		 * ignore the node constraint.
>  		 */
> -		if (unlikely(node != NUMA_NO_NODE &&
> -			     !node_isset(node, slab_nodes)))
> +		if (unlikely(node != NUMA_NO_NODE && !node_state(node, N_NORMAL_MEMORY)))
>  			node = NUMA_NO_NODE;

After my first patch, slab_nodes will also contain N_MEMORY nodes, which
makes me think whether that check should be N_MEMORY?



-- 
Oscar Salvador
SUSE Labs


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 1/3] mm,slub: Do not special case N_NORMAL nodes for slab_nodes
  2025-04-08 10:17   ` David Hildenbrand
  2025-04-08 12:49     ` Oscar Salvador
@ 2025-04-08 14:18     ` Harry Yoo
  2025-04-08 14:25       ` David Hildenbrand
  1 sibling, 1 reply; 17+ messages in thread
From: Harry Yoo @ 2025-04-08 14:18 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Oscar Salvador, Andrew Morton, linux-mm, linux-kernel,
	Vlastimil Babka, Jonathan Cameron, linux-cxl

On Tue, Apr 08, 2025 at 12:17:52PM +0200, David Hildenbrand wrote:
> On 08.04.25 10:41, Oscar Salvador wrote:
> > Currently, slab_mem_going_going_callback() checks whether the node has
> > N_NORMAL memory in order to be set in slab_nodes.
> > While it is true that gettind rid of that enforcing would mean
> > ending up with movables nodes in slab_nodes, the memory waste that comes
> > with that is negligible.
> > 
> > So stop checking for status_change_nid_normal and just use status_change_nid
> > instead which works for both types of memory.
> > 
> > Also, once we allocate the kmem_cache_node cache  for the node in
> > slab_mem_online_callback(), we never deallocate it in
> > slab_mem_off_callback() when the node goes memoryless, so we can just
> > get rid of it.
> > 
> > The only side effect is that we will stop clearing the node from slab_nodes.
> > 
> 
> Feel free to add a Suggested-by: if you think it applies.
> 
> 
> Do we have to take care of the N_NORMAL_MEMORY check in kmem_cache_init() ? Likely it
> would have to be a N_MEMORY check.
> 
> 
> But, I was wondering if we could get rid of the "slab_nodes" thingy as a first step?

The following commit says that SLUB has slab_nodes thingy for a reason...
kmem_cache_node might not be ready yet even when N_NORMAL_MEMORY check
says it now has normal memory.

@Vlastimil maybe a dumb question but why not check s->node[nid]
instead of having slab_nodes bitmask?

commit 7e1fa93deff44677a94dfc323ff629bbf5cf9360
Author: Vlastimil Babka <vbabka@suse.cz>
Date:   Wed Feb 24 12:01:12 2021 -0800

    mm, slab, slub: stop taking memory hotplug lock
    
    Since commit 03afc0e25f7f ("slab: get_online_mems for
    kmem_cache_{create,destroy,shrink}") we are taking memory hotplug lock for
    SLAB and SLUB when creating, destroying or shrinking a cache.  It is quite
    a heavy lock and it's best to avoid it if possible, as we had several
    issues with lockdep complaining about ordering in the past, see e.g.
    e4f8e513c3d3 ("mm/slub: fix a deadlock in show_slab_objects()").
    
    The problem scenario in 03afc0e25f7f (solved by the memory hotplug lock)
    can be summarized as follows: while there's slab_mutex synchronizing new
    kmem cache creation and SLUB's MEM_GOING_ONLINE callback
    slab_mem_going_online_callback(), we may miss creation of kmem_cache_node
    for the hotplugged node in the new kmem cache, because the hotplug
    callback doesn't yet see the new cache, and cache creation in
    init_kmem_cache_nodes() only inits kmem_cache_node for nodes in the
    N_NORMAL_MEMORY nodemask, which however may not yet include the new node,
    as that happens only later after the MEM_GOING_ONLINE callback.
    
    Instead of using get/put_online_mems(), the problem can be solved by SLUB
    maintaining its own nodemask of nodes for which it has allocated the
    per-node kmem_cache_node structures.  This nodemask would generally mirror
    the N_NORMAL_MEMORY nodemask, but would be updated only in under SLUB's
    control in its memory hotplug callbacks under the slab_mutex.  This patch
    adds such nodemask and its handling.
    
    Commit 03afc0e25f7f mentiones "issues like [the one above]", but there
    don't appear to be further issues.  All the paths (shared for SLAB and
    SLUB) taking the memory hotplug locks are also taking the slab_mutex,
    except kmem_cache_shrink() where 03afc0e25f7f replaced slab_mutex with
    get/put_online_mems().
    
    We however cannot simply restore slab_mutex in kmem_cache_shrink(), as
    SLUB can enters the function from a write to sysfs 'shrink' file, thus
    holding kernfs lock, and in kmem_cache_create() the kernfs lock is nested
    within slab_mutex.  But on closer inspection we don't actually need to
    protect kmem_cache_shrink() from hotplug callbacks: While SLUB's
    __kmem_cache_shrink() does for_each_kmem_cache_node(), missing a new node
    added in parallel hotplug is not fatal, and parallel hotremove does not
    free kmem_cache_node's anymore after the previous patch, so use-after free
    cannot happen.  The per-node shrinking itself is protected by
    n->list_lock.  Same is true for SLAB, and SLOB is no-op.
    
    SLAB also doesn't need the memory hotplug locking, which it only gained by
    03afc0e25f7f through the shared paths in slab_common.c.  Its memory
    hotplug callbacks are also protected by slab_mutex against races with
    these paths.  The problem of SLUB relying on N_NORMAL_MEMORY doesn't apply
    to SLAB, as its setup_kmem_cache_nodes relies on N_ONLINE, and the new
    node is already set there during the MEM_GOING_ONLINE callback, so no
    special care is needed for SLAB.
    
    As such, this patch removes all get/put_online_mems() usage by the slab
    subsystem.
    
    Link: https://lkml.kernel.org/r/20210113131634.3671-3-vbabka@suse.cz
    Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
    Cc: Christoph Lameter <cl@linux.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: David Rientjes <rientjes@google.com>
    Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
    Cc: Michal Hocko <mhocko@kernel.org>
    Cc: Pekka Enberg <penberg@kernel.org>
    Cc: Qian Cai <cai@redhat.com>
    Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

> 
> From 518a2b83a9c5bd85d74ddabbc36ce5d181a88ed6 Mon Sep 17 00:00:00 2001
> From: David Hildenbrand <david@redhat.com>
> Date: Tue, 8 Apr 2025 12:16:13 +0200
> Subject: [PATCH] tmp
> 
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
>  mm/slub.c | 56 ++++---------------------------------------------------
>  1 file changed, 4 insertions(+), 52 deletions(-)
> 
> diff --git a/mm/slub.c b/mm/slub.c
> index b46f87662e71d..afe31149e7f4e 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -445,14 +445,6 @@ static inline struct kmem_cache_node *get_node(struct kmem_cache *s, int node)
>  	for (__node = 0; __node < nr_node_ids; __node++) \
>  		 if ((__n = get_node(__s, __node)))
> -/*
> - * Tracks for which NUMA nodes we have kmem_cache_nodes allocated.
> - * Corresponds to node_state[N_NORMAL_MEMORY], but can temporarily
> - * differ during memory hotplug/hotremove operations.
> - * Protected by slab_mutex.
> - */
> -static nodemask_t slab_nodes;
> -
>  #ifndef CONFIG_SLUB_TINY
>  /*
>   * Workqueue used for flush_cpu_slab().
> @@ -3706,10 +3698,9 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
>  	if (!slab) {
>  		/*
>  		 * if the node is not online or has no normal memory, just
> -		 * ignore the node constraint
> +		 * ignore the node constraint.
>  		 */
> -		if (unlikely(node != NUMA_NO_NODE &&
> -			     !node_isset(node, slab_nodes)))
> +		if (unlikely(node != NUMA_NO_NODE && !node_state(node, N_NORMAL_MEMORY)))
>  			node = NUMA_NO_NODE;
>  		goto new_slab;
>  	}
> @@ -3719,7 +3710,7 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
>  		 * same as above but node_match() being false already
>  		 * implies node != NUMA_NO_NODE
>  		 */
> -		if (!node_isset(node, slab_nodes)) {
> +		if (!node_state(node, N_NORMAL_MEMORY)) {
>  			node = NUMA_NO_NODE;
>  		} else {
>  			stat(s, ALLOC_NODE_MISMATCH);
> @@ -5623,7 +5614,7 @@ static int init_kmem_cache_nodes(struct kmem_cache *s)
>  {
>  	int node;
> -	for_each_node_mask(node, slab_nodes) {
> +	for_each_node_state(node, N_NORMAL_MEMORY) {
>  		struct kmem_cache_node *n;
>  		if (slab_state == DOWN) {
> @@ -6164,30 +6155,6 @@ static int slab_mem_going_offline_callback(void *arg)
>  	return 0;
>  }
> -static void slab_mem_offline_callback(void *arg)
> -{
> -	struct memory_notify *marg = arg;
> -	int offline_node;
> -
> -	offline_node = marg->status_change_nid_normal;
> -
> -	/*
> -	 * If the node still has available memory. we need kmem_cache_node
> -	 * for it yet.
> -	 */
> -	if (offline_node < 0)
> -		return;
> -
> -	mutex_lock(&slab_mutex);
> -	node_clear(offline_node, slab_nodes);
> -	/*
> -	 * We no longer free kmem_cache_node structures here, as it would be
> -	 * racy with all get_node() users, and infeasible to protect them with
> -	 * slab_mutex.
> -	 */
> -	mutex_unlock(&slab_mutex);
> -}
> -
>  static int slab_mem_going_online_callback(void *arg)
>  {
>  	struct kmem_cache_node *n;
> @@ -6229,11 +6196,6 @@ static int slab_mem_going_online_callback(void *arg)
>  		init_kmem_cache_node(n);
>  		s->node[nid] = n;
>  	}
> -	/*
> -	 * Any cache created after this point will also have kmem_cache_node
> -	 * initialized for the new node.
> -	 */
> -	node_set(nid, slab_nodes);
>  out:
>  	mutex_unlock(&slab_mutex);
>  	return ret;
> @@ -6253,8 +6215,6 @@ static int slab_memory_callback(struct notifier_block *self,
>  		break;
>  	case MEM_OFFLINE:
>  	case MEM_CANCEL_ONLINE:
> -		slab_mem_offline_callback(arg);
> -		break;
>  	case MEM_ONLINE:
>  	case MEM_CANCEL_OFFLINE:
>  		break;
> @@ -6309,7 +6269,6 @@ void __init kmem_cache_init(void)
>  {
>  	static __initdata struct kmem_cache boot_kmem_cache,
>  		boot_kmem_cache_node;
> -	int node;
>  	if (debug_guardpage_minorder())
>  		slub_max_order = 0;
> @@ -6321,13 +6280,6 @@ void __init kmem_cache_init(void)
>  	kmem_cache_node = &boot_kmem_cache_node;
>  	kmem_cache = &boot_kmem_cache;
> -	/*
> -	 * Initialize the nodemask for which we will allocate per node
> -	 * structures. Here we don't need taking slab_mutex yet.
> -	 */
> -	for_each_node_state(node, N_NORMAL_MEMORY)
> -		node_set(node, slab_nodes);
> -
>  	create_boot_cache(kmem_cache_node, "kmem_cache_node",
>  			sizeof(struct kmem_cache_node),
>  			SLAB_HWCACHE_ALIGN | SLAB_NO_OBJ_EXT, 0, 0);
> -- 
> 2.48.1
> 
> 
> Not sure if there are any races to consider ... just an idea.
> 
> -- 
> Cheers,
> 
> David / dhildenb
> 

-- 
Cheers,
Harry (formerly known as Hyeonggon)


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 1/3] mm,slub: Do not special case N_NORMAL nodes for slab_nodes
  2025-04-08 14:18     ` Harry Yoo
@ 2025-04-08 14:25       ` David Hildenbrand
  2025-04-08 14:54         ` Harry Yoo
  2025-04-08 17:55         ` Vlastimil Babka
  0 siblings, 2 replies; 17+ messages in thread
From: David Hildenbrand @ 2025-04-08 14:25 UTC (permalink / raw)
  To: Harry Yoo
  Cc: Oscar Salvador, Andrew Morton, linux-mm, linux-kernel,
	Vlastimil Babka, Jonathan Cameron, linux-cxl

On 08.04.25 16:18, Harry Yoo wrote:
> On Tue, Apr 08, 2025 at 12:17:52PM +0200, David Hildenbrand wrote:
>> On 08.04.25 10:41, Oscar Salvador wrote:
>>> Currently, slab_mem_going_going_callback() checks whether the node has
>>> N_NORMAL memory in order to be set in slab_nodes.
>>> While it is true that gettind rid of that enforcing would mean
>>> ending up with movables nodes in slab_nodes, the memory waste that comes
>>> with that is negligible.
>>>
>>> So stop checking for status_change_nid_normal and just use status_change_nid
>>> instead which works for both types of memory.
>>>
>>> Also, once we allocate the kmem_cache_node cache  for the node in
>>> slab_mem_online_callback(), we never deallocate it in
>>> slab_mem_off_callback() when the node goes memoryless, so we can just
>>> get rid of it.
>>>
>>> The only side effect is that we will stop clearing the node from slab_nodes.
>>>
>>
>> Feel free to add a Suggested-by: if you think it applies.
>>
>>
>> Do we have to take care of the N_NORMAL_MEMORY check in kmem_cache_init() ? Likely it
>> would have to be a N_MEMORY check.
>>
>>
>> But, I was wondering if we could get rid of the "slab_nodes" thingy as a first step?
> 
> The following commit says that SLUB has slab_nodes thingy for a reason...
> kmem_cache_node might not be ready yet even when N_NORMAL_MEMORY check
> says it now has normal memory.

node_states_set_node() is called from memory hotplug code after 
MEM_GOING_ONLINE and after online_pages_range().

Pages might be isolated at that point, but node_states_set_node() is set 
only after the memory notifier (MEM_GOING_ONLINE) was triggered.

So I don't immediately see the problem assuming that we never free the 
structures.

But yeah, this is what I raised below: "Not sure if there are any races 
to consider" :)

> 
> @Vlastimil maybe a dumb question but why not check s->node[nid]
> instead of having slab_nodes bitmask?
> 
> commit 7e1fa93deff44677a94dfc323ff629bbf5cf9360
> Author: Vlastimil Babka <vbabka@suse.cz>
> Date:   Wed Feb 24 12:01:12 2021 -0800
> 
>      mm, slab, slub: stop taking memory hotplug lock
>      
>      Since commit 03afc0e25f7f ("slab: get_online_mems for
>      kmem_cache_{create,destroy,shrink}") we are taking memory hotplug lock for
>      SLAB and SLUB when creating, destroying or shrinking a cache.  It is quite
>      a heavy lock and it's best to avoid it if possible, as we had several
>      issues with lockdep complaining about ordering in the past, see e.g.
>      e4f8e513c3d3 ("mm/slub: fix a deadlock in show_slab_objects()").
>      
>      The problem scenario in 03afc0e25f7f (solved by the memory hotplug lock)
>      can be summarized as follows: while there's slab_mutex synchronizing new
>      kmem cache creation and SLUB's MEM_GOING_ONLINE callback
>      slab_mem_going_online_callback(), we may miss creation of kmem_cache_node
>      for the hotplugged node in the new kmem cache, because the hotplug
>      callback doesn't yet see the new cache, and cache creation in
>      init_kmem_cache_nodes() only inits kmem_cache_node for nodes in the
>      N_NORMAL_MEMORY nodemask, which however may not yet include the new node,
>      as that happens only later after the MEM_GOING_ONLINE callback.
>      
>      Instead of using get/put_online_mems(), the problem can be solved by SLUB
>      maintaining its own nodemask of nodes for which it has allocated the
>      per-node kmem_cache_node structures.  This nodemask would generally mirror
>      the N_NORMAL_MEMORY nodemask, but would be updated only in under SLUB's
>      control in its memory hotplug callbacks under the slab_mutex.  This patch
>      adds such nodemask and its handling.
>      
>      Commit 03afc0e25f7f mentiones "issues like [the one above]", but there
>      don't appear to be further issues.  All the paths (shared for SLAB and
>      SLUB) taking the memory hotplug locks are also taking the slab_mutex,
>      except kmem_cache_shrink() where 03afc0e25f7f replaced slab_mutex with
>      get/put_online_mems().
>      
>      We however cannot simply restore slab_mutex in kmem_cache_shrink(), as
>      SLUB can enters the function from a write to sysfs 'shrink' file, thus
>      holding kernfs lock, and in kmem_cache_create() the kernfs lock is nested
>      within slab_mutex.  But on closer inspection we don't actually need to
>      protect kmem_cache_shrink() from hotplug callbacks: While SLUB's
>      __kmem_cache_shrink() does for_each_kmem_cache_node(), missing a new node
>      added in parallel hotplug is not fatal, and parallel hotremove does not
>      free kmem_cache_node's anymore after the previous patch, so use-after free
>      cannot happen.  The per-node shrinking itself is protected by
>      n->list_lock.  Same is true for SLAB, and SLOB is no-op.
>      
>      SLAB also doesn't need the memory hotplug locking, which it only gained by
>      03afc0e25f7f through the shared paths in slab_common.c.  Its memory
>      hotplug callbacks are also protected by slab_mutex against races with
>      these paths.  The problem of SLUB relying on N_NORMAL_MEMORY doesn't apply
>      to SLAB, as its setup_kmem_cache_nodes relies on N_ONLINE, and the new
>      node is already set there during the MEM_GOING_ONLINE callback, so no
>      special care is needed for SLAB.
>      
>      As such, this patch removes all get/put_online_mems() usage by the slab
>      subsystem.
>      
>      Link: https://lkml.kernel.org/r/20210113131634.3671-3-vbabka@suse.cz
>      Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
>      Cc: Christoph Lameter <cl@linux.com>
>      Cc: David Hildenbrand <david@redhat.com>
>      Cc: David Rientjes <rientjes@google.com>
>      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
>      Cc: Michal Hocko <mhocko@kernel.org>
>      Cc: Pekka Enberg <penberg@kernel.org>
>      Cc: Qian Cai <cai@redhat.com>
>      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
>      Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
>      Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
> 
>>
>>  From 518a2b83a9c5bd85d74ddabbc36ce5d181a88ed6 Mon Sep 17 00:00:00 2001
>> From: David Hildenbrand <david@redhat.com>
>> Date: Tue, 8 Apr 2025 12:16:13 +0200
>> Subject: [PATCH] tmp
>>
>> Signed-off-by: David Hildenbrand <david@redhat.com>
>> ---
>>   mm/slub.c | 56 ++++---------------------------------------------------
>>   1 file changed, 4 insertions(+), 52 deletions(-)
>>
>> diff --git a/mm/slub.c b/mm/slub.c
>> index b46f87662e71d..afe31149e7f4e 100644
>> --- a/mm/slub.c
>> +++ b/mm/slub.c
>> @@ -445,14 +445,6 @@ static inline struct kmem_cache_node *get_node(struct kmem_cache *s, int node)
>>   	for (__node = 0; __node < nr_node_ids; __node++) \
>>   		 if ((__n = get_node(__s, __node)))
>> -/*
>> - * Tracks for which NUMA nodes we have kmem_cache_nodes allocated.
>> - * Corresponds to node_state[N_NORMAL_MEMORY], but can temporarily
>> - * differ during memory hotplug/hotremove operations.
>> - * Protected by slab_mutex.
>> - */
>> -static nodemask_t slab_nodes;
>> -
>>   #ifndef CONFIG_SLUB_TINY
>>   /*
>>    * Workqueue used for flush_cpu_slab().
>> @@ -3706,10 +3698,9 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
>>   	if (!slab) {
>>   		/*
>>   		 * if the node is not online or has no normal memory, just
>> -		 * ignore the node constraint
>> +		 * ignore the node constraint.
>>   		 */
>> -		if (unlikely(node != NUMA_NO_NODE &&
>> -			     !node_isset(node, slab_nodes)))
>> +		if (unlikely(node != NUMA_NO_NODE && !node_state(node, N_NORMAL_MEMORY)))
>>   			node = NUMA_NO_NODE;
>>   		goto new_slab;
>>   	}
>> @@ -3719,7 +3710,7 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
>>   		 * same as above but node_match() being false already
>>   		 * implies node != NUMA_NO_NODE
>>   		 */
>> -		if (!node_isset(node, slab_nodes)) {
>> +		if (!node_state(node, N_NORMAL_MEMORY)) {
>>   			node = NUMA_NO_NODE;
>>   		} else {
>>   			stat(s, ALLOC_NODE_MISMATCH);
>> @@ -5623,7 +5614,7 @@ static int init_kmem_cache_nodes(struct kmem_cache *s)
>>   {
>>   	int node;
>> -	for_each_node_mask(node, slab_nodes) {
>> +	for_each_node_state(node, N_NORMAL_MEMORY) {
>>   		struct kmem_cache_node *n;
>>   		if (slab_state == DOWN) {
>> @@ -6164,30 +6155,6 @@ static int slab_mem_going_offline_callback(void *arg)
>>   	return 0;
>>   }
>> -static void slab_mem_offline_callback(void *arg)
>> -{
>> -	struct memory_notify *marg = arg;
>> -	int offline_node;
>> -
>> -	offline_node = marg->status_change_nid_normal;
>> -
>> -	/*
>> -	 * If the node still has available memory. we need kmem_cache_node
>> -	 * for it yet.
>> -	 */
>> -	if (offline_node < 0)
>> -		return;
>> -
>> -	mutex_lock(&slab_mutex);
>> -	node_clear(offline_node, slab_nodes);
>> -	/*
>> -	 * We no longer free kmem_cache_node structures here, as it would be
>> -	 * racy with all get_node() users, and infeasible to protect them with
>> -	 * slab_mutex.
>> -	 */
>> -	mutex_unlock(&slab_mutex);
>> -}
>> -
>>   static int slab_mem_going_online_callback(void *arg)
>>   {
>>   	struct kmem_cache_node *n;
>> @@ -6229,11 +6196,6 @@ static int slab_mem_going_online_callback(void *arg)
>>   		init_kmem_cache_node(n);
>>   		s->node[nid] = n;
>>   	}
>> -	/*
>> -	 * Any cache created after this point will also have kmem_cache_node
>> -	 * initialized for the new node.
>> -	 */
>> -	node_set(nid, slab_nodes);
>>   out:
>>   	mutex_unlock(&slab_mutex);
>>   	return ret;
>> @@ -6253,8 +6215,6 @@ static int slab_memory_callback(struct notifier_block *self,
>>   		break;
>>   	case MEM_OFFLINE:
>>   	case MEM_CANCEL_ONLINE:
>> -		slab_mem_offline_callback(arg);
>> -		break;
>>   	case MEM_ONLINE:
>>   	case MEM_CANCEL_OFFLINE:
>>   		break;
>> @@ -6309,7 +6269,6 @@ void __init kmem_cache_init(void)
>>   {
>>   	static __initdata struct kmem_cache boot_kmem_cache,
>>   		boot_kmem_cache_node;
>> -	int node;
>>   	if (debug_guardpage_minorder())
>>   		slub_max_order = 0;
>> @@ -6321,13 +6280,6 @@ void __init kmem_cache_init(void)
>>   	kmem_cache_node = &boot_kmem_cache_node;
>>   	kmem_cache = &boot_kmem_cache;
>> -	/*
>> -	 * Initialize the nodemask for which we will allocate per node
>> -	 * structures. Here we don't need taking slab_mutex yet.
>> -	 */
>> -	for_each_node_state(node, N_NORMAL_MEMORY)
>> -		node_set(node, slab_nodes);
>> -
>>   	create_boot_cache(kmem_cache_node, "kmem_cache_node",
>>   			sizeof(struct kmem_cache_node),
>>   			SLAB_HWCACHE_ALIGN | SLAB_NO_OBJ_EXT, 0, 0);
>> -- 
>> 2.48.1
>>
>>
>> Not sure if there are any races to consider ... just an idea.
>>
>> -- 
>> Cheers,
>>
>> David / dhildenb
>>
> 


-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 1/3] mm,slub: Do not special case N_NORMAL nodes for slab_nodes
  2025-04-08 14:25       ` David Hildenbrand
@ 2025-04-08 14:54         ` Harry Yoo
  2025-04-08 17:55         ` Vlastimil Babka
  1 sibling, 0 replies; 17+ messages in thread
From: Harry Yoo @ 2025-04-08 14:54 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Oscar Salvador, Andrew Morton, linux-mm, linux-kernel,
	Vlastimil Babka, Jonathan Cameron, linux-cxl

On Tue, Apr 08, 2025 at 04:25:33PM +0200, David Hildenbrand wrote:
> On 08.04.25 16:18, Harry Yoo wrote:
> > On Tue, Apr 08, 2025 at 12:17:52PM +0200, David Hildenbrand wrote:
> > > On 08.04.25 10:41, Oscar Salvador wrote:
> > > > Currently, slab_mem_going_going_callback() checks whether the node has
> > > > N_NORMAL memory in order to be set in slab_nodes.
> > > > While it is true that gettind rid of that enforcing would mean
> > > > ending up with movables nodes in slab_nodes, the memory waste that comes
> > > > with that is negligible.
> > > > 
> > > > So stop checking for status_change_nid_normal and just use status_change_nid
> > > > instead which works for both types of memory.
> > > > 
> > > > Also, once we allocate the kmem_cache_node cache  for the node in
> > > > slab_mem_online_callback(), we never deallocate it in
> > > > slab_mem_off_callback() when the node goes memoryless, so we can just
> > > > get rid of it.
> > > > 
> > > > The only side effect is that we will stop clearing the node from slab_nodes.
> > > > 
> > > 
> > > Feel free to add a Suggested-by: if you think it applies.
> > > 
> > > 
> > > Do we have to take care of the N_NORMAL_MEMORY check in kmem_cache_init() ? Likely it
> > > would have to be a N_MEMORY check.
> > > 
> > > 
> > > But, I was wondering if we could get rid of the "slab_nodes" thingy as a first step?
> > 
> > The following commit says that SLUB has slab_nodes thingy for a reason...
> > kmem_cache_node might not be ready yet even when N_NORMAL_MEMORY check
> > says it now has normal memory.
> 
> node_states_set_node() is called from memory hotplug code after
> MEM_GOING_ONLINE and after online_pages_range().
> 
> Pages might be isolated at that point, but node_states_set_node() is set
> only after the memory notifier (MEM_GOING_ONLINE) was triggered.

Oh right, didn't realize that. Thanks.

> So I don't immediately see the problem assuming that we never free the
> structures.
> 
> But yeah, this is what I raised below: "Not sure if there are any races to
> consider" :)

At least on the slab side I don't see any races that need to be
addressed. Hopefully I didn't overlook anything :)

> > @Vlastimil maybe a dumb question but why not check s->node[nid]
> > instead of having slab_nodes bitmask?
> > 
> > commit 7e1fa93deff44677a94dfc323ff629bbf5cf9360
> > Author: Vlastimil Babka <vbabka@suse.cz>
> > Date:   Wed Feb 24 12:01:12 2021 -0800
> > 
> >      mm, slab, slub: stop taking memory hotplug lock
> >      Since commit 03afc0e25f7f ("slab: get_online_mems for
> >      kmem_cache_{create,destroy,shrink}") we are taking memory hotplug lock for
> >      SLAB and SLUB when creating, destroying or shrinking a cache.  It is quite
> >      a heavy lock and it's best to avoid it if possible, as we had several
> >      issues with lockdep complaining about ordering in the past, see e.g.
> >      e4f8e513c3d3 ("mm/slub: fix a deadlock in show_slab_objects()").
> >      The problem scenario in 03afc0e25f7f (solved by the memory hotplug lock)
> >      can be summarized as follows: while there's slab_mutex synchronizing new
> >      kmem cache creation and SLUB's MEM_GOING_ONLINE callback
> >      slab_mem_going_online_callback(), we may miss creation of kmem_cache_node
> >      for the hotplugged node in the new kmem cache, because the hotplug
> >      callback doesn't yet see the new cache, and cache creation in
> >      init_kmem_cache_nodes() only inits kmem_cache_node for nodes in the
> >      N_NORMAL_MEMORY nodemask, which however may not yet include the new node,
> >      as that happens only later after the MEM_GOING_ONLINE callback.
> >      Instead of using get/put_online_mems(), the problem can be solved by SLUB
> >      maintaining its own nodemask of nodes for which it has allocated the
> >      per-node kmem_cache_node structures.  This nodemask would generally mirror
> >      the N_NORMAL_MEMORY nodemask, but would be updated only in under SLUB's
> >      control in its memory hotplug callbacks under the slab_mutex.  This patch
> >      adds such nodemask and its handling.
> >      Commit 03afc0e25f7f mentiones "issues like [the one above]", but there
> >      don't appear to be further issues.  All the paths (shared for SLAB and
> >      SLUB) taking the memory hotplug locks are also taking the slab_mutex,
> >      except kmem_cache_shrink() where 03afc0e25f7f replaced slab_mutex with
> >      get/put_online_mems().
> >      We however cannot simply restore slab_mutex in kmem_cache_shrink(), as
> >      SLUB can enters the function from a write to sysfs 'shrink' file, thus
> >      holding kernfs lock, and in kmem_cache_create() the kernfs lock is nested
> >      within slab_mutex.  But on closer inspection we don't actually need to
> >      protect kmem_cache_shrink() from hotplug callbacks: While SLUB's
> >      __kmem_cache_shrink() does for_each_kmem_cache_node(), missing a new node
> >      added in parallel hotplug is not fatal, and parallel hotremove does not
> >      free kmem_cache_node's anymore after the previous patch, so use-after free
> >      cannot happen.  The per-node shrinking itself is protected by
> >      n->list_lock.  Same is true for SLAB, and SLOB is no-op.
> >      SLAB also doesn't need the memory hotplug locking, which it only gained by
> >      03afc0e25f7f through the shared paths in slab_common.c.  Its memory
> >      hotplug callbacks are also protected by slab_mutex against races with
> >      these paths.  The problem of SLUB relying on N_NORMAL_MEMORY doesn't apply
> >      to SLAB, as its setup_kmem_cache_nodes relies on N_ONLINE, and the new
> >      node is already set there during the MEM_GOING_ONLINE callback, so no
> >      special care is needed for SLAB.
> >      As such, this patch removes all get/put_online_mems() usage by the slab
> >      subsystem.
> >      Link: https://lkml.kernel.org/r/20210113131634.3671-3-vbabka@suse.cz
> >      Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
> >      Cc: Christoph Lameter <cl@linux.com>
> >      Cc: David Hildenbrand <david@redhat.com>
> >      Cc: David Rientjes <rientjes@google.com>
> >      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> >      Cc: Michal Hocko <mhocko@kernel.org>
> >      Cc: Pekka Enberg <penberg@kernel.org>
> >      Cc: Qian Cai <cai@redhat.com>
> >      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
> >      Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
> >      Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
> > 
> > > 
> > >  From 518a2b83a9c5bd85d74ddabbc36ce5d181a88ed6 Mon Sep 17 00:00:00 2001
> > > From: David Hildenbrand <david@redhat.com>
> > > Date: Tue, 8 Apr 2025 12:16:13 +0200
> > > Subject: [PATCH] tmp
> > > 
> > > Signed-off-by: David Hildenbrand <david@redhat.com>
> > > ---
> > >   mm/slub.c | 56 ++++---------------------------------------------------
> > >   1 file changed, 4 insertions(+), 52 deletions(-)
> > > 
> > > diff --git a/mm/slub.c b/mm/slub.c
> > > index b46f87662e71d..afe31149e7f4e 100644
> > > --- a/mm/slub.c
> > > +++ b/mm/slub.c
> > > @@ -445,14 +445,6 @@ static inline struct kmem_cache_node *get_node(struct kmem_cache *s, int node)
> > >   	for (__node = 0; __node < nr_node_ids; __node++) \
> > >   		 if ((__n = get_node(__s, __node)))
> > > -/*
> > > - * Tracks for which NUMA nodes we have kmem_cache_nodes allocated.
> > > - * Corresponds to node_state[N_NORMAL_MEMORY], but can temporarily
> > > - * differ during memory hotplug/hotremove operations.
> > > - * Protected by slab_mutex.
> > > - */
> > > -static nodemask_t slab_nodes;
> > > -
> > >   #ifndef CONFIG_SLUB_TINY
> > >   /*
> > >    * Workqueue used for flush_cpu_slab().
> > > @@ -3706,10 +3698,9 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
> > >   	if (!slab) {
> > >   		/*
> > >   		 * if the node is not online or has no normal memory, just
> > > -		 * ignore the node constraint
> > > +		 * ignore the node constraint.
> > >   		 */
> > > -		if (unlikely(node != NUMA_NO_NODE &&
> > > -			     !node_isset(node, slab_nodes)))
> > > +		if (unlikely(node != NUMA_NO_NODE && !node_state(node, N_NORMAL_MEMORY)))
> > >   			node = NUMA_NO_NODE;
> > >   		goto new_slab;
> > >   	}
> > > @@ -3719,7 +3710,7 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
> > >   		 * same as above but node_match() being false already
> > >   		 * implies node != NUMA_NO_NODE
> > >   		 */
> > > -		if (!node_isset(node, slab_nodes)) {
> > > +		if (!node_state(node, N_NORMAL_MEMORY)) {
> > >   			node = NUMA_NO_NODE;
> > >   		} else {
> > >   			stat(s, ALLOC_NODE_MISMATCH);
> > > @@ -5623,7 +5614,7 @@ static int init_kmem_cache_nodes(struct kmem_cache *s)
> > >   {
> > >   	int node;
> > > -	for_each_node_mask(node, slab_nodes) {
> > > +	for_each_node_state(node, N_NORMAL_MEMORY) {
> > >   		struct kmem_cache_node *n;
> > >   		if (slab_state == DOWN) {
> > > @@ -6164,30 +6155,6 @@ static int slab_mem_going_offline_callback(void *arg)
> > >   	return 0;
> > >   }
> > > -static void slab_mem_offline_callback(void *arg)
> > > -{
> > > -	struct memory_notify *marg = arg;
> > > -	int offline_node;
> > > -
> > > -	offline_node = marg->status_change_nid_normal;
> > > -
> > > -	/*
> > > -	 * If the node still has available memory. we need kmem_cache_node
> > > -	 * for it yet.
> > > -	 */
> > > -	if (offline_node < 0)
> > > -		return;
> > > -
> > > -	mutex_lock(&slab_mutex);
> > > -	node_clear(offline_node, slab_nodes);
> > > -	/*
> > > -	 * We no longer free kmem_cache_node structures here, as it would be
> > > -	 * racy with all get_node() users, and infeasible to protect them with
> > > -	 * slab_mutex.
> > > -	 */
> > > -	mutex_unlock(&slab_mutex);
> > > -}
> > > -
> > >   static int slab_mem_going_online_callback(void *arg)
> > >   {
> > >   	struct kmem_cache_node *n;
> > > @@ -6229,11 +6196,6 @@ static int slab_mem_going_online_callback(void *arg)
> > >   		init_kmem_cache_node(n);
> > >   		s->node[nid] = n;
> > >   	}
> > > -	/*
> > > -	 * Any cache created after this point will also have kmem_cache_node
> > > -	 * initialized for the new node.
> > > -	 */
> > > -	node_set(nid, slab_nodes);
> > >   out:
> > >   	mutex_unlock(&slab_mutex);
> > >   	return ret;
> > > @@ -6253,8 +6215,6 @@ static int slab_memory_callback(struct notifier_block *self,
> > >   		break;
> > >   	case MEM_OFFLINE:
> > >   	case MEM_CANCEL_ONLINE:
> > > -		slab_mem_offline_callback(arg);
> > > -		break;
> > >   	case MEM_ONLINE:
> > >   	case MEM_CANCEL_OFFLINE:
> > >   		break;
> > > @@ -6309,7 +6269,6 @@ void __init kmem_cache_init(void)
> > >   {
> > >   	static __initdata struct kmem_cache boot_kmem_cache,
> > >   		boot_kmem_cache_node;
> > > -	int node;
> > >   	if (debug_guardpage_minorder())
> > >   		slub_max_order = 0;
> > > @@ -6321,13 +6280,6 @@ void __init kmem_cache_init(void)
> > >   	kmem_cache_node = &boot_kmem_cache_node;
> > >   	kmem_cache = &boot_kmem_cache;
> > > -	/*
> > > -	 * Initialize the nodemask for which we will allocate per node
> > > -	 * structures. Here we don't need taking slab_mutex yet.
> > > -	 */
> > > -	for_each_node_state(node, N_NORMAL_MEMORY)
> > > -		node_set(node, slab_nodes);
> > > -
> > >   	create_boot_cache(kmem_cache_node, "kmem_cache_node",
> > >   			sizeof(struct kmem_cache_node),
> > >   			SLAB_HWCACHE_ALIGN | SLAB_NO_OBJ_EXT, 0, 0);
> > > -- 
> > > 2.48.1
> > > 
> > > 
> > > Not sure if there are any races to consider ... just an idea.
> > > 
> > > -- 
> > > Cheers,
> > > 
> > > David / dhildenb
> > > 
> > 
> -- 
> Cheers,
> 
> David / dhildenb
> 
> 

-- 
Cheers,
Harry (Hyeonggon is my Korean name)


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 1/3] mm,slub: Do not special case N_NORMAL nodes for slab_nodes
  2025-04-08 12:49     ` Oscar Salvador
@ 2025-04-08 15:15       ` Harry Yoo
  0 siblings, 0 replies; 17+ messages in thread
From: Harry Yoo @ 2025-04-08 15:15 UTC (permalink / raw)
  To: Oscar Salvador
  Cc: David Hildenbrand, Andrew Morton, linux-mm, linux-kernel,
	Vlastimil Babka, Jonathan Cameron, linux-cxl

On Tue, Apr 08, 2025 at 02:49:43PM +0200, Oscar Salvador wrote:
> On Tue, Apr 08, 2025 at 12:17:52PM +0200, David Hildenbrand wrote:
> > Feel free to add a Suggested-by: if you think it applies.
> 
> Sorry David, my bad, totally missed it.
> I shall add it.
> 
> > Do we have to take care of the N_NORMAL_MEMORY check in kmem_cache_init() ? Likely it
> > would have to be a N_MEMORY check.
> 
> Yes, should be N_MEMORY.
> 
> > But, I was wondering if we could get rid of the "slab_nodes" thingy as a first step?
> 
> I glanced over it and I did not see anything wrong with it, just a
> question below.
> 
> So, if Vlastimil and Harry think this is fine, we can indeed do this.

Looks fine to me.

> If so, I would combine this and the #1 first of this series and add
> your Signed-off-by as co-autor. Is that fine by you?

> > @@ -3706,10 +3698,9 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
> >  	if (!slab) {
> >  		/*
> >  		 * if the node is not online or has no normal memory, just
> > -		 * ignore the node constraint
> > +		 * ignore the node constraint.
> >  		 */
> > -		if (unlikely(node != NUMA_NO_NODE &&
> > -			     !node_isset(node, slab_nodes)))
> > +		if (unlikely(node != NUMA_NO_NODE && !node_state(node, N_NORMAL_MEMORY)))
> >  			node = NUMA_NO_NODE;
> 
> After my first patch, slab_nodes will also contain N_MEMORY nodes, which
> makes me think whether that check should be N_MEMORY?

Assuming that it's not frequent to call kmem_cache_alloc_node/kmalloc_node
with nid where the node has no normal memory, either way looks fine to
me.

It N_MEMORY check says it has some memory but it has no normal memory,
it will fail anyway because the kernel can't allocate pages from the
buddy.

But why not fall back to other nodes early when the kernel knows
it has no normal memory?

> -- 
> Oscar Salvador
> SUSE Labs

-- 
Cheers,
Harry / Hyeonggon


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 1/3] mm,slub: Do not special case N_NORMAL nodes for slab_nodes
  2025-04-08 14:25       ` David Hildenbrand
  2025-04-08 14:54         ` Harry Yoo
@ 2025-04-08 17:55         ` Vlastimil Babka
  2025-04-08 18:18           ` David Hildenbrand
  1 sibling, 1 reply; 17+ messages in thread
From: Vlastimil Babka @ 2025-04-08 17:55 UTC (permalink / raw)
  To: David Hildenbrand, Harry Yoo
  Cc: Oscar Salvador, Andrew Morton, linux-mm, linux-kernel,
	Jonathan Cameron, linux-cxl

On 4/8/25 16:25, David Hildenbrand wrote:
> On 08.04.25 16:18, Harry Yoo wrote:
>> On Tue, Apr 08, 2025 at 12:17:52PM +0200, David Hildenbrand wrote:
>>> On 08.04.25 10:41, Oscar Salvador wrote:
>>>> Currently, slab_mem_going_going_callback() checks whether the node has
>>>> N_NORMAL memory in order to be set in slab_nodes.
>>>> While it is true that gettind rid of that enforcing would mean
>>>> ending up with movables nodes in slab_nodes, the memory waste that comes
>>>> with that is negligible.
>>>>
>>>> So stop checking for status_change_nid_normal and just use status_change_nid
>>>> instead which works for both types of memory.
>>>>
>>>> Also, once we allocate the kmem_cache_node cache  for the node in
>>>> slab_mem_online_callback(), we never deallocate it in
>>>> slab_mem_off_callback() when the node goes memoryless, so we can just
>>>> get rid of it.
>>>>
>>>> The only side effect is that we will stop clearing the node from slab_nodes.
>>>>
>>>
>>> Feel free to add a Suggested-by: if you think it applies.
>>>
>>>
>>> Do we have to take care of the N_NORMAL_MEMORY check in kmem_cache_init() ? Likely it
>>> would have to be a N_MEMORY check.
>>>
>>>
>>> But, I was wondering if we could get rid of the "slab_nodes" thingy as a first step?
>> 
>> The following commit says that SLUB has slab_nodes thingy for a reason...
>> kmem_cache_node might not be ready yet even when N_NORMAL_MEMORY check
>> says it now has normal memory.
> 
> node_states_set_node() is called from memory hotplug code after 
> MEM_GOING_ONLINE and after online_pages_range().
> 
> Pages might be isolated at that point, but node_states_set_node() is set 
> only after the memory notifier (MEM_GOING_ONLINE) was triggered.
> 
> So I don't immediately see the problem assuming that we never free the 
> structures.

Hm, I think it still exists. So consider:

- slab_mem_going_online_callback() creates kmem_cache_node for the new node
for existing caches under the mutex

Then a cache creation races with "node_states_set_node() is set only after
the memory notifier (MEM_GOING_ONLINE) was triggered" in a way that it
doesn't see the node set yet - kmem_cache_node for the new node on the new
cache will be missing.

The root issue is node_states_set_node() doesn't happen under slab_mutex.
slab_nodes update happens under the slab_mutex to avoid this race.

And once we have slab_nodes for this cache creation vs hotplug
synchronization then it's also to use it for the ___slab_alloc() checks,
although they could indeed now use node_state(node, N_NORMAL_MEMORY) once we
create the node structures and set bits in slab_nodes for N_MEMORY.

Maybe it could be solved at the memory hotplug level if it we had a new
state e.g.  N_GOING_ONLINE that was set before calling MEM_GOING_ONLINE
callbacks and was never reset. Then init_kmem_cache_nodes() could iterate on
that. But I'm not sure if slab_mutex would provide necessary synchronization
guarantees here so that the addition of node to MEM_GOING_ONLINE can't be
missed, hm.

> But yeah, this is what I raised below: "Not sure if there are any races 
> to consider" :)
> 
>> 
>> @Vlastimil maybe a dumb question but why not check s->node[nid]
>> instead of having slab_nodes bitmask?

You mean check i.e. in ___slab_alloc()? We'd still be prone to miss a
kmem_cache_node creation and degrade the particular cache that way, we'd
just avoid a NULL pointer dereference.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 1/3] mm,slub: Do not special case N_NORMAL nodes for slab_nodes
  2025-04-08 17:55         ` Vlastimil Babka
@ 2025-04-08 18:18           ` David Hildenbrand
  2025-04-30  8:47             ` Oscar Salvador
  0 siblings, 1 reply; 17+ messages in thread
From: David Hildenbrand @ 2025-04-08 18:18 UTC (permalink / raw)
  To: Vlastimil Babka, Harry Yoo
  Cc: Oscar Salvador, Andrew Morton, linux-mm, linux-kernel,
	Jonathan Cameron, linux-cxl

On 08.04.25 19:55, Vlastimil Babka wrote:
> On 4/8/25 16:25, David Hildenbrand wrote:
>> On 08.04.25 16:18, Harry Yoo wrote:
>>> On Tue, Apr 08, 2025 at 12:17:52PM +0200, David Hildenbrand wrote:
>>>> On 08.04.25 10:41, Oscar Salvador wrote:
>>>>> Currently, slab_mem_going_going_callback() checks whether the node has
>>>>> N_NORMAL memory in order to be set in slab_nodes.
>>>>> While it is true that gettind rid of that enforcing would mean
>>>>> ending up with movables nodes in slab_nodes, the memory waste that comes
>>>>> with that is negligible.
>>>>>
>>>>> So stop checking for status_change_nid_normal and just use status_change_nid
>>>>> instead which works for both types of memory.
>>>>>
>>>>> Also, once we allocate the kmem_cache_node cache  for the node in
>>>>> slab_mem_online_callback(), we never deallocate it in
>>>>> slab_mem_off_callback() when the node goes memoryless, so we can just
>>>>> get rid of it.
>>>>>
>>>>> The only side effect is that we will stop clearing the node from slab_nodes.
>>>>>
>>>>
>>>> Feel free to add a Suggested-by: if you think it applies.
>>>>
>>>>
>>>> Do we have to take care of the N_NORMAL_MEMORY check in kmem_cache_init() ? Likely it
>>>> would have to be a N_MEMORY check.
>>>>
>>>>
>>>> But, I was wondering if we could get rid of the "slab_nodes" thingy as a first step?
>>>
>>> The following commit says that SLUB has slab_nodes thingy for a reason...
>>> kmem_cache_node might not be ready yet even when N_NORMAL_MEMORY check
>>> says it now has normal memory.
>>
>> node_states_set_node() is called from memory hotplug code after
>> MEM_GOING_ONLINE and after online_pages_range().
>>
>> Pages might be isolated at that point, but node_states_set_node() is set
>> only after the memory notifier (MEM_GOING_ONLINE) was triggered.
>>
>> So I don't immediately see the problem assuming that we never free the
>> structures.
> 
> Hm, I think it still exists. So consider:

Ah, now I understand the problem, thanks for pointing that out.

> 
> - slab_mem_going_online_callback() creates kmem_cache_node for the new node
> for existing caches under the mutex
> 
> Then a cache creation races with "node_states_set_node() is set only after
> the memory notifier (MEM_GOING_ONLINE) was triggered" in a way that it
> doesn't see the node set yet - kmem_cache_node for the new node on the new
> cache will be missing.
> 
> The root issue is node_states_set_node() doesn't happen under slab_mutex.
> slab_nodes update happens under the slab_mutex to avoid this race.

We could by grabbing the mutex in MEM_GOING_ONLINE and putting it in 
MEM_CANCEL_ONLINE / MEM_ONLINE when a node is going online.

We use something similar in virtio-mem code to sync over the whole 
onlining period.

Of course, if someone would try creating a cache from inside a notifier, 
it would be problematic. (unlikely, but pointing it out)

Onlining a node is not a particularly common operation, so maybe ... 
good enough.

Let me think about other alternatives that require minimal adjustments.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 2/3] mm,memory_hotplug: Implement numa node notifier
  2025-04-08  8:41 ` [PATCH v2 2/3] mm,memory_hotplug: Implement numa node notifier Oscar Salvador
@ 2025-04-09 13:44   ` kernel test robot
  2025-04-09 16:58     ` Oscar Salvador
  0 siblings, 1 reply; 17+ messages in thread
From: kernel test robot @ 2025-04-09 13:44 UTC (permalink / raw)
  To: Oscar Salvador, Andrew Morton
  Cc: llvm, oe-kbuild-all, Linux Memory Management List,
	David Hildenbrand, linux-kernel, Vlastimil Babka, Harry Yoo,
	Jonathan Cameron, linux-cxl, Oscar Salvador

Hi Oscar,

kernel test robot noticed the following build errors:

[auto build test ERROR on akpm-mm/mm-everything]
[also build test ERROR on next-20250409]
[cannot apply to rafael-pm/linux-next rafael-pm/bleeding-edge driver-core/driver-core-testing driver-core/driver-core-next driver-core/driver-core-linus tj-cgroup/for-next linus/master vbabka-slab/for-next v6.15-rc1]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Oscar-Salvador/mm-slub-Do-not-special-case-N_NORMAL-nodes-for-slab_nodes/20250408-164417
base:   https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
patch link:    https://lore.kernel.org/r/20250408084153.255762-3-osalvador%40suse.de
patch subject: [PATCH v2 2/3] mm,memory_hotplug: Implement numa node notifier
config: arm64-randconfig-001-20250409 (https://download.01.org/0day-ci/archive/20250409/202504092104.MyHeSV43-lkp@intel.com/config)
compiler: clang version 21.0.0git (https://github.com/llvm/llvm-project 92c93f5286b9ff33f27ff694d2dc33da1c07afdd)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20250409/202504092104.MyHeSV43-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202504092104.MyHeSV43-lkp@intel.com/

All errors (new ones prefixed by >>):

>> drivers/base/node.c:115:5: error: redefinition of 'register_node_notifier'
     115 | int register_node_notifier(struct notifier_block *nb)
         |     ^
   include/linux/memory.h:172:19: note: previous definition is here
     172 | static inline int register_node_notifier(struct notifier_block *nb)
         |                   ^
>> drivers/base/node.c:121:6: error: redefinition of 'unregister_node_notifier'
     121 | void unregister_node_notifier(struct notifier_block *nb)
         |      ^
   include/linux/memory.h:176:20: note: previous definition is here
     176 | static inline void unregister_node_notifier(struct notifier_block *nb)
         |                    ^
>> drivers/base/node.c:127:5: error: redefinition of 'node_notify'
     127 | int node_notify(unsigned long val, void *v)
         |     ^
   include/linux/memory.h:179:19: note: previous definition is here
     179 | static inline int node_notify(unsigned long val, void *v)
         |                   ^
   3 errors generated.


vim +/register_node_notifier +115 drivers/base/node.c

   114	
 > 115	int register_node_notifier(struct notifier_block *nb)
   116	{
   117		return blocking_notifier_chain_register(&node_chain, nb);
   118	}
   119	EXPORT_SYMBOL(register_node_notifier);
   120	
 > 121	void unregister_node_notifier(struct notifier_block *nb)
   122	{
   123		blocking_notifier_chain_unregister(&node_chain, nb);
   124	}
   125	EXPORT_SYMBOL(unregister_node_notifier);
   126	
 > 127	int node_notify(unsigned long val, void *v)
   128	{
   129		return blocking_notifier_call_chain(&node_chain, val, v);
   130	}
   131	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 2/3] mm,memory_hotplug: Implement numa node notifier
  2025-04-09 13:44   ` kernel test robot
@ 2025-04-09 16:58     ` Oscar Salvador
  0 siblings, 0 replies; 17+ messages in thread
From: Oscar Salvador @ 2025-04-09 16:58 UTC (permalink / raw)
  To: kernel test robot
  Cc: Andrew Morton, llvm, oe-kbuild-all, Linux Memory Management List,
	David Hildenbrand, linux-kernel, Vlastimil Babka, Harry Yoo,
	Jonathan Cameron, linux-cxl

On Wed, Apr 09, 2025 at 09:44:52PM +0800, kernel test robot wrote:
> Hi Oscar,
> 
> kernel test robot noticed the following build errors:
... 
> >> drivers/base/node.c:115:5: error: redefinition of 'register_node_notifier'
>      115 | int register_node_notifier(struct notifier_block *nb)
>          |     ^
>    include/linux/memory.h:172:19: note: previous definition is here
>      172 | static inline int register_node_notifier(struct notifier_block *nb)
>          |                   ^
> >> drivers/base/node.c:121:6: error: redefinition of 'unregister_node_notifier'
>      121 | void unregister_node_notifier(struct notifier_block *nb)
>          |      ^
>    include/linux/memory.h:176:20: note: previous definition is here
>      176 | static inline void unregister_node_notifier(struct notifier_block *nb)
>          |                    ^
> >> drivers/base/node.c:127:5: error: redefinition of 'node_notify'
>      127 | int node_notify(unsigned long val, void *v)
>          |     ^
>    include/linux/memory.h:179:19: note: previous definition is here
>      179 | static inline int node_notify(unsigned long val, void *v)
>          |                   ^
>    3 errors generated.

Ah, I see. When CONFIG_MEMORY_HOTPLUG=n those come into play.
That is not a problem for the memory-notify thing because drivers/base/memory.c
gets compiled IFF CONFIG_MEMORY_HOTPLUG=y.
I am thinking two ways to fix this:

 1) Move the code for node-notify to drivers/base/memory.c
 2) Surround those functions within a
 #ifdef CONFIG_MEMORY_HOTPLUG
 ...
 #endif

Thoughts? I lean towards option #2 as it looks cleaner to me:

 diff --git a/drivers/base/node.c b/drivers/base/node.c
 index 182c71dfb5b8..3b084d71888a 100644
 --- a/drivers/base/node.c
 +++ b/drivers/base/node.c
 @@ -110,6 +110,7 @@ static const struct attribute_group *node_access_node_groups[] = {
         NULL,
  };
  
 +#ifdef CONFIG_MEMORY_HOTPLUG
  static BLOCKING_NOTIFIER_HEAD(node_chain);
  
  int register_node_notifier(struct notifier_block *nb)
 @@ -128,6 +129,7 @@ int node_notify(unsigned long val, void *v)
  {
         return blocking_notifier_call_chain(&node_chain, val, v);
  }
 +#endif
  
  static void node_remove_accesses(struct node *node)
  {



-- 
Oscar Salvador
SUSE Labs


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 1/3] mm,slub: Do not special case N_NORMAL nodes for slab_nodes
  2025-04-08 18:18           ` David Hildenbrand
@ 2025-04-30  8:47             ` Oscar Salvador
  2025-04-30  8:57               ` Vlastimil Babka
  0 siblings, 1 reply; 17+ messages in thread
From: Oscar Salvador @ 2025-04-30  8:47 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Vlastimil Babka, Harry Yoo, Andrew Morton, linux-mm,
	linux-kernel, Jonathan Cameron, linux-cxl

On Tue, Apr 08, 2025 at 08:18:32PM +0200, David Hildenbrand wrote:
> We could by grabbing the mutex in MEM_GOING_ONLINE and putting it in
> MEM_CANCEL_ONLINE / MEM_ONLINE when a node is going online.

Hi guys,

After a few busy days I have been revisiting this.

I checked the proposal of slab_mutex spanning GOING_ONLINE <-> ONLINE,
and I did not see any issue.

The only concern I had is that we might be calling some slab function from
{online,offline}_pages() that also takes the mutex.
I am not aware of any though, and quickly checking did not reveal
anything either.

If there is any will be quickly revealed though :-).

So, unless there is an opposition, I can move forward and see how it
looks.

Thoughts?

-- 
Oscar Salvador
SUSE Labs

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 1/3] mm,slub: Do not special case N_NORMAL nodes for slab_nodes
  2025-04-30  8:47             ` Oscar Salvador
@ 2025-04-30  8:57               ` Vlastimil Babka
  2025-04-30  9:02                 ` David Hildenbrand
  0 siblings, 1 reply; 17+ messages in thread
From: Vlastimil Babka @ 2025-04-30  8:57 UTC (permalink / raw)
  To: Oscar Salvador, David Hildenbrand
  Cc: Harry Yoo, Andrew Morton, linux-mm, linux-kernel,
	Jonathan Cameron, linux-cxl

On 4/30/25 10:47, Oscar Salvador wrote:
> On Tue, Apr 08, 2025 at 08:18:32PM +0200, David Hildenbrand wrote:
>> We could by grabbing the mutex in MEM_GOING_ONLINE and putting it in
>> MEM_CANCEL_ONLINE / MEM_ONLINE when a node is going online.
> 
> Hi guys,
> 
> After a few busy days I have been revisiting this.
> 
> I checked the proposal of slab_mutex spanning GOING_ONLINE <-> ONLINE,
> and I did not see any issue.
> 
> The only concern I had is that we might be calling some slab function from
> {online,offline}_pages() that also takes the mutex.
> I am not aware of any though, and quickly checking did not reveal
> anything either.
> 
> If there is any will be quickly revealed though :-).
> 
> So, unless there is an opposition, I can move forward and see how it
> looks.
> 
> Thoughts?

I feel a bit uneasy about it, while maintaining slab_nodes doesn't seem like
a huge issue to me, so dunno. Maybe David has a better idea now :)


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 1/3] mm,slub: Do not special case N_NORMAL nodes for slab_nodes
  2025-04-30  8:57               ` Vlastimil Babka
@ 2025-04-30  9:02                 ` David Hildenbrand
  0 siblings, 0 replies; 17+ messages in thread
From: David Hildenbrand @ 2025-04-30  9:02 UTC (permalink / raw)
  To: Vlastimil Babka, Oscar Salvador
  Cc: Harry Yoo, Andrew Morton, linux-mm, linux-kernel,
	Jonathan Cameron, linux-cxl

On 30.04.25 10:57, Vlastimil Babka wrote:
> On 4/30/25 10:47, Oscar Salvador wrote:
>> On Tue, Apr 08, 2025 at 08:18:32PM +0200, David Hildenbrand wrote:
>>> We could by grabbing the mutex in MEM_GOING_ONLINE and putting it in
>>> MEM_CANCEL_ONLINE / MEM_ONLINE when a node is going online.
>>
>> Hi guys,
>>
>> After a few busy days I have been revisiting this.
>>
>> I checked the proposal of slab_mutex spanning GOING_ONLINE <-> ONLINE,
>> and I did not see any issue.
>>
>> The only concern I had is that we might be calling some slab function from
>> {online,offline}_pages() that also takes the mutex.
>> I am not aware of any though, and quickly checking did not reveal
>> anything either.
>>
>> If there is any will be quickly revealed though :-).
>>
>> So, unless there is an opposition, I can move forward and see how it
>> looks.
>>
>> Thoughts?
> 
> I feel a bit uneasy about it, while maintaining slab_nodes doesn't seem like
> a huge issue to me, so dunno. Maybe David has a better idea now :)

Yeah, let's avoid the slab_mutex change for now.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2025-04-30  9:02 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-04-08  8:41 [PATCH v2 0/3] Implement numa node notifier Oscar Salvador
2025-04-08  8:41 ` [PATCH v2 1/3] mm,slub: Do not special case N_NORMAL nodes for slab_nodes Oscar Salvador
2025-04-08 10:17   ` David Hildenbrand
2025-04-08 12:49     ` Oscar Salvador
2025-04-08 15:15       ` Harry Yoo
2025-04-08 14:18     ` Harry Yoo
2025-04-08 14:25       ` David Hildenbrand
2025-04-08 14:54         ` Harry Yoo
2025-04-08 17:55         ` Vlastimil Babka
2025-04-08 18:18           ` David Hildenbrand
2025-04-30  8:47             ` Oscar Salvador
2025-04-30  8:57               ` Vlastimil Babka
2025-04-30  9:02                 ` David Hildenbrand
2025-04-08  8:41 ` [PATCH v2 2/3] mm,memory_hotplug: Implement numa node notifier Oscar Salvador
2025-04-09 13:44   ` kernel test robot
2025-04-09 16:58     ` Oscar Salvador
2025-04-08  8:41 ` [PATCH v2 3/3] mm,memory_hotplug: Rename status_change_nid parameter in memory_notify Oscar Salvador

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox