* [PATCH -next v2] mm/hotplug: fix a null-ptr-deref during NUMA boot
@ 2019-05-12 5:48 Qian Cai
2019-05-13 12:41 ` Michal Hocko
0 siblings, 1 reply; 21+ messages in thread
From: Qian Cai @ 2019-05-12 5:48 UTC (permalink / raw)
To: akpm
Cc: mhocko, brho, kernelfans, dave.hansen, rppt, peterz, mpe, mingo,
osalvador, luto, tglx, linux-mm, linux-kernel, Qian Cai
The linux-next commit ("x86, numa: always initialize all possible
nodes") introduced a crash below during boot for systems with a
memory-less node. This is due to CPUs that get onlined during SMP boot,
but that onlining triggers a page fault in bus_add_device() during
device registration:
error = sysfs_create_link(&bus->p->devices_kset->kobj,
bus->p is NULL. That "p" is the subsys_private struct, and it should
have been set in,
postcore_initcall(register_node_type);
but that happens in do_basic_setup() after smp_init().
The old code had set this node online via alloc_node_data(), so when it
came time to do_cpu_up() -> try_online_node(), the node was already up
and nothing happened.
Now, it attempts to online the node, which registers the node with
sysfs, but that can't happen before the 'node' subsystem is registered.
Since kernel_init() is running by a kernel thread that is in
SYSTEM_SCHEDULING state, fixed this by skipping registering with sysfs
during the early boot in __try_online_node().
Call Trace:
device_add+0x43e/0x690
device_register+0x107/0x110
__register_one_node+0x72/0x150
__try_online_node+0x8f/0xd0
try_online_node+0x2b/0x50
do_cpu_up+0x46/0xf0
cpu_up+0x13/0x20
smp_init+0x6e/0xd0
kernel_init_freeable+0xe5/0x21f
kernel_init+0xf/0x180
ret_from_fork+0x1f/0x30
Reported-by: Barret Rhoden <brho@google.com>
Signed-off-by: Qian Cai <cai@lca.pw>
---
v2: Set the node online as it have CPUs. Otherwise, those memory-less nodes will
end up being not in sysfs i.e., /sys/devices/system/node/.
mm/memory_hotplug.c | 12 ++++++++++++
1 file changed, 12 insertions(+)
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index b236069ff0d8..6eb2331fa826 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1037,6 +1037,18 @@ static int __try_online_node(int nid, u64 start, bool set_node_online)
if (node_online(nid))
return 0;
+ /*
+ * Here is called by cpu_up() to online a node without memory from
+ * kernel_init() which guarantees that "set_node_online" is true which
+ * will set the node online as it have CPUs but not ready to call
+ * register_one_node() as "node_subsys" has not been initialized
+ * properly yet.
+ */
+ if (system_state == SYSTEM_SCHEDULING) {
+ node_set_online(nid);
+ return 0;
+ }
+
pgdat = hotadd_new_pgdat(nid, start);
if (!pgdat) {
pr_err("Cannot online node %d due to NULL pgdat\n", nid);
--
2.20.1 (Apple Git-117)
^ permalink raw reply [flat|nested] 21+ messages in thread* Re: [PATCH -next v2] mm/hotplug: fix a null-ptr-deref during NUMA boot 2019-05-12 5:48 [PATCH -next v2] mm/hotplug: fix a null-ptr-deref during NUMA boot Qian Cai @ 2019-05-13 12:41 ` Michal Hocko 2019-05-13 13:43 ` Qian Cai 2019-06-21 13:17 ` Qian Cai 0 siblings, 2 replies; 21+ messages in thread From: Michal Hocko @ 2019-05-13 12:41 UTC (permalink / raw) To: Qian Cai Cc: akpm, brho, kernelfans, dave.hansen, rppt, peterz, mpe, mingo, osalvador, luto, tglx, linux-mm, linux-kernel On Sun 12-05-19 01:48:29, Qian Cai wrote: > The linux-next commit ("x86, numa: always initialize all possible > nodes") introduced a crash below during boot for systems with a > memory-less node. This is due to CPUs that get onlined during SMP boot, > but that onlining triggers a page fault in bus_add_device() during > device registration: > > error = sysfs_create_link(&bus->p->devices_kset->kobj, > > bus->p is NULL. That "p" is the subsys_private struct, and it should > have been set in, > > postcore_initcall(register_node_type); > > but that happens in do_basic_setup() after smp_init(). > > The old code had set this node online via alloc_node_data(), so when it > came time to do_cpu_up() -> try_online_node(), the node was already up > and nothing happened. > > Now, it attempts to online the node, which registers the node with > sysfs, but that can't happen before the 'node' subsystem is registered. > > Since kernel_init() is running by a kernel thread that is in > SYSTEM_SCHEDULING state, fixed this by skipping registering with sysfs > during the early boot in __try_online_node(). Relying on SYSTEM_SCHEDULING looks really hackish. Why cannot we simply drop try_online_node from do_cpu_up? Your v2 remark below suggests that we need to call node_set_online because something later on depends on that. Btw. why do we even allocate a pgdat from this path? This looks really messy. > Call Trace: > device_add+0x43e/0x690 > device_register+0x107/0x110 > __register_one_node+0x72/0x150 > __try_online_node+0x8f/0xd0 > try_online_node+0x2b/0x50 > do_cpu_up+0x46/0xf0 > cpu_up+0x13/0x20 > smp_init+0x6e/0xd0 > kernel_init_freeable+0xe5/0x21f > kernel_init+0xf/0x180 > ret_from_fork+0x1f/0x30 > > Reported-by: Barret Rhoden <brho@google.com> > Signed-off-by: Qian Cai <cai@lca.pw> > --- > > v2: Set the node online as it have CPUs. Otherwise, those memory-less nodes will > end up being not in sysfs i.e., /sys/devices/system/node/. > > mm/memory_hotplug.c | 12 ++++++++++++ > 1 file changed, 12 insertions(+) > > diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c > index b236069ff0d8..6eb2331fa826 100644 > --- a/mm/memory_hotplug.c > +++ b/mm/memory_hotplug.c > @@ -1037,6 +1037,18 @@ static int __try_online_node(int nid, u64 start, bool set_node_online) > if (node_online(nid)) > return 0; > > + /* > + * Here is called by cpu_up() to online a node without memory from > + * kernel_init() which guarantees that "set_node_online" is true which > + * will set the node online as it have CPUs but not ready to call > + * register_one_node() as "node_subsys" has not been initialized > + * properly yet. > + */ > + if (system_state == SYSTEM_SCHEDULING) { > + node_set_online(nid); > + return 0; > + } > + > pgdat = hotadd_new_pgdat(nid, start); > if (!pgdat) { > pr_err("Cannot online node %d due to NULL pgdat\n", nid); > -- > 2.20.1 (Apple Git-117) -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH -next v2] mm/hotplug: fix a null-ptr-deref during NUMA boot 2019-05-13 12:41 ` Michal Hocko @ 2019-05-13 13:43 ` Qian Cai 2019-05-13 14:04 ` Michal Hocko 2019-06-21 13:17 ` Qian Cai 1 sibling, 1 reply; 21+ messages in thread From: Qian Cai @ 2019-05-13 13:43 UTC (permalink / raw) To: Michal Hocko Cc: akpm, brho, kernelfans, dave.hansen, rppt, peterz, mpe, mingo, osalvador, luto, tglx, linux-mm, linux-kernel On Mon, 2019-05-13 at 14:41 +0200, Michal Hocko wrote: > On Sun 12-05-19 01:48:29, Qian Cai wrote: > > The linux-next commit ("x86, numa: always initialize all possible > > nodes") introduced a crash below during boot for systems with a > > memory-less node. This is due to CPUs that get onlined during SMP boot, > > but that onlining triggers a page fault in bus_add_device() during > > device registration: > > > > error = sysfs_create_link(&bus->p->devices_kset->kobj, > > > > bus->p is NULL. That "p" is the subsys_private struct, and it should > > have been set in, > > > > postcore_initcall(register_node_type); > > > > but that happens in do_basic_setup() after smp_init(). > > > > The old code had set this node online via alloc_node_data(), so when it > > came time to do_cpu_up() -> try_online_node(), the node was already up > > and nothing happened. > > > > Now, it attempts to online the node, which registers the node with > > sysfs, but that can't happen before the 'node' subsystem is registered. > > > > Since kernel_init() is running by a kernel thread that is in > > SYSTEM_SCHEDULING state, fixed this by skipping registering with sysfs > > during the early boot in __try_online_node(). > > Relying on SYSTEM_SCHEDULING looks really hackish. Why cannot we simply > drop try_online_node from do_cpu_up? Your v2 remark below suggests that > we need to call node_set_online because something later on depends on > that. Btw. why do we even allocate a pgdat from this path? This looks > really messy. See the commit cf23422b9d76 ("cpu/mem hotplug: enable CPUs online before local memory online") It looks like try_online_node() in do_cpu_up() is needed for memory hotplug which is to put its node online if offlined and then hotadd_new_pgdat() calls build_all_zonelists() to initialize the zone list. > > > Call Trace: > > device_add+0x43e/0x690 > > device_register+0x107/0x110 > > __register_one_node+0x72/0x150 > > __try_online_node+0x8f/0xd0 > > try_online_node+0x2b/0x50 > > do_cpu_up+0x46/0xf0 > > cpu_up+0x13/0x20 > > smp_init+0x6e/0xd0 > > kernel_init_freeable+0xe5/0x21f > > kernel_init+0xf/0x180 > > ret_from_fork+0x1f/0x30 > > > > Reported-by: Barret Rhoden <brho@google.com> > > Signed-off-by: Qian Cai <cai@lca.pw> > > --- > > > > v2: Set the node online as it have CPUs. Otherwise, those memory-less nodes > > will > > end up being not in sysfs i.e., /sys/devices/system/node/. > > > > mm/memory_hotplug.c | 12 ++++++++++++ > > 1 file changed, 12 insertions(+) > > > > diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c > > index b236069ff0d8..6eb2331fa826 100644 > > --- a/mm/memory_hotplug.c > > +++ b/mm/memory_hotplug.c > > @@ -1037,6 +1037,18 @@ static int __try_online_node(int nid, u64 start, bool > > set_node_online) > > if (node_online(nid)) > > return 0; > > > > + /* > > + * Here is called by cpu_up() to online a node without memory from > > + * kernel_init() which guarantees that "set_node_online" is true > > which > > + * will set the node online as it have CPUs but not ready to call > > + * register_one_node() as "node_subsys" has not been initialized > > + * properly yet. > > + */ > > + if (system_state == SYSTEM_SCHEDULING) { > > + node_set_online(nid); > > + return 0; > > + } > > + > > pgdat = hotadd_new_pgdat(nid, start); > > if (!pgdat) { > > pr_err("Cannot online node %d due to NULL pgdat\n", nid); > > -- > > 2.20.1 (Apple Git-117) > > ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH -next v2] mm/hotplug: fix a null-ptr-deref during NUMA boot 2019-05-13 13:43 ` Qian Cai @ 2019-05-13 14:04 ` Michal Hocko 2019-05-13 15:20 ` Qian Cai 0 siblings, 1 reply; 21+ messages in thread From: Michal Hocko @ 2019-05-13 14:04 UTC (permalink / raw) To: Qian Cai Cc: akpm, brho, kernelfans, dave.hansen, rppt, peterz, mpe, mingo, osalvador, luto, tglx, linux-mm, linux-kernel On Mon 13-05-19 09:43:59, Qian Cai wrote: > On Mon, 2019-05-13 at 14:41 +0200, Michal Hocko wrote: > > On Sun 12-05-19 01:48:29, Qian Cai wrote: > > > The linux-next commit ("x86, numa: always initialize all possible > > > nodes") introduced a crash below during boot for systems with a > > > memory-less node. This is due to CPUs that get onlined during SMP boot, > > > but that onlining triggers a page fault in bus_add_device() during > > > device registration: > > > > > > error = sysfs_create_link(&bus->p->devices_kset->kobj, > > > > > > bus->p is NULL. That "p" is the subsys_private struct, and it should > > > have been set in, > > > > > > postcore_initcall(register_node_type); > > > > > > but that happens in do_basic_setup() after smp_init(). > > > > > > The old code had set this node online via alloc_node_data(), so when it > > > came time to do_cpu_up() -> try_online_node(), the node was already up > > > and nothing happened. > > > > > > Now, it attempts to online the node, which registers the node with > > > sysfs, but that can't happen before the 'node' subsystem is registered. > > > > > > Since kernel_init() is running by a kernel thread that is in > > > SYSTEM_SCHEDULING state, fixed this by skipping registering with sysfs > > > during the early boot in __try_online_node(). > > > > Relying on SYSTEM_SCHEDULING looks really hackish. Why cannot we simply > > drop try_online_node from do_cpu_up? Your v2 remark below suggests that > > we need to call node_set_online because something later on depends on > > that. Btw. why do we even allocate a pgdat from this path? This looks > > really messy. > > See the commit cf23422b9d76 ("cpu/mem hotplug: enable CPUs online before local > memory online") > > It looks like try_online_node() in do_cpu_up() is needed for memory hotplug > which is to put its node online if offlined and then hotadd_new_pgdat() calls > build_all_zonelists() to initialize the zone list. Well, do we still have to followthe logic that the above (unreviewed) commit has established? The hotplug code in general made a lot of ad-hoc design decisions which had to be revisited over time. If we are not allocating pgdats for newly added memory then we should really make sure to do so at a proper time and hook. I am not sure about CPU vs. memory init ordering but even then I would really prefer if we could make the init less obscure and _documented_. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH -next v2] mm/hotplug: fix a null-ptr-deref during NUMA boot 2019-05-13 14:04 ` Michal Hocko @ 2019-05-13 15:20 ` Qian Cai 2019-05-13 15:31 ` Michal Hocko 0 siblings, 1 reply; 21+ messages in thread From: Qian Cai @ 2019-05-13 15:20 UTC (permalink / raw) To: Michal Hocko Cc: akpm, brho, kernelfans, dave.hansen, rppt, peterz, mpe, mingo, osalvador, luto, tglx, linux-mm, linux-kernel On Mon, 2019-05-13 at 16:04 +0200, Michal Hocko wrote: > On Mon 13-05-19 09:43:59, Qian Cai wrote: > > On Mon, 2019-05-13 at 14:41 +0200, Michal Hocko wrote: > > > On Sun 12-05-19 01:48:29, Qian Cai wrote: > > > > The linux-next commit ("x86, numa: always initialize all possible > > > > nodes") introduced a crash below during boot for systems with a > > > > memory-less node. This is due to CPUs that get onlined during SMP boot, > > > > but that onlining triggers a page fault in bus_add_device() during > > > > device registration: > > > > > > > > error = sysfs_create_link(&bus->p->devices_kset->kobj, > > > > > > > > bus->p is NULL. That "p" is the subsys_private struct, and it should > > > > have been set in, > > > > > > > > postcore_initcall(register_node_type); > > > > > > > > but that happens in do_basic_setup() after smp_init(). > > > > > > > > The old code had set this node online via alloc_node_data(), so when it > > > > came time to do_cpu_up() -> try_online_node(), the node was already up > > > > and nothing happened. > > > > > > > > Now, it attempts to online the node, which registers the node with > > > > sysfs, but that can't happen before the 'node' subsystem is registered. > > > > > > > > Since kernel_init() is running by a kernel thread that is in > > > > SYSTEM_SCHEDULING state, fixed this by skipping registering with sysfs > > > > during the early boot in __try_online_node(). > > > > > > Relying on SYSTEM_SCHEDULING looks really hackish. Why cannot we simply > > > drop try_online_node from do_cpu_up? Your v2 remark below suggests that > > > we need to call node_set_online because something later on depends on > > > that. Btw. why do we even allocate a pgdat from this path? This looks > > > really messy. > > > > See the commit cf23422b9d76 ("cpu/mem hotplug: enable CPUs online before > > local > > memory online") > > > > It looks like try_online_node() in do_cpu_up() is needed for memory hotplug > > which is to put its node online if offlined and then hotadd_new_pgdat() > > calls > > build_all_zonelists() to initialize the zone list. > > Well, do we still have to followthe logic that the above (unreviewed) > commit has established? The hotplug code in general made a lot of ad-hoc > design decisions which had to be revisited over time. If we are not > allocating pgdats for newly added memory then we should really make sure > to do so at a proper time and hook. I am not sure about CPU vs. memory > init ordering but even then I would really prefer if we could make the > init less obscure and _documented_. I don't know, but I think it is a good idea to keep the existing logic rather than do a big surgery unless someone is able to confirm it is not breaking NUMA node physical hotplug. ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH -next v2] mm/hotplug: fix a null-ptr-deref during NUMA boot 2019-05-13 15:20 ` Qian Cai @ 2019-05-13 15:31 ` Michal Hocko 2019-05-22 7:12 ` Pingfan Liu 0 siblings, 1 reply; 21+ messages in thread From: Michal Hocko @ 2019-05-13 15:31 UTC (permalink / raw) To: Qian Cai Cc: akpm, brho, kernelfans, dave.hansen, rppt, peterz, mpe, mingo, osalvador, luto, tglx, linux-mm, linux-kernel On Mon 13-05-19 11:20:46, Qian Cai wrote: > On Mon, 2019-05-13 at 16:04 +0200, Michal Hocko wrote: > > On Mon 13-05-19 09:43:59, Qian Cai wrote: > > > On Mon, 2019-05-13 at 14:41 +0200, Michal Hocko wrote: > > > > On Sun 12-05-19 01:48:29, Qian Cai wrote: > > > > > The linux-next commit ("x86, numa: always initialize all possible > > > > > nodes") introduced a crash below during boot for systems with a > > > > > memory-less node. This is due to CPUs that get onlined during SMP boot, > > > > > but that onlining triggers a page fault in bus_add_device() during > > > > > device registration: > > > > > > > > > > error = sysfs_create_link(&bus->p->devices_kset->kobj, > > > > > > > > > > bus->p is NULL. That "p" is the subsys_private struct, and it should > > > > > have been set in, > > > > > > > > > > postcore_initcall(register_node_type); > > > > > > > > > > but that happens in do_basic_setup() after smp_init(). > > > > > > > > > > The old code had set this node online via alloc_node_data(), so when it > > > > > came time to do_cpu_up() -> try_online_node(), the node was already up > > > > > and nothing happened. > > > > > > > > > > Now, it attempts to online the node, which registers the node with > > > > > sysfs, but that can't happen before the 'node' subsystem is registered. > > > > > > > > > > Since kernel_init() is running by a kernel thread that is in > > > > > SYSTEM_SCHEDULING state, fixed this by skipping registering with sysfs > > > > > during the early boot in __try_online_node(). > > > > > > > > Relying on SYSTEM_SCHEDULING looks really hackish. Why cannot we simply > > > > drop try_online_node from do_cpu_up? Your v2 remark below suggests that > > > > we need to call node_set_online because something later on depends on > > > > that. Btw. why do we even allocate a pgdat from this path? This looks > > > > really messy. > > > > > > See the commit cf23422b9d76 ("cpu/mem hotplug: enable CPUs online before > > > local > > > memory online") > > > > > > It looks like try_online_node() in do_cpu_up() is needed for memory hotplug > > > which is to put its node online if offlined and then hotadd_new_pgdat() > > > calls > > > build_all_zonelists() to initialize the zone list. > > > > Well, do we still have to followthe logic that the above (unreviewed) > > commit has established? The hotplug code in general made a lot of ad-hoc > > design decisions which had to be revisited over time. If we are not > > allocating pgdats for newly added memory then we should really make sure > > to do so at a proper time and hook. I am not sure about CPU vs. memory > > init ordering but even then I would really prefer if we could make the > > init less obscure and _documented_. > > I don't know, but I think it is a good idea to keep the existing logic rather > than do a big surgery Adding more hacks just doesn't make the situation any better. > unless someone is able to confirm it is not breaking NUMA > node physical hotplug. I have a machine to test whole node offline. I am just busy to prepare a patch myself. I can have it tested though. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH -next v2] mm/hotplug: fix a null-ptr-deref during NUMA boot 2019-05-13 15:31 ` Michal Hocko @ 2019-05-22 7:12 ` Pingfan Liu 2019-05-22 11:16 ` Michal Hocko 0 siblings, 1 reply; 21+ messages in thread From: Pingfan Liu @ 2019-05-22 7:12 UTC (permalink / raw) To: Michal Hocko Cc: Qian Cai, Andrew Morton, brho, Dave Hansen, Mike Rapoport, Peter Zijlstra, Michael Ellerman, Ingo Molnar, Oscar Salvador, Andy Lutomirski, Thomas Gleixner, linux-mm, LKML On Mon, May 13, 2019 at 11:31 PM Michal Hocko <mhocko@kernel.org> wrote: > > On Mon 13-05-19 11:20:46, Qian Cai wrote: > > On Mon, 2019-05-13 at 16:04 +0200, Michal Hocko wrote: > > > On Mon 13-05-19 09:43:59, Qian Cai wrote: > > > > On Mon, 2019-05-13 at 14:41 +0200, Michal Hocko wrote: > > > > > On Sun 12-05-19 01:48:29, Qian Cai wrote: > > > > > > The linux-next commit ("x86, numa: always initialize all possible > > > > > > nodes") introduced a crash below during boot for systems with a > > > > > > memory-less node. This is due to CPUs that get onlined during SMP boot, > > > > > > but that onlining triggers a page fault in bus_add_device() during > > > > > > device registration: > > > > > > > > > > > > error = sysfs_create_link(&bus->p->devices_kset->kobj, > > > > > > > > > > > > bus->p is NULL. That "p" is the subsys_private struct, and it should > > > > > > have been set in, > > > > > > > > > > > > postcore_initcall(register_node_type); > > > > > > > > > > > > but that happens in do_basic_setup() after smp_init(). > > > > > > > > > > > > The old code had set this node online via alloc_node_data(), so when it > > > > > > came time to do_cpu_up() -> try_online_node(), the node was already up > > > > > > and nothing happened. > > > > > > > > > > > > Now, it attempts to online the node, which registers the node with > > > > > > sysfs, but that can't happen before the 'node' subsystem is registered. > > > > > > > > > > > > Since kernel_init() is running by a kernel thread that is in > > > > > > SYSTEM_SCHEDULING state, fixed this by skipping registering with sysfs > > > > > > during the early boot in __try_online_node(). > > > > > > > > > > Relying on SYSTEM_SCHEDULING looks really hackish. Why cannot we simply > > > > > drop try_online_node from do_cpu_up? Your v2 remark below suggests that > > > > > we need to call node_set_online because something later on depends on > > > > > that. Btw. why do we even allocate a pgdat from this path? This looks > > > > > really messy. > > > > > > > > See the commit cf23422b9d76 ("cpu/mem hotplug: enable CPUs online before > > > > local > > > > memory online") > > > > > > > > It looks like try_online_node() in do_cpu_up() is needed for memory hotplug > > > > which is to put its node online if offlined and then hotadd_new_pgdat() > > > > calls > > > > build_all_zonelists() to initialize the zone list. > > > > > > Well, do we still have to followthe logic that the above (unreviewed) > > > commit has established? The hotplug code in general made a lot of ad-hoc > > > design decisions which had to be revisited over time. If we are not > > > allocating pgdats for newly added memory then we should really make sure > > > to do so at a proper time and hook. I am not sure about CPU vs. memory > > > init ordering but even then I would really prefer if we could make the > > > init less obscure and _documented_. > > > > I don't know, but I think it is a good idea to keep the existing logic rather > > than do a big surgery > > Adding more hacks just doesn't make the situation any better. > > > unless someone is able to confirm it is not breaking NUMA > > node physical hotplug. > > I have a machine to test whole node offline. I am just busy to prepare a > patch myself. I can have it tested though. > I think the definition of "node online" is worth of rethinking. Before patch "x86, numa: always initialize all possible nodes", online means either cpu or memory present. After this patch, only node owing memory as present. In the commit log, I think the change's motivation should be "Not to mention that it doesn't really make much sense to consider an empty node as online because we just consider this node whenever we want to iterate nodes to use and empty node is obviously not the best candidate." But in fact, we already have for_each_node_state(nid, N_MEMORY) to cover this purpose. Furthermore, changing the definition of online may break something in the scheduler, e.g. in task_numa_migrate(), where it calls for_each_online_node. By keeping the node owning cpu as online, Michal's patch can avoid such corner case and keep things easy. Furthermore, if needed, the other patch can use for_each_node_state(nid, N_MEMORY) to replace for_each_online_node is some space. Regards, Pingfan > -- > Michal Hocko > SUSE Labs ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH -next v2] mm/hotplug: fix a null-ptr-deref during NUMA boot 2019-05-22 7:12 ` Pingfan Liu @ 2019-05-22 11:16 ` Michal Hocko 2019-05-23 3:58 ` Pingfan Liu 0 siblings, 1 reply; 21+ messages in thread From: Michal Hocko @ 2019-05-22 11:16 UTC (permalink / raw) To: Pingfan Liu Cc: Qian Cai, Andrew Morton, brho, Dave Hansen, Mike Rapoport, Peter Zijlstra, Michael Ellerman, Ingo Molnar, Oscar Salvador, Andy Lutomirski, Thomas Gleixner, linux-mm, LKML On Wed 22-05-19 15:12:16, Pingfan Liu wrote: > On Mon, May 13, 2019 at 11:31 PM Michal Hocko <mhocko@kernel.org> wrote: > > > > On Mon 13-05-19 11:20:46, Qian Cai wrote: > > > On Mon, 2019-05-13 at 16:04 +0200, Michal Hocko wrote: > > > > On Mon 13-05-19 09:43:59, Qian Cai wrote: > > > > > On Mon, 2019-05-13 at 14:41 +0200, Michal Hocko wrote: > > > > > > On Sun 12-05-19 01:48:29, Qian Cai wrote: > > > > > > > The linux-next commit ("x86, numa: always initialize all possible > > > > > > > nodes") introduced a crash below during boot for systems with a > > > > > > > memory-less node. This is due to CPUs that get onlined during SMP boot, > > > > > > > but that onlining triggers a page fault in bus_add_device() during > > > > > > > device registration: > > > > > > > > > > > > > > error = sysfs_create_link(&bus->p->devices_kset->kobj, > > > > > > > > > > > > > > bus->p is NULL. That "p" is the subsys_private struct, and it should > > > > > > > have been set in, > > > > > > > > > > > > > > postcore_initcall(register_node_type); > > > > > > > > > > > > > > but that happens in do_basic_setup() after smp_init(). > > > > > > > > > > > > > > The old code had set this node online via alloc_node_data(), so when it > > > > > > > came time to do_cpu_up() -> try_online_node(), the node was already up > > > > > > > and nothing happened. > > > > > > > > > > > > > > Now, it attempts to online the node, which registers the node with > > > > > > > sysfs, but that can't happen before the 'node' subsystem is registered. > > > > > > > > > > > > > > Since kernel_init() is running by a kernel thread that is in > > > > > > > SYSTEM_SCHEDULING state, fixed this by skipping registering with sysfs > > > > > > > during the early boot in __try_online_node(). > > > > > > > > > > > > Relying on SYSTEM_SCHEDULING looks really hackish. Why cannot we simply > > > > > > drop try_online_node from do_cpu_up? Your v2 remark below suggests that > > > > > > we need to call node_set_online because something later on depends on > > > > > > that. Btw. why do we even allocate a pgdat from this path? This looks > > > > > > really messy. > > > > > > > > > > See the commit cf23422b9d76 ("cpu/mem hotplug: enable CPUs online before > > > > > local > > > > > memory online") > > > > > > > > > > It looks like try_online_node() in do_cpu_up() is needed for memory hotplug > > > > > which is to put its node online if offlined and then hotadd_new_pgdat() > > > > > calls > > > > > build_all_zonelists() to initialize the zone list. > > > > > > > > Well, do we still have to followthe logic that the above (unreviewed) > > > > commit has established? The hotplug code in general made a lot of ad-hoc > > > > design decisions which had to be revisited over time. If we are not > > > > allocating pgdats for newly added memory then we should really make sure > > > > to do so at a proper time and hook. I am not sure about CPU vs. memory > > > > init ordering but even then I would really prefer if we could make the > > > > init less obscure and _documented_. > > > > > > I don't know, but I think it is a good idea to keep the existing logic rather > > > than do a big surgery > > > > Adding more hacks just doesn't make the situation any better. > > > > > unless someone is able to confirm it is not breaking NUMA > > > node physical hotplug. > > > > I have a machine to test whole node offline. I am just busy to prepare a > > patch myself. I can have it tested though. > > > I think the definition of "node online" is worth of rethinking. Before > patch "x86, numa: always initialize all possible nodes", online means > either cpu or memory present. After this patch, only node owing memory > as present. > > In the commit log, I think the change's motivation should be "Not to > mention that it doesn't really make much sense to consider an empty > node as online because we just consider this node whenever we want to > iterate nodes to use and empty node is obviously not the best > candidate." > > But in fact, we already have for_each_node_state(nid, N_MEMORY) to > cover this purpose. I do not really think we want to spread N_MEMORY outside of the core MM. It is quite confusing IMHO. . > Furthermore, changing the definition of online may > break something in the scheduler, e.g. in task_numa_migrate(), where > it calls for_each_online_node. Could you be more specific please? Why should numa balancing consider nodes without any memory? > By keeping the node owning cpu as online, Michal's patch can avoid > such corner case and keep things easy. Furthermore, if needed, the > other patch can use for_each_node_state(nid, N_MEMORY) to replace > for_each_online_node is some space. Ideally no code outside of the core MM should care about what kind of memory does the node really own. The external code should only care whether the node is online and thus usable or offline and of no interest. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH -next v2] mm/hotplug: fix a null-ptr-deref during NUMA boot 2019-05-22 11:16 ` Michal Hocko @ 2019-05-23 3:58 ` Pingfan Liu 2019-05-23 4:00 ` Pingfan Liu 2019-05-28 18:20 ` Michal Hocko 0 siblings, 2 replies; 21+ messages in thread From: Pingfan Liu @ 2019-05-23 3:58 UTC (permalink / raw) To: Michal Hocko Cc: Qian Cai, Andrew Morton, Barret Rhoden, Dave Hansen, Mike Rapoport, Peter Zijlstra, Michael Ellerman, Ingo Molnar, Oscar Salvador, Andy Lutomirski, Thomas Gleixner, linux-mm, LKML On Wed, May 22, 2019 at 7:16 PM Michal Hocko <mhocko@kernel.org> wrote: > > On Wed 22-05-19 15:12:16, Pingfan Liu wrote: > > On Mon, May 13, 2019 at 11:31 PM Michal Hocko <mhocko@kernel.org> wrote: > > > > > > On Mon 13-05-19 11:20:46, Qian Cai wrote: > > > > On Mon, 2019-05-13 at 16:04 +0200, Michal Hocko wrote: > > > > > On Mon 13-05-19 09:43:59, Qian Cai wrote: > > > > > > On Mon, 2019-05-13 at 14:41 +0200, Michal Hocko wrote: > > > > > > > On Sun 12-05-19 01:48:29, Qian Cai wrote: > > > > > > > > The linux-next commit ("x86, numa: always initialize all possible > > > > > > > > nodes") introduced a crash below during boot for systems with a > > > > > > > > memory-less node. This is due to CPUs that get onlined during SMP boot, > > > > > > > > but that onlining triggers a page fault in bus_add_device() during > > > > > > > > device registration: > > > > > > > > > > > > > > > > error = sysfs_create_link(&bus->p->devices_kset->kobj, > > > > > > > > > > > > > > > > bus->p is NULL. That "p" is the subsys_private struct, and it should > > > > > > > > have been set in, > > > > > > > > > > > > > > > > postcore_initcall(register_node_type); > > > > > > > > > > > > > > > > but that happens in do_basic_setup() after smp_init(). > > > > > > > > > > > > > > > > The old code had set this node online via alloc_node_data(), so when it > > > > > > > > came time to do_cpu_up() -> try_online_node(), the node was already up > > > > > > > > and nothing happened. > > > > > > > > > > > > > > > > Now, it attempts to online the node, which registers the node with > > > > > > > > sysfs, but that can't happen before the 'node' subsystem is registered. > > > > > > > > > > > > > > > > Since kernel_init() is running by a kernel thread that is in > > > > > > > > SYSTEM_SCHEDULING state, fixed this by skipping registering with sysfs > > > > > > > > during the early boot in __try_online_node(). > > > > > > > > > > > > > > Relying on SYSTEM_SCHEDULING looks really hackish. Why cannot we simply > > > > > > > drop try_online_node from do_cpu_up? Your v2 remark below suggests that > > > > > > > we need to call node_set_online because something later on depends on > > > > > > > that. Btw. why do we even allocate a pgdat from this path? This looks > > > > > > > really messy. > > > > > > > > > > > > See the commit cf23422b9d76 ("cpu/mem hotplug: enable CPUs online before > > > > > > local > > > > > > memory online") > > > > > > > > > > > > It looks like try_online_node() in do_cpu_up() is needed for memory hotplug > > > > > > which is to put its node online if offlined and then hotadd_new_pgdat() > > > > > > calls > > > > > > build_all_zonelists() to initialize the zone list. > > > > > > > > > > Well, do we still have to followthe logic that the above (unreviewed) > > > > > commit has established? The hotplug code in general made a lot of ad-hoc > > > > > design decisions which had to be revisited over time. If we are not > > > > > allocating pgdats for newly added memory then we should really make sure > > > > > to do so at a proper time and hook. I am not sure about CPU vs. memory > > > > > init ordering but even then I would really prefer if we could make the > > > > > init less obscure and _documented_. > > > > > > > > I don't know, but I think it is a good idea to keep the existing logic rather > > > > than do a big surgery > > > > > > Adding more hacks just doesn't make the situation any better. > > > > > > > unless someone is able to confirm it is not breaking NUMA > > > > node physical hotplug. > > > > > > I have a machine to test whole node offline. I am just busy to prepare a > > > patch myself. I can have it tested though. > > > > > I think the definition of "node online" is worth of rethinking. Before > > patch "x86, numa: always initialize all possible nodes", online means > > either cpu or memory present. After this patch, only node owing memory > > as present. > > > > In the commit log, I think the change's motivation should be "Not to > > mention that it doesn't really make much sense to consider an empty > > node as online because we just consider this node whenever we want to > > iterate nodes to use and empty node is obviously not the best > > candidate." > > > > But in fact, we already have for_each_node_state(nid, N_MEMORY) to > > cover this purpose. > > I do not really think we want to spread N_MEMORY outside of the core MM. > It is quite confusing IMHO. > . But it has already like this. Just git grep N_MEMORY. > > Furthermore, changing the definition of online may > > break something in the scheduler, e.g. in task_numa_migrate(), where > > it calls for_each_online_node. > > Could you be more specific please? Why should numa balancing consider > nodes without any memory? > As my understanding, the destination cpu can be on a memory less node. BTW, there are several functions in the scheduler facing the same scenario, task_numa_migrate() is an example. > > By keeping the node owning cpu as online, Michal's patch can avoid > > such corner case and keep things easy. Furthermore, if needed, the > > other patch can use for_each_node_state(nid, N_MEMORY) to replace > > for_each_online_node is some space. > > Ideally no code outside of the core MM should care about what kind of > memory does the node really own. The external code should only care > whether the node is online and thus usable or offline and of no > interest. Yes, but maybe it will pay great effort on it. Regards, Pingfan > -- > Michal Hocko > SUSE Labs ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH -next v2] mm/hotplug: fix a null-ptr-deref during NUMA boot 2019-05-23 3:58 ` Pingfan Liu @ 2019-05-23 4:00 ` Pingfan Liu 2019-05-28 18:21 ` Michal Hocko 2019-05-28 18:20 ` Michal Hocko 1 sibling, 1 reply; 21+ messages in thread From: Pingfan Liu @ 2019-05-23 4:00 UTC (permalink / raw) To: Michal Hocko Cc: Qian Cai, Andrew Morton, Barret Rhoden, Dave Hansen, Mike Rapoport, Peter Zijlstra, Michael Ellerman, Ingo Molnar, Oscar Salvador, Andy Lutomirski, Thomas Gleixner, linux-mm, LKML On Thu, May 23, 2019 at 11:58 AM Pingfan Liu <kernelfans@gmail.com> wrote: > > On Wed, May 22, 2019 at 7:16 PM Michal Hocko <mhocko@kernel.org> wrote: > > > > On Wed 22-05-19 15:12:16, Pingfan Liu wrote: > > > On Mon, May 13, 2019 at 11:31 PM Michal Hocko <mhocko@kernel.org> wrote: > > > > > > > > On Mon 13-05-19 11:20:46, Qian Cai wrote: > > > > > On Mon, 2019-05-13 at 16:04 +0200, Michal Hocko wrote: > > > > > > On Mon 13-05-19 09:43:59, Qian Cai wrote: > > > > > > > On Mon, 2019-05-13 at 14:41 +0200, Michal Hocko wrote: > > > > > > > > On Sun 12-05-19 01:48:29, Qian Cai wrote: > > > > > > > > > The linux-next commit ("x86, numa: always initialize all possible > > > > > > > > > nodes") introduced a crash below during boot for systems with a > > > > > > > > > memory-less node. This is due to CPUs that get onlined during SMP boot, > > > > > > > > > but that onlining triggers a page fault in bus_add_device() during > > > > > > > > > device registration: > > > > > > > > > > > > > > > > > > error = sysfs_create_link(&bus->p->devices_kset->kobj, > > > > > > > > > > > > > > > > > > bus->p is NULL. That "p" is the subsys_private struct, and it should > > > > > > > > > have been set in, > > > > > > > > > > > > > > > > > > postcore_initcall(register_node_type); > > > > > > > > > > > > > > > > > > but that happens in do_basic_setup() after smp_init(). > > > > > > > > > > > > > > > > > > The old code had set this node online via alloc_node_data(), so when it > > > > > > > > > came time to do_cpu_up() -> try_online_node(), the node was already up > > > > > > > > > and nothing happened. > > > > > > > > > > > > > > > > > > Now, it attempts to online the node, which registers the node with > > > > > > > > > sysfs, but that can't happen before the 'node' subsystem is registered. > > > > > > > > > > > > > > > > > > Since kernel_init() is running by a kernel thread that is in > > > > > > > > > SYSTEM_SCHEDULING state, fixed this by skipping registering with sysfs > > > > > > > > > during the early boot in __try_online_node(). > > > > > > > > > > > > > > > > Relying on SYSTEM_SCHEDULING looks really hackish. Why cannot we simply > > > > > > > > drop try_online_node from do_cpu_up? Your v2 remark below suggests that > > > > > > > > we need to call node_set_online because something later on depends on > > > > > > > > that. Btw. why do we even allocate a pgdat from this path? This looks > > > > > > > > really messy. > > > > > > > > > > > > > > See the commit cf23422b9d76 ("cpu/mem hotplug: enable CPUs online before > > > > > > > local > > > > > > > memory online") > > > > > > > > > > > > > > It looks like try_online_node() in do_cpu_up() is needed for memory hotplug > > > > > > > which is to put its node online if offlined and then hotadd_new_pgdat() > > > > > > > calls > > > > > > > build_all_zonelists() to initialize the zone list. > > > > > > > > > > > > Well, do we still have to followthe logic that the above (unreviewed) > > > > > > commit has established? The hotplug code in general made a lot of ad-hoc > > > > > > design decisions which had to be revisited over time. If we are not > > > > > > allocating pgdats for newly added memory then we should really make sure > > > > > > to do so at a proper time and hook. I am not sure about CPU vs. memory > > > > > > init ordering but even then I would really prefer if we could make the > > > > > > init less obscure and _documented_. > > > > > > > > > > I don't know, but I think it is a good idea to keep the existing logic rather > > > > > than do a big surgery > > > > > > > > Adding more hacks just doesn't make the situation any better. > > > > > > > > > unless someone is able to confirm it is not breaking NUMA > > > > > node physical hotplug. > > > > > > > > I have a machine to test whole node offline. I am just busy to prepare a > > > > patch myself. I can have it tested though. > > > > > > > I think the definition of "node online" is worth of rethinking. Before > > > patch "x86, numa: always initialize all possible nodes", online means > > > either cpu or memory present. After this patch, only node owing memory > > > as present. > > > > > > In the commit log, I think the change's motivation should be "Not to > > > mention that it doesn't really make much sense to consider an empty > > > node as online because we just consider this node whenever we want to > > > iterate nodes to use and empty node is obviously not the best > > > candidate." > > > > > > But in fact, we already have for_each_node_state(nid, N_MEMORY) to > > > cover this purpose. > > > > I do not really think we want to spread N_MEMORY outside of the core MM. > > It is quite confusing IMHO. > > . > But it has already like this. Just git grep N_MEMORY. > > > > Furthermore, changing the definition of online may > > > break something in the scheduler, e.g. in task_numa_migrate(), where > > > it calls for_each_online_node. > > > > Could you be more specific please? Why should numa balancing consider > > nodes without any memory? > > > As my understanding, the destination cpu can be on a memory less node. > BTW, there are several functions in the scheduler facing the same > scenario, task_numa_migrate() is an example. > > > > By keeping the node owning cpu as online, Michal's patch can avoid > > > such corner case and keep things easy. Furthermore, if needed, the > > > other patch can use for_each_node_state(nid, N_MEMORY) to replace > > > for_each_online_node is some space. > > > > Ideally no code outside of the core MM should care about what kind of > > memory does the node really own. The external code should only care > > whether the node is online and thus usable or offline and of no > > interest. > Yes, but maybe it will pay great effort on it. > And as a first step, we can find a way to fix the bug reported by me and the one reported by Barret > Regards, > Pingfan > > -- > > Michal Hocko > > SUSE Labs ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH -next v2] mm/hotplug: fix a null-ptr-deref during NUMA boot 2019-05-23 4:00 ` Pingfan Liu @ 2019-05-28 18:21 ` Michal Hocko 2019-05-30 13:01 ` Pingfan Liu 0 siblings, 1 reply; 21+ messages in thread From: Michal Hocko @ 2019-05-28 18:21 UTC (permalink / raw) To: Pingfan Liu Cc: Qian Cai, Andrew Morton, Barret Rhoden, Dave Hansen, Mike Rapoport, Peter Zijlstra, Michael Ellerman, Ingo Molnar, Oscar Salvador, Andy Lutomirski, Thomas Gleixner, linux-mm, LKML On Thu 23-05-19 12:00:46, Pingfan Liu wrote: [...] > > Yes, but maybe it will pay great effort on it. > > > And as a first step, we can find a way to fix the bug reported by me > and the one reported by Barret Can we try http://lkml.kernel.org/r/20190513140448.GJ24036@dhcp22.suse.cz for starter? -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH -next v2] mm/hotplug: fix a null-ptr-deref during NUMA boot 2019-05-28 18:21 ` Michal Hocko @ 2019-05-30 13:01 ` Pingfan Liu 0 siblings, 0 replies; 21+ messages in thread From: Pingfan Liu @ 2019-05-30 13:01 UTC (permalink / raw) To: Michal Hocko Cc: Qian Cai, Andrew Morton, Barret Rhoden, Dave Hansen, Mike Rapoport, Peter Zijlstra, Michael Ellerman, Ingo Molnar, Oscar Salvador, Andy Lutomirski, Thomas Gleixner, linux-mm, LKML On Wed, May 29, 2019 at 2:21 AM Michal Hocko <mhocko@kernel.org> wrote: > > On Thu 23-05-19 12:00:46, Pingfan Liu wrote: > [...] > > > Yes, but maybe it will pay great effort on it. > > > > > And as a first step, we can find a way to fix the bug reported by me > > and the one reported by Barret > > Can we try http://lkml.kernel.org/r/20190513140448.GJ24036@dhcp22.suse.cz > for starter? If it turns out that the changing of for_each_online_node() will not break functions in scheduler like task_numa_migrate(), I think it will be a good starter. On the other hand, if it does, then I think it only requires a slight adjustment of your patch to meet the aim. Thanks, Pingfan > -- > Michal Hocko > SUSE Labs ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH -next v2] mm/hotplug: fix a null-ptr-deref during NUMA boot 2019-05-23 3:58 ` Pingfan Liu 2019-05-23 4:00 ` Pingfan Liu @ 2019-05-28 18:20 ` Michal Hocko 2019-05-30 12:55 ` Pingfan Liu 1 sibling, 1 reply; 21+ messages in thread From: Michal Hocko @ 2019-05-28 18:20 UTC (permalink / raw) To: Pingfan Liu Cc: Qian Cai, Andrew Morton, Barret Rhoden, Dave Hansen, Mike Rapoport, Peter Zijlstra, Michael Ellerman, Ingo Molnar, Oscar Salvador, Andy Lutomirski, Thomas Gleixner, linux-mm, LKML [Sorry for a late reply] On Thu 23-05-19 11:58:45, Pingfan Liu wrote: > On Wed, May 22, 2019 at 7:16 PM Michal Hocko <mhocko@kernel.org> wrote: > > > > On Wed 22-05-19 15:12:16, Pingfan Liu wrote: [...] > > > But in fact, we already have for_each_node_state(nid, N_MEMORY) to > > > cover this purpose. > > > > I do not really think we want to spread N_MEMORY outside of the core MM. > > It is quite confusing IMHO. > > . > But it has already like this. Just git grep N_MEMORY. I might be wrong but I suspect a closer review would reveal that the use will be inconsistent or dubious so following the existing users is not the best approach. > > > Furthermore, changing the definition of online may > > > break something in the scheduler, e.g. in task_numa_migrate(), where > > > it calls for_each_online_node. > > > > Could you be more specific please? Why should numa balancing consider > > nodes without any memory? > > > As my understanding, the destination cpu can be on a memory less node. > BTW, there are several functions in the scheduler facing the same > scenario, task_numa_migrate() is an example. Even if the destination node is memoryless then any migration would fail because there is no memory. Anyway I still do not see how using online node would break anything. > > > By keeping the node owning cpu as online, Michal's patch can avoid > > > such corner case and keep things easy. Furthermore, if needed, the > > > other patch can use for_each_node_state(nid, N_MEMORY) to replace > > > for_each_online_node is some space. > > > > Ideally no code outside of the core MM should care about what kind of > > memory does the node really own. The external code should only care > > whether the node is online and thus usable or offline and of no > > interest. > > Yes, but maybe it will pay great effort on it. Even if that is the case it would be preferable because the current situation is just not sustainable wrt maintenance cost. It is just too simple to break the existing logic as this particular report outlines. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH -next v2] mm/hotplug: fix a null-ptr-deref during NUMA boot 2019-05-28 18:20 ` Michal Hocko @ 2019-05-30 12:55 ` Pingfan Liu 2019-05-31 9:03 ` Michal Hocko 0 siblings, 1 reply; 21+ messages in thread From: Pingfan Liu @ 2019-05-30 12:55 UTC (permalink / raw) To: Michal Hocko Cc: Qian Cai, Andrew Morton, Barret Rhoden, Dave Hansen, Mike Rapoport, Peter Zijlstra, Michael Ellerman, Ingo Molnar, Oscar Salvador, Andy Lutomirski, Thomas Gleixner, linux-mm, LKML On Wed, May 29, 2019 at 2:20 AM Michal Hocko <mhocko@kernel.org> wrote: > > [Sorry for a late reply] > > On Thu 23-05-19 11:58:45, Pingfan Liu wrote: > > On Wed, May 22, 2019 at 7:16 PM Michal Hocko <mhocko@kernel.org> wrote: > > > > > > On Wed 22-05-19 15:12:16, Pingfan Liu wrote: > [...] > > > > But in fact, we already have for_each_node_state(nid, N_MEMORY) to > > > > cover this purpose. > > > > > > I do not really think we want to spread N_MEMORY outside of the core MM. > > > It is quite confusing IMHO. > > > . > > But it has already like this. Just git grep N_MEMORY. > > I might be wrong but I suspect a closer review would reveal that the use > will be inconsistent or dubious so following the existing users is not > the best approach. > > > > > Furthermore, changing the definition of online may > > > > break something in the scheduler, e.g. in task_numa_migrate(), where > > > > it calls for_each_online_node. > > > > > > Could you be more specific please? Why should numa balancing consider > > > nodes without any memory? > > > > > As my understanding, the destination cpu can be on a memory less node. > > BTW, there are several functions in the scheduler facing the same > > scenario, task_numa_migrate() is an example. > > Even if the destination node is memoryless then any migration would fail > because there is no memory. Anyway I still do not see how using online > node would break anything. > Suppose we have nodes A, B,C, where C is memory less but has little distance to B, comparing with the one from A to B. Then if a task is running on A, but prefer to run on B due to memory footprint. task_numa_migrate() allows us to migrate the task to node C. Changing for_each_online_node will break this. Regards, Pingfan > > > > By keeping the node owning cpu as online, Michal's patch can avoid > > > > such corner case and keep things easy. Furthermore, if needed, the > > > > other patch can use for_each_node_state(nid, N_MEMORY) to replace > > > > for_each_online_node is some space. > > > > > > Ideally no code outside of the core MM should care about what kind of > > > memory does the node really own. The external code should only care > > > whether the node is online and thus usable or offline and of no > > > interest. > > > > Yes, but maybe it will pay great effort on it. > > Even if that is the case it would be preferable because the current > situation is just not sustainable wrt maintenance cost. It is just too > simple to break the existing logic as this particular report outlines. > -- > Michal Hocko > SUSE Labs ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH -next v2] mm/hotplug: fix a null-ptr-deref during NUMA boot 2019-05-30 12:55 ` Pingfan Liu @ 2019-05-31 9:03 ` Michal Hocko 2019-06-03 4:17 ` Pingfan Liu 0 siblings, 1 reply; 21+ messages in thread From: Michal Hocko @ 2019-05-31 9:03 UTC (permalink / raw) To: Pingfan Liu Cc: Qian Cai, Andrew Morton, Barret Rhoden, Dave Hansen, Mike Rapoport, Peter Zijlstra, Michael Ellerman, Ingo Molnar, Oscar Salvador, Andy Lutomirski, Thomas Gleixner, linux-mm, LKML On Thu 30-05-19 20:55:32, Pingfan Liu wrote: > On Wed, May 29, 2019 at 2:20 AM Michal Hocko <mhocko@kernel.org> wrote: > > > > [Sorry for a late reply] > > > > On Thu 23-05-19 11:58:45, Pingfan Liu wrote: > > > On Wed, May 22, 2019 at 7:16 PM Michal Hocko <mhocko@kernel.org> wrote: > > > > > > > > On Wed 22-05-19 15:12:16, Pingfan Liu wrote: > > [...] > > > > > But in fact, we already have for_each_node_state(nid, N_MEMORY) to > > > > > cover this purpose. > > > > > > > > I do not really think we want to spread N_MEMORY outside of the core MM. > > > > It is quite confusing IMHO. > > > > . > > > But it has already like this. Just git grep N_MEMORY. > > > > I might be wrong but I suspect a closer review would reveal that the use > > will be inconsistent or dubious so following the existing users is not > > the best approach. > > > > > > > Furthermore, changing the definition of online may > > > > > break something in the scheduler, e.g. in task_numa_migrate(), where > > > > > it calls for_each_online_node. > > > > > > > > Could you be more specific please? Why should numa balancing consider > > > > nodes without any memory? > > > > > > > As my understanding, the destination cpu can be on a memory less node. > > > BTW, there are several functions in the scheduler facing the same > > > scenario, task_numa_migrate() is an example. > > > > Even if the destination node is memoryless then any migration would fail > > because there is no memory. Anyway I still do not see how using online > > node would break anything. > > > Suppose we have nodes A, B,C, where C is memory less but has little > distance to B, comparing with the one from A to B. Then if a task is > running on A, but prefer to run on B due to memory footprint. > task_numa_migrate() allows us to migrate the task to node C. Changing > for_each_online_node will break this. That would require the task to have preferred node to be C no? Or do I missunderstand the task migration logic? -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH -next v2] mm/hotplug: fix a null-ptr-deref during NUMA boot 2019-05-31 9:03 ` Michal Hocko @ 2019-06-03 4:17 ` Pingfan Liu 0 siblings, 0 replies; 21+ messages in thread From: Pingfan Liu @ 2019-06-03 4:17 UTC (permalink / raw) To: Michal Hocko Cc: Qian Cai, Andrew Morton, Barret Rhoden, Dave Hansen, Mike Rapoport, Peter Zijlstra, Michael Ellerman, Ingo Molnar, Oscar Salvador, Andy Lutomirski, Thomas Gleixner, linux-mm, LKML On Fri, May 31, 2019 at 5:03 PM Michal Hocko <mhocko@kernel.org> wrote: > > On Thu 30-05-19 20:55:32, Pingfan Liu wrote: > > On Wed, May 29, 2019 at 2:20 AM Michal Hocko <mhocko@kernel.org> wrote: > > > > > > [Sorry for a late reply] > > > > > > On Thu 23-05-19 11:58:45, Pingfan Liu wrote: > > > > On Wed, May 22, 2019 at 7:16 PM Michal Hocko <mhocko@kernel.org> wrote: > > > > > > > > > > On Wed 22-05-19 15:12:16, Pingfan Liu wrote: > > > [...] > > > > > > But in fact, we already have for_each_node_state(nid, N_MEMORY) to > > > > > > cover this purpose. > > > > > > > > > > I do not really think we want to spread N_MEMORY outside of the core MM. > > > > > It is quite confusing IMHO. > > > > > . > > > > But it has already like this. Just git grep N_MEMORY. > > > > > > I might be wrong but I suspect a closer review would reveal that the use > > > will be inconsistent or dubious so following the existing users is not > > > the best approach. > > > > > > > > > Furthermore, changing the definition of online may > > > > > > break something in the scheduler, e.g. in task_numa_migrate(), where > > > > > > it calls for_each_online_node. > > > > > > > > > > Could you be more specific please? Why should numa balancing consider > > > > > nodes without any memory? > > > > > > > > > As my understanding, the destination cpu can be on a memory less node. > > > > BTW, there are several functions in the scheduler facing the same > > > > scenario, task_numa_migrate() is an example. > > > > > > Even if the destination node is memoryless then any migration would fail > > > because there is no memory. Anyway I still do not see how using online > > > node would break anything. > > > > > Suppose we have nodes A, B,C, where C is memory less but has little > > distance to B, comparing with the one from A to B. Then if a task is > > running on A, but prefer to run on B due to memory footprint. > > task_numa_migrate() allows us to migrate the task to node C. Changing > > for_each_online_node will break this. > > That would require the task to have preferred node to be C no? Or do I > missunderstand the task migration logic? I think in task_numa_migrate(), the migration logic should looks like: env.dst_nid = p->numa_preferred_nid; //Here dst nid is B But later in if (env.best_cpu == -1 || (p->numa_group && p->numa_group->active_nodes > 1)) { for_each_online_node(nid) { [...] task_numa_find_cpu(&env, taskimp, groupimp); // Here is a chance to change p->numa_preferred_nid There are serveral other broken by changing for_each_online_node(), -1. show_numa_stats() -2. init_numa_topology_type(), where sched_numa_topology_type may be mistaken evaluated. -3. ... can check call to for_each_online_node() one by one in scheduler. That is my understanding of the code. Thanks, Pingfan ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH -next v2] mm/hotplug: fix a null-ptr-deref during NUMA boot 2019-05-13 12:41 ` Michal Hocko 2019-05-13 13:43 ` Qian Cai @ 2019-06-21 13:17 ` Qian Cai 2019-06-21 13:55 ` Michal Hocko 1 sibling, 1 reply; 21+ messages in thread From: Qian Cai @ 2019-06-21 13:17 UTC (permalink / raw) To: Michal Hocko Cc: akpm, brho, kernelfans, dave.hansen, rppt, peterz, mpe, mingo, osalvador, luto, tglx, linux-mm, linux-kernel Sigh... I don't see any benefit to keep the broken commit, "x86, numa: always initialize all possible nodes" for so long in linux-next that just prevent x86 NUMA machines with any memory- less node from booting. Andrew, maybe it is time to drop this patch until Michal found some time to fix it properly. On Mon, 2019-05-13 at 14:41 +0200, Michal Hocko wrote: > On Sun 12-05-19 01:48:29, Qian Cai wrote: > > The linux-next commit ("x86, numa: always initialize all possible > > nodes") introduced a crash below during boot for systems with a > > memory-less node. This is due to CPUs that get onlined during SMP boot, > > but that onlining triggers a page fault in bus_add_device() during > > device registration: > > > > error = sysfs_create_link(&bus->p->devices_kset->kobj, > > > > bus->p is NULL. That "p" is the subsys_private struct, and it should > > have been set in, > > > > postcore_initcall(register_node_type); > > > > but that happens in do_basic_setup() after smp_init(). > > > > The old code had set this node online via alloc_node_data(), so when it > > came time to do_cpu_up() -> try_online_node(), the node was already up > > and nothing happened. > > > > Now, it attempts to online the node, which registers the node with > > sysfs, but that can't happen before the 'node' subsystem is registered. > > > > Since kernel_init() is running by a kernel thread that is in > > SYSTEM_SCHEDULING state, fixed this by skipping registering with sysfs > > during the early boot in __try_online_node(). > > Relying on SYSTEM_SCHEDULING looks really hackish. Why cannot we simply > drop try_online_node from do_cpu_up? Your v2 remark below suggests that > we need to call node_set_online because something later on depends on > that. Btw. why do we even allocate a pgdat from this path? This looks > really messy. > > > Call Trace: > > device_add+0x43e/0x690 > > device_register+0x107/0x110 > > __register_one_node+0x72/0x150 > > __try_online_node+0x8f/0xd0 > > try_online_node+0x2b/0x50 > > do_cpu_up+0x46/0xf0 > > cpu_up+0x13/0x20 > > smp_init+0x6e/0xd0 > > kernel_init_freeable+0xe5/0x21f > > kernel_init+0xf/0x180 > > ret_from_fork+0x1f/0x30 > > > > Reported-by: Barret Rhoden <brho@google.com> > > Signed-off-by: Qian Cai <cai@lca.pw> > > --- > > > > v2: Set the node online as it have CPUs. Otherwise, those memory-less nodes > > will > > end up being not in sysfs i.e., /sys/devices/system/node/. > > > > mm/memory_hotplug.c | 12 ++++++++++++ > > 1 file changed, 12 insertions(+) > > > > diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c > > index b236069ff0d8..6eb2331fa826 100644 > > --- a/mm/memory_hotplug.c > > +++ b/mm/memory_hotplug.c > > @@ -1037,6 +1037,18 @@ static int __try_online_node(int nid, u64 start, bool > > set_node_online) > > if (node_online(nid)) > > return 0; > > > > + /* > > + * Here is called by cpu_up() to online a node without memory from > > + * kernel_init() which guarantees that "set_node_online" is true > > which > > + * will set the node online as it have CPUs but not ready to call > > + * register_one_node() as "node_subsys" has not been initialized > > + * properly yet. > > + */ > > + if (system_state == SYSTEM_SCHEDULING) { > > + node_set_online(nid); > > + return 0; > > + } > > + > > pgdat = hotadd_new_pgdat(nid, start); > > if (!pgdat) { > > pr_err("Cannot online node %d due to NULL pgdat\n", nid); > > -- > > 2.20.1 (Apple Git-117) > > ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH -next v2] mm/hotplug: fix a null-ptr-deref during NUMA boot 2019-06-21 13:17 ` Qian Cai @ 2019-06-21 13:55 ` Michal Hocko 2019-06-24 8:42 ` Pingfan Liu 0 siblings, 1 reply; 21+ messages in thread From: Michal Hocko @ 2019-06-21 13:55 UTC (permalink / raw) To: Qian Cai Cc: akpm, brho, kernelfans, dave.hansen, rppt, peterz, mpe, mingo, osalvador, luto, tglx, linux-mm, linux-kernel On Fri 21-06-19 09:17:58, Qian Cai wrote: > Sigh... > > I don't see any benefit to keep the broken commit, > > "x86, numa: always initialize all possible nodes" > > for so long in linux-next that just prevent x86 NUMA machines with any memory- > less node from booting. > > Andrew, maybe it is time to drop this patch until Michal found some time to fix > it properly. Yes, please drop the patch for now, Andrew. I thought I could get to this but time is just scarce. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH -next v2] mm/hotplug: fix a null-ptr-deref during NUMA boot 2019-06-21 13:55 ` Michal Hocko @ 2019-06-24 8:42 ` Pingfan Liu 2019-06-26 13:57 ` Michal Hocko 0 siblings, 1 reply; 21+ messages in thread From: Pingfan Liu @ 2019-06-24 8:42 UTC (permalink / raw) To: Michal Hocko Cc: Qian Cai, Andrew Morton, Barret Rhoden, Dave Hansen, Mike Rapoport, Peter Zijlstra, Michael Ellerman, Ingo Molnar, Oscar Salvador, Andy Lutomirski, Thomas Gleixner, linux-mm, LKML Hi Michal, What about dropping the change of the online definition of your patch, just do the following? diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c index e6dad60..9c087c3 100644 --- a/arch/x86/mm/numa.c +++ b/arch/x86/mm/numa.c @@ -749,13 +749,12 @@ static void __init init_memory_less_node(int nid) */ void __init init_cpu_to_node(void) { - int cpu; + int cpu, node; u16 *cpu_to_apicid = early_per_cpu_ptr(x86_cpu_to_apicid); BUG_ON(cpu_to_apicid == NULL); - for_each_possible_cpu(cpu) { - int node = numa_cpu_node(cpu); + for_each_node_mask(node, numa_nodes_parsed) { if (node == NUMA_NO_NODE) continue; @@ -765,6 +764,10 @@ void __init init_cpu_to_node(void) numa_set_node(cpu, node); } + for_each_possible_cpu(cpu) { + int node = numa_cpu_node(cpu); + numa_set_node(cpu, node); + } } Thanks, Pingfan On Fri, Jun 21, 2019 at 9:55 PM Michal Hocko <mhocko@kernel.org> wrote: > > On Fri 21-06-19 09:17:58, Qian Cai wrote: > > Sigh... > > > > I don't see any benefit to keep the broken commit, > > > > "x86, numa: always initialize all possible nodes" > > > > for so long in linux-next that just prevent x86 NUMA machines with any memory- > > less node from booting. > > > > Andrew, maybe it is time to drop this patch until Michal found some time to fix > > it properly. > > Yes, please drop the patch for now, Andrew. I thought I could get to > this but time is just scarce. > -- > Michal Hocko > SUSE Labs ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH -next v2] mm/hotplug: fix a null-ptr-deref during NUMA boot 2019-06-24 8:42 ` Pingfan Liu @ 2019-06-26 13:57 ` Michal Hocko 2019-06-27 3:11 ` Pingfan Liu 0 siblings, 1 reply; 21+ messages in thread From: Michal Hocko @ 2019-06-26 13:57 UTC (permalink / raw) To: Pingfan Liu Cc: Qian Cai, Andrew Morton, Barret Rhoden, Dave Hansen, Mike Rapoport, Peter Zijlstra, Michael Ellerman, Ingo Molnar, Oscar Salvador, Andy Lutomirski, Thomas Gleixner, linux-mm, LKML On Mon 24-06-19 16:42:20, Pingfan Liu wrote: > Hi Michal, > > What about dropping the change of the online definition of your patch, > just do the following? I am sorry but I am unlikely to find some more time to look into this. I am willing to help reviewing but I will not find enough time to focus on this to fix up the patch. Are you willing to work on this and finish the patch? It is a very tricky area with side effects really hard to see in advance but going with a robust fix is definitely worth the effort. Thanks! -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH -next v2] mm/hotplug: fix a null-ptr-deref during NUMA boot 2019-06-26 13:57 ` Michal Hocko @ 2019-06-27 3:11 ` Pingfan Liu 0 siblings, 0 replies; 21+ messages in thread From: Pingfan Liu @ 2019-06-27 3:11 UTC (permalink / raw) To: Michal Hocko Cc: Qian Cai, Andrew Morton, Barret Rhoden, Dave Hansen, Mike Rapoport, Peter Zijlstra, Michael Ellerman, Ingo Molnar, Oscar Salvador, Andy Lutomirski, Thomas Gleixner, linux-mm, LKML On Wed, Jun 26, 2019 at 9:57 PM Michal Hocko <mhocko@kernel.org> wrote: > > On Mon 24-06-19 16:42:20, Pingfan Liu wrote: > > Hi Michal, > > > > What about dropping the change of the online definition of your patch, > > just do the following? > > I am sorry but I am unlikely to find some more time to look into this. I > am willing to help reviewing but I will not find enough time to focus on > this to fix up the patch. Are you willing to work on this and finish the > patch? It is a very tricky area with side effects really hard to see in > advance but going with a robust fix is definitely worth the effort. Yeah, the bug is a little trivial but hard to fix bug, and take a lot of time. It is hard to meet your original design target, based on current situation. I will have a try limited to this bug. Thanks, Pingfan > > Thanks! > -- > Michal Hocko > SUSE Labs ^ permalink raw reply [flat|nested] 21+ messages in thread
end of thread, other threads:[~2019-06-27 3:11 UTC | newest] Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2019-05-12 5:48 [PATCH -next v2] mm/hotplug: fix a null-ptr-deref during NUMA boot Qian Cai 2019-05-13 12:41 ` Michal Hocko 2019-05-13 13:43 ` Qian Cai 2019-05-13 14:04 ` Michal Hocko 2019-05-13 15:20 ` Qian Cai 2019-05-13 15:31 ` Michal Hocko 2019-05-22 7:12 ` Pingfan Liu 2019-05-22 11:16 ` Michal Hocko 2019-05-23 3:58 ` Pingfan Liu 2019-05-23 4:00 ` Pingfan Liu 2019-05-28 18:21 ` Michal Hocko 2019-05-30 13:01 ` Pingfan Liu 2019-05-28 18:20 ` Michal Hocko 2019-05-30 12:55 ` Pingfan Liu 2019-05-31 9:03 ` Michal Hocko 2019-06-03 4:17 ` Pingfan Liu 2019-06-21 13:17 ` Qian Cai 2019-06-21 13:55 ` Michal Hocko 2019-06-24 8:42 ` Pingfan Liu 2019-06-26 13:57 ` Michal Hocko 2019-06-27 3:11 ` Pingfan Liu
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox