From: Pingfan Liu <kernelfans@gmail.com>
To: Michal Hocko <mhocko@kernel.org>
Cc: Qian Cai <cai@lca.pw>, Andrew Morton <akpm@linux-foundation.org>,
Barret Rhoden <brho@google.com>,
Dave Hansen <dave.hansen@intel.com>,
Mike Rapoport <rppt@linux.ibm.com>,
Peter Zijlstra <peterz@infradead.org>,
Michael Ellerman <mpe@ellerman.id.au>,
Ingo Molnar <mingo@elte.hu>, Oscar Salvador <osalvador@suse.de>,
Andy Lutomirski <luto@kernel.org>,
Thomas Gleixner <tglx@linutronix.de>,
linux-mm@kvack.org, LKML <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH -next v2] mm/hotplug: fix a null-ptr-deref during NUMA boot
Date: Thu, 23 May 2019 12:00:46 +0800 [thread overview]
Message-ID: <CAFgQCTvKZU1B0e4Bg3hQedMJ4Oq2uiOshnsBQCjKinmrGdKcYg@mail.gmail.com> (raw)
In-Reply-To: <CAFgQCTuKVif9gPTsbNdAqLGQyQpQ+gC2D1BQT99d0yDYHj4_mA@mail.gmail.com>
On Thu, May 23, 2019 at 11:58 AM Pingfan Liu <kernelfans@gmail.com> wrote:
>
> On Wed, May 22, 2019 at 7:16 PM Michal Hocko <mhocko@kernel.org> wrote:
> >
> > On Wed 22-05-19 15:12:16, Pingfan Liu wrote:
> > > On Mon, May 13, 2019 at 11:31 PM Michal Hocko <mhocko@kernel.org> wrote:
> > > >
> > > > On Mon 13-05-19 11:20:46, Qian Cai wrote:
> > > > > On Mon, 2019-05-13 at 16:04 +0200, Michal Hocko wrote:
> > > > > > On Mon 13-05-19 09:43:59, Qian Cai wrote:
> > > > > > > On Mon, 2019-05-13 at 14:41 +0200, Michal Hocko wrote:
> > > > > > > > On Sun 12-05-19 01:48:29, Qian Cai wrote:
> > > > > > > > > The linux-next commit ("x86, numa: always initialize all possible
> > > > > > > > > nodes") introduced a crash below during boot for systems with a
> > > > > > > > > memory-less node. This is due to CPUs that get onlined during SMP boot,
> > > > > > > > > but that onlining triggers a page fault in bus_add_device() during
> > > > > > > > > device registration:
> > > > > > > > >
> > > > > > > > > error = sysfs_create_link(&bus->p->devices_kset->kobj,
> > > > > > > > >
> > > > > > > > > bus->p is NULL. That "p" is the subsys_private struct, and it should
> > > > > > > > > have been set in,
> > > > > > > > >
> > > > > > > > > postcore_initcall(register_node_type);
> > > > > > > > >
> > > > > > > > > but that happens in do_basic_setup() after smp_init().
> > > > > > > > >
> > > > > > > > > The old code had set this node online via alloc_node_data(), so when it
> > > > > > > > > came time to do_cpu_up() -> try_online_node(), the node was already up
> > > > > > > > > and nothing happened.
> > > > > > > > >
> > > > > > > > > Now, it attempts to online the node, which registers the node with
> > > > > > > > > sysfs, but that can't happen before the 'node' subsystem is registered.
> > > > > > > > >
> > > > > > > > > Since kernel_init() is running by a kernel thread that is in
> > > > > > > > > SYSTEM_SCHEDULING state, fixed this by skipping registering with sysfs
> > > > > > > > > during the early boot in __try_online_node().
> > > > > > > >
> > > > > > > > Relying on SYSTEM_SCHEDULING looks really hackish. Why cannot we simply
> > > > > > > > drop try_online_node from do_cpu_up? Your v2 remark below suggests that
> > > > > > > > we need to call node_set_online because something later on depends on
> > > > > > > > that. Btw. why do we even allocate a pgdat from this path? This looks
> > > > > > > > really messy.
> > > > > > >
> > > > > > > See the commit cf23422b9d76 ("cpu/mem hotplug: enable CPUs online before
> > > > > > > local
> > > > > > > memory online")
> > > > > > >
> > > > > > > It looks like try_online_node() in do_cpu_up() is needed for memory hotplug
> > > > > > > which is to put its node online if offlined and then hotadd_new_pgdat()
> > > > > > > calls
> > > > > > > build_all_zonelists() to initialize the zone list.
> > > > > >
> > > > > > Well, do we still have to followthe logic that the above (unreviewed)
> > > > > > commit has established? The hotplug code in general made a lot of ad-hoc
> > > > > > design decisions which had to be revisited over time. If we are not
> > > > > > allocating pgdats for newly added memory then we should really make sure
> > > > > > to do so at a proper time and hook. I am not sure about CPU vs. memory
> > > > > > init ordering but even then I would really prefer if we could make the
> > > > > > init less obscure and _documented_.
> > > > >
> > > > > I don't know, but I think it is a good idea to keep the existing logic rather
> > > > > than do a big surgery
> > > >
> > > > Adding more hacks just doesn't make the situation any better.
> > > >
> > > > > unless someone is able to confirm it is not breaking NUMA
> > > > > node physical hotplug.
> > > >
> > > > I have a machine to test whole node offline. I am just busy to prepare a
> > > > patch myself. I can have it tested though.
> > > >
> > > I think the definition of "node online" is worth of rethinking. Before
> > > patch "x86, numa: always initialize all possible nodes", online means
> > > either cpu or memory present. After this patch, only node owing memory
> > > as present.
> > >
> > > In the commit log, I think the change's motivation should be "Not to
> > > mention that it doesn't really make much sense to consider an empty
> > > node as online because we just consider this node whenever we want to
> > > iterate nodes to use and empty node is obviously not the best
> > > candidate."
> > >
> > > But in fact, we already have for_each_node_state(nid, N_MEMORY) to
> > > cover this purpose.
> >
> > I do not really think we want to spread N_MEMORY outside of the core MM.
> > It is quite confusing IMHO.
> > .
> But it has already like this. Just git grep N_MEMORY.
>
> > > Furthermore, changing the definition of online may
> > > break something in the scheduler, e.g. in task_numa_migrate(), where
> > > it calls for_each_online_node.
> >
> > Could you be more specific please? Why should numa balancing consider
> > nodes without any memory?
> >
> As my understanding, the destination cpu can be on a memory less node.
> BTW, there are several functions in the scheduler facing the same
> scenario, task_numa_migrate() is an example.
>
> > > By keeping the node owning cpu as online, Michal's patch can avoid
> > > such corner case and keep things easy. Furthermore, if needed, the
> > > other patch can use for_each_node_state(nid, N_MEMORY) to replace
> > > for_each_online_node is some space.
> >
> > Ideally no code outside of the core MM should care about what kind of
> > memory does the node really own. The external code should only care
> > whether the node is online and thus usable or offline and of no
> > interest.
> Yes, but maybe it will pay great effort on it.
>
And as a first step, we can find a way to fix the bug reported by me
and the one reported by Barret
> Regards,
> Pingfan
> > --
> > Michal Hocko
> > SUSE Labs
next prev parent reply other threads:[~2019-05-23 4:00 UTC|newest]
Thread overview: 21+ messages / expand[flat|nested] mbox.gz Atom feed top
2019-05-12 5:48 Qian Cai
2019-05-13 12:41 ` Michal Hocko
2019-05-13 13:43 ` Qian Cai
2019-05-13 14:04 ` Michal Hocko
2019-05-13 15:20 ` Qian Cai
2019-05-13 15:31 ` Michal Hocko
2019-05-22 7:12 ` Pingfan Liu
2019-05-22 11:16 ` Michal Hocko
2019-05-23 3:58 ` Pingfan Liu
2019-05-23 4:00 ` Pingfan Liu [this message]
2019-05-28 18:21 ` Michal Hocko
2019-05-30 13:01 ` Pingfan Liu
2019-05-28 18:20 ` Michal Hocko
2019-05-30 12:55 ` Pingfan Liu
2019-05-31 9:03 ` Michal Hocko
2019-06-03 4:17 ` Pingfan Liu
2019-06-21 13:17 ` Qian Cai
2019-06-21 13:55 ` Michal Hocko
2019-06-24 8:42 ` Pingfan Liu
2019-06-26 13:57 ` Michal Hocko
2019-06-27 3:11 ` Pingfan Liu
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=CAFgQCTvKZU1B0e4Bg3hQedMJ4Oq2uiOshnsBQCjKinmrGdKcYg@mail.gmail.com \
--to=kernelfans@gmail.com \
--cc=akpm@linux-foundation.org \
--cc=brho@google.com \
--cc=cai@lca.pw \
--cc=dave.hansen@intel.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=luto@kernel.org \
--cc=mhocko@kernel.org \
--cc=mingo@elte.hu \
--cc=mpe@ellerman.id.au \
--cc=osalvador@suse.de \
--cc=peterz@infradead.org \
--cc=rppt@linux.ibm.com \
--cc=tglx@linutronix.de \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox