On Tue, Dec 4, 2018 at 4:56 PM Michal Hocko wrote: > > On Tue 04-12-18 16:20:32, Pingfan Liu wrote: > > On Tue, Dec 4, 2018 at 3:22 PM Michal Hocko wrote: > > > > > > On Tue 04-12-18 11:05:57, Pingfan Liu wrote: > > > > During my test on some AMD machine, with kexec -l nr_cpus=x option, the > > > > kernel failed to bootup, because some node's data struct can not be allocated, > > > > e.g, on x86, initialized by init_cpu_to_node()->init_memory_less_node(). But > > > > device->numa_node info is used as preferred_nid param for > > > > __alloc_pages_nodemask(), which causes NULL reference > > > > ac->zonelist = node_zonelist(preferred_nid, gfp_mask); > > > > This patch tries to fix the issue by falling back to the first online node, > > > > when encountering such corner case. > > > > > > We have seen similar issues already and the bug was usually that the > > > zonelists were not initialized yet or the node is completely bogus. > > > Zonelists should be initialized by build_all_zonelists quite early so I > > > am wondering whether the later is the case. What is the actual node > > > number the device is associated with? > > > > > The device's node num is 2. And in my case, I used nr_cpus param. Due > > to init_cpu_to_node() initialize all the possible node. It is hard > > for me to figure out without this param, how zonelists is accessed > > before page allocator works. > > I believe we should focus on this. Why does the node have no zonelist > even though all zonelists should be initialized already? Maybe this is > nr_cpus pecularity and we do not initialize all the existing numa nodes. > Or maybe the device is associated to a non-existing node with that > setup. A full dmesg might help us here. > Requiring the machine again, and I got the following without nr_cpus option [root@dell-per7425-03 ~]# cd /sys/devices/system/node/ [root@dell-per7425-03 node]# ls has_cpu has_memory has_normal_memory node0 node1 node2 node3 node4 node5 node6 node7 online possible power uevent [root@dell-per7425-03 node]# cat has_cpu 0-7 [root@dell-per7425-03 node]# cat has_memory 1,5 [root@dell-per7425-03 node]# cat online 0-7 [root@dell-per7425-03 node]# cat possible 0-7 And lscpu shows the following numa-cpu info: NUMA node0 CPU(s): 0,8,16,24 NUMA node1 CPU(s): 2,10,18,26 NUMA node2 CPU(s): 4,12,20,28 NUMA node3 CPU(s): 6,14,22,30 NUMA node4 CPU(s): 1,9,17,25 NUMA node5 CPU(s): 3,11,19,27 NUMA node6 CPU(s): 5,13,21,29 NUMA node7 CPU(s): 7,15,23,31 For the full panic message (I masked some hostname info with xx), please see the attachment. In a short word, it seems a problem with nr_cpus, if without this option, the kernel can bootup correctly. > > > Your patch is not correct btw, because we want to fallback into the node in > > > the distance order rather into the first online node. > > > -- > > What about this: > > +extern int find_next_best_node(int node, nodemask_t *used_node_mask); > > + > > /* > > * We get the zone list from the current node and the gfp_mask. > > * This zone list contains a maximum of MAXNODES*MAX_NR_ZONES zones. > > @@ -453,6 +455,11 @@ static inline int gfp_zonelist(gfp_t flags) > > */ > > static inline struct zonelist *node_zonelist(int nid, gfp_t flags) > > { > > + if (unlikely(!node_online(nid))) { > > + nodemask_t used_mask; > > + nodes_complement(used_mask, node_online_map); > > + nid = find_next_best_node(nid, &used_mask); > > + } > > return NODE_DATA(nid)->node_zonelists + gfp_zonelist(flags); > > } > > > > I just finished the compiling, not test it yet, since the machine is > > not on hand yet. It needs some time to get it again. > > This is clearly a no-go. nodemask_t can be giant and you cannot have it > on the stack for allocation paths which might be called from a deep What about the static variable static inline struct zonelist *node_zonelist(int nid, gfp_t flags) { + if (unlikely(!node_online(nid))) { + WARN_ONCE(1, "Try to alloc mem from not online node\n"); + nid = find_best_node(nid); + } return NODE_DATA(nid)->node_zonelists + gfp_zonelist(flags); } +int find_best_node(int node) +{ + static nodemask_t used_mask; + static DEFINE_SPINLOCK(lock); + static int best_neigh[MAX_NUMNODES] = {0}; + int nid; + + spin_lock(&lock); + if (likely( best_neigh[node] != 0)) { + nid = best_neigh[node]; + goto unlock; + } + nodes_complement(used_mask, node_online_map); + nid = find_next_best_node(node, &used_mask); + best_neigh[node] = nid; + +unlock: + spin_unlock(&lock); + return nid; +} > stack already. Also this is called from the allocator hot paths and each > branch counts. > I am puzzling about this. Does unlikely work for it? Thanks, Pingfan