* [PATCH v6 1/3] powerpc/numa: Set numa_node for all possible cpus
2020-08-18 8:11 [PATCH v6 0/3] Offline memoryless cpuless node 0 Srikar Dronamraju
@ 2020-08-18 8:11 ` Srikar Dronamraju
2020-08-18 8:11 ` [PATCH v6 2/3] powerpc/numa: Prefer node id queried from vphn Srikar Dronamraju
` (2 subsequent siblings)
3 siblings, 0 replies; 5+ messages in thread
From: Srikar Dronamraju @ 2020-08-18 8:11 UTC (permalink / raw)
To: Michael Ellerman
Cc: linuxppc-dev, Srikar Dronamraju, linux-mm, Michal Hocko,
Mel Gorman, Vlastimil Babka, Christopher Lameter, Andrew Morton,
Linus Torvalds, Gautham R Shenoy, Satheesh Rajendran,
David Hildenbrand, Aneesh Kumar K V
A Powerpc system with multiple possible nodes and with CONFIG_NUMA
enabled always used to have a node 0, even if node 0 does not any cpus
or memory attached to it. As per PAPR, node affinity of a cpu is only
available once its present / online. For all cpus that are possible but
not present, cpu_to_node() would point to node 0.
To ensure a cpuless, memoryless dummy node is not online, powerpc need
to make sure all possible but not present cpu_to_node are set to a
proper node.
Cc: linuxppc-dev@lists.ozlabs.org
Cc: linux-mm@kvack.org
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Christopher Lameter <cl@linux.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Gautham R Shenoy <ego@linux.vnet.ibm.com>
Cc: Satheesh Rajendran <sathnaga@linux.vnet.ibm.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Aneesh Kumar K V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---
Changelog v3:->v4:
- Resolved comments from Christopher.
Link v3: http://lore.kernel.org/lkml/20200501031128.19584-1-srikar@linux.vnet.ibm.com/t/#u
arch/powerpc/mm/numa.c | 16 +++++++++++++++-
1 file changed, 15 insertions(+), 1 deletion(-)
diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index 1f61fa2148b5..72f6cca1332c 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -507,6 +507,11 @@ static int numa_setup_cpu(unsigned long lcpu)
int fcpu = cpu_first_thread_sibling(lcpu);
int nid = NUMA_NO_NODE;
+ if (!cpu_present(lcpu)) {
+ set_cpu_numa_node(lcpu, first_online_node);
+ return first_online_node;
+ }
+
/*
* If a valid cpu-to-node mapping is already available, use it
* directly instead of querying the firmware, since it represents
@@ -935,8 +940,17 @@ void __init mem_topology_setup(void)
reset_numa_cpu_lookup_table();
- for_each_present_cpu(cpu)
+ for_each_possible_cpu(cpu) {
+ /*
+ * Powerpc with CONFIG_NUMA always used to have a node 0,
+ * even if it was memoryless or cpuless. For all cpus that
+ * are possible but not present, cpu_to_node() would point
+ * to node 0. To remove a cpuless, memoryless dummy node,
+ * powerpc need to make sure all possible but not present
+ * cpu_to_node are set to a proper node.
+ */
numa_setup_cpu(cpu);
+ }
}
void __init initmem_init(void)
--
2.18.1
^ permalink raw reply [flat|nested] 5+ messages in thread* [PATCH v6 2/3] powerpc/numa: Prefer node id queried from vphn
2020-08-18 8:11 [PATCH v6 0/3] Offline memoryless cpuless node 0 Srikar Dronamraju
2020-08-18 8:11 ` [PATCH v6 1/3] powerpc/numa: Set numa_node for all possible cpus Srikar Dronamraju
@ 2020-08-18 8:11 ` Srikar Dronamraju
2020-08-18 8:11 ` [PATCH v6 3/3] powerpc/numa: Offline memoryless cpuless node 0 Srikar Dronamraju
2020-09-17 11:27 ` [PATCH v6 0/3] " Michael Ellerman
3 siblings, 0 replies; 5+ messages in thread
From: Srikar Dronamraju @ 2020-08-18 8:11 UTC (permalink / raw)
To: Michael Ellerman
Cc: linuxppc-dev, Srikar Dronamraju, linux-mm, Michal Hocko,
Mel Gorman, Vlastimil Babka, Christopher Lameter, Andrew Morton,
Linus Torvalds, Gautham R Shenoy, Satheesh Rajendran,
David Hildenbrand, Aneesh Kumar KV
Node id queried from the static device tree may not
be correct. For example: it may always show 0 on a shared processor.
Hence prefer the node id queried from vphn and fallback on the device tree
based node id if vphn query fails.
Cc: linuxppc-dev@lists.ozlabs.org
Cc: linux-mm@kvack.org
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Christopher Lameter <cl@linux.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Gautham R Shenoy <ego@linux.vnet.ibm.com>
Cc: Satheesh Rajendran <sathnaga@linux.vnet.ibm.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Aneesh Kumar KV <aneesh.kumar@linux.ibm.com>
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---
Changelog v2:->v3:
- Resolved comments from Gautham.
Link v2: https://lore.kernel.org/linuxppc-dev/20200428093836.27190-1-srikar@linux.vnet.ibm.com/t/#u
Changelog v1:->v2:
- Rebased to v5.7-rc3
arch/powerpc/mm/numa.c | 19 ++++++++++---------
1 file changed, 10 insertions(+), 9 deletions(-)
diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index 72f6cca1332c..10c5064eeb88 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -728,21 +728,22 @@ static int __init parse_numa_properties(void)
*/
for_each_present_cpu(i) {
struct device_node *cpu;
- int nid;
-
- cpu = of_get_cpu_node(i, NULL);
- BUG_ON(!cpu);
- nid = of_node_to_nid_single(cpu);
- of_node_put(cpu);
+ int nid = vphn_get_nid(i);
/*
* Don't fall back to default_nid yet -- we will plug
* cpus into nodes once the memory scan has discovered
* the topology.
*/
- if (nid < 0)
- continue;
- node_set_online(nid);
+ if (nid == NUMA_NO_NODE) {
+ cpu = of_get_cpu_node(i, NULL);
+ BUG_ON(!cpu);
+ nid = of_node_to_nid_single(cpu);
+ of_node_put(cpu);
+ }
+
+ if (likely(nid > 0))
+ node_set_online(nid);
}
get_n_mem_cells(&n_mem_addr_cells, &n_mem_size_cells);
--
2.18.1
^ permalink raw reply [flat|nested] 5+ messages in thread* [PATCH v6 3/3] powerpc/numa: Offline memoryless cpuless node 0
2020-08-18 8:11 [PATCH v6 0/3] Offline memoryless cpuless node 0 Srikar Dronamraju
2020-08-18 8:11 ` [PATCH v6 1/3] powerpc/numa: Set numa_node for all possible cpus Srikar Dronamraju
2020-08-18 8:11 ` [PATCH v6 2/3] powerpc/numa: Prefer node id queried from vphn Srikar Dronamraju
@ 2020-08-18 8:11 ` Srikar Dronamraju
2020-09-17 11:27 ` [PATCH v6 0/3] " Michael Ellerman
3 siblings, 0 replies; 5+ messages in thread
From: Srikar Dronamraju @ 2020-08-18 8:11 UTC (permalink / raw)
To: Michael Ellerman
Cc: linuxppc-dev, Srikar Dronamraju, linux-mm, Michal Hocko,
Mel Gorman, Vlastimil Babka, Christopher Lameter, Andrew Morton,
Linus Torvalds, Gautham R Shenoy, David Hildenbrand,
Aneesh Kumar K V
Currently Linux kernel with CONFIG_NUMA on a system with multiple
possible nodes, marks node 0 as online at boot. However in practice,
there are systems which have node 0 as memoryless and cpuless.
This can cause numa_balancing to be enabled on systems with only one node
with memory and CPUs. The existence of this dummy node which is cpuless and
memoryless node can confuse users/scripts looking at output of lscpu /
numactl.
By marking, node 0 as offline, lets stop assuming that node 0 is
always online. If node 0 has CPU or memory that are online, node 0 will
again be set as online.
v5.8
available: 2 nodes (0,2)
node 0 cpus:
node 0 size: 0 MB
node 0 free: 0 MB
node 2 cpus: 0 1 2 3 4 5 6 7
node 2 size: 32625 MB
node 2 free: 31490 MB
node distances:
node 0 2
0: 10 20
2: 20 10
proc and sys files
------------------
/sys/devices/system/node/online: 0,2
/proc/sys/kernel/numa_balancing: 1
/sys/devices/system/node/has_cpu: 2
/sys/devices/system/node/has_memory: 2
/sys/devices/system/node/has_normal_memory: 2
/sys/devices/system/node/possible: 0-31
v5.8 + patch
------------------
available: 1 nodes (2)
node 2 cpus: 0 1 2 3 4 5 6 7
node 2 size: 32625 MB
node 2 free: 31487 MB
node distances:
node 2
2: 10
proc and sys files
------------------
/sys/devices/system/node/online: 2
/proc/sys/kernel/numa_balancing: 0
/sys/devices/system/node/has_cpu: 2
/sys/devices/system/node/has_memory: 2
/sys/devices/system/node/has_normal_memory: 2
/sys/devices/system/node/possible: 0-31
Example of a node with online CPUs/memory on node 0.
(Same o/p with and without patch)
numactl -H
available: 4 nodes (0-3)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
node 0 size: 32482 MB
node 0 free: 22994 MB
node 1 cpus: 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95
node 1 size: 0 MB
node 1 free: 0 MB
node 2 cpus: 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143
node 2 size: 0 MB
node 2 free: 0 MB
node 3 cpus: 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 node 3 size: 0 MB
node 3 free: 0 MB
node distances:
node 0 1 2 3
0: 10 20 40 40
1: 20 10 40 40
2: 40 40 10 20
3: 40 40 20 10
Note: On Powerpc, cpu_to_node of possible but not present cpus would
previously return 0. Hence this commit depends on commit ("powerpc/numa: Set
numa_node for all possible cpus") and commit ("powerpc/numa: Prefer node id
queried from vphn"). Without the 2 commits, Powerpc system might crash.
1. User space applications like Numactl, lscpu, that parse the sysfs tend to
believe there is an extra online node. This tends to confuse users and
applications. Other user space applications start believing that system was
not able to use all the resources (i.e missing resources) or the system was
not setup correctly.
2. Also existence of dummy node also leads to inconsistent information. The
number of online nodes is inconsistent with the information in the
device-tree and resource-dump
3. When the dummy node is present, single node non-Numa systems end up showing
up as NUMA systems and numa_balancing gets enabled. This will mean we take
the hit from the unnecessary numa hinting faults.
Cc: linuxppc-dev@lists.ozlabs.org
Cc: linux-mm@kvack.org
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Christopher Lameter <cl@linux.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Gautham R Shenoy <ego@linux.vnet.ibm.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Aneesh Kumar K V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---
Changelog:
v5->v6:
Moved fix from arch independent code to powerpc specific
(Michal Hocko, Christopher Lamater)
arch/powerpc/mm/numa.c | 10 ++++++++++
1 file changed, 10 insertions(+)
diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index 10c5064eeb88..0d72a7d4360e 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -924,6 +924,16 @@ void __init mem_topology_setup(void)
{
int cpu;
+ /*
+ * Linux/mm assumes node 0 to be online at boot. However this is not
+ * true on PowerPC, where node 0 is similar to any other node, it
+ * could be cpuless, memoryless node. So force node 0 to be offline
+ * for now. This will prevent cpuless, memoryless node 0 showing up
+ * unnecessarily as online. If a node has cpus or memory that need
+ * to be online, then node will anyway be marked online.
+ */
+ node_set_offline(0);
+
if (parse_numa_properties())
setup_nonnuma();
--
2.18.1
^ permalink raw reply [flat|nested] 5+ messages in thread* Re: [PATCH v6 0/3] Offline memoryless cpuless node 0
2020-08-18 8:11 [PATCH v6 0/3] Offline memoryless cpuless node 0 Srikar Dronamraju
` (2 preceding siblings ...)
2020-08-18 8:11 ` [PATCH v6 3/3] powerpc/numa: Offline memoryless cpuless node 0 Srikar Dronamraju
@ 2020-09-17 11:27 ` Michael Ellerman
3 siblings, 0 replies; 5+ messages in thread
From: Michael Ellerman @ 2020-09-17 11:27 UTC (permalink / raw)
To: Michael Ellerman, Srikar Dronamraju
Cc: Andrew Morton, David Hildenbrand, Christopher Lameter,
linuxppc-dev, Linus Torvalds, Vlastimil Babka,
Satheesh Rajendran, Gautham R Shenoy, linux-mm, Aneesh Kumar K V,
Michal Hocko, Mel Gorman
On Tue, 18 Aug 2020 13:41:01 +0530, Srikar Dronamraju wrote:
> Changelog v5:->v6:
> - Now the fix is Powerpc specific.
> (David Hildenbrand, Michal Hocko, Christopher Lamater)
> - rebased to v5.8
> link v5: https://lore.kernel.org/linuxppc-dev/20200624092846.9194-1-srikar@linux.vnet.ibm.com/t/#u
>
> Changelog v4:->v5:
> - rebased to v5.8
> link v4: http://lore.kernel.org/lkml/20200512132937.19295-1-srikar@linux.vnet.ibm.com/t/#u
>
> [...]
Applied to powerpc/next.
[1/3] powerpc/numa: Set numa_node for all possible cpus
https://git.kernel.org/powerpc/c/a874f1005ef5dfe53dfd8cda59a6600e89986ecd
[2/3] powerpc/numa: Prefer node id queried from vphn
https://git.kernel.org/powerpc/c/6398eaa268168b528dd1d3d0e70e61e9c13bea23
[3/3] powerpc/numa: Offline memoryless cpuless node 0
https://git.kernel.org/powerpc/c/e75130f20b1f48e04ccc806aea01f0a361f9cb6b
cheers
^ permalink raw reply [flat|nested] 5+ messages in thread