* [Patch: 000/006] pgdat allocation for new node add
@ 2006-04-20 10:03 Yasunori Goto
2006-04-20 10:10 ` [Patch: 001/006] pgdat allocation for new node add (specify node id) Yasunori Goto
` (5 more replies)
0 siblings, 6 replies; 11+ messages in thread
From: Yasunori Goto @ 2006-04-20 10:03 UTC (permalink / raw)
To: Andrew Morton; +Cc: Linux Kernel ML, linux-mm, Yasunori Goto
Hello.
These are parts of patches for new nodes addition v4.
When new node is added, new pgdat is allocated and initialized by this patch.
These includes...
- specify node id at add_memory().
- start kswapd for new node.
- allocate pgdat and register its address to node_data[].
This set includes node_data[] updater for generic arch.
Ia64 has copies of node_data[] on each node.
But, this patch set doesn't include patches to update them.
I'll post them later.
This patch is for 2.6.17-rc1-mm3.
Please apply.
------------------------------------------------------------
Change log from v4 of node hot-add.
- generic pgdat allocation is picked up.
- update for 2.6.17-rc1-mm3.
V4 of post is here.
<description>
http://marc.theaimsgroup.com/?l=linux-mm&m=114258404023573&w=2
<patches>
http://marc.theaimsgroup.com/?l=linux-mm&w=2&r=1&s=memory+hotplug+node+v.4.&q=b
--
Yasunori Goto
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 11+ messages in thread
* [Patch: 001/006] pgdat allocation for new node add (specify node id)
2006-04-20 10:03 [Patch: 000/006] pgdat allocation for new node add Yasunori Goto
@ 2006-04-20 10:10 ` Yasunori Goto
2006-04-20 22:49 ` Andrew Morton
2006-04-20 10:10 ` [Patch: 002/006] pgdat allocation for new node add (get node id by acpi) Yasunori Goto
` (4 subsequent siblings)
5 siblings, 1 reply; 11+ messages in thread
From: Yasunori Goto @ 2006-04-20 10:10 UTC (permalink / raw)
To: Andrew Morton; +Cc: Linux Kernel ML, linux-mm
This patch changes name of old add_memory() to arch_add_memory.
and use node id to get pgdat for the node at NODE_DATA().
Note: Powerpc's old add_memory() is defined as __devinit. However,
add_memory() is usually called only after bootup.
I suppose it may be redundant. But, I'm not well known about powerpc.
So, I keep it. (But, __meminit is better at least.)
Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com>
arch/i386/mm/init.c | 2 +-
arch/ia64/mm/init.c | 4 ++--
arch/powerpc/mm/mem.c | 9 ++++++---
arch/x86_64/mm/init.c | 6 +++---
drivers/acpi/acpi_memhotplug.c | 3 ++-
drivers/base/memory.c | 4 +++-
include/linux/memory_hotplug.h | 13 ++++++++++++-
mm/memory_hotplug.c | 10 ++++++++++
8 files changed, 39 insertions(+), 12 deletions(-)
Index: pgdat11/arch/i386/mm/init.c
===================================================================
--- pgdat11.orig/arch/i386/mm/init.c 2006-04-20 11:00:04.000000000 +0900
+++ pgdat11/arch/i386/mm/init.c 2006-04-20 16:08:21.000000000 +0900
@@ -654,7 +654,7 @@ void __init mem_init(void)
*/
#ifdef CONFIG_MEMORY_HOTPLUG
#ifndef CONFIG_NEED_MULTIPLE_NODES
-int add_memory(u64 start, u64 size)
+int arch_add_memory(int nid, u64 start , u64 size)
{
struct pglist_data *pgdata = &contig_page_data;
struct zone *zone = pgdata->node_zones + MAX_NR_ZONES-1;
Index: pgdat11/arch/ia64/mm/init.c
===================================================================
--- pgdat11.orig/arch/ia64/mm/init.c 2006-04-20 11:00:04.000000000 +0900
+++ pgdat11/arch/ia64/mm/init.c 2006-04-20 16:04:14.000000000 +0900
@@ -652,7 +652,7 @@ void online_page(struct page *page)
num_physpages++;
}
-int add_memory(u64 start, u64 size)
+int arch_add_memory(int nid, u64 start, u64 size)
{
pg_data_t *pgdat;
struct zone *zone;
@@ -660,7 +660,7 @@ int add_memory(u64 start, u64 size)
unsigned long nr_pages = size >> PAGE_SHIFT;
int ret;
- pgdat = NODE_DATA(0);
+ pgdat = NODE_DATA(nid);
zone = pgdat->node_zones + ZONE_NORMAL;
ret = __add_pages(zone, start_pfn, nr_pages);
Index: pgdat11/arch/powerpc/mm/mem.c
===================================================================
--- pgdat11.orig/arch/powerpc/mm/mem.c 2006-04-20 10:59:54.000000000 +0900
+++ pgdat11/arch/powerpc/mm/mem.c 2006-04-20 16:06:58.000000000 +0900
@@ -114,15 +114,18 @@ void online_page(struct page *page)
num_physpages++;
}
-int __devinit add_memory(u64 start, u64 size)
+int memory_add_physaddr_to_nid(u64 start)
+{
+ return hot_add_scn_to_nid(start);
+}
+
+int __devinit arch_add_memory(in nid, u64 start, u64 size)
{
struct pglist_data *pgdata;
struct zone *zone;
- int nid;
unsigned long start_pfn = start >> PAGE_SHIFT;
unsigned long nr_pages = size >> PAGE_SHIFT;
- nid = hot_add_scn_to_nid(start);
pgdata = NODE_DATA(nid);
start = (unsigned long)__va(start);
Index: pgdat11/arch/x86_64/mm/init.c
===================================================================
--- pgdat11.orig/arch/x86_64/mm/init.c 2006-04-20 11:00:04.000000000 +0900
+++ pgdat11/arch/x86_64/mm/init.c 2006-04-20 16:10:38.000000000 +0900
@@ -552,9 +552,9 @@ int __add_pages(struct zone *z, unsigned
* Memory is added always to NORMAL zone. This means you will never get
* additional DMA/DMA32 memory.
*/
-int add_memory(u64 start, u64 size)
+int arch_add_memory(int nid, u64 start, u64 size)
{
- struct pglist_data *pgdat = NODE_DATA(0);
+ struct pglist_data *pgdat = NODE_DATA(nid);
struct zone *zone = pgdat->node_zones + MAX_NR_ZONES-2;
unsigned long start_pfn = start >> PAGE_SHIFT;
unsigned long nr_pages = size >> PAGE_SHIFT;
@@ -571,7 +571,7 @@ error:
printk("%s: Problem encountered in __add_pages!\n", __func__);
return ret;
}
-EXPORT_SYMBOL_GPL(add_memory);
+EXPORT_SYMBOL_GPL(arch_add_memory);
int remove_memory(u64 start, u64 size)
{
Index: pgdat11/drivers/acpi/acpi_memhotplug.c
===================================================================
--- pgdat11.orig/drivers/acpi/acpi_memhotplug.c 2006-04-20 11:00:04.000000000 +0900
+++ pgdat11/drivers/acpi/acpi_memhotplug.c 2006-04-20 16:35:24.000000000 +0900
@@ -215,6 +215,7 @@ static int acpi_memory_enable_device(str
{
int result, num_enabled = 0;
struct acpi_memory_info *info;
+ int node = 0;
ACPI_FUNCTION_TRACE("acpi_memory_enable_device");
@@ -244,7 +245,7 @@ static int acpi_memory_enable_device(str
continue;
}
- result = add_memory(info->start_addr, info->length);
+ result = add_memory(node, info->start_addr, info->length);
if (result)
continue;
info->enabled = 1;
Index: pgdat11/include/linux/memory_hotplug.h
===================================================================
--- pgdat11.orig/include/linux/memory_hotplug.h 2006-04-20 11:00:07.000000000 +0900
+++ pgdat11/include/linux/memory_hotplug.h 2006-04-20 16:35:23.000000000 +0900
@@ -63,6 +63,16 @@ extern int online_pages(unsigned long, u
/* reasonably generic interface to expand the physical pages in a zone */
extern int __add_pages(struct zone *zone, unsigned long start_pfn,
unsigned long nr_pages);
+
+#if defined(CONFIG_NUMA)
+extern int memory_add_physaddr_to_nid(u64 start);
+#else
+static inline int memofy_add_physaddr_to_nid(u64 start)
+{
+ return 0;
+}
+#endif
+
#else /* ! CONFIG_MEMORY_HOTPLUG */
/*
* Stub functions for when hotplug is off
@@ -99,7 +109,8 @@ static inline int __remove_pages(struct
return -ENOSYS;
}
-extern int add_memory(u64 start, u64 size);
+extern int add_memory(int nid, u64 start, u64 size);
+extern int arch_add_memory(int nid, u64 start, u64 size);
extern int remove_memory(u64 start, u64 size);
#endif /* __LINUX_MEMORY_HOTPLUG_H */
Index: pgdat11/mm/memory_hotplug.c
===================================================================
--- pgdat11.orig/mm/memory_hotplug.c 2006-04-20 11:00:07.000000000 +0900
+++ pgdat11/mm/memory_hotplug.c 2006-04-20 16:35:53.000000000 +0900
@@ -159,3 +159,13 @@ int online_pages(unsigned long pfn, unsi
return 0;
}
+
+int add_memory(int nid, u64 start, u64 size)
+{
+ int ret;
+
+ /* call arch's memory hotadd */
+ ret = arch_add_memory(nid, start, size;
+
+ return ret;
+}
Index: pgdat11/drivers/base/memory.c
===================================================================
--- pgdat11.orig/drivers/base/memory.c 2006-04-20 10:59:54.000000000 +0900
+++ pgdat11/drivers/base/memory.c 2006-04-20 11:00:09.000000000 +0900
@@ -306,11 +306,13 @@ static ssize_t
memory_probe_store(struct class *class, const char *buf, size_t count)
{
u64 phys_addr;
+ int nid;
int ret;
phys_addr = simple_strtoull(buf, NULL, 0);
- ret = add_memory(phys_addr, PAGES_PER_SECTION << PAGE_SHIFT);
+ nid = memory_add_physaddr_to_nid(phys_addr);
+ ret = add_memory(nid, phys_addr, PAGES_PER_SECTION << PAGE_SHIFT);
if (ret)
count = ret;
--
Yasunori Goto
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 11+ messages in thread
* [Patch: 002/006] pgdat allocation for new node add (get node id by acpi)
2006-04-20 10:03 [Patch: 000/006] pgdat allocation for new node add Yasunori Goto
2006-04-20 10:10 ` [Patch: 001/006] pgdat allocation for new node add (specify node id) Yasunori Goto
@ 2006-04-20 10:10 ` Yasunori Goto
2006-04-20 10:10 ` [Patch: 003/006] pgdat allocation for new node add (generic alloc node_data) Yasunori Goto
` (3 subsequent siblings)
5 siblings, 0 replies; 11+ messages in thread
From: Yasunori Goto @ 2006-04-20 10:10 UTC (permalink / raw)
To: Andrew Morton; +Cc: Linux Kernel ML, linux-mm
This is to find node id from acpi's handle of memory_device in DSDT.
_PXM for the new node can be found by acpi_get_pxm()
by using new memory's handle.
So, node id can be found by pxm_to_nid_map[].
This patch becomes simpler than v2 of node hot-add patch.
Because old add_memory() function doesn't have node id parameter.
So, kernel must find its handle by physical address via DSDT again.
But, v3 just give node id to add_memory() now.
Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com>
drivers/acpi/acpi_memhotplug.c | 3 ++-
drivers/acpi/numa.c | 15 +++++++++++++++
include/linux/acpi.h | 6 ++++++
3 files changed, 23 insertions(+), 1 deletion(-)
Index: pgdat11/drivers/acpi/acpi_memhotplug.c
===================================================================
--- pgdat11.orig/drivers/acpi/acpi_memhotplug.c 2006-04-20 11:00:09.000000000 +0900
+++ pgdat11/drivers/acpi/acpi_memhotplug.c 2006-04-20 11:00:17.000000000 +0900
@@ -215,7 +215,7 @@ static int acpi_memory_enable_device(str
{
int result, num_enabled = 0;
struct acpi_memory_info *info;
- int node = 0;
+ int node;
ACPI_FUNCTION_TRACE("acpi_memory_enable_device");
@@ -227,6 +227,7 @@ static int acpi_memory_enable_device(str
return result;
}
+ node = acpi_get_node(mem_device->handle);
/*
* Tell the VM there is more memory here...
* Note: Assume that this function returns zero on success
Index: pgdat11/drivers/acpi/numa.c
===================================================================
--- pgdat11.orig/drivers/acpi/numa.c 2006-04-20 11:00:04.000000000 +0900
+++ pgdat11/drivers/acpi/numa.c 2006-04-20 11:00:17.000000000 +0900
@@ -256,3 +256,18 @@ int acpi_get_pxm(acpi_handle h)
}
EXPORT_SYMBOL(acpi_get_pxm);
+
+int acpi_get_node(acpi_handle *handle)
+{
+ int pxm, node = -1;
+
+ ACPI_FUNCTION_TRACE("acpi_get_node");
+
+ pxm = acpi_get_pxm(handle);
+ if (pxm >= 0)
+ node = acpi_map_pxm_to_node(pxm);
+
+ return_VALUE(node);
+}
+
+EXPORT_SYMBOL(acpi_get_node);
Index: pgdat11/include/linux/acpi.h
===================================================================
--- pgdat11.orig/include/linux/acpi.h 2006-04-20 11:00:07.000000000 +0900
+++ pgdat11/include/linux/acpi.h 2006-04-20 11:00:17.000000000 +0900
@@ -529,12 +529,18 @@ static inline void acpi_set_cstate_limit
#ifdef CONFIG_ACPI_NUMA
int acpi_get_pxm(acpi_handle handle);
+int acpi_get_node(acpi_handle *handle);
#else
static inline int acpi_get_pxm(acpi_handle handle)
{
return 0;
}
+static inline int acpi_get_node(acpi_handle *handle)
+{
+ return 0;
+}
#endif
+extern int acpi_paddr_to_node(u64 start_addr, u64 size);
extern int pnpacpi_disabled;
--
Yasunori Goto
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 11+ messages in thread
* [Patch: 003/006] pgdat allocation for new node add (generic alloc node_data)
2006-04-20 10:03 [Patch: 000/006] pgdat allocation for new node add Yasunori Goto
2006-04-20 10:10 ` [Patch: 001/006] pgdat allocation for new node add (specify node id) Yasunori Goto
2006-04-20 10:10 ` [Patch: 002/006] pgdat allocation for new node add (get node id by acpi) Yasunori Goto
@ 2006-04-20 10:10 ` Yasunori Goto
2006-04-20 23:01 ` Andrew Morton
2006-04-20 10:10 ` [Patch: 004/006] pgdat allocation for new node add (refresh node_data[]) Yasunori Goto
` (2 subsequent siblings)
5 siblings, 1 reply; 11+ messages in thread
From: Yasunori Goto @ 2006-04-20 10:10 UTC (permalink / raw)
To: Andrew Morton; +Cc: Linux Kernel ML, linux-mm
For node hotplug, basically we have to allocate new pgdat.
But, there are several types of implementations of pgdat.
1. Allocate only pgdat.
This style allocate only pgdat area.
And its address is recorded in node_data[].
It is most popular style.
2. Static array of pgdat
In this case, all of pgdats are static array.
Some archs use this style.
3. Allocate not only pgdat, but also per node data.
To increase performance, each node has copy of some data as
a per node data. So, this area must be allocated too.
Ia64 is this style. Ia64 has the copies of node_data[] array
on each per node data to increase performance.
In this series of patches, treat (1) as generic arch.
generic archs can use generic function. (2) and (3) should have
its own if necessary.
This patch defines pgdat allocator.
Updating NODE_DATA() macro function is in other patch.
( I'll post another patch for (3).
I don't know (2) which can use memory hotplug.
So, there is not patch for (2). )
Signed-off-by: Yasonori Goto <y-goto@jp.fujitsu.com>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
include/linux/memory_hotplug.h | 55 +++++++++++++++++++++++++++++++++++++++++
1 files changed, 55 insertions(+)
Index: pgdat11/include/linux/memory_hotplug.h
===================================================================
--- pgdat11.orig/include/linux/memory_hotplug.h 2006-04-20 11:00:09.000000000 +0900
+++ pgdat11/include/linux/memory_hotplug.h 2006-04-20 11:00:23.000000000 +0900
@@ -73,6 +73,61 @@ static inline int memofy_add_physaddr_to
}
#endif
+#ifdef CONFIG_HAVE_ARCH_NODEDATA_EXTENSION
+/*
+ * For supporint node-hotadd, we have to allocate new pgdat.
+ *
+ * If an arch have generic style NODE_DATA(),
+ * node_data[nid] = kzalloc() works well . But it depends on each arch.
+ *
+ * In general, generic_alloc_nodedata() is used.
+ * Now, arch_free_nodedata() is just defined for error path of node_hot_add.
+ *
+ */
+static inline pg_data_t * arch_alloc_nodedata(int nid)
+{
+ return NULL;
+}
+static inline void arch_free_nodedata(pg_data_t *pgdat)
+{
+}
+
+#else /* CONFIG_HAVE_ARCH_NODEDATA_EXTENSION */
+
+#define arch_alloc_nodedata(nid) generic_alloc_nodedata(nid)
+#define arch_free_nodedata(pgdat) generic_free_nodedata(pgdat)
+
+#ifdef CONFIG_NUMA
+/*
+ * If ARCH_HAS_NODEDATA_EXTENSION=n, this func is used to allocate pgdat.
+ * XXX: kmalloc_node() can't work well to get new node's memory at this time.
+ * Because, pgdat for the new node is not allocated/initialized yet itself.
+ * To use new node's memory, more consideration will be necessary.
+ */
+#define generic_alloc_nodedata(nid) \
+({ \
+ (pg_data_t *)kzalloc(sizeof(pg_data_t), GFP_KERNEL); \
+})
+/*
+ * This definition is just for error path in node hotadd.
+ * For node hotremove, we have to replace this.
+ */
+#define generic_free_nodedata(pgdat) kfree(pgdat)
+
+#else /* !CONFIG_NUMA */
+
+/* never called */
+static inline pg_data_t *generic_alloc_nodedata(int nid)
+{
+ BUG();
+ return NULL;
+}
+static inline void generic_free_nodedata(pg_data_t *pgdat)
+{
+}
+#endif /* CONFIG_NUMA */
+#endif /* CONFIG_HAVE_ARCH_NODEDATA_EXTENSION */
+
#else /* ! CONFIG_MEMORY_HOTPLUG */
/*
* Stub functions for when hotplug is off
--
Yasunori Goto
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 11+ messages in thread
* [Patch: 004/006] pgdat allocation for new node add (refresh node_data[])
2006-04-20 10:03 [Patch: 000/006] pgdat allocation for new node add Yasunori Goto
` (2 preceding siblings ...)
2006-04-20 10:10 ` [Patch: 003/006] pgdat allocation for new node add (generic alloc node_data) Yasunori Goto
@ 2006-04-20 10:10 ` Yasunori Goto
2006-04-20 10:10 ` [Patch: 005/006] pgdat allocation for new node add (export kswapd start func) Yasunori Goto
2006-04-20 10:10 ` [Patch: 006/006] pgdat allocation for new node add (call pgdat allocation) Yasunori Goto
5 siblings, 0 replies; 11+ messages in thread
From: Yasunori Goto @ 2006-04-20 10:10 UTC (permalink / raw)
To: Andrew Morton; +Cc: Linux Kernel ML, linux-mm
This function refresh NODE_DATA() for generic archs.
In this case, NODE_DATA(nid) == node_data[nid].
node_data[] is array of address of pgdat.
So, refresh is quite simple.
Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
arch/ia64/Kconfig | 4 ++++
include/linux/memory_hotplug.h | 12 ++++++++++++
2 files changed, 16 insertions(+)
Index: pgdat11/include/linux/memory_hotplug.h
===================================================================
--- pgdat11.orig/include/linux/memory_hotplug.h 2006-04-20 11:00:23.000000000 +0900
+++ pgdat11/include/linux/memory_hotplug.h 2006-04-20 11:00:28.000000000 +0900
@@ -91,6 +91,9 @@ static inline pg_data_t * arch_alloc_nod
static inline void arch_free_nodedata(pg_data_t *pgdat)
{
}
+static inline void arch_refresh_nodedata(int nid, pg_data_t *pgdat)
+{
+}
#else /* CONFIG_HAVE_ARCH_NODEDATA_EXTENSION */
@@ -114,6 +117,12 @@ static inline void arch_free_nodedata(pg
*/
#define generic_free_nodedata(pgdat) kfree(pgdat)
+extern pg_data_t *node_data[];
+static inline void generic_refresh_nodedata(int nid, pg_data_t *pgdat)
+{
+ node_data[nid] = pgdat;
+}
+
#else /* !CONFIG_NUMA */
/* never called */
@@ -125,6 +134,9 @@ static inline pg_data_t *generic_alloc_n
static inline void generic_free_nodedata(pg_data_t *pgdat)
{
}
+static inline void generic_refresh_nodedata(int nid, pg_data_t *pgdat)
+{
+}
#endif /* CONFIG_NUMA */
#endif /* CONFIG_HAVE_ARCH_NODEDATA_EXTENSION */
Index: pgdat11/arch/ia64/Kconfig
===================================================================
--- pgdat11.orig/arch/ia64/Kconfig 2006-04-20 11:00:04.000000000 +0900
+++ pgdat11/arch/ia64/Kconfig 2006-04-20 11:00:28.000000000 +0900
@@ -374,6 +374,10 @@ config HAVE_ARCH_EARLY_PFN_TO_NID
def_bool y
depends on NEED_MULTIPLE_NODES
+config HAVE_ARCH_NODEDATA_EXTENSION
+ def_bool y
+ depends on NUMA
+
config IA32_SUPPORT
bool "Support for Linux/x86 binaries"
help
--
Yasunori Goto
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 11+ messages in thread
* [Patch: 005/006] pgdat allocation for new node add (export kswapd start func)
2006-04-20 10:03 [Patch: 000/006] pgdat allocation for new node add Yasunori Goto
` (3 preceding siblings ...)
2006-04-20 10:10 ` [Patch: 004/006] pgdat allocation for new node add (refresh node_data[]) Yasunori Goto
@ 2006-04-20 10:10 ` Yasunori Goto
2006-04-20 10:10 ` [Patch: 006/006] pgdat allocation for new node add (call pgdat allocation) Yasunori Goto
5 siblings, 0 replies; 11+ messages in thread
From: Yasunori Goto @ 2006-04-20 10:10 UTC (permalink / raw)
To: Andrew Morton; +Cc: Linux Kernel ML, linux-mm
When node is hot-added, kswapd for the node should start.
This export kswapd start function as kswapd_run() to use at add_memory().
Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
include/linux/swap.h | 2 ++
mm/vmscan.c | 35 ++++++++++++++++++++++++++---------
2 files changed, 28 insertions(+), 9 deletions(-)
Index: pgdat11/mm/vmscan.c
===================================================================
--- pgdat11.orig/mm/vmscan.c 2006-04-20 11:00:07.000000000 +0900
+++ pgdat11/mm/vmscan.c 2006-04-20 11:00:33.000000000 +0900
@@ -35,6 +35,7 @@
#include <linux/notifier.h>
#include <linux/rwsem.h>
#include <linux/delay.h>
+#include <linux/kthread.h>
#include <asm/tlbflush.h>
#include <asm/div64.h>
@@ -1353,20 +1354,36 @@ static int __devinit cpu_callback(struct
}
#endif /* CONFIG_HOTPLUG_CPU */
+/*
+ * This kswapd start function will be called by init and node-hot-add.
+ * On node-hot-add, kswapd will moved to proper cpus if cpus are hot-added.
+ */
+int kswapd_run(int nid)
+{
+ pg_data_t *pgdat = NODE_DATA(nid);
+ int ret = 0;
+
+ if (pgdat->kswapd)
+ return 0;
+
+ pgdat->kswapd = kthread_run(kswapd, pgdat, "kswapd%d", nid);
+ if (pgdat->kswapd == ERR_PTR(-ENOMEM)) {
+ /* failure at boot is fatal */
+ BUG_ON(system_state == SYSTEM_BOOTING);
+ printk("faled to run kswapd on node %d\n",nid);
+ ret = -1;
+ }
+ return ret;
+}
+
static int __init kswapd_init(void)
{
- pg_data_t *pgdat;
+ int nid;
swap_setup();
- for_each_online_pgdat(pgdat) {
- pid_t pid;
+ for_each_online_node(nid)
+ kswapd_run(nid);
- pid = kernel_thread(kswapd, pgdat, CLONE_KERNEL);
- BUG_ON(pid < 0);
- read_lock(&tasklist_lock);
- pgdat->kswapd = find_task_by_pid(pid);
- read_unlock(&tasklist_lock);
- }
total_memory = nr_free_pagecache_pages();
hotcpu_notifier(cpu_callback, 0);
return 0;
Index: pgdat11/include/linux/swap.h
===================================================================
--- pgdat11.orig/include/linux/swap.h 2006-04-20 11:00:07.000000000 +0900
+++ pgdat11/include/linux/swap.h 2006-04-20 11:00:33.000000000 +0900
@@ -212,6 +212,8 @@ static inline int zone_reclaim(struct zo
}
#endif
+extern int kswapd_run(int nid);
+
#ifdef CONFIG_MMU
/* linux/mm/shmem.c */
extern int shmem_unuse(swp_entry_t entry, struct page *page);
--
Yasunori Goto
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 11+ messages in thread
* [Patch: 006/006] pgdat allocation for new node add (call pgdat allocation)
2006-04-20 10:03 [Patch: 000/006] pgdat allocation for new node add Yasunori Goto
` (4 preceding siblings ...)
2006-04-20 10:10 ` [Patch: 005/006] pgdat allocation for new node add (export kswapd start func) Yasunori Goto
@ 2006-04-20 10:10 ` Yasunori Goto
5 siblings, 0 replies; 11+ messages in thread
From: Yasunori Goto @ 2006-04-20 10:10 UTC (permalink / raw)
To: Andrew Morton; +Cc: Linux Kernel ML, linux-mm
This patch adds node-hot-add support to add_memory().
node hotadd uses this sequence.
1. allocate pgdat.
2. refresh NODE_DATA()
3. call free_area_init_node() to initialize
4. create sysfs entry
5. add memory (old add_memory())
6. set node online
7. run kswapd for new node.
(8). update zonelist after pages are onlined. (This is already merged in -mm
due to update phase is difference.)
Note:
To make common function as much as possible,
there is 2 changes from v2.
- The old add_memory(), which is defiend by each archs,
is renamed to arch_add_memory(). New add_memory becomes
caller of arch dependent function as a common code.
- This patch changes add_memory()'s interface
From: add_memory(start, end)
TO : add_memory(nid, start, end).
It was cause of similar code that finding node id from
physical address is inside of old add_memory() on each arch.
In addition, acpi memory hotplug driver can find node id easier.
In v2, it must walk DSDT'S _CRS by matching physical address to
get the handle of its memory device, then get _PXM and node id.
Because input is just physical address.
However, in v3, the acpi driver can use handle to get _PXM and node id
for the new memory device. It can pass just node id to add_memory().
Fix interface of arch_add_memory() is in next patche.
Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
mm/memory_hotplug.c | 52 ++++++++++++++++++++++++++++++++++++++++++++++++++++
1 files changed, 52 insertions(+)
Index: pgdat11/mm/memory_hotplug.c
===================================================================
--- pgdat11.orig/mm/memory_hotplug.c 2006-04-20 16:36:38.000000000 +0900
+++ pgdat11/mm/memory_hotplug.c 2006-04-20 17:09:39.000000000 +0900
@@ -160,12 +160,64 @@ int online_pages(unsigned long pfn, unsi
return 0;
}
+static pg_data_t *hotadd_new_pgdat(int nid, u64 start)
+{
+ struct pglist_data *pgdat;
+ unsigned long zones_size[MAX_NR_ZONES] = {0};
+ unsigned long zholes_size[MAX_NR_ZONES] = {0};
+ unsigned long start_pfn = start >> PAGE_SHIFT;
+
+ pgdat = arch_alloc_nodedata(nid);
+ if (!pgdat)
+ return NULL;
+
+ arch_refresh_nodedata(nid, pgdat);
+
+ /* we can use NODE_DATA(nid) from here */
+
+ /* init node's zones as empty zones, we don't have any present pages.*/
+ free_area_init_node(nid, pgdat, zones_size, start_pfn, zholes_size);
+
+ return pgdat;
+}
+
+static void rollback_node_hotadd(int nid, pg_data_t *pgdat)
+{
+ arch_refresh_nodedata(nid, NULL);
+ arch_free_nodedata(pgdat);
+ return;
+}
+
int add_memory(int nid, u64 start, u64 size)
{
+ pg_data_t *pgdat = NULL;
+ int new_pgdat = 0;
int ret;
+ if (!node_online(nid)) {
+ pgdat = hotadd_new_pgdat(nid, start);
+ if (!pgdat)
+ return -ENOMEM;
+ new_pgdat = 1;
+ ret = kswapd_run(nid);
+ if (ret)
+ goto error;
+ }
+
/* call arch's memory hotadd */
ret = arch_add_memory(nid, start, size);
+ if (ret < 0)
+ goto error;
+
+ /* we online node here. we have no error path from here. */
+ node_set_online(nid);
+
+ return ret;
+error:
+ /* rollback pgdat allocation and others */
+ if (new_pgdat)
+ rollback_node_hotadd(nid, pgdat);
+
return ret;
}
--
Yasunori Goto
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [Patch: 001/006] pgdat allocation for new node add (specify node id)
2006-04-20 10:10 ` [Patch: 001/006] pgdat allocation for new node add (specify node id) Yasunori Goto
@ 2006-04-20 22:49 ` Andrew Morton
2006-04-20 23:38 ` Yasunori Goto
0 siblings, 1 reply; 11+ messages in thread
From: Andrew Morton @ 2006-04-20 22:49 UTC (permalink / raw)
To: Yasunori Goto; +Cc: linux-kernel, linux-mm
Yasunori Goto <y-goto@jp.fujitsu.com> wrote:
>
> +int add_memory(int nid, u64 start, u64 size)
> +{
> + int ret;
> +
> + /* call arch's memory hotadd */
> + ret = arch_add_memory(nid, start, size;
> +
> + return ret;
> +}
So this patch is missing a ), but your later patch which touches this code
actually has the ). Which tells me that this isn't the correct version of
this patch.
I'll fix that all up, but I would ask you to carefully verify that the
patches which I merged are the ones which you meant to send, thanks.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [Patch: 003/006] pgdat allocation for new node add (generic alloc node_data)
2006-04-20 10:10 ` [Patch: 003/006] pgdat allocation for new node add (generic alloc node_data) Yasunori Goto
@ 2006-04-20 23:01 ` Andrew Morton
2006-04-21 0:23 ` Yasunori Goto
0 siblings, 1 reply; 11+ messages in thread
From: Andrew Morton @ 2006-04-20 23:01 UTC (permalink / raw)
To: Yasunori Goto; +Cc: linux-kernel, linux-mm
Yasunori Goto <y-goto@jp.fujitsu.com> wrote:
>
> +#define generic_alloc_nodedata(nid) \
> +({ \
> + (pg_data_t *)kzalloc(sizeof(pg_data_t), GFP_KERNEL); \
> +})
In general, library functions which perform memory allocation should not
make assumptions about which gfp_t they are allowed to use.
So this really should be `generic_alloc_nodedata(nid, gfp_mask)'.
However, it's very desirable that memory allocations use GFP_KERNEL rather
than, say, GFP_ATOMIC. So your interface here _forces_ callers to be in a
state where GFP_KERNEL is legal, which is good discipline.
Although if that turns out to be a problem, we can expect to see a sad
little patch from someone which tries to change this to GFP_ATOMIC, which
makes everything worse - even those callers who _can_ use GFP_KERNEL.
(In practice, NUMA developers seem to never test with sufficient
CONFIG_DEBUG_* flags enabled, and with CONFIG_PREEMPT, so they happily
don't get to discover their sleep-in-spinlock bugs anyway).
Anyway, on balance, I think it'd be best to convert this API to take a
gfp_t as well.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [Patch: 001/006] pgdat allocation for new node add (specify node id)
2006-04-20 22:49 ` Andrew Morton
@ 2006-04-20 23:38 ` Yasunori Goto
0 siblings, 0 replies; 11+ messages in thread
From: Yasunori Goto @ 2006-04-20 23:38 UTC (permalink / raw)
To: Andrew Morton; +Cc: linux-kernel, linux-mm
> Yasunori Goto <y-goto@jp.fujitsu.com> wrote:
> >
> > +int add_memory(int nid, u64 start, u64 size)
> > +{
> > + int ret;
> > +
> > + /* call arch's memory hotadd */
> > + ret = arch_add_memory(nid, start, size;
> > +
> > + return ret;
> > +}
>
> So this patch is missing a ), but your later patch which touches this code
> actually has the ). Which tells me that this isn't the correct version of
> this patch.
>
> I'll fix that all up, but I would ask you to carefully verify that the
> patches which I merged are the ones which you meant to send, thanks.
Oops. I thought I fixed it, but I made mistake.
Sorry.
--
Yasunori Goto
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [Patch: 003/006] pgdat allocation for new node add (generic alloc node_data)
2006-04-20 23:01 ` Andrew Morton
@ 2006-04-21 0:23 ` Yasunori Goto
0 siblings, 0 replies; 11+ messages in thread
From: Yasunori Goto @ 2006-04-21 0:23 UTC (permalink / raw)
To: Andrew Morton; +Cc: linux-kernel, linux-mm
> Yasunori Goto <y-goto@jp.fujitsu.com> wrote:
> >
> > +#define generic_alloc_nodedata(nid) \
> > +({ \
> > + (pg_data_t *)kzalloc(sizeof(pg_data_t), GFP_KERNEL); \
> > +})
>
> In general, library functions which perform memory allocation should not
> make assumptions about which gfp_t they are allowed to use.
>
> So this really should be `generic_alloc_nodedata(nid, gfp_mask)'.
>
> However, it's very desirable that memory allocations use GFP_KERNEL rather
> than, say, GFP_ATOMIC. So your interface here _forces_ callers to be in a
> state where GFP_KERNEL is legal, which is good discipline.
>
> Although if that turns out to be a problem, we can expect to see a sad
> little patch from someone which tries to change this to GFP_ATOMIC, which
> makes everything worse - even those callers who _can_ use GFP_KERNEL.
>
> (In practice, NUMA developers seem to never test with sufficient
> CONFIG_DEBUG_* flags enabled, and with CONFIG_PREEMPT, so they happily
> don't get to discover their sleep-in-spinlock bugs anyway).
>
> Anyway, on balance, I think it'd be best to convert this API to take a
> gfp_t as well.
To tell the truth, I prefer making new interface to allocate new added
memory than using normal kzalloc().
Because this kzalloc() will allocate() other node's memory
by im-completion new memory's initialization.
In addition, memory hot-add may be required at the time that memory
is already exhausted.
This kzalloc() might be "finish blow". (pgdat is not small.
Especially, ia64 needs node data's copy.)
Probably, user will feel very strange.
"I added new memory. But OOM killer is called by it, why????"
So I think alloc_hot_added_memory() is desirable, which can allocate new
added memory until completion of initialization.
Thanks for your advice.
--
Yasunori Goto
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2006-04-21 0:23 UTC | newest]
Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2006-04-20 10:03 [Patch: 000/006] pgdat allocation for new node add Yasunori Goto
2006-04-20 10:10 ` [Patch: 001/006] pgdat allocation for new node add (specify node id) Yasunori Goto
2006-04-20 22:49 ` Andrew Morton
2006-04-20 23:38 ` Yasunori Goto
2006-04-20 10:10 ` [Patch: 002/006] pgdat allocation for new node add (get node id by acpi) Yasunori Goto
2006-04-20 10:10 ` [Patch: 003/006] pgdat allocation for new node add (generic alloc node_data) Yasunori Goto
2006-04-20 23:01 ` Andrew Morton
2006-04-21 0:23 ` Yasunori Goto
2006-04-20 10:10 ` [Patch: 004/006] pgdat allocation for new node add (refresh node_data[]) Yasunori Goto
2006-04-20 10:10 ` [Patch: 005/006] pgdat allocation for new node add (export kswapd start func) Yasunori Goto
2006-04-20 10:10 ` [Patch: 006/006] pgdat allocation for new node add (call pgdat allocation) Yasunori Goto
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox