* [Patch] memory unplug v3 [1/4] page isolation
2007-05-22 6:58 [Patch] memory unplug v3 [0/4] KAMEZAWA Hiroyuki
@ 2007-05-22 7:01 ` KAMEZAWA Hiroyuki
2007-05-22 10:19 ` Mel Gorman
2007-05-22 18:38 ` Christoph Lameter
2007-05-22 7:04 ` [Patch] memory unplug v3 [2/4] migration by kernel KAMEZAWA Hiroyuki
` (3 subsequent siblings)
4 siblings, 2 replies; 20+ messages in thread
From: KAMEZAWA Hiroyuki @ 2007-05-22 7:01 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki; +Cc: linux-mm, mel, y-goto, clameter
Patch for isoalte pages.
'isoalte'means make pages to be free and never allocated.
This feature helps making the range of pages unused.
This patch is based on Mel's page grouping method.
This patch add MIGRATE_ISOLATE to MIGRATE_TYPES. By this
- MIGRATE_TYPES increases.
- bitmap for migratetype is enlarged.
If isolate_pages(start,end) is called,
- migratetype of the range turns to be MIGRATE_ISOLATE if
its current type is MIGRATE_MOVABLE or MIGRATE_RESERVE.
- MIGRATE_ISOLATE is not on migratetype fallback list.
Then, pages of this migratetype will not be allocated even if it is free.
Now, isolate_pages() only can treat the range aligned to MAX_ORDER.
This can be adjusted if necesasry...maybe.
Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com>
Signed-Off-By: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Index: devel-2.6.22-rc1-mm1/include/linux/mmzone.h
===================================================================
--- devel-2.6.22-rc1-mm1.orig/include/linux/mmzone.h 2007-05-22 14:30:43.000000000 +0900
+++ devel-2.6.22-rc1-mm1/include/linux/mmzone.h 2007-05-22 15:12:28.000000000 +0900
@@ -35,11 +35,12 @@
*/
#define PAGE_ALLOC_COSTLY_ORDER 3
-#define MIGRATE_UNMOVABLE 0
-#define MIGRATE_RECLAIMABLE 1
-#define MIGRATE_MOVABLE 2
-#define MIGRATE_RESERVE 3
-#define MIGRATE_TYPES 4
+#define MIGRATE_UNMOVABLE 0 /* not reclaimable pages */
+#define MIGRATE_RECLAIMABLE 1 /* shrink_xxx routine can reap this */
+#define MIGRATE_MOVABLE 2 /* migrate_page can migrate this */
+#define MIGRATE_RESERVE 3 /* no type yet */
+#define MIGRATE_ISOLATE 4 /* never allocated from */
+#define MIGRATE_TYPES 5
#define for_each_migratetype_order(order, type) \
for (order = 0; order < MAX_ORDER; order++) \
Index: devel-2.6.22-rc1-mm1/include/linux/pageblock-flags.h
===================================================================
--- devel-2.6.22-rc1-mm1.orig/include/linux/pageblock-flags.h 2007-05-22 14:30:43.000000000 +0900
+++ devel-2.6.22-rc1-mm1/include/linux/pageblock-flags.h 2007-05-22 15:12:28.000000000 +0900
@@ -31,7 +31,7 @@
/* Bit indices that affect a whole block of pages */
enum pageblock_bits {
- PB_range(PB_migrate, 2), /* 2 bits required for migrate types */
+ PB_range(PB_migrate, 3), /* 3 bits required for migrate types */
NR_PAGEBLOCK_BITS
};
Index: devel-2.6.22-rc1-mm1/mm/page_alloc.c
===================================================================
--- devel-2.6.22-rc1-mm1.orig/mm/page_alloc.c 2007-05-22 14:30:43.000000000 +0900
+++ devel-2.6.22-rc1-mm1/mm/page_alloc.c 2007-05-22 15:12:28.000000000 +0900
@@ -41,6 +41,7 @@
#include <linux/pfn.h>
#include <linux/backing-dev.h>
#include <linux/fault-inject.h>
+#include <linux/page-isolation.h>
#include <asm/tlbflush.h>
#include <asm/div64.h>
@@ -1056,6 +1057,7 @@
struct zone *zone = page_zone(page);
struct per_cpu_pages *pcp;
unsigned long flags;
+ unsigned long migrate_type;
if (PageAnon(page))
page->mapping = NULL;
@@ -1064,6 +1066,12 @@
if (!PageHighMem(page))
debug_check_no_locks_freed(page_address(page), PAGE_SIZE);
+
+ migrate_type = get_pageblock_migratetype(page);
+ if (migrate_type == MIGRATE_ISOLATE) {
+ __free_pages_ok(page, 0);
+ return;
+ }
arch_free_page(page, 0);
kernel_map_pages(page, 1, 0);
@@ -1071,7 +1079,7 @@
local_irq_save(flags);
__count_vm_event(PGFREE);
list_add(&page->lru, &pcp->list);
- set_page_private(page, get_pageblock_migratetype(page));
+ set_page_private(page, migrate_type);
pcp->count++;
if (pcp->count >= pcp->high) {
free_pages_bulk(zone, pcp->batch, &pcp->list, 0);
@@ -4389,3 +4397,53 @@
else
__clear_bit(bitidx + start_bitidx, bitmap);
}
+
+/*
+ * set/clear page block's type to be ISOLATE.
+ * page allocater never alloc memory from ISOLATE blcok.
+ */
+
+int is_page_isolated(struct page *page)
+{
+ if ((page_count(page) == 0) &&
+ (get_pageblock_migratetype(page) == MIGRATE_ISOLATE))
+ return 1;
+ return 0;
+}
+
+int set_migratetype_isolate(struct page *page)
+{
+ struct zone *zone;
+ unsigned long flags;
+ int migrate_type;
+ int ret = -EBUSY;
+
+ zone = page_zone(page);
+ spin_lock_irqsave(&zone->lock, flags);
+ migrate_type = get_pageblock_migratetype(page);
+ if ((migrate_type != MIGRATE_MOVABLE) &&
+ (migrate_type != MIGRATE_RESERVE))
+ goto out;
+ set_pageblock_migratetype(page, MIGRATE_ISOLATE);
+ move_freepages_block(zone, page, MIGRATE_ISOLATE);
+ ret = 0;
+out:
+ spin_unlock_irqrestore(&zone->lock, flags);
+ if (!ret)
+ drain_all_local_pages();
+ return ret;
+}
+
+void clear_migratetype_isolate(struct page *page)
+{
+ struct zone *zone;
+ unsigned long flags;
+ zone = page_zone(page);
+ spin_lock_irqsave(&zone->lock, flags);
+ if (get_pageblock_migratetype(page) != MIGRATE_ISOLATE)
+ goto out;
+ set_pageblock_migratetype(page, MIGRATE_RESERVE);
+ move_freepages_block(zone, page, MIGRATE_RESERVE);
+out:
+ spin_unlock_irqrestore(&zone->lock, flags);
+}
Index: devel-2.6.22-rc1-mm1/mm/page_isolation.c
===================================================================
--- /dev/null 1970-01-01 00:00:00.000000000 +0000
+++ devel-2.6.22-rc1-mm1/mm/page_isolation.c 2007-05-22 15:12:28.000000000 +0900
@@ -0,0 +1,67 @@
+/*
+ * linux/mm/page_isolation.c
+ */
+
+#include <stddef.h>
+#include <linux/mm.h>
+#include <linux/page-isolation.h>
+
+#define ROUND_DOWN(x,y) ((x) & ~((y) - 1))
+#define ROUND_UP(x,y) (((x) + (y) -1) & ~((y) - 1))
+int
+isolate_pages(unsigned long start_pfn, unsigned long end_pfn)
+{
+ unsigned long pfn, start_pfn_aligned, end_pfn_aligned;
+ unsigned long undo_pfn;
+
+ start_pfn_aligned = ROUND_DOWN(start_pfn, NR_PAGES_ISOLATION_BLOCK);
+ end_pfn_aligned = ROUND_UP(end_pfn, NR_PAGES_ISOLATION_BLOCK);
+
+ for (pfn = start_pfn_aligned;
+ pfn < end_pfn_aligned;
+ pfn += NR_PAGES_ISOLATION_BLOCK)
+ if (set_migratetype_isolate(pfn_to_page(pfn))) {
+ undo_pfn = pfn;
+ goto undo;
+ }
+ return 0;
+undo:
+ for (pfn = start_pfn_aligned;
+ pfn <= undo_pfn;
+ pfn += NR_PAGES_ISOLATION_BLOCK)
+ clear_migratetype_isolate(pfn_to_page(pfn));
+
+ return -EBUSY;
+}
+
+
+int
+free_isolated_pages(unsigned long start_pfn, unsigned long end_pfn)
+{
+ unsigned long pfn, start_pfn_aligned, end_pfn_aligned;
+ start_pfn_aligned = ROUND_DOWN(start_pfn, NR_PAGES_ISOLATION_BLOCK);
+ end_pfn_aligned = ROUND_UP(end_pfn, NR_PAGES_ISOLATION_BLOCK);
+
+ for (pfn = start_pfn_aligned;
+ pfn < end_pfn_aligned;
+ pfn += MAX_ORDER_NR_PAGES)
+ clear_migratetype_isolate(pfn_to_page(pfn));
+ return 0;
+}
+
+int
+test_pages_isolated(unsigned long start_pfn, unsigned long end_pfn)
+{
+ unsigned long pfn;
+ int ret = 0;
+
+ for (pfn = start_pfn; pfn < end_pfn; pfn++) {
+ if (!pfn_valid(pfn))
+ continue;
+ if (!is_page_isolated(pfn_to_page(pfn))) {
+ ret = 1;
+ break;
+ }
+ }
+ return ret;
+}
Index: devel-2.6.22-rc1-mm1/include/linux/page-isolation.h
===================================================================
--- /dev/null 1970-01-01 00:00:00.000000000 +0000
+++ devel-2.6.22-rc1-mm1/include/linux/page-isolation.h 2007-05-22 15:12:28.000000000 +0900
@@ -0,0 +1,47 @@
+#ifndef __LINUX_PAGEISOLATION_H
+#define __LINUX_PAGEISOLATION_H
+/*
+ * Define an interface for capturing and isolating some amount of
+ * contiguous pages.
+ * isolated pages are freed but wll never be allocated until they are
+ * pushed back.
+ *
+ * This isolation function requires some alignment.
+ */
+
+#define PAGE_ISOLATION_ORDER (MAX_ORDER - 1)
+#define NR_PAGES_ISOLATION_BLOCK (1 << PAGE_ISOLATION_ORDER)
+
+/*
+ * set page isolation range.
+ * If specified range includes migrate types other than MOVABLE,
+ * this will fail with -EBUSY.
+ */
+extern int
+isolate_pages(unsigned long start_pfn, unsigned long end_pfn);
+
+/*
+ * Free all isolated memory and push back them as MIGRATE_RESERVE type.
+ */
+extern int
+free_isolated_pages(unsigned long start_pfn, unsigned long end_pfn);
+
+/*
+ * test all pages are isolated or not.
+ */
+extern int
+test_pages_isolated(unsigned long start_pfn, unsigned long end_pfn);
+
+/* test routine for check page is isolated or not */
+extern int is_page_isolated(struct page *page);
+
+/*
+ * Internal funcs.
+ * Changes pageblock's migrate type
+ */
+extern int set_migratetype_isolate(struct page *page);
+extern void clear_migratetype_isolate(struct page *page);
+extern int __is_page_isolated(struct page *page);
+
+
+#endif
Index: devel-2.6.22-rc1-mm1/mm/Makefile
===================================================================
--- devel-2.6.22-rc1-mm1.orig/mm/Makefile 2007-05-22 14:30:43.000000000 +0900
+++ devel-2.6.22-rc1-mm1/mm/Makefile 2007-05-22 15:12:28.000000000 +0900
@@ -11,7 +11,7 @@
page_alloc.o page-writeback.o pdflush.o \
readahead.o swap.o truncate.o vmscan.o \
prio_tree.o util.o mmzone.o vmstat.o backing-dev.o \
- $(mmu-y)
+ page_isolation.o $(mmu-y)
ifeq ($(CONFIG_MMU)$(CONFIG_BLOCK),yy)
obj-y += bounce.o
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 20+ messages in thread* Re: [Patch] memory unplug v3 [1/4] page isolation
2007-05-22 7:01 ` [Patch] memory unplug v3 [1/4] page isolation KAMEZAWA Hiroyuki
@ 2007-05-22 10:19 ` Mel Gorman
2007-05-22 11:01 ` KAMEZAWA Hiroyuki
2007-05-22 18:38 ` Christoph Lameter
1 sibling, 1 reply; 20+ messages in thread
From: Mel Gorman @ 2007-05-22 10:19 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki; +Cc: linux-mm, y-goto, clameter
On Tue, 22 May 2007, KAMEZAWA Hiroyuki wrote:
> Patch for isoalte pages.
> 'isoalte'means make pages to be free and never allocated.
> This feature helps making the range of pages unused.
>
> This patch is based on Mel's page grouping method.
>
> This patch add MIGRATE_ISOLATE to MIGRATE_TYPES. By this
> - MIGRATE_TYPES increases.
> - bitmap for migratetype is enlarged.
>
Both correct.
> If isolate_pages(start,end) is called,
> - migratetype of the range turns to be MIGRATE_ISOLATE if
> its current type is MIGRATE_MOVABLE or MIGRATE_RESERVE.
Why not MIGRATE_RECLAIMABLE as well?
> - MIGRATE_ISOLATE is not on migratetype fallback list.
>
> Then, pages of this migratetype will not be allocated even if it is free.
>
> Now, isolate_pages() only can treat the range aligned to MAX_ORDER.
> This can be adjusted if necesasry...maybe.
>
I have a patch ready that groups pages by an arbitrary order. Right now it
is related to the size of the huge page on the system but it's a single
variable pageblock_order that determines the range. You may find you want
to adjust this value.
> Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com>
> Signed-Off-By: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
>
> Index: devel-2.6.22-rc1-mm1/include/linux/mmzone.h
> ===================================================================
> --- devel-2.6.22-rc1-mm1.orig/include/linux/mmzone.h 2007-05-22 14:30:43.000000000 +0900
> +++ devel-2.6.22-rc1-mm1/include/linux/mmzone.h 2007-05-22 15:12:28.000000000 +0900
> @@ -35,11 +35,12 @@
> */
> #define PAGE_ALLOC_COSTLY_ORDER 3
>
> -#define MIGRATE_UNMOVABLE 0
> -#define MIGRATE_RECLAIMABLE 1
> -#define MIGRATE_MOVABLE 2
> -#define MIGRATE_RESERVE 3
> -#define MIGRATE_TYPES 4
> +#define MIGRATE_UNMOVABLE 0 /* not reclaimable pages */
> +#define MIGRATE_RECLAIMABLE 1 /* shrink_xxx routine can reap this */
> +#define MIGRATE_MOVABLE 2 /* migrate_page can migrate this */
> +#define MIGRATE_RESERVE 3 /* no type yet */
MIGRATE_RESERVE is where the min_free_kbytes pages are kept if possible
and the number of RESERVE blocks depends on the value of it. It is only
allocated from if the alternative is to fail the allocation so this
comment should read
/* min_free_kbytes free pages here */
Later we may find a way of using MIGRATE_RESERVE to isolate ranges but
it's not necessary now because it would obscure how the patch works.
> +#define MIGRATE_ISOLATE 4 /* never allocated from */
> +#define MIGRATE_TYPES 5
>
The documentation changes probably belong in a separate patch but thanks,
it nudges me again into getting around to it.
> #define for_each_migratetype_order(order, type) \
> for (order = 0; order < MAX_ORDER; order++) \
> Index: devel-2.6.22-rc1-mm1/include/linux/pageblock-flags.h
> ===================================================================
> --- devel-2.6.22-rc1-mm1.orig/include/linux/pageblock-flags.h 2007-05-22 14:30:43.000000000 +0900
> +++ devel-2.6.22-rc1-mm1/include/linux/pageblock-flags.h 2007-05-22 15:12:28.000000000 +0900
> @@ -31,7 +31,7 @@
>
> /* Bit indices that affect a whole block of pages */
> enum pageblock_bits {
> - PB_range(PB_migrate, 2), /* 2 bits required for migrate types */
> + PB_range(PB_migrate, 3), /* 3 bits required for migrate types */
Right.
> NR_PAGEBLOCK_BITS
> };
>
> Index: devel-2.6.22-rc1-mm1/mm/page_alloc.c
> ===================================================================
> --- devel-2.6.22-rc1-mm1.orig/mm/page_alloc.c 2007-05-22 14:30:43.000000000 +0900
> +++ devel-2.6.22-rc1-mm1/mm/page_alloc.c 2007-05-22 15:12:28.000000000 +0900
> @@ -41,6 +41,7 @@
> #include <linux/pfn.h>
> #include <linux/backing-dev.h>
> #include <linux/fault-inject.h>
> +#include <linux/page-isolation.h>
>
> #include <asm/tlbflush.h>
> #include <asm/div64.h>
> @@ -1056,6 +1057,7 @@
> struct zone *zone = page_zone(page);
> struct per_cpu_pages *pcp;
> unsigned long flags;
> + unsigned long migrate_type;
>
> if (PageAnon(page))
> page->mapping = NULL;
> @@ -1064,6 +1066,12 @@
>
> if (!PageHighMem(page))
> debug_check_no_locks_freed(page_address(page), PAGE_SIZE);
> +
> + migrate_type = get_pageblock_migratetype(page);
> + if (migrate_type == MIGRATE_ISOLATE) {
> + __free_pages_ok(page, 0);
> + return;
> + }
This change to the PCP allocator may be unnecessary. If you let the page
free to the pcp lists, they will never be allocated from there because
allocflags_to_migratetype() will never return MIGRATE_ISOLATE. What you
could do is drain the PCP lists just before you try to hot-remove or call
test_pages_isolated() to that the pcp pages will free back to the
MIGRATE_ISOLATE lists.
The extra drain is undesirable but probably better than checking for
isolate every time a free occurs to the pcp lists.
> arch_free_page(page, 0);
> kernel_map_pages(page, 1, 0);
>
> @@ -1071,7 +1079,7 @@
> local_irq_save(flags);
> __count_vm_event(PGFREE);
> list_add(&page->lru, &pcp->list);
> - set_page_private(page, get_pageblock_migratetype(page));
> + set_page_private(page, migrate_type);
> pcp->count++;
> if (pcp->count >= pcp->high) {
> free_pages_bulk(zone, pcp->batch, &pcp->list, 0);
> @@ -4389,3 +4397,53 @@
> else
> __clear_bit(bitidx + start_bitidx, bitmap);
> }
> +
> +/*
> + * set/clear page block's type to be ISOLATE.
> + * page allocater never alloc memory from ISOLATE blcok.
> + */
> +
> +int is_page_isolated(struct page *page)
> +{
> + if ((page_count(page) == 0) &&
> + (get_pageblock_migratetype(page) == MIGRATE_ISOLATE))
(PageBuddy(page) || (page_count(page) == 0 && PagePrivate(page))) &&
(get_pageblock_migratetype(page) == MIGRATE_ISOLATE)
PageBuddy(page) for free pages and page_count(page) with PagePrivate
should indicate pages that are on the pcp lists.
As you currently prevent ISOLATE pages going to the pcp lists, only the
PageBuddy check is necessary right now but If you drain before you check
for isolated pages, you only need the PageBuddy() check. If you choose to
let pages on the pcp lists until a drain occurs, then you need the second
check.
This page_count() check instead of PageBuddy() appears to be related to
how test_pages_isolated() is implemented - more on that later.
> + return 1;
> + return 0;
> +}
> +
> +int set_migratetype_isolate(struct page *page)
> +{
set_pageblock_isolate() maybe to match set_pageblock_migratetype() naming?
> + struct zone *zone;
> + unsigned long flags;
> + int migrate_type;
> + int ret = -EBUSY;
> +
> + zone = page_zone(page);
> + spin_lock_irqsave(&zone->lock, flags);
It may be more appropriate to have the caller take this lock. More later
in isolates_pages()
> + migrate_type = get_pageblock_migratetype(page);
> + if ((migrate_type != MIGRATE_MOVABLE) &&
> + (migrate_type != MIGRATE_RESERVE))
> + goto out;
and maybe MIGRATE_RECLAIMABLE here particularly in view of Christoph's
work with kmem_cache_vacate().
> + set_pageblock_migratetype(page, MIGRATE_ISOLATE);
> + move_freepages_block(zone, page, MIGRATE_ISOLATE);
> + ret = 0;
> +out:
> + spin_unlock_irqrestore(&zone->lock, flags);
> + if (!ret)
> + drain_all_local_pages();
It's not clear why you drain the pcp lists when you encounter a block of
the wrong migrate_type. Draining the pcp lists is unlikely to help you.
> + return ret;
> +}
> +
> +void clear_migratetype_isolate(struct page *page)
> +{
> + struct zone *zone;
> + unsigned long flags;
> + zone = page_zone(page);
> + spin_lock_irqsave(&zone->lock, flags);
> + if (get_pageblock_migratetype(page) != MIGRATE_ISOLATE)
> + goto out;
> + set_pageblock_migratetype(page, MIGRATE_RESERVE);
> + move_freepages_block(zone, page, MIGRATE_RESERVE);
MIGRATE_RESERVE is likely not what you want to do here. The number of
MIGRATE_RESERVE blocks in a zone is determined by
setup_zone_migrate_reserve(). If you are setting blocks like this, then
you need to call setup_zone_migrate_reserve() with the zone->lru_lock held
after you have call clear_migratetype_isolate() for all the necessary
blocks.
It may be easier to just set the blocks MIGRATE_MOVABLE.
> +out:
> + spin_unlock_irqrestore(&zone->lock, flags);
> +}
> Index: devel-2.6.22-rc1-mm1/mm/page_isolation.c
> ===================================================================
> --- /dev/null 1970-01-01 00:00:00.000000000 +0000
> +++ devel-2.6.22-rc1-mm1/mm/page_isolation.c 2007-05-22 15:12:28.000000000 +0900
> @@ -0,0 +1,67 @@
> +/*
> + * linux/mm/page_isolation.c
> + */
> +
> +#include <stddef.h>
> +#include <linux/mm.h>
> +#include <linux/page-isolation.h>
> +
> +#define ROUND_DOWN(x,y) ((x) & ~((y) - 1))
> +#define ROUND_UP(x,y) (((x) + (y) -1) & ~((y) - 1))
A roundup() macro already exists in kernel.h. You may want to use that and
define a new rounddown() macro there instead.
> +int
> +isolate_pages(unsigned long start_pfn, unsigned long end_pfn)
> +{
> + unsigned long pfn, start_pfn_aligned, end_pfn_aligned;
> + unsigned long undo_pfn;
> +
> + start_pfn_aligned = ROUND_DOWN(start_pfn, NR_PAGES_ISOLATION_BLOCK);
> + end_pfn_aligned = ROUND_UP(end_pfn, NR_PAGES_ISOLATION_BLOCK);
> +
> + for (pfn = start_pfn_aligned;
> + pfn < end_pfn_aligned;
> + pfn += NR_PAGES_ISOLATION_BLOCK)
> + if (set_migratetype_isolate(pfn_to_page(pfn))) {
You will need to call pfn_valid() in the non-SPARSEMEM case before calling
pfn_to_page() or this will crash in some circumstances.
You also need to check zone boundaries. Lets say start_pfn is the start of
a non-MAX_ORDER aligned zone. Aligning it could make you start isolating
in the wrong zone - prehaps this is intentional, I don't know.
> + undo_pfn = pfn;
> + goto undo;
> + }
> + return 0;
> +undo:
> + for (pfn = start_pfn_aligned;
> + pfn <= undo_pfn;
> + pfn += NR_PAGES_ISOLATION_BLOCK)
> + clear_migratetype_isolate(pfn_to_page(pfn));
> +
We fail if we encounter any non-MIGRATE_MOVABLE block in the start_pfn to
end_pfn range but at that point we've done a lot of work. We also take and
release an interrupt safe lock for each NR_PAGES_ISOLATION_BLOCK block
because set_migratetype_isolate() is responsible for lock taking.
It might be better if you took the lock here, scanned first to make sure
all the blocks were suitable for isolation and only then, call
set_migratetype_isolate() for each of them before releasing the lock.
That would take the lock once and avoid the need for back-out code that
changes all the MIGRATE types in the range. Even for large ranges of
memory, it should not be too long to be holding a lock particularly in
this path.
> + return -EBUSY;
> +}
> +
> +
> +int
> +free_isolated_pages(unsigned long start_pfn, unsigned long end_pfn)
> +{
> + unsigned long pfn, start_pfn_aligned, end_pfn_aligned;
> + start_pfn_aligned = ROUND_DOWN(start_pfn, NR_PAGES_ISOLATION_BLOCK);
> + end_pfn_aligned = ROUND_UP(end_pfn, NR_PAGES_ISOLATION_BLOCK);
spaces instead of tabs there before end_pfn_aligned.
> +
> + for (pfn = start_pfn_aligned;
> + pfn < end_pfn_aligned;
> + pfn += MAX_ORDER_NR_PAGES)
pfn += NR_PAGES_ISOLATION_BLOCK ?
pfn_valid() ?
> + clear_migratetype_isolate(pfn_to_page(pfn));
> + return 0;
> +}
> +
> +int
> +test_pages_isolated(unsigned long start_pfn, unsigned long end_pfn)
> +{
> + unsigned long pfn;
> + int ret = 0;
> +
You didn't align here, intentional?
> + for (pfn = start_pfn; pfn < end_pfn; pfn++) {
> + if (!pfn_valid(pfn))
> + continue;
> + if (!is_page_isolated(pfn_to_page(pfn))) {
> + ret = 1;
> + break;
> + }
If the page is isolated, it's free and assuming you've drained the pcp
lists, it will have PageBuddy() set. In that case, you should be checking
what order the page is free at and skipping forward that number of pages.
I am guessing this pfn++ walk here is why you are checking
page_count(page) == 0 in is_page_isolated() instead of PageBuddy()
> + }
> + return ret;
The return value is a little counter-intuitive. It returns 1 if they are
not isolated. I would expect it to return 1 if isolated like test_bit()
returns 1 if it's set.
> +}
> Index: devel-2.6.22-rc1-mm1/include/linux/page-isolation.h
> ===================================================================
> --- /dev/null 1970-01-01 00:00:00.000000000 +0000
> +++ devel-2.6.22-rc1-mm1/include/linux/page-isolation.h 2007-05-22 15:12:28.000000000 +0900
> @@ -0,0 +1,47 @@
> +#ifndef __LINUX_PAGEISOLATION_H
> +#define __LINUX_PAGEISOLATION_H
> +/*
> + * Define an interface for capturing and isolating some amount of
> + * contiguous pages.
> + * isolated pages are freed but wll never be allocated until they are
> + * pushed back.
> + *
> + * This isolation function requires some alignment.
> + */
> +
> +#define PAGE_ISOLATION_ORDER (MAX_ORDER - 1)
> +#define NR_PAGES_ISOLATION_BLOCK (1 << PAGE_ISOLATION_ORDER)
> +
When grouping-pages-by-arbitary-order goes in, there will be a value
available called pageblock_order and nr_pages_pageblock which will be
identical to these two values.
> +/*
> + * set page isolation range.
> + * If specified range includes migrate types other than MOVABLE,
> + * this will fail with -EBUSY.
> + */
> +extern int
> +isolate_pages(unsigned long start_pfn, unsigned long end_pfn);
> +
> +/*
> + * Free all isolated memory and push back them as MIGRATE_RESERVE type.
> + */
> +extern int
> +free_isolated_pages(unsigned long start_pfn, unsigned long end_pfn);
> +
> +/*
> + * test all pages are isolated or not.
> + */
> +extern int
> +test_pages_isolated(unsigned long start_pfn, unsigned long end_pfn);
> +
> +/* test routine for check page is isolated or not */
> +extern int is_page_isolated(struct page *page);
> +
> +/*
> + * Internal funcs.
> + * Changes pageblock's migrate type
> + */
> +extern int set_migratetype_isolate(struct page *page);
> +extern void clear_migratetype_isolate(struct page *page);
> +extern int __is_page_isolated(struct page *page);
> +
> +
> +#endif
> Index: devel-2.6.22-rc1-mm1/mm/Makefile
> ===================================================================
> --- devel-2.6.22-rc1-mm1.orig/mm/Makefile 2007-05-22 14:30:43.000000000 +0900
> +++ devel-2.6.22-rc1-mm1/mm/Makefile 2007-05-22 15:12:28.000000000 +0900
> @@ -11,7 +11,7 @@
> page_alloc.o page-writeback.o pdflush.o \
> readahead.o swap.o truncate.o vmscan.o \
> prio_tree.o util.o mmzone.o vmstat.o backing-dev.o \
> - $(mmu-y)
> + page_isolation.o $(mmu-y)
>
> ifeq ($(CONFIG_MMU)$(CONFIG_BLOCK),yy)
> obj-y += bounce.o
>
All in all, I like this implementation. I found it nice and relatively
straight-forward to read. Thanks
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 20+ messages in thread* Re: [Patch] memory unplug v3 [1/4] page isolation
2007-05-22 10:19 ` Mel Gorman
@ 2007-05-22 11:01 ` KAMEZAWA Hiroyuki
0 siblings, 0 replies; 20+ messages in thread
From: KAMEZAWA Hiroyuki @ 2007-05-22 11:01 UTC (permalink / raw)
To: Mel Gorman; +Cc: linux-mm, y-goto, clameter
On Tue, 22 May 2007 11:19:27 +0100 (IST)
Mel Gorman <mel@csn.ul.ie> wrote:
> > If isolate_pages(start,end) is called,
> > - migratetype of the range turns to be MIGRATE_ISOLATE if
> > its current type is MIGRATE_MOVABLE or MIGRATE_RESERVE.
>
> Why not MIGRATE_RECLAIMABLE as well?
>
To allow that, I have to implement page_reclaime_range(start_pfn, end_pfn);
Now, I use just migration.
I'll consider it as my future work.
Maybe Christoph's work will help me.
> > - MIGRATE_ISOLATE is not on migratetype fallback list.
> >
> > Then, pages of this migratetype will not be allocated even if it is free.
> >
> > Now, isolate_pages() only can treat the range aligned to MAX_ORDER.
> > This can be adjusted if necesasry...maybe.
> >
>
> I have a patch ready that groups pages by an arbitrary order. Right now it
> is related to the size of the huge page on the system but it's a single
> variable pageblock_order that determines the range. You may find you want
> to adjust this value.
>
I see. I'll support it in patches for next -mm.
> > +#define MIGRATE_UNMOVABLE 0 /* not reclaimable pages */
> > +#define MIGRATE_RECLAIMABLE 1 /* shrink_xxx routine can reap this */
> > +#define MIGRATE_MOVABLE 2 /* migrate_page can migrate this */
> > +#define MIGRATE_RESERVE 3 /* no type yet */
>
> MIGRATE_RESERVE is where the min_free_kbytes pages are kept if possible
> and the number of RESERVE blocks depends on the value of it. It is only
> allocated from if the alternative is to fail the allocation so this
> comment should read
>
> /* min_free_kbytes free pages here */
>
ok.
> Later we may find a way of using MIGRATE_RESERVE to isolate ranges but
> it's not necessary now because it would obscure how the patch works.
>
> > +#define MIGRATE_ISOLATE 4 /* never allocated from */
> > +#define MIGRATE_TYPES 5
> >
>
> The documentation changes probably belong in a separate patch but thanks,
> it nudges me again into getting around to it.
>
Ok, I'll just consider comments for MIGRAT_ISOLATE.
>
> > +
> > + migrate_type = get_pageblock_migratetype(page);
> > + if (migrate_type == MIGRATE_ISOLATE) {
> > + __free_pages_ok(page, 0);
> > + return;
> > + }
>
> This change to the PCP allocator may be unnecessary. If you let the page
> free to the pcp lists, they will never be allocated from there because
> allocflags_to_migratetype() will never return MIGRATE_ISOLATE. What you
> could do is drain the PCP lists just before you try to hot-remove or call
> test_pages_isolated() to that the pcp pages will free back to the
> MIGRATE_ISOLATE lists.
>
Ah.. thanks. I'll remove this.
> The extra drain is undesirable but probably better than checking for
> isolate every time a free occurs to the pcp lists.
>
yes.
>
> > +/*
> > + * set/clear page block's type to be ISOLATE.
> > + * page allocater never alloc memory from ISOLATE blcok.
> > + */
> > +
> > +int is_page_isolated(struct page *page)
> > +{
> > + if ((page_count(page) == 0) &&
> > + (get_pageblock_migratetype(page) == MIGRATE_ISOLATE))
>
> (PageBuddy(page) || (page_count(page) == 0 && PagePrivate(page))) &&
> (get_pageblock_migratetype(page) == MIGRATE_ISOLATE)
>
> PageBuddy(page) for free pages and page_count(page) with PagePrivate
> should indicate pages that are on the pcp lists.
>
> As you currently prevent ISOLATE pages going to the pcp lists, only the
> PageBuddy check is necessary right now but If you drain before you check
> for isolated pages, you only need the PageBuddy() check. If you choose to
> let pages on the pcp lists until a drain occurs, then you need the second
> check.
>
> This page_count() check instead of PageBuddy() appears to be related to
> how test_pages_isolated() is implemented - more on that later.
>
PG_buddy is set only if page is linked to freelist. IOW, if the page
is not the head of its buddy, PG_buddy is not set.
So, I didn't use PageBuddy().
(*) If I use PG_buddy for check "page is free or not", I have to search
head of buddy and its order.
> > + return 1;
> > + return 0;
> > +}
> > +
> > +int set_migratetype_isolate(struct page *page)
> > +{
>
> set_pageblock_isolate() maybe to match set_pageblock_migratetype() naming?
>
> > + struct zone *zone;
> > + unsigned long flags;
> > + int migrate_type;
> > + int ret = -EBUSY;
> > +
> > + zone = page_zone(page);
> > + spin_lock_irqsave(&zone->lock, flags);
>
> It may be more appropriate to have the caller take this lock. More later
> in isolates_pages()
>
ok.
> > + migrate_type = get_pageblock_migratetype(page);
> > + if ((migrate_type != MIGRATE_MOVABLE) &&
> > + (migrate_type != MIGRATE_RESERVE))
> > + goto out;
>
> and maybe MIGRATE_RECLAIMABLE here particularly in view of Christoph's
> work with kmem_cache_vacate().
>
ok. I'll look into.
> > + set_pageblock_migratetype(page, MIGRATE_ISOLATE);
> > + move_freepages_block(zone, page, MIGRATE_ISOLATE);
> > + ret = 0;
> > +out:
> > + spin_unlock_irqrestore(&zone->lock, flags);
> > + if (!ret)
> > + drain_all_local_pages();
>
> It's not clear why you drain the pcp lists when you encounter a block of
> the wrong migrate_type. Draining the pcp lists is unlikely to help you.
>
Ah, drain_all_local_pages() are called when MIGRATE_ISOLATE is successfully set.
But I'll change this because I'll remove hook in free_hot_cold_page() and call
drain_all_local_pages() in somewhere.
> > + return ret;
> > +}
> > +
> > +void clear_migratetype_isolate(struct page *page)
> > +{
> > + struct zone *zone;
> > + unsigned long flags;
> > + zone = page_zone(page);
> > + spin_lock_irqsave(&zone->lock, flags);
> > + if (get_pageblock_migratetype(page) != MIGRATE_ISOLATE)
> > + goto out;
> > + set_pageblock_migratetype(page, MIGRATE_RESERVE);
> > + move_freepages_block(zone, page, MIGRATE_RESERVE);
>
> MIGRATE_RESERVE is likely not what you want to do here. The number of
> MIGRATE_RESERVE blocks in a zone is determined by
> setup_zone_migrate_reserve(). If you are setting blocks like this, then
> you need to call setup_zone_migrate_reserve() with the zone->lru_lock held
> after you have call clear_migratetype_isolate() for all the necessary
> blocks.
>
> It may be easier to just set the blocks MIGRATE_MOVABLE.
>
Ok.
> > +out:
> > + spin_unlock_irqrestore(&zone->lock, flags);
> > +}
> > Index: devel-2.6.22-rc1-mm1/mm/page_isolation.c
> > ===================================================================
> > --- /dev/null 1970-01-01 00:00:00.000000000 +0000
> > +++ devel-2.6.22-rc1-mm1/mm/page_isolation.c 2007-05-22 15:12:28.000000000 +0900
> > @@ -0,0 +1,67 @@
> > +/*
> > + * linux/mm/page_isolation.c
> > + */
> > +
> > +#include <stddef.h>
> > +#include <linux/mm.h>
> > +#include <linux/page-isolation.h>
> > +
> > +#define ROUND_DOWN(x,y) ((x) & ~((y) - 1))
> > +#define ROUND_UP(x,y) (((x) + (y) -1) & ~((y) - 1))
>
> A roundup() macro already exists in kernel.h. You may want to use that and
> define a new rounddown() macro there instead.
Oh...I couldn't find it. thank you.
>
> > +int
> > +isolate_pages(unsigned long start_pfn, unsigned long end_pfn)
> > +{
> > + unsigned long pfn, start_pfn_aligned, end_pfn_aligned;
> > + unsigned long undo_pfn;
> > +
> > + start_pfn_aligned = ROUND_DOWN(start_pfn, NR_PAGES_ISOLATION_BLOCK);
> > + end_pfn_aligned = ROUND_UP(end_pfn, NR_PAGES_ISOLATION_BLOCK);
> > +
> > + for (pfn = start_pfn_aligned;
> > + pfn < end_pfn_aligned;
> > + pfn += NR_PAGES_ISOLATION_BLOCK)
> > + if (set_migratetype_isolate(pfn_to_page(pfn))) {
>
> You will need to call pfn_valid() in the non-SPARSEMEM case before calling
> pfn_to_page() or this will crash in some circumstances.
ok.
>
> You also need to check zone boundaries. Lets say start_pfn is the start of
> a non-MAX_ORDER aligned zone. Aligning it could make you start isolating
> in the wrong zone - prehaps this is intentional, I don't know.
Ah, ok. at least pfn_valid() is necessary.
>
> > + undo_pfn = pfn;
> > + goto undo;
> > + }
> > + return 0;
> > +undo:
> > + for (pfn = start_pfn_aligned;
> > + pfn <= undo_pfn;
> > + pfn += NR_PAGES_ISOLATION_BLOCK)
> > + clear_migratetype_isolate(pfn_to_page(pfn));
> > +
>
> We fail if we encounter any non-MIGRATE_MOVABLE block in the start_pfn to
> end_pfn range but at that point we've done a lot of work. We also take and
> release an interrupt safe lock for each NR_PAGES_ISOLATION_BLOCK block
> because set_migratetype_isolate() is responsible for lock taking.
>
> It might be better if you took the lock here, scanned first to make sure
> all the blocks were suitable for isolation and only then, call
> set_migratetype_isolate() for each of them before releasing the lock.
Hm. ok.
>
> That would take the lock once and avoid the need for back-out code that
> changes all the MIGRATE types in the range. Even for large ranges of
> memory, it should not be too long to be holding a lock particularly in
> this path.
>
> > + return -EBUSY;
> > +}
> > +
> > +
> > +int
> > +free_isolated_pages(unsigned long start_pfn, unsigned long end_pfn)
> > +{
> > + unsigned long pfn, start_pfn_aligned, end_pfn_aligned;
> > + start_pfn_aligned = ROUND_DOWN(start_pfn, NR_PAGES_ISOLATION_BLOCK);
> > + end_pfn_aligned = ROUND_UP(end_pfn, NR_PAGES_ISOLATION_BLOCK);
>
> spaces instead of tabs there before end_pfn_aligned.
>
> > +
> > + for (pfn = start_pfn_aligned;
> > + pfn < end_pfn_aligned;
> > + pfn += MAX_ORDER_NR_PAGES)
>
> pfn += NR_PAGES_ISOLATION_BLOCK ?
>
yes. it should be.
> pfn_valid() ?
>
ok.
> > + clear_migratetype_isolate(pfn_to_page(pfn));
> > + return 0;
> > +}
> > +
> > +int
> > +test_pages_isolated(unsigned long start_pfn, unsigned long end_pfn)
> > +{
> > + unsigned long pfn;
> > + int ret = 0;
> > +
>
> You didn't align here, intentional?
>
Ah...no. check alignment in the next version.
> > + for (pfn = start_pfn; pfn < end_pfn; pfn++) {
> > + if (!pfn_valid(pfn))
> > + continue;
> > + if (!is_page_isolated(pfn_to_page(pfn))) {
> > + ret = 1;
> > + break;
> > + }
>
> If the page is isolated, it's free and assuming you've drained the pcp
> lists, it will have PageBuddy() set. In that case, you should be checking
> what order the page is free at and skipping forward that number of pages.
> I am guessing this pfn++ walk here is why you are checking
> page_count(page) == 0 in is_page_isolated() instead of PageBuddy()
>
yes. In next version, I'd like to try to treat PageBuddy() and page_order() things.
> > + }
> > + return ret;
>
> The return value is a little counter-intuitive. It returns 1 if they are
> not isolated. I would expect it to return 1 if isolated like test_bit()
> returns 1 if it's set.
>
ok.
> > +#define PAGE_ISOLATION_ORDER (MAX_ORDER - 1)
> > +#define NR_PAGES_ISOLATION_BLOCK (1 << PAGE_ISOLATION_ORDER)
> > +
>
> When grouping-pages-by-arbitary-order goes in, there will be a value
> available called pageblock_order and nr_pages_pageblock which will be
> identical to these two values.
>
ok.
> All in all, I like this implementation. I found it nice and relatively
> straight-forward to read. Thanks
>
Thank you for review. I'll reflect your comments in the next version.
-Kame
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [Patch] memory unplug v3 [1/4] page isolation
2007-05-22 7:01 ` [Patch] memory unplug v3 [1/4] page isolation KAMEZAWA Hiroyuki
2007-05-22 10:19 ` Mel Gorman
@ 2007-05-22 18:38 ` Christoph Lameter
2007-05-23 1:41 ` KAMEZAWA Hiroyuki
1 sibling, 1 reply; 20+ messages in thread
From: Christoph Lameter @ 2007-05-22 18:38 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki; +Cc: linux-mm, mel, y-goto
On Tue, 22 May 2007, KAMEZAWA Hiroyuki wrote:
> Index: devel-2.6.22-rc1-mm1/mm/page_isolation.c
> ===================================================================
> --- /dev/null 1970-01-01 00:00:00.000000000 +0000
> +++ devel-2.6.22-rc1-mm1/mm/page_isolation.c 2007-05-22 15:12:28.000000000 +0900
> @@ -0,0 +1,67 @@
> +/*
> + * linux/mm/page_isolation.c
> + */
> +
> +#include <stddef.h>
> +#include <linux/mm.h>
> +#include <linux/page-isolation.h>
> +
> +#define ROUND_DOWN(x,y) ((x) & ~((y) - 1))
> +#define ROUND_UP(x,y) (((x) + (y) -1) & ~((y) - 1))
Use the common definitions like ALIGN in kernel.h and the rounding
functions in log2.h?
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [Patch] memory unplug v3 [1/4] page isolation
2007-05-22 18:38 ` Christoph Lameter
@ 2007-05-23 1:41 ` KAMEZAWA Hiroyuki
0 siblings, 0 replies; 20+ messages in thread
From: KAMEZAWA Hiroyuki @ 2007-05-23 1:41 UTC (permalink / raw)
To: Christoph Lameter; +Cc: linux-mm, mel, y-goto
On Tue, 22 May 2007 11:38:56 -0700 (PDT)
Christoph Lameter <clameter@sgi.com> wrote:
> On Tue, 22 May 2007, KAMEZAWA Hiroyuki wrote:
>
> > Index: devel-2.6.22-rc1-mm1/mm/page_isolation.c
> > ===================================================================
> > --- /dev/null 1970-01-01 00:00:00.000000000 +0000
> > +++ devel-2.6.22-rc1-mm1/mm/page_isolation.c 2007-05-22 15:12:28.000000000 +0900
> > @@ -0,0 +1,67 @@
> > +/*
> > + * linux/mm/page_isolation.c
> > + */
> > +
> > +#include <stddef.h>
> > +#include <linux/mm.h>
> > +#include <linux/page-isolation.h>
> > +
> > +#define ROUND_DOWN(x,y) ((x) & ~((y) - 1))
> > +#define ROUND_UP(x,y) (((x) + (y) -1) & ~((y) - 1))
>
> Use the common definitions like ALIGN in kernel.h and the rounding
> functions in log2.h?
>
yes. I should do so.
Thanks,
-Kmae
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 20+ messages in thread
* [Patch] memory unplug v3 [2/4] migration by kernel
2007-05-22 6:58 [Patch] memory unplug v3 [0/4] KAMEZAWA Hiroyuki
2007-05-22 7:01 ` [Patch] memory unplug v3 [1/4] page isolation KAMEZAWA Hiroyuki
@ 2007-05-22 7:04 ` KAMEZAWA Hiroyuki
2007-05-22 18:49 ` Christoph Lameter
2007-05-22 7:07 ` [Patch] memory unplug v3 [3/4] page removal KAMEZAWA Hiroyuki
` (2 subsequent siblings)
4 siblings, 1 reply; 20+ messages in thread
From: KAMEZAWA Hiroyuki @ 2007-05-22 7:04 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki; +Cc: linux-mm, mel, y-goto, clameter
This patch adds a feature that the kernel can migrate user pages by its own
context.
Now, sys_migrate(), a system call to migrate pages, works well.
When we want to migrate pages by some kernel codes, we have 2 approachs.
(a) acquire some mm->sem of a mapper of the target page.
(b) avoid race condition by additional check codes.
This patch implemetns (b) and adds following 2 codes.
1. delay freeing anon_vma while a page which belongs to it is migrated.
2. check page_mapped() before calling try_to_unmap().
Maybe more check will be needed. At least, this patch's migration_nocntext()
works well under heavy memory pressure on my environment.
Signed-Off-By: Yasonori Goto <y-goto@jp.fujitsu.com>
Signed-Off-By: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Index: devel-2.6.22-rc1-mm1/mm/Kconfig
===================================================================
--- devel-2.6.22-rc1-mm1.orig/mm/Kconfig 2007-05-22 14:30:39.000000000 +0900
+++ devel-2.6.22-rc1-mm1/mm/Kconfig 2007-05-22 15:12:29.000000000 +0900
@@ -152,6 +152,15 @@
example on NUMA systems to put pages nearer to the processors accessing
the page.
+config MIGRATION_BY_KERNEL
+ bool "Page migration by kernel's page scan"
+ def_bool y
+ depends on MIGRATION
+ help
+ Allows page migration from kernel context. This means page migration
+ can be done by codes other than sys_migrate() system call. Will add
+ some additional check code in page migration.
+
config RESOURCES_64BIT
bool "64 bit Memory and IO resources (EXPERIMENTAL)" if (!64BIT && EXPERIMENTAL)
default 64BIT
Index: devel-2.6.22-rc1-mm1/mm/migrate.c
===================================================================
--- devel-2.6.22-rc1-mm1.orig/mm/migrate.c 2007-05-22 14:30:39.000000000 +0900
+++ devel-2.6.22-rc1-mm1/mm/migrate.c 2007-05-22 15:12:29.000000000 +0900
@@ -607,11 +607,12 @@
* to the newly allocated page in newpage.
*/
static int unmap_and_move(new_page_t get_new_page, unsigned long private,
- struct page *page, int force)
+ struct page *page, int force, int context)
{
int rc = 0;
int *result = NULL;
struct page *newpage = get_new_page(page, private, &result);
+ struct anon_vma *anon_vma = NULL;
if (!newpage)
return -ENOMEM;
@@ -632,16 +633,29 @@
goto unlock;
wait_on_page_writeback(page);
}
-
+#ifdef CONFIG_MIGRATION_BY_KERNEL
+ if (PageAnon(page) && context)
+ /* hold this anon_vma until page migration ends */
+ anon_vma = anon_vma_hold(page);
+
+ if (page_mapped(page))
+ try_to_unmap(page, 1);
+#else
/*
* Establish migration ptes or remove ptes
*/
try_to_unmap(page, 1);
+#endif
if (!page_mapped(page))
rc = move_to_new_page(newpage, page);
- if (rc)
+ if (rc) {
remove_migration_ptes(page, page);
+ }
+#ifdef CONFIG_MIGRATION_BY_KERNEL
+ if (anon_vma)
+ anon_vma_release(anon_vma);
+#endif
unlock:
unlock_page(page);
@@ -686,8 +700,8 @@
*
* Return: Number of pages not migrated or error code.
*/
-int migrate_pages(struct list_head *from,
- new_page_t get_new_page, unsigned long private)
+int __migrate_pages(struct list_head *from,
+ new_page_t get_new_page, unsigned long private, int context)
{
int retry = 1;
int nr_failed = 0;
@@ -707,7 +721,7 @@
cond_resched();
rc = unmap_and_move(get_new_page, private,
- page, pass > 2);
+ page, pass > 2, context);
switch(rc) {
case -ENOMEM:
@@ -737,6 +751,25 @@
return nr_failed + retry;
}
+int migrate_pages(struct list_head *from,
+ new_page_t get_new_page, unsigned long private)
+{
+ return __migrate_pages(from, get_new_page, private, 0);
+}
+
+#ifdef CONFIG_MIGRATION_BY_KERNEL
+/*
+ * When page migration is issued by the kernel itself without page mapper's
+ * mm->sem, we have to be more careful to do page migration.
+ */
+int migrate_pages_nocontext(struct list_head *from,
+ new_page_t get_new_page, unsigned long private)
+{
+ return __migrate_pages(from, get_new_page, private, 1);
+}
+
+#endif /* CONFIG_MIGRATION_BY_KERNEL */
+
#ifdef CONFIG_NUMA
/*
* Move a list of individual pages
Index: devel-2.6.22-rc1-mm1/include/linux/rmap.h
===================================================================
--- devel-2.6.22-rc1-mm1.orig/include/linux/rmap.h 2007-05-22 14:30:39.000000000 +0900
+++ devel-2.6.22-rc1-mm1/include/linux/rmap.h 2007-05-22 15:12:29.000000000 +0900
@@ -26,12 +26,16 @@
struct anon_vma {
spinlock_t lock; /* Serialize access to vma list */
struct list_head head; /* List of private "related" vmas */
+#ifdef CONFIG_MIGRATION_BY_KERNEL
+ atomic_t ref; /* special refcnt for migration */
+#endif
};
#ifdef CONFIG_MMU
extern struct kmem_cache *anon_vma_cachep;
+#ifndef CONFIG_MIGRATION_BY_KERNEL
static inline struct anon_vma *anon_vma_alloc(void)
{
return kmem_cache_alloc(anon_vma_cachep, GFP_KERNEL);
@@ -41,6 +45,26 @@
{
kmem_cache_free(anon_vma_cachep, anon_vma);
}
+#define anon_vma_hold(page) do{}while(0)
+#define anon_vma_release(anon) do{}while(0)
+
+#else /* CONFIG_MIGRATION_BY_KERNEL */
+static inline struct anon_vma *anon_vma_alloc(void)
+{
+ struct anon_vma *ret = kmem_cache_alloc(anon_vma_cachep, GFP_KERNEL);
+ if (ret)
+ atomic_set(&ret->ref, 0);
+ return ret;
+}
+static inline void anon_vma_free(struct anon_vma *anon_vma)
+{
+ if (atomic_read(&anon_vma->ref) == 0)
+ kmem_cache_free(anon_vma_cachep, anon_vma);
+}
+extern struct anon_vma *anon_vma_hold(struct page *page);
+extern void anon_vma_release(struct anon_vma *anon_vma);
+
+#endif /* CONFIG_MIGRATION_BY_KERNEL */
static inline void anon_vma_lock(struct vm_area_struct *vma)
{
Index: devel-2.6.22-rc1-mm1/mm/rmap.c
===================================================================
--- devel-2.6.22-rc1-mm1.orig/mm/rmap.c 2007-05-22 14:30:39.000000000 +0900
+++ devel-2.6.22-rc1-mm1/mm/rmap.c 2007-05-22 15:12:29.000000000 +0900
@@ -203,6 +203,28 @@
spin_unlock(&anon_vma->lock);
rcu_read_unlock();
}
+#ifdef CONFIG_MIGRATION_BY_KERNEL
+struct anon_vma *anon_vma_hold(struct page *page) {
+ struct anon_vma *anon_vma;
+ anon_vma = page_lock_anon_vma(page);
+ if (!anon_vma)
+ return NULL;
+ atomic_set(&anon_vma->ref, 1);
+ spin_unlock(&anon_vma->lock);
+ return anon_vma;
+}
+
+void anon_vma_release(struct anon_vma *anon_vma)
+{
+ int empty;
+ spin_lock(&anon_vma->lock);
+ atomic_set(&anon_vma->ref, 0);
+ empty = list_empty(&anon_vma->head);
+ spin_unlock(&anon_vma->lock);
+ if (empty)
+ anon_vma_free(anon_vma);
+}
+#endif
/*
* At what user virtual address is page expected in vma?
Index: devel-2.6.22-rc1-mm1/include/linux/migrate.h
===================================================================
--- devel-2.6.22-rc1-mm1.orig/include/linux/migrate.h 2007-05-22 14:30:39.000000000 +0900
+++ devel-2.6.22-rc1-mm1/include/linux/migrate.h 2007-05-22 15:12:29.000000000 +0900
@@ -30,7 +30,10 @@
extern int migrate_page(struct address_space *,
struct page *, struct page *);
extern int migrate_pages(struct list_head *l, new_page_t x, unsigned long);
-
+#ifdef CONFIG_MIGRATION_BY_KERNEL
+extern int migrate_pages_nocontext(struct list_head *l, new_page_t x,
+ unsigned long);
+#endif
extern int fail_migrate_page(struct address_space *,
struct page *, struct page *);
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 20+ messages in thread* Re: [Patch] memory unplug v3 [2/4] migration by kernel
2007-05-22 7:04 ` [Patch] memory unplug v3 [2/4] migration by kernel KAMEZAWA Hiroyuki
@ 2007-05-22 18:49 ` Christoph Lameter
2007-05-23 1:45 ` KAMEZAWA Hiroyuki
2007-05-23 19:14 ` Mel Gorman
0 siblings, 2 replies; 20+ messages in thread
From: Christoph Lameter @ 2007-05-22 18:49 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki; +Cc: linux-mm, mel, y-goto
On Tue, 22 May 2007, KAMEZAWA Hiroyuki wrote:
> +config MIGRATION_BY_KERNEL
> + bool "Page migration by kernel's page scan"
> + def_bool y
> + depends on MIGRATION
> + help
> + Allows page migration from kernel context. This means page migration
> + can be done by codes other than sys_migrate() system call. Will add
> + some additional check code in page migration.
I think the scope of this is much bigger than you imagine. This is also
going to be useful when Mel is going to implement defragmentation. So I
think this should not be a separate option but be on by default.
> Index: devel-2.6.22-rc1-mm1/mm/migrate.c
> ===================================================================
> --- devel-2.6.22-rc1-mm1.orig/mm/migrate.c 2007-05-22 14:30:39.000000000 +0900
> +++ devel-2.6.22-rc1-mm1/mm/migrate.c 2007-05-22 15:12:29.000000000 +0900
> @@ -607,11 +607,12 @@
> * to the newly allocated page in newpage.
> */
> static int unmap_and_move(new_page_t get_new_page, unsigned long private,
> - struct page *page, int force)
> + struct page *page, int force, int context)
context is set if there is no context? Call this nocontext instead?
>
> - if (rc)
> + if (rc) {
> remove_migration_ptes(page, page);
> + }
Why are you adding { } here?
> +#ifdef CONFIG_MIGRATION_BY_KERNEL
> + if (anon_vma)
> + anon_vma_release(anon_vma);
> +#endif
The check for anon_vma != NULL could be put into anon_vma_release to avoid
the ifdef.
> Index: devel-2.6.22-rc1-mm1/mm/rmap.c
> ===================================================================
> --- devel-2.6.22-rc1-mm1.orig/mm/rmap.c 2007-05-22 14:30:39.000000000 +0900
> +++ devel-2.6.22-rc1-mm1/mm/rmap.c 2007-05-22 15:12:29.000000000 +0900
> @@ -203,6 +203,28 @@
> spin_unlock(&anon_vma->lock);
> rcu_read_unlock();
> }
> +#ifdef CONFIG_MIGRATION_BY_KERNEL
> +struct anon_vma *anon_vma_hold(struct page *page) {
> + struct anon_vma *anon_vma;
> + anon_vma = page_lock_anon_vma(page);
> + if (!anon_vma)
> + return NULL;
> + atomic_set(&anon_vma->ref, 1);
Why use an atomic value if it is set and cleared within a spinlock?
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 20+ messages in thread* Re: [Patch] memory unplug v3 [2/4] migration by kernel
2007-05-22 18:49 ` Christoph Lameter
@ 2007-05-23 1:45 ` KAMEZAWA Hiroyuki
2007-05-23 1:56 ` Christoph Lameter
2007-05-23 19:14 ` Mel Gorman
1 sibling, 1 reply; 20+ messages in thread
From: KAMEZAWA Hiroyuki @ 2007-05-23 1:45 UTC (permalink / raw)
To: Christoph Lameter; +Cc: linux-mm, mel, y-goto
On Tue, 22 May 2007 11:49:04 -0700 (PDT)
Christoph Lameter <clameter@sgi.com> wrote:
> On Tue, 22 May 2007, KAMEZAWA Hiroyuki wrote:
>
> > +config MIGRATION_BY_KERNEL
> > + bool "Page migration by kernel's page scan"
> > + def_bool y
> > + depends on MIGRATION
> > + help
> > + Allows page migration from kernel context. This means page migration
> > + can be done by codes other than sys_migrate() system call. Will add
> > + some additional check code in page migration.
>
> I think the scope of this is much bigger than you imagine. This is also
> going to be useful when Mel is going to implement defragmentation. So I
> think this should not be a separate option but be on by default.
ok. (Then I can remove this config.)
> > static int unmap_and_move(new_page_t get_new_page, unsigned long private,
> > - struct page *page, int force)
> > + struct page *page, int force, int context)
>
> context is set if there is no context? Call this nocontext instead?
>
ok, this should be.
> >
> > - if (rc)
> > + if (rc) {
> > remove_migration_ptes(page, page);
> > + }
>
> Why are you adding { } here?
>
maybe my garbage from older version.
> > +#ifdef CONFIG_MIGRATION_BY_KERNEL
> > +struct anon_vma *anon_vma_hold(struct page *page) {
> > + struct anon_vma *anon_vma;
> > + anon_vma = page_lock_anon_vma(page);
> > + if (!anon_vma)
> > + return NULL;
> > + atomic_set(&anon_vma->ref, 1);
>
> Why use an atomic value if it is set and cleared within a spinlock?
anon_vma_free(), which see this value, doesn't take any lock and use atomic ops.
I used atomic ops to handle atomic_t.
-Kame
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 20+ messages in thread* Re: [Patch] memory unplug v3 [2/4] migration by kernel
2007-05-23 1:45 ` KAMEZAWA Hiroyuki
@ 2007-05-23 1:56 ` Christoph Lameter
2007-05-23 2:09 ` KAMEZAWA Hiroyuki
0 siblings, 1 reply; 20+ messages in thread
From: Christoph Lameter @ 2007-05-23 1:56 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki; +Cc: linux-mm, mel, y-goto
On Wed, 23 May 2007, KAMEZAWA Hiroyuki wrote:
> > > +#ifdef CONFIG_MIGRATION_BY_KERNEL
> > > +struct anon_vma *anon_vma_hold(struct page *page) {
> > > + struct anon_vma *anon_vma;
> > > + anon_vma = page_lock_anon_vma(page);
> > > + if (!anon_vma)
> > > + return NULL;
> > > + atomic_set(&anon_vma->ref, 1);
> >
> > Why use an atomic value if it is set and cleared within a spinlock?
>
> anon_vma_free(), which see this value, doesn't take any lock and use atomic ops.
> I used atomic ops to handle atomic_t.
anon_vma_free() only reads the value. Thus no race. You do not need an
atomic_t. atomic_t is only necessary if a variable needs to be changed
atomically. Reading a word from memory is atomic regardless.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 20+ messages in thread* Re: [Patch] memory unplug v3 [2/4] migration by kernel
2007-05-23 1:56 ` Christoph Lameter
@ 2007-05-23 2:09 ` KAMEZAWA Hiroyuki
0 siblings, 0 replies; 20+ messages in thread
From: KAMEZAWA Hiroyuki @ 2007-05-23 2:09 UTC (permalink / raw)
To: Christoph Lameter; +Cc: linux-mm, mel, y-goto
On Tue, 22 May 2007 18:56:56 -0700 (PDT)
Christoph Lameter <clameter@sgi.com> wrote:
> On Wed, 23 May 2007, KAMEZAWA Hiroyuki wrote:
>
> > > > +#ifdef CONFIG_MIGRATION_BY_KERNEL
> > > > +struct anon_vma *anon_vma_hold(struct page *page) {
> > > > + struct anon_vma *anon_vma;
> > > > + anon_vma = page_lock_anon_vma(page);
> > > > + if (!anon_vma)
> > > > + return NULL;
> > > > + atomic_set(&anon_vma->ref, 1);
> > >
> > > Why use an atomic value if it is set and cleared within a spinlock?
> >
> > anon_vma_free(), which see this value, doesn't take any lock and use atomic ops.
> > I used atomic ops to handle atomic_t.
>
> anon_vma_free() only reads the value. Thus no race. You do not need an
> atomic_t. atomic_t is only necessary if a variable needs to be changed
> atomically. Reading a word from memory is atomic regardless.
>
thank you for pointing out. I understand.
-Kame
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [Patch] memory unplug v3 [2/4] migration by kernel
2007-05-22 18:49 ` Christoph Lameter
2007-05-23 1:45 ` KAMEZAWA Hiroyuki
@ 2007-05-23 19:14 ` Mel Gorman
2007-05-25 7:43 ` KAMEZAWA Hiroyuki
1 sibling, 1 reply; 20+ messages in thread
From: Mel Gorman @ 2007-05-23 19:14 UTC (permalink / raw)
To: Christoph Lameter; +Cc: KAMEZAWA Hiroyuki, linux-mm, y-goto
On Tue, 22 May 2007, Christoph Lameter wrote:
> On Tue, 22 May 2007, KAMEZAWA Hiroyuki wrote:
>
>> +config MIGRATION_BY_KERNEL
>> + bool "Page migration by kernel's page scan"
>> + def_bool y
>> + depends on MIGRATION
>> + help
>> + Allows page migration from kernel context. This means page migration
>> + can be done by codes other than sys_migrate() system call. Will add
>> + some additional check code in page migration.
>
> I think the scope of this is much bigger than you imagine. This is also
> going to be useful when Mel is going to implement defragmentation. So I
> think this should not be a separate option but be on by default.
>
I'm not 100% sure but chances are I need this.
I put together a memory compaction prototype today[*] to check because
it's been put off long enough. However, memory compaction works whether I
called migrate_pages() or migrate_pages_nocontext() even when regularly
compacting under load. That said, calling migrate_pages() is probably
racing like mad and I am not getting nailed for it as the test machine is
small with one CPU and the stress load is kernel compiles instead of
processes with mapped data. I'm basing compaction on top of a slightly
modified version of this patch and will revisit it later.
Incidentally, the results of the compaction at rest are;
Freelists before compaction
Node 0, zone Normal, type Unmovable 302 55 26 20 12 6 2 0 0 0 0
Node 0, zone Normal, type Reclaimable 3165 734 218 28 3 0 0 0 0 0 0
Node 0, zone Normal, type Movable 4986 2222 1980 1553 752 238 26 2 0 0 0
Node 0, zone Normal, type Reserve 5 3 0 0 1 1 0 0 1 1 0
Freelists after compaction
Node 0, zone Normal, type Unmovable 278 32 14 12 10 5 4 2 0 0 0
Node 0, zone Normal, type Reclaimable 3184 743 226 32 3 0 0 0 0 0 0
Node 0, zone Normal, type Movable 862 676 599 421 238 94 17 6 4 3 31
Node 0, zone Normal, type Reserve 1 1 1 1 1 1 1 1 1 1 0
So it's doing something and the machine hasn't killed itself in the face.
Aside, the page migration framework is ridiculously easy to work with -
kudos to all who worked on it.
[*] Considering a working prototype only took a day to put
together, I'm irritated it took me this long to get around to it.
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 20+ messages in thread* Re: [Patch] memory unplug v3 [2/4] migration by kernel
2007-05-23 19:14 ` Mel Gorman
@ 2007-05-25 7:43 ` KAMEZAWA Hiroyuki
0 siblings, 0 replies; 20+ messages in thread
From: KAMEZAWA Hiroyuki @ 2007-05-25 7:43 UTC (permalink / raw)
To: Mel Gorman; +Cc: clameter, linux-mm, y-goto
On Wed, 23 May 2007 20:14:39 +0100 (IST)
Mel Gorman <mel@csn.ul.ie> wrote:
> I put together a memory compaction prototype today[*] to check because
> it's been put off long enough. However, memory compaction works whether I
> called migrate_pages() or migrate_pages_nocontext() even when regularly
> compacting under load. That said, calling migrate_pages() is probably
> racing like mad and I am not getting nailed for it as the test machine is
> small with one CPU and the stress load is kernel compiles instead of
> processes with mapped data. I'm basing compaction on top of a slightly
> modified version of this patch and will revisit it later.
>
thank you for test :)
We (I and Goto-san) saw !page_mapped(page) case in try_to_unmap() under heavy
memory pressure,....swapping.
So, at least,
==
+ if (page_mapped(page))
+ try_to_unmap(page, 1);
==
This change is necessary.
About anon_vma, see comments in page_remove_rmap().
> Incidentally, the results of the compaction at rest are;
>
> Freelists before compaction
> Node 0, zone Normal, type Unmovable 302 55 26 20 12 6 2 0 0 0 0
> Node 0, zone Normal, type Reclaimable 3165 734 218 28 3 0 0 0 0 0 0
> Node 0, zone Normal, type Movable 4986 2222 1980 1553 752 238 26 2 0 0 0
> Node 0, zone Normal, type Reserve 5 3 0 0 1 1 0 0 1 1 0
>
> Freelists after compaction
> Node 0, zone Normal, type Unmovable 278 32 14 12 10 5 4 2 0 0 0
> Node 0, zone Normal, type Reclaimable 3184 743 226 32 3 0 0 0 0 0 0
> Node 0, zone Normal, type Movable 862 676 599 421 238 94 17 6 4 3 31
> Node 0, zone Normal, type Reserve 1 1 1 1 1 1 1 1 1 1 0
>
> So it's doing something and the machine hasn't killed itself in the face.
> Aside, the page migration framework is ridiculously easy to work with -
> kudos to all who worked on it.
>
I'll write this patch as one independent from memory unplug, AMAP.
Thanks,
-Kame
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 20+ messages in thread
* [Patch] memory unplug v3 [3/4] page removal
2007-05-22 6:58 [Patch] memory unplug v3 [0/4] KAMEZAWA Hiroyuki
2007-05-22 7:01 ` [Patch] memory unplug v3 [1/4] page isolation KAMEZAWA Hiroyuki
2007-05-22 7:04 ` [Patch] memory unplug v3 [2/4] migration by kernel KAMEZAWA Hiroyuki
@ 2007-05-22 7:07 ` KAMEZAWA Hiroyuki
2007-05-22 18:52 ` Christoph Lameter
2007-05-22 7:08 ` [Patch] memory unplug v3 [4/4] ia64 interface KAMEZAWA Hiroyuki
2007-05-22 18:34 ` [Patch] memory unplug v3 [0/4] Christoph Lameter
4 siblings, 1 reply; 20+ messages in thread
From: KAMEZAWA Hiroyuki @ 2007-05-22 7:07 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki; +Cc: linux-mm, mel, y-goto, clameter
This is page hot removal core patch.
How this works:
* isolate all specified range.
* for_all_pfn_in_range
- skip if !pfn_valid()
- skip if page_count(page)==0 && PageReserved()
- skip if a page is isolated (freed)
- migrate a page if it is used. (uses migration_nocontext)
- if page cannot be migrated, returns -EBUSY.
* if timeout returns -EAGAIN.
* if signals are recevied, returns -EINTR.
* Make all pages in the range to be Reserved if they all are freed.
* This patch doesn't implement a user interface. An arch, which want to
support memory unplug, should add offline_pages() call to its remove_pages().
(see ia64 patch)
* This patch doesn't free memmap. this will be implemented by other patch.
if your arch support,
echo offline > /sys/devices/system/memory/memoryXXX/state
will offline memory if it can.
offliend memory can be added again by
echo online > /sys/device/system/memory/memoryXXX/state.
A kind of defrag by hand :).
I wonder the logic can be more sophisticated and simpler...
Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com>
Signed-Off-By: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Index: devel-2.6.22-rc1-mm1/mm/Kconfig
===================================================================
--- devel-2.6.22-rc1-mm1.orig/mm/Kconfig 2007-05-22 15:12:29.000000000 +0900
+++ devel-2.6.22-rc1-mm1/mm/Kconfig 2007-05-22 15:12:30.000000000 +0900
@@ -126,6 +126,12 @@
def_bool y
depends on SPARSEMEM && MEMORY_HOTPLUG
+config MEMORY_HOTREMOVE
+ bool "Allow for memory hot remove"
+ depends on MEMORY_HOTPLUG
+ select MIGRATION
+ select MIGRATION_BY_KERNEL
+
# Heavily threaded applications may benefit from splitting the mm-wide
# page_table_lock, so that faults on different parts of the user address
# space can be handled with less contention: split it at this NR_CPUS.
Index: devel-2.6.22-rc1-mm1/mm/memory_hotplug.c
===================================================================
--- devel-2.6.22-rc1-mm1.orig/mm/memory_hotplug.c 2007-05-22 14:30:39.000000000 +0900
+++ devel-2.6.22-rc1-mm1/mm/memory_hotplug.c 2007-05-22 15:12:30.000000000 +0900
@@ -23,6 +23,9 @@
#include <linux/vmalloc.h>
#include <linux/ioport.h>
#include <linux/cpuset.h>
+#include <linux/delay.h>
+#include <linux/migrate.h>
+#include <linux/page-isolation.h>
#include <asm/tlbflush.h>
@@ -308,3 +311,196 @@
return ret;
}
EXPORT_SYMBOL_GPL(add_memory);
+
+#ifdef CONFIG_MEMORY_HOTREMOVE
+
+static struct page *
+hotremove_migrate_alloc(struct page *page,
+ unsigned long private,
+ int **x)
+{
+ return alloc_page(GFP_HIGH_MOVABLE);
+}
+
+
+#define NR_OFFLINE_AT_ONCE_PAGES (256)
+static int
+do_migrate_range(unsigned long start_pfn, unsigned long end_pfn)
+{
+ unsigned long pfn;
+ struct page *page;
+ int move_pages = NR_OFFLINE_AT_ONCE_PAGES;
+ int not_managed = 0;
+ int ret = 0;
+ LIST_HEAD(source);
+
+ for (pfn = start_pfn; pfn < end_pfn && move_pages > 0; pfn++) {
+ if (!pfn_valid(pfn))
+ continue;
+ page = pfn_to_page(pfn);
+ /* page is isolated or being freed ? */
+ if ((page_count(page) == 0) || PageReserved(page))
+ continue;
+ ret = isolate_lru_page(page, &source);
+
+ if (ret == 0) {
+ move_pages--;
+ } else {
+ not_managed++;
+ }
+ }
+ ret = -EBUSY;
+ if (not_managed) {
+ if (!list_empty(&source))
+ putback_lru_pages(&source);
+ goto out;
+ }
+ ret = 0;
+ if (list_empty(&source))
+ goto out;
+ /* this function returns # of failed pages */
+ ret = migrate_pages_nocontext(&source, hotremove_migrate_alloc, 0);
+
+out:
+ return ret;
+}
+
+/*
+ * remove from free_area[] and mark all as Reserved.
+ */
+static void
+offline_isolated_pages(unsigned long start_pfn, unsigned long end_pfn)
+{
+ struct resource res;
+ unsigned long tmp_start, tmp_end;
+
+ res.start = start_pfn << PAGE_SHIFT;
+ res.end = (end_pfn - 1) << PAGE_SHIFT;
+ res.flags = IORESOURCE_MEM;
+ while ((res.start < res.end) && (find_next_system_ram(&res) >= 0)) {
+ tmp_start = res.start >> PAGE_SHIFT;
+ tmp_end = (res.end >> PAGE_SHIFT) + 1;
+ /* this function touches free_area[]...so please see
+ page_alloc.c */
+ __offline_isolated_pages(tmp_start, tmp_end);
+ res.start = res.end + 1;
+ res.end = end_pfn;
+ }
+}
+
+/*
+ * Check all pages in range, recoreded as memory resource, are isolated.
+ */
+static long
+check_pages_isolated(unsigned long start_pfn, unsigned long end_pfn)
+{
+ struct resource res;
+ unsigned long tmp_start, tmp_end;
+ int ret, offlined = 0;
+
+ res.start = start_pfn << PAGE_SHIFT;
+ res.end = (end_pfn - 1) << PAGE_SHIFT;
+ res.flags = IORESOURCE_MEM;
+ while ((res.start < res.end) && (find_next_system_ram(&res) >= 0)) {
+ tmp_start = res.start >> PAGE_SHIFT;
+ tmp_end = (res.end >> PAGE_SHIFT) + 1;
+ ret = test_pages_isolated(tmp_start, tmp_end);
+ if (ret)
+ return -EBUSY;
+ offlined += tmp_end - tmp_start;
+ res.start = res.end + 1;
+ res.end = end_pfn;
+ }
+ return offlined;
+}
+
+
+int offline_pages(unsigned long start_pfn,
+ unsigned long end_pfn, unsigned long timeout)
+{
+ unsigned long pfn, nr_pages, expire;
+ long offlined_pages;
+ int ret;
+ struct page *page;
+ struct zone *zone;
+
+ BUG_ON(start_pfn >= end_pfn);
+ /* at least, alignment against pageblock is necessary */
+ if (start_pfn & (NR_PAGES_ISOLATION_BLOCK - 1))
+ return -EINVAL;
+ if (end_pfn & (NR_PAGES_ISOLATION_BLOCK - 1))
+ return -EINVAL;
+ /* This makes hotplug much easier...and readable.
+ we assume this for now. .*/
+ if (page_zone(pfn_to_page(start_pfn)) !=
+ page_zone(pfn_to_page(end_pfn - 1)))
+ return -EINVAL;
+ /* set above range as isolated */
+ ret = isolate_pages(start_pfn, end_pfn);
+ if (ret)
+ return ret;
+ nr_pages = end_pfn - start_pfn;
+ pfn = start_pfn;
+ expire = jiffies + timeout;
+repeat:
+ /* start memory hot removal */
+ ret = -EAGAIN;
+ if (time_after(jiffies, expire))
+ goto failed_removal;
+ ret = -EINTR;
+ if (signal_pending(current))
+ goto failed_removal;
+ ret = 0;
+ /* drain all zone's lru pagevec */
+ lru_add_drain_all();
+
+ /* skip isolated pages */
+ for(; pfn < end_pfn; pfn++) {
+ if (!pfn_valid(pfn))
+ continue;
+ page = pfn_to_page(pfn);
+ if (PageReserved(page))
+ continue;
+ if (!is_page_isolated(page))
+ break;
+ }
+ /* start point is here */
+ if (pfn != end_pfn) {
+ ret = do_migrate_range(pfn, end_pfn);
+ if (!ret) {
+ cond_resched();
+ goto repeat;
+ } else if (ret < 0) {
+ goto failed_removal;
+ } else if (ret > 0) {
+ /* some congestion found. sleep a bit */
+ msleep(10);
+ goto repeat;
+ }
+ }
+ /* check again */
+ ret = check_pages_isolated(start_pfn, end_pfn);
+ if (ret < 0) {
+ goto failed_removal;
+ }
+ offlined_pages = ret;
+ /* Ok, all of our target is islaoted.
+ We cannot do rollback at this point. */
+ offline_isolated_pages(start_pfn, end_pfn);
+ /* removal success */
+ zone = page_zone(pfn_to_page(start_pfn));
+ zone->present_pages -= offlined_pages;
+ zone->zone_pgdat->node_present_pages -= offlined_pages;
+ totalram_pages -= offlined_pages;
+ num_physpages -= offlined_pages;
+ vm_total_pages = nr_free_pagecache_pages();
+ writeback_set_ratelimit();
+ return 0;
+
+failed_removal:
+ printk("memory offlining %lx to %lx failed\n",start_pfn, end_pfn);
+ /* pushback to free area */
+ free_isolated_pages(start_pfn, end_pfn);
+ return ret;
+}
+#endif /* CONFIG_MEMORY_HOTREMOVE */
Index: devel-2.6.22-rc1-mm1/include/linux/memory_hotplug.h
===================================================================
--- devel-2.6.22-rc1-mm1.orig/include/linux/memory_hotplug.h 2007-05-22 14:30:39.000000000 +0900
+++ devel-2.6.22-rc1-mm1/include/linux/memory_hotplug.h 2007-05-22 15:12:30.000000000 +0900
@@ -59,7 +59,10 @@
extern void online_page(struct page *page);
/* VM interface that may be used by firmware interface */
extern int online_pages(unsigned long, unsigned long);
-
+#ifdef CONFIG_MEMORY_HOTREMOVE
+extern int offline_pages(unsigned long, unsigned long, unsigned long);
+extern void __offline_isolated_pages(unsigned long, unsigned long);
+#endif
/* reasonably generic interface to expand the physical pages in a zone */
extern int __add_pages(struct zone *zone, unsigned long start_pfn,
unsigned long nr_pages);
Index: devel-2.6.22-rc1-mm1/mm/page_alloc.c
===================================================================
--- devel-2.6.22-rc1-mm1.orig/mm/page_alloc.c 2007-05-22 15:12:28.000000000 +0900
+++ devel-2.6.22-rc1-mm1/mm/page_alloc.c 2007-05-22 15:12:30.000000000 +0900
@@ -4447,3 +4447,52 @@
out:
spin_unlock_irqrestore(&zone->lock, flags);
}
+
+#ifdef CONFIG_MEMORY_HOTREMOVE
+/*
+ * All pages in the range must be isolated before calling this.
+ */
+void
+__offline_isolated_pages(unsigned long start_pfn, unsigned long end_pfn)
+{
+ struct page *page, *tmp;
+ struct zone *zone;
+ struct free_area *area;
+ int order, i;
+ unsigned long pfn;
+ /* find the first valid pfn */
+ for (pfn = start_pfn; pfn < end_pfn; pfn++)
+ if (pfn_valid(pfn))
+ break;
+ if (pfn == end_pfn)
+ return;
+ zone = page_zone(pfn_to_page(pfn));
+ spin_lock(&zone->lock);
+ printk("do isoalte \n");
+ for (order = 0; order < MAX_ORDER; order++) {
+ area = &zone->free_area[order];
+ list_for_each_entry_safe(page, tmp,
+ &area->free_list[MIGRATE_ISOLATE],
+ lru) {
+ pfn = page_to_pfn(page);
+ if (pfn < start_pfn || end_pfn <= pfn)
+ continue;
+ printk("found %lx %lx %lx\n",
+ start_pfn, pfn, end_pfn);
+ list_del(&page->lru);
+ rmv_page_order(page);
+ area->nr_free--;
+ __mod_zone_page_state(zone, NR_FREE_PAGES,
+ - (1UL << order));
+ }
+ }
+ spin_unlock(&zone->lock);
+ for (pfn = start_pfn; pfn < end_pfn; pfn++) {
+ if (!pfn_valid(pfn))
+ continue;
+ page = pfn_to_page(pfn);
+ BUG_ON(page_count(page));
+ SetPageReserved(page);
+ }
+}
+#endif
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 20+ messages in thread* Re: [Patch] memory unplug v3 [3/4] page removal
2007-05-22 7:07 ` [Patch] memory unplug v3 [3/4] page removal KAMEZAWA Hiroyuki
@ 2007-05-22 18:52 ` Christoph Lameter
2007-05-23 1:50 ` KAMEZAWA Hiroyuki
0 siblings, 1 reply; 20+ messages in thread
From: Christoph Lameter @ 2007-05-22 18:52 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki; +Cc: linux-mm, mel, y-goto
On Tue, 22 May 2007, KAMEZAWA Hiroyuki wrote:
> +static int
> +do_migrate_range(unsigned long start_pfn, unsigned long end_pfn)
> +{
> + unsigned long pfn;
> + struct page *page;
> + int move_pages = NR_OFFLINE_AT_ONCE_PAGES;
> + int not_managed = 0;
> + int ret = 0;
> + LIST_HEAD(source);
> +
> + for (pfn = start_pfn; pfn < end_pfn && move_pages > 0; pfn++) {
> + if (!pfn_valid(pfn))
> + continue;
> + page = pfn_to_page(pfn);
> + /* page is isolated or being freed ? */
> + if ((page_count(page) == 0) || PageReserved(page))
> + continue;
The check above is not necessary. A Page count = 0 page is not on the LRU
neither is a Reserved page.
> + /* this function returns # of failed pages */
> + ret = migrate_pages_nocontext(&source, hotremove_migrate_alloc, 0);
You have no context so the last parameter should be 1?
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 20+ messages in thread* Re: [Patch] memory unplug v3 [3/4] page removal
2007-05-22 18:52 ` Christoph Lameter
@ 2007-05-23 1:50 ` KAMEZAWA Hiroyuki
0 siblings, 0 replies; 20+ messages in thread
From: KAMEZAWA Hiroyuki @ 2007-05-23 1:50 UTC (permalink / raw)
To: Christoph Lameter; +Cc: linux-mm, mel, y-goto
On Tue, 22 May 2007 11:52:11 -0700 (PDT)
Christoph Lameter <clameter@sgi.com> wrote:
> On Tue, 22 May 2007, KAMEZAWA Hiroyuki wrote:
>
> > +static int
> > +do_migrate_range(unsigned long start_pfn, unsigned long end_pfn)
> > +{
> > + unsigned long pfn;
> > + struct page *page;
> > + int move_pages = NR_OFFLINE_AT_ONCE_PAGES;
> > + int not_managed = 0;
> > + int ret = 0;
> > + LIST_HEAD(source);
> > +
> > + for (pfn = start_pfn; pfn < end_pfn && move_pages > 0; pfn++) {
> > + if (!pfn_valid(pfn))
> > + continue;
> > + page = pfn_to_page(pfn);
> > + /* page is isolated or being freed ? */
> > + if ((page_count(page) == 0) || PageReserved(page))
> > + continue;
>
> The check above is not necessary. A Page count = 0 page is not on the LRU
> neither is a Reserved page.
Ah, ok. but I'm now treating error in isolate_lru_page() as fatal.
This code avoid that isolate_lru_page() returns error by !PageLRU().
I'll consider again this part.
> > + /* this function returns # of failed pages */
> > + ret = migrate_pages_nocontext(&source, hotremove_migrate_alloc, 0);
>
> You have no context so the last parameter should be 1?
>
migrate_pages_noccontest()'s 3rd param is equal to migrate_pages()'s 3rd param 'private'.
-Kame
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 20+ messages in thread
* [Patch] memory unplug v3 [4/4] ia64 interface
2007-05-22 6:58 [Patch] memory unplug v3 [0/4] KAMEZAWA Hiroyuki
` (2 preceding siblings ...)
2007-05-22 7:07 ` [Patch] memory unplug v3 [3/4] page removal KAMEZAWA Hiroyuki
@ 2007-05-22 7:08 ` KAMEZAWA Hiroyuki
2007-05-22 18:34 ` [Patch] memory unplug v3 [0/4] Christoph Lameter
4 siblings, 0 replies; 20+ messages in thread
From: KAMEZAWA Hiroyuki @ 2007-05-22 7:08 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki; +Cc: linux-mm, mel, y-goto, clameter
Add call for offline_pages() to ia64.
Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com>
Signed-Off-By: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Index: devel-2.6.22-rc1-mm1/arch/ia64/mm/init.c
===================================================================
--- devel-2.6.22-rc1-mm1.orig/arch/ia64/mm/init.c 2007-05-22 14:30:38.000000000 +0900
+++ devel-2.6.22-rc1-mm1/arch/ia64/mm/init.c 2007-05-22 15:12:31.000000000 +0900
@@ -724,7 +724,17 @@
int remove_memory(u64 start, u64 size)
{
- return -EINVAL;
+ unsigned long start_pfn, end_pfn;
+ unsigned long timeout = 120 * HZ;
+ int ret;
+ start_pfn = start >> PAGE_SHIFT;
+ end_pfn = start_pfn + (size >> PAGE_SHIFT);
+ ret = offline_pages(start_pfn, end_pfn, timeout);
+ if (ret)
+ goto out;
+ /* we can free mem_map at this point */
+out:
+ return ret;
}
EXPORT_SYMBOL_GPL(remove_memory);
#endif
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 20+ messages in thread* Re: [Patch] memory unplug v3 [0/4]
2007-05-22 6:58 [Patch] memory unplug v3 [0/4] KAMEZAWA Hiroyuki
` (3 preceding siblings ...)
2007-05-22 7:08 ` [Patch] memory unplug v3 [4/4] ia64 interface KAMEZAWA Hiroyuki
@ 2007-05-22 18:34 ` Christoph Lameter
2007-05-23 1:59 ` KAMEZAWA Hiroyuki
4 siblings, 1 reply; 20+ messages in thread
From: Christoph Lameter @ 2007-05-22 18:34 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki; +Cc: Linux-MM, mel, y-goto
On Tue, 22 May 2007, KAMEZAWA Hiroyuki wrote:
> - user kernelcore=XXX boot option to create ZONE_MOVABLE.
> Memory unplug itself can work without ZONE_MOVABLE but it will be
> better to use kernelcore= if your section size is big.
Hmmm.... Sure wish the ZONE_MOVABLE would go away. Isnt there some way to
have a dynamic boundary within ZONE_NORMAL?
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 20+ messages in thread* Re: [Patch] memory unplug v3 [0/4]
2007-05-22 18:34 ` [Patch] memory unplug v3 [0/4] Christoph Lameter
@ 2007-05-23 1:59 ` KAMEZAWA Hiroyuki
2007-05-23 2:09 ` Christoph Lameter
0 siblings, 1 reply; 20+ messages in thread
From: KAMEZAWA Hiroyuki @ 2007-05-23 1:59 UTC (permalink / raw)
To: Christoph Lameter; +Cc: linux-mm, mel, y-goto
On Tue, 22 May 2007 11:34:04 -0700 (PDT)
Christoph Lameter <clameter@sgi.com> wrote:
> On Tue, 22 May 2007, KAMEZAWA Hiroyuki wrote:
>
> > - user kernelcore=XXX boot option to create ZONE_MOVABLE.
> > Memory unplug itself can work without ZONE_MOVABLE but it will be
> > better to use kernelcore= if your section size is big.
>
> Hmmm.... Sure wish the ZONE_MOVABLE would go away. Isnt there some way to
> have a dynamic boundary within ZONE_NORMAL?
>
Hmm.
1. Assume there is only ZONE_NORMAL.
2. grouping pages into MIGRATE_UNMOVABLE, MOGIRATE_RECLAIMABLE, MIGRATE_MOVABLE.
Some range of pages can be used "only" for MIGRATE_MOVABLE(+ RECLAIMABLE)
3. page recaliming algorithm should know what type of page they should reclaim.
Current page reclaming is zone-based. So I think adding zone is not a bad option
if we use zone-based reclaiming.
If I think of a simple way to avoid adding new zone, I'll post it. but not yet.
-Kame
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 20+ messages in thread