[PATCH 0/7] [RFC] Memory Compaction v1

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/7] [RFC] Memory Compaction v1
@ 2007-05-29 17:36 Mel Gorman
  2007-05-29 17:36 ` [PATCH 1/7] Roll-up patch of what has been sent already Mel Gorman
                   ` (6 more replies)
  0 siblings, 7 replies; 21+ messages in thread
From: Mel Gorman @ 2007-05-29 17:36 UTC (permalink / raw)
  To: linux-mm, linux-kernel; +Cc: Mel Gorman, kamezawa.hiroyu, clameter

This is a prototype for compacting memory to reduce external fragmentation
so that free memory exists as fewer, but larger contiguous blocks. Rather
than being a full defragmentation solution, this focuses exclusively on
pages that are movable via the page migration mechanism.

The compaction mechanism operates within a zone and moves movable pages
towards the end of.  Grouping pages by mobility already biases the location
of unmovable pages is biased towards the lower addresses, so these strategies
work in conjunction.

A single compaction run involves two scanners operating within a zone - a
migration and a free scanner. The migration scanner starts at the beginning
of a zone and finds all movable pages within one pageblock_nr_pages-sized
area and isolates them on a migratepages list. The free scanner begins at
the end of the zone and searches on a per-area basis for enough free pages to
migrate all the pages on the migratepages list. As each area is respecively
migrated or exhaused of free pages, the scanners are advanced one area.
A compaction run completes within a zone when the two scanners meet.

This is what /proc/buddyinfo looks like before and after a compaction run.

mel@arnold:~/results$ cat before-buddyinfo.txt 
Node 0, zone      DMA    150     33      6      4      2      1      1      1      1      0      0 
Node 0, zone   Normal   7901   3005   2205   1511    758    245     34      3      0      1      0 

mel@arnold:~/results$ cat after-buddyinfo.txt 
Node 0, zone      DMA    150     33      6      4      2      1      1      1      1      0      0 
Node 0, zone   Normal   1900   1187    609    325    228    178    110     32      6      4     24 

In this patchset, memory is never compacted automatically and is only triggered
by writing a node number to /proc/sys/vm/compact_node. This version of the
patchset is mainly concerned with getting the compaction mechanism correct.

The first patch is a roll-up patch of changes to grouping pages by mobility
posted to the linux-mm list but not merged into -mm yet. The second patch
is from the memory hot-remove patchset which memory compaction can use.

The two patches after that are changes to page migration. The third patch
allows CONFIG_MIGRATION to be set without CONFIG_NUMA.  The fourth patch
allows LRU pages to be isolated in batch instead of acquiring and releasing
the LRU lock a lot.

The fifth patch exports some metrics on external fragmentation which are
relevant to memory compaction. The sixth patch is what implements memory
compaction for a single zone. The final patch enables a node to be compacted
explicitly by writing to a special file in /proc.

This patchset has been tested based on 2.6.22-rc2-mm1 with the following;

o x86 with one CPU, 512MB RAM, FLATMEM
o x86 with four CPUs, 2GB RAM, FLATMEM
o x86_64 with four CPUs, 1GB of RAM, FLATMEM
o x86_64 with four CPUs, 8GB of RAM, DISCONTIG NUMA with 4 nodes
o ppc64 with two CPUs, 2GB of RAM, SPARSEMEM
o IA64 with four CPUs, 1GB of RAM, FLATMEM + VIRTUAL_MEM_MAP

The x86 with one CPU is the only machine that has been tested under
stress. The others was a minimal boot-test followed by compaction under
no load.

This patchset is incomplete. Here some outstanding items on a TODO list in
no particular order.

o Have pageblock_suitable_migration() check the number of free pages properly
o Do not call lru_add_drain_all() on every update
o Add trigger to directly compact before reclaiming for high orders
o Make the fragmentation statistics independent of CONFIG_MIGRATION
o Obey watermarks in split_pagebuddy_page
o Handle free pages intelligently when they are larger than pageblock_order
o Implement compaction_debug boot-time option like slub_debug
o Implement compaction_disable boot-time option just in case
o Investigate using debugfs as the manual compaction trigger instead of proc
o Deal with MIGRATE_RESERVE during compaction properly
o Build test to verify correctness and behaviour under load

Any comments on this first version are welcome.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH 1/7] Roll-up patch of what has been sent already
  2007-05-29 17:36 [PATCH 0/7] [RFC] Memory Compaction v1 Mel Gorman
@ 2007-05-29 17:36 ` Mel Gorman
  2007-05-29 17:36 ` [PATCH 2/7] KAMEZAWA Hiroyuki - migration by kernel Mel Gorman
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 21+ messages in thread
From: Mel Gorman @ 2007-05-29 17:36 UTC (permalink / raw)
  To: linux-mm, linux-kernel; +Cc: Mel Gorman, kamezawa.hiroyu, clameter

This contains some bug fixes including one to PAGE_OWNER, grouping by
arbitrary order, statistics work. The patches have been sent piecemeal
to linux-mm already so the are not broken-out here. At time of sending
the patches have not been merged to -mm so I am including them here for
convenience.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---

 Documentation/page_owner.c      |    3 
 arch/ia64/Kconfig               |    5 
 arch/ia64/mm/hugetlbpage.c      |    4 
 fs/proc/proc_misc.c             |   50 ++++
 include/linux/gfp.h             |   12 +
 include/linux/mmzone.h          |   14 +
 include/linux/pageblock-flags.h |   24 ++
 mm/internal.h                   |   10 
 mm/page_alloc.c                 |  108 ++++------
 mm/vmstat.c                     |  377 +++++++++++++++++++++++++++--------
 10 files changed, 465 insertions(+), 142 deletions(-)

diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.22-rc2-mm1-clean/arch/ia64/Kconfig linux-2.6.22-rc2-mm1-001_lameter-v4r4/arch/ia64/Kconfig
--- linux-2.6.22-rc2-mm1-clean/arch/ia64/Kconfig	2007-05-24 10:13:32.000000000 +0100
+++ linux-2.6.22-rc2-mm1-001_lameter-v4r4/arch/ia64/Kconfig	2007-05-28 14:09:40.000000000 +0100
@@ -54,6 +54,11 @@ config ARCH_HAS_ILOG2_U64
 	bool
 	default n
 
+config HUGETLB_PAGE_SIZE_VARIABLE
+	bool
+	depends on HUGETLB_PAGE
+	default y
+
 config GENERIC_FIND_NEXT_BIT
 	bool
 	default y
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.22-rc2-mm1-clean/arch/ia64/mm/hugetlbpage.c linux-2.6.22-rc2-mm1-001_lameter-v4r4/arch/ia64/mm/hugetlbpage.c
--- linux-2.6.22-rc2-mm1-clean/arch/ia64/mm/hugetlbpage.c	2007-05-19 05:06:17.000000000 +0100
+++ linux-2.6.22-rc2-mm1-001_lameter-v4r4/arch/ia64/mm/hugetlbpage.c	2007-05-28 14:09:40.000000000 +0100
@@ -195,6 +195,6 @@ static int __init hugetlb_setup_sz(char 
 	 * override here with new page shift.
 	 */
 	ia64_set_rr(HPAGE_REGION_BASE, hpage_shift << 2);
-	return 1;
+	return 0;
 }
-__setup("hugepagesz=", hugetlb_setup_sz);
+early_param("hugepagesz", hugetlb_setup_sz);
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.22-rc2-mm1-clean/Documentation/page_owner.c linux-2.6.22-rc2-mm1-001_lameter-v4r4/Documentation/page_owner.c
--- linux-2.6.22-rc2-mm1-clean/Documentation/page_owner.c	2007-05-24 10:13:32.000000000 +0100
+++ linux-2.6.22-rc2-mm1-001_lameter-v4r4/Documentation/page_owner.c	2007-05-28 14:09:40.000000000 +0100
@@ -2,7 +2,8 @@
  * User-space helper to sort the output of /proc/page_owner
  *
  * Example use:
- * cat /proc/page_owner > page_owner.txt
+ * cat /proc/page_owner > page_owner_full.txt
+ * grep -v ^PFN page_owner_full.txt > page_owner.txt
  * ./sort page_owner.txt sorted_page_owner.txt
 */
 
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.22-rc2-mm1-clean/fs/proc/proc_misc.c linux-2.6.22-rc2-mm1-001_lameter-v4r4/fs/proc/proc_misc.c
--- linux-2.6.22-rc2-mm1-clean/fs/proc/proc_misc.c	2007-05-24 10:13:33.000000000 +0100
+++ linux-2.6.22-rc2-mm1-001_lameter-v4r4/fs/proc/proc_misc.c	2007-05-28 14:09:40.000000000 +0100
@@ -232,6 +232,19 @@ static const struct file_operations frag
 	.release	= seq_release,
 };
 
+extern struct seq_operations pagetypeinfo_op;
+static int pagetypeinfo_open(struct inode *inode, struct file *file)
+{
+	return seq_open(file, &pagetypeinfo_op);
+}
+
+static const struct file_operations pagetypeinfo_file_ops = {
+	.open		= pagetypeinfo_open,
+	.read		= seq_read,
+	.llseek		= seq_lseek,
+	.release	= seq_release,
+};
+
 extern struct seq_operations zoneinfo_op;
 static int zoneinfo_open(struct inode *inode, struct file *file)
 {
@@ -748,6 +761,7 @@ read_page_owner(struct file *file, char 
 	unsigned long offset = 0, symsize;
 	int i;
 	ssize_t num_written = 0;
+	int blocktype = 0, pagetype = 0;
 
 	pfn = min_low_pfn + *ppos;
 	page = pfn_to_page(pfn);
@@ -755,8 +769,16 @@ read_page_owner(struct file *file, char 
 		if (!pfn_valid(pfn))
 			continue;
 		page = pfn_to_page(pfn);
+
+		/* Catch situations where free pages have a bad ->order  */
+		if (page->order >= 0 && PageBuddy(page))
+			printk(KERN_WARNING
+				"PageOwner info inaccurate for PFN %lu\n",
+				pfn);
+
 		if (page->order >= 0)
 			break;
+
 		next_idx++;
 	}
 
@@ -776,6 +798,33 @@ read_page_owner(struct file *file, char 
 		goto out;
 	}
 
+	/* Print information relevant to grouping pages by mobility */
+	blocktype = get_pageblock_migratetype(page);
+	pagetype  = allocflags_to_migratetype(page->gfp_mask);
+	ret += snprintf(kbuf+ret, count-ret,
+			"PFN %lu Block %lu type %d %s "
+			"Flags %s%s%s%s%s%s%s%s%s%s%s%s\n",
+			pfn,
+			pfn >> pageblock_order,
+			blocktype,
+			blocktype != pagetype ? "Fallback" : "        ",
+			PageLocked(page)	? "K" : " ",
+			PageError(page)		? "E" : " ",
+			PageReferenced(page)	? "R" : " ",
+			PageUptodate(page)	? "U" : " ",
+			PageDirty(page)		? "D" : " ",
+			PageLRU(page)		? "L" : " ",
+			PageActive(page)	? "A" : " ",
+			PageSlab(page)		? "S" : " ",
+			PageWriteback(page)	? "W" : " ",
+			PageCompound(page)	? "C" : " ",
+			PageSwapCache(page)	? "B" : " ",
+			PageMappedToDisk(page)	? "M" : " ");
+	if (ret >= count) {
+		ret = -ENOMEM;
+		goto out;
+	}
+
 	num_written = ret;
 
 	for (i = 0; i < 8; i++) {
@@ -874,6 +923,7 @@ void __init proc_misc_init(void)
 #endif
 #endif
 	create_seq_entry("buddyinfo",S_IRUGO, &fragmentation_file_operations);
+	create_seq_entry("pagetypeinfo", S_IRUGO, &pagetypeinfo_file_ops);
 	create_seq_entry("vmstat",S_IRUGO, &proc_vmstat_file_operations);
 	create_seq_entry("zoneinfo",S_IRUGO, &proc_zoneinfo_file_operations);
 #ifdef CONFIG_BLOCK
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.22-rc2-mm1-clean/include/linux/gfp.h linux-2.6.22-rc2-mm1-001_lameter-v4r4/include/linux/gfp.h
--- linux-2.6.22-rc2-mm1-clean/include/linux/gfp.h	2007-05-24 10:13:34.000000000 +0100
+++ linux-2.6.22-rc2-mm1-001_lameter-v4r4/include/linux/gfp.h	2007-05-28 14:09:40.000000000 +0100
@@ -101,6 +101,18 @@ struct vm_area_struct;
 /* 4GB DMA on some platforms */
 #define GFP_DMA32	__GFP_DMA32
 
+/* Convert GFP flags to their corresponding migrate type */
+static inline int allocflags_to_migratetype(gfp_t gfp_flags)
+{
+	WARN_ON((gfp_flags & GFP_MOVABLE_MASK) == GFP_MOVABLE_MASK);
+
+	if (unlikely(page_group_by_mobility_disabled))
+		return MIGRATE_UNMOVABLE;
+
+	/* Group based on mobility */
+	return (((gfp_flags & __GFP_MOVABLE) != 0) << 1) |
+		((gfp_flags & __GFP_RECLAIMABLE) != 0);
+}
 
 static inline enum zone_type gfp_zone(gfp_t flags)
 {
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.22-rc2-mm1-clean/include/linux/mmzone.h linux-2.6.22-rc2-mm1-001_lameter-v4r4/include/linux/mmzone.h
--- linux-2.6.22-rc2-mm1-clean/include/linux/mmzone.h	2007-05-24 10:13:34.000000000 +0100
+++ linux-2.6.22-rc2-mm1-001_lameter-v4r4/include/linux/mmzone.h	2007-05-28 14:09:40.000000000 +0100
@@ -45,6 +45,16 @@ extern int page_group_by_mobility_disabl
 	for (order = 0; order < MAX_ORDER; order++) \
 		for (type = 0; type < MIGRATE_TYPES; type++)
 
+extern int page_group_by_mobility_disabled;
+
+static inline int get_pageblock_migratetype(struct page *page)
+{
+	if (unlikely(page_group_by_mobility_disabled))
+		return MIGRATE_UNMOVABLE;
+
+	return get_pageblock_flags_group(page, PB_migrate, PB_migrate_end);
+}
+
 struct free_area {
 	struct list_head	free_list[MIGRATE_TYPES];
 	unsigned long		nr_free;
@@ -238,7 +248,7 @@ struct zone {
 
 #ifndef CONFIG_SPARSEMEM
 	/*
-	 * Flags for a MAX_ORDER_NR_PAGES block. See pageblock-flags.h.
+	 * Flags for a pageblock_nr_pages block. See pageblock-flags.h.
 	 * In SPARSEMEM, this map is stored in struct mem_section
 	 */
 	unsigned long		*pageblock_flags;
@@ -713,7 +723,7 @@ extern struct zone *next_zone(struct zon
 #define PAGE_SECTION_MASK	(~(PAGES_PER_SECTION-1))
 
 #define SECTION_BLOCKFLAGS_BITS \
-		((1 << (PFN_SECTION_SHIFT - (MAX_ORDER-1))) * NR_PAGEBLOCK_BITS)
+	((1UL << (PFN_SECTION_SHIFT - pageblock_order)) * NR_PAGEBLOCK_BITS)
 
 #if (MAX_ORDER - 1 + PAGE_SHIFT) > SECTION_SIZE_BITS
 #error Allocator MAX_ORDER exceeds SECTION_SIZE
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.22-rc2-mm1-clean/include/linux/pageblock-flags.h linux-2.6.22-rc2-mm1-001_lameter-v4r4/include/linux/pageblock-flags.h
--- linux-2.6.22-rc2-mm1-clean/include/linux/pageblock-flags.h	2007-05-24 10:13:34.000000000 +0100
+++ linux-2.6.22-rc2-mm1-001_lameter-v4r4/include/linux/pageblock-flags.h	2007-05-28 14:09:40.000000000 +0100
@@ -1,6 +1,6 @@
 /*
  * Macros for manipulating and testing flags related to a
- * MAX_ORDER_NR_PAGES block of pages.
+ * pageblock_nr_pages number of pages.
  *
  * This program is free software; you can redistribute it and/or modify
  * it under the terms of the GNU General Public License as published by
@@ -35,6 +35,28 @@ enum pageblock_bits {
 	NR_PAGEBLOCK_BITS
 };
 
+#ifdef CONFIG_HUGETLB_PAGE
+
+#ifdef CONFIG_HUGETLB_PAGE_SIZE_VARIABLE
+
+/* Huge page sizes are variable */
+extern int pageblock_order;
+
+#else /* CONFIG_HUGETLB_PAGE_SIZE_VARIABLE */
+
+/* Huge pages are a constant size */
+#define pageblock_order		HUGETLB_PAGE_ORDER
+
+#endif /* CONFIG_HUGETLB_PAGE_SIZE_VARIABLE */
+
+#else /* CONFIG_HUGETLB_PAGE */
+
+/* If huge pages are not used, group by MAX_ORDER_NR_PAGES */
+#define pageblock_order		(MAX_ORDER-1)
+#endif /* CONFIG_HUGETLB_PAGE */
+
+#define pageblock_nr_pages	(1UL << pageblock_order)
+
 /* Forward declaration */
 struct page;
 
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.22-rc2-mm1-clean/mm/internal.h linux-2.6.22-rc2-mm1-001_lameter-v4r4/mm/internal.h
--- linux-2.6.22-rc2-mm1-clean/mm/internal.h	2007-05-19 05:06:17.000000000 +0100
+++ linux-2.6.22-rc2-mm1-001_lameter-v4r4/mm/internal.h	2007-05-28 14:09:40.000000000 +0100
@@ -37,4 +37,14 @@ static inline void __put_page(struct pag
 extern void fastcall __init __free_pages_bootmem(struct page *page,
 						unsigned int order);
 
+/*
+ * function for dealing with page's order in buddy system.
+ * zone->lock is already acquired when we use these.
+ * So, we don't need atomic page->flags operations here.
+ */
+static inline unsigned long page_order(struct page *page)
+{
+	VM_BUG_ON(!PageBuddy(page));
+	return page_private(page);
+}
 #endif
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.22-rc2-mm1-clean/mm/page_alloc.c linux-2.6.22-rc2-mm1-001_lameter-v4r4/mm/page_alloc.c
--- linux-2.6.22-rc2-mm1-clean/mm/page_alloc.c	2007-05-24 10:13:34.000000000 +0100
+++ linux-2.6.22-rc2-mm1-001_lameter-v4r4/mm/page_alloc.c	2007-05-28 14:09:40.000000000 +0100
@@ -59,6 +59,10 @@ unsigned long totalreserve_pages __read_
 long nr_swap_pages;
 int percpu_pagelist_fraction;
 
+#ifdef CONFIG_HUGETLB_PAGE_SIZE_VARIABLE
+int pageblock_order __read_mostly;
+#endif
+
 static void __free_pages_ok(struct page *page, unsigned int order);
 
 /*
@@ -151,32 +155,12 @@ EXPORT_SYMBOL(nr_node_ids);
 
 int page_group_by_mobility_disabled __read_mostly;
 
-static inline int get_pageblock_migratetype(struct page *page)
-{
-	if (unlikely(page_group_by_mobility_disabled))
-		return MIGRATE_UNMOVABLE;
-
-	return get_pageblock_flags_group(page, PB_migrate, PB_migrate_end);
-}
-
 static void set_pageblock_migratetype(struct page *page, int migratetype)
 {
 	set_pageblock_flags_group(page, (unsigned long)migratetype,
 					PB_migrate, PB_migrate_end);
 }
 
-static inline int allocflags_to_migratetype(gfp_t gfp_flags)
-{
-	WARN_ON((gfp_flags & GFP_MOVABLE_MASK) == GFP_MOVABLE_MASK);
-
-	if (unlikely(page_group_by_mobility_disabled))
-		return MIGRATE_UNMOVABLE;
-
-	/* Cluster based on mobility */
-	return (((gfp_flags & __GFP_MOVABLE) != 0) << 1) |
-		((gfp_flags & __GFP_RECLAIMABLE) != 0);
-}
-
 #ifdef CONFIG_DEBUG_VM
 static int page_outside_zone_boundaries(struct zone *zone, struct page *page)
 {
@@ -336,20 +320,13 @@ static inline void prep_zero_page(struct
 		clear_highpage(page + i);
 }
 
-/*
- * function for dealing with page's order in buddy system.
- * zone->lock is already acquired when we use these.
- * So, we don't need atomic page->flags operations here.
- */
-static inline unsigned long page_order(struct page *page)
-{
-	return page_private(page);
-}
-
 static inline void set_page_order(struct page *page, int order)
 {
 	set_page_private(page, order);
 	__SetPageBuddy(page);
+#ifdef CONFIG_PAGE_OWNER
+		page->order = -1;
+#endif
 }
 
 static inline void rmv_page_order(struct page *page)
@@ -719,7 +696,7 @@ static int fallbacks[MIGRATE_TYPES][MIGR
 
 /*
  * Move the free pages in a range to the free lists of the requested type.
- * Note that start_page and end_pages are not aligned in a MAX_ORDER_NR_PAGES
+ * Note that start_page and end_pages are not aligned on a pageblock
  * boundary. If alignment is required, use move_freepages_block()
  */
 int move_freepages(struct zone *zone,
@@ -728,7 +705,7 @@ int move_freepages(struct zone *zone,
 {
 	struct page *page;
 	unsigned long order;
-	int blocks_moved = 0;
+	int pages_moved = 0;
 
 #ifndef CONFIG_HOLES_IN_ZONE
 	/*
@@ -757,10 +734,10 @@ int move_freepages(struct zone *zone,
 		list_add(&page->lru,
 			&zone->free_area[order].free_list[migratetype]);
 		page += 1 << order;
-		blocks_moved++;
+		pages_moved += 1 << order;
 	}
 
-	return blocks_moved;
+	return pages_moved;
 }
 
 int move_freepages_block(struct zone *zone, struct page *page, int migratetype)
@@ -769,10 +746,10 @@ int move_freepages_block(struct zone *zo
 	struct page *start_page, *end_page;
 
 	start_pfn = page_to_pfn(page);
-	start_pfn = start_pfn & ~(MAX_ORDER_NR_PAGES-1);
+	start_pfn = start_pfn & ~(pageblock_nr_pages-1);
 	start_page = pfn_to_page(start_pfn);
-	end_page = start_page + MAX_ORDER_NR_PAGES - 1;
-	end_pfn = start_pfn + MAX_ORDER_NR_PAGES - 1;
+	end_page = start_page + pageblock_nr_pages - 1;
+	end_pfn = start_pfn + pageblock_nr_pages - 1;
 
 	/* Do not cross zone boundaries */
 	if (start_pfn < zone->zone_start_pfn)
@@ -836,14 +813,14 @@ static struct page *__rmqueue_fallback(s
 			 * back for a reclaimable kernel allocation, be more
 			 * agressive about taking ownership of free pages
 			 */
-			if (unlikely(current_order >= MAX_ORDER / 2) ||
+			if (unlikely(current_order >= (pageblock_order >> 1)) ||
 					start_migratetype == MIGRATE_RECLAIMABLE) {
 				unsigned long pages;
 				pages = move_freepages_block(zone, page,
 								start_migratetype);
 
 				/* Claim the whole block if over half of it is free */
-				if ((pages << current_order) >= (1 << (MAX_ORDER-2)))
+				if (pages >= (1 << (pageblock_order-1)))
 					set_pageblock_migratetype(page,
 								start_migratetype);
 
@@ -856,7 +833,7 @@ static struct page *__rmqueue_fallback(s
 			__mod_zone_page_state(zone, NR_FREE_PAGES,
 							-(1UL << order));
 
-			if (current_order == MAX_ORDER - 1)
+			if (current_order == pageblock_order)
 				set_pageblock_migratetype(page,
 							start_migratetype);
 
@@ -1771,9 +1748,6 @@ fastcall void __free_pages(struct page *
 			free_hot_page(page);
 		else
 			__free_pages_ok(page, order);
-#ifdef CONFIG_PAGE_OWNER
-		page->order = -1;
-#endif
 	}
 }
 
@@ -2426,7 +2400,7 @@ void build_all_zonelists(void)
 	 * made on memory-hotadd so a system can start with mobility
 	 * disabled and enable it later
 	 */
-	if (vm_total_pages < (MAX_ORDER_NR_PAGES * MIGRATE_TYPES))
+	if (vm_total_pages < (pageblock_nr_pages * MIGRATE_TYPES))
 		page_group_by_mobility_disabled = 1;
 	else
 		page_group_by_mobility_disabled = 0;
@@ -2511,7 +2485,7 @@ static inline unsigned long wait_table_b
 #define LONG_ALIGN(x) (((x)+(sizeof(long))-1)&~((sizeof(long))-1))
 
 /*
- * Mark a number of MAX_ORDER_NR_PAGES blocks as MIGRATE_RESERVE. The number
+ * Mark a number of pageblocks as MIGRATE_RESERVE. The number
  * of blocks reserved is based on zone->pages_min. The memory within the
  * reserve will tend to store contiguous free pages. Setting min_free_kbytes
  * higher will lead to a bigger reserve which will get freed as contiguous
@@ -2526,9 +2500,10 @@ static void setup_zone_migrate_reserve(s
 	/* Get the start pfn, end pfn and the number of blocks to reserve */
 	start_pfn = zone->zone_start_pfn;
 	end_pfn = start_pfn + zone->spanned_pages;
-	reserve = roundup(zone->pages_min, MAX_ORDER_NR_PAGES) >> (MAX_ORDER-1);
+	reserve = roundup(zone->pages_min, pageblock_nr_pages) >>
+							pageblock_order;
 
-	for (pfn = start_pfn; pfn < end_pfn; pfn += MAX_ORDER_NR_PAGES) {
+	for (pfn = start_pfn; pfn < end_pfn; pfn += pageblock_nr_pages) {
 		if (!pfn_valid(pfn))
 			continue;
 		page = pfn_to_page(pfn);
@@ -2603,7 +2578,7 @@ void __meminit memmap_init_zone(unsigned
 		 * the start are marked MIGRATE_RESERVE by
 		 * setup_zone_migrate_reserve()
 		 */
-		if ((pfn & (MAX_ORDER_NR_PAGES-1)))
+		if ((pfn & (pageblock_nr_pages-1)))
 			set_pageblock_migratetype(page, MIGRATE_MOVABLE);
 
 		INIT_LIST_HEAD(&page->lru);
@@ -3307,8 +3282,8 @@ static void __meminit calculate_node_tot
 #ifndef CONFIG_SPARSEMEM
 /*
  * Calculate the size of the zone->blockflags rounded to an unsigned long
- * Start by making sure zonesize is a multiple of MAX_ORDER-1 by rounding up
- * Then figure 1 NR_PAGEBLOCK_BITS worth of bits per MAX_ORDER-1, finally
+ * Start by making sure zonesize is a multiple of pageblock_order by rounding
+ * up. Then use 1 NR_PAGEBLOCK_BITS worth of bits per pageblock, finally
  * round what is now in bits to nearest long in bits, then return it in
  * bytes.
  */
@@ -3316,8 +3291,8 @@ static unsigned long __init usemap_size(
 {
 	unsigned long usemapsize;
 
-	usemapsize = roundup(zonesize, MAX_ORDER_NR_PAGES);
-	usemapsize = usemapsize >> (MAX_ORDER-1);
+	usemapsize = roundup(zonesize, pageblock_nr_pages);
+	usemapsize = usemapsize >> pageblock_order;
 	usemapsize *= NR_PAGEBLOCK_BITS;
 	usemapsize = roundup(usemapsize, 8 * sizeof(unsigned long));
 
@@ -3339,6 +3314,26 @@ static void inline setup_usemap(struct p
 				struct zone *zone, unsigned long zonesize) {}
 #endif /* CONFIG_SPARSEMEM */
 
+#ifdef CONFIG_HUGETLB_PAGE_SIZE_VARIABLE
+/* Initialise the number of pages represented by NR_PAGEBLOCK_BITS */
+void __init set_pageblock_order(unsigned int order)
+{
+	/* Check that pageblock_nr_pages has not already been setup */
+	if (pageblock_order)
+		return;
+
+	/*
+	 * Assume the largest contiguous order of interest is a huge page.
+	 * This value may be variable depending on boot parameters on IA64
+	 */
+	pageblock_order = order;
+}
+#else /* CONFIG_HUGETLB_PAGE_SIZE_VARIABLE */
+void __init set_pageblock_order(unsigned int order)
+{
+}
+#endif /* CONFIG_HUGETLB_PAGE_SIZE_VARIABLE */
+
 /*
  * Set up the zone data structures:
  *   - mark all pages reserved
@@ -3419,6 +3414,7 @@ static void __meminit free_area_init_cor
 		if (!size)
 			continue;
 
+		set_pageblock_order(HUGETLB_PAGE_ORDER);
 		setup_usemap(pgdat, zone, size);
 		ret = init_currently_empty_zone(zone, zone_start_pfn,
 						size, MEMMAP_EARLY);
@@ -4345,15 +4341,15 @@ static inline int pfn_to_bitidx(struct z
 {
 #ifdef CONFIG_SPARSEMEM
 	pfn &= (PAGES_PER_SECTION-1);
-	return (pfn >> (MAX_ORDER-1)) * NR_PAGEBLOCK_BITS;
+	return (pfn >> pageblock_order) * NR_PAGEBLOCK_BITS;
 #else
 	pfn = pfn - zone->zone_start_pfn;
-	return (pfn >> (MAX_ORDER-1)) * NR_PAGEBLOCK_BITS;
+	return (pfn >> pageblock_order) * NR_PAGEBLOCK_BITS;
 #endif /* CONFIG_SPARSEMEM */
 }
 
 /**
- * get_pageblock_flags_group - Return the requested group of flags for the MAX_ORDER_NR_PAGES block of pages
+ * get_pageblock_flags_group - Return the requested group of flags for the pageblock_nr_pages block of pages
  * @page: The page within the block of interest
  * @start_bitidx: The first bit of interest to retrieve
  * @end_bitidx: The last bit of interest
@@ -4381,7 +4377,7 @@ unsigned long get_pageblock_flags_group(
 }
 
 /**
- * set_pageblock_flags_group - Set the requested group of flags for a MAX_ORDER_NR_PAGES block of pages
+ * set_pageblock_flags_group - Set the requested group of flags for a pageblock_nr_pages block of pages
  * @page: The page within the block of interest
  * @start_bitidx: The first bit of interest
  * @end_bitidx: The last bit of interest
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.22-rc2-mm1-clean/mm/vmstat.c linux-2.6.22-rc2-mm1-001_lameter-v4r4/mm/vmstat.c
--- linux-2.6.22-rc2-mm1-clean/mm/vmstat.c	2007-05-24 10:13:34.000000000 +0100
+++ linux-2.6.22-rc2-mm1-001_lameter-v4r4/mm/vmstat.c	2007-05-28 14:09:40.000000000 +0100
@@ -13,6 +13,7 @@
 #include <linux/module.h>
 #include <linux/cpu.h>
 #include <linux/sched.h>
+#include "internal.h"
 
 #ifdef CONFIG_VM_EVENT_COUNTERS
 DEFINE_PER_CPU(struct vm_event_state, vm_event_states) = {{0}};
@@ -397,6 +398,13 @@ void zone_statistics(struct zonelist *zo
 
 #include <linux/seq_file.h>
 
+static char * const migratetype_names[MIGRATE_TYPES] = {
+	"Unmovable",
+	"Reclaimable",
+	"Movable",
+	"Reserve",
+};
+
 static void *frag_start(struct seq_file *m, loff_t *pos)
 {
 	pg_data_t *pgdat;
@@ -421,28 +429,236 @@ static void frag_stop(struct seq_file *m
 {
 }
 
-/*
- * This walks the free areas for each zone.
- */
-static int frag_show(struct seq_file *m, void *arg)
+/* Walk all the zones in a node and print using a callback */
+static void walk_zones_in_node(struct seq_file *m, pg_data_t *pgdat,
+		void (*print)(struct seq_file *m, pg_data_t *, struct zone *))
 {
-	pg_data_t *pgdat = (pg_data_t *)arg;
 	struct zone *zone;
 	struct zone *node_zones = pgdat->node_zones;
 	unsigned long flags;
-	int order;
 
 	for (zone = node_zones; zone - node_zones < MAX_NR_ZONES; ++zone) {
 		if (!populated_zone(zone))
 			continue;
 
 		spin_lock_irqsave(&zone->lock, flags);
-		seq_printf(m, "Node %d, zone %8s ", pgdat->node_id, zone->name);
-		for (order = 0; order < MAX_ORDER; ++order)
-			seq_printf(m, "%6lu ", zone->free_area[order].nr_free);
+		print(m, pgdat, zone);
 		spin_unlock_irqrestore(&zone->lock, flags);
+	}
+}
+
+static void frag_show_print(struct seq_file *m, pg_data_t *pgdat,
+						struct zone *zone)
+{
+	int order;
+
+	seq_printf(m, "Node %d, zone %8s ", pgdat->node_id, zone->name);
+	for (order = 0; order < MAX_ORDER; ++order)
+		seq_printf(m, "%6lu ", zone->free_area[order].nr_free);
+	seq_putc(m, '\n');
+}
+
+/*
+ * This walks the free areas for each zone.
+ */
+static int frag_show(struct seq_file *m, void *arg)
+{
+	pg_data_t *pgdat = (pg_data_t *)arg;
+	walk_zones_in_node(m, pgdat, frag_show_print);
+	return 0;
+}
+
+static void pagetypeinfo_showfree_print(struct seq_file *m,
+					pg_data_t *pgdat, struct zone *zone)
+{
+	int order, mtype;
+
+	for (mtype = 0; mtype < MIGRATE_TYPES; mtype++) {
+		seq_printf(m, "Node %4d, zone %8s, type %12s ",
+					pgdat->node_id,
+					zone->name,
+					migratetype_names[mtype]);
+		for (order = 0; order < MAX_ORDER; ++order) {
+			unsigned long freecount = 0;
+			struct free_area *area;
+			struct list_head *curr;
+
+			area = &(zone->free_area[order]);
+
+			list_for_each(curr, &area->free_list[mtype])
+				freecount++;
+			seq_printf(m, "%6lu ", freecount);
+		}
 		seq_putc(m, '\n');
 	}
+}
+
+/* Print out the free pages at each order for each migatetype */
+static int pagetypeinfo_showfree(struct seq_file *m, void *arg)
+{
+	int order;
+	pg_data_t *pgdat = (pg_data_t *)arg;
+
+	/* Print header */
+	seq_printf(m, "%-43s ", "Free pages count per migrate type at order");
+	for (order = 0; order < MAX_ORDER; ++order)
+		seq_printf(m, "%6d ", order);
+	seq_putc(m, '\n');
+
+	walk_zones_in_node(m, pgdat, pagetypeinfo_showfree_print);
+
+	return 0;
+}
+
+static void pagetypeinfo_showblockcount_print(struct seq_file *m,
+					pg_data_t *pgdat, struct zone *zone)
+{
+	int mtype;
+	unsigned long pfn;
+	unsigned long start_pfn = zone->zone_start_pfn;
+	unsigned long end_pfn = start_pfn + zone->spanned_pages;
+	unsigned long count[MIGRATE_TYPES] = { 0, };
+
+	for (pfn = start_pfn; pfn < end_pfn; pfn += pageblock_nr_pages) {
+		struct page *page;
+
+		if (!pfn_valid(pfn))
+			continue;
+
+		page = pfn_to_page(pfn);
+		mtype = get_pageblock_migratetype(page);
+
+		count[mtype]++;
+	}
+
+	/* Print counts */
+	seq_printf(m, "Node %d, zone %8s ", pgdat->node_id, zone->name);
+	for (mtype = 0; mtype < MIGRATE_TYPES; mtype++)
+		seq_printf(m, "%12lu ", count[mtype]);
+	seq_putc(m, '\n');
+}
+
+/* Print out the free pages at each order for each migratetype */
+static int pagetypeinfo_showblockcount(struct seq_file *m, void *arg)
+{
+	int mtype;
+	pg_data_t *pgdat = (pg_data_t *)arg;
+
+	seq_printf(m, "\n%-23s", "Number of blocks type ");
+	for (mtype = 0; mtype < MIGRATE_TYPES; mtype++)
+		seq_printf(m, "%12s ", migratetype_names[mtype]);
+	seq_putc(m, '\n');
+	walk_zones_in_node(m, pgdat, pagetypeinfo_showblockcount_print);
+
+	return 0;
+}
+
+#ifdef CONFIG_PAGE_OWNER
+static void pagetypeinfo_showmixedcount_print(struct seq_file *m,
+							pg_data_t *pgdat,
+							struct zone *zone)
+{
+	int mtype, pagetype;
+	unsigned long pfn;
+	unsigned long start_pfn = zone->zone_start_pfn;
+	unsigned long end_pfn = start_pfn + zone->spanned_pages;
+	unsigned long count[MIGRATE_TYPES] = { 0, };
+
+	/* Align PFNs to pageblock_nr_pages boundary */
+	pfn = start_pfn & ~(pageblock_nr_pages-1);
+
+	/*
+	 * Walk the zone in pageblock_nr_pages steps. If a page block spans
+	 * a zone boundary, it will be double counted between zones. This does
+	 * not matter as the mixed block count will still be correct
+	 */
+	for (; pfn < end_pfn; pfn += pageblock_nr_pages) {
+		struct page *page;
+		unsigned long offset = 0;
+
+		/* Do not read before the zone start, use a valid page */
+		if (pfn < start_pfn)
+			offset = start_pfn - pfn;
+
+		if (!pfn_valid(pfn + offset))
+			continue;
+
+		page = pfn_to_page(pfn + offset);
+		mtype = get_pageblock_migratetype(page);
+
+		/* Check the block for bad migrate types */
+		for (; offset < pageblock_nr_pages; offset++) {
+			/* Do not past the end of the zone */
+			if (pfn + offset >= end_pfn)
+				break;
+
+			if (!pfn_valid_within(pfn + offset))
+				continue;
+
+			page = pfn_to_page(pfn + offset);
+
+			/* Skip free pages */
+			if (PageBuddy(page)) {
+				offset += (1UL << page_order(page)) - 1UL;
+				continue;
+			}
+			if (page->order < 0)
+				continue;
+
+			pagetype = allocflags_to_migratetype(page->gfp_mask);
+			if (pagetype != mtype) {
+				count[mtype]++;
+				break;
+			}
+
+			/* Move to end of this allocation */
+			offset += (1 << page->order) - 1;
+		}
+	}
+
+	/* Print counts */
+	seq_printf(m, "Node %d, zone %8s ", pgdat->node_id, zone->name);
+	for (mtype = 0; mtype < MIGRATE_TYPES; mtype++)
+		seq_printf(m, "%12lu ", count[mtype]);
+	seq_putc(m, '\n');
+}
+#endif /* CONFIG_PAGE_OWNER */
+
+/*
+ * Print out the number of pageblocks for each migratetype that contain pages
+ * of other types. This gives an indication of how well fallbacks are being
+ * contained by rmqueue_fallback(). It requires information from PAGE_OWNER
+ * to determine what is going on
+ */
+static void pagetypeinfo_showmixedcount(struct seq_file *m, pg_data_t *pgdat)
+{
+#ifdef CONFIG_PAGE_OWNER
+	int mtype;
+
+	seq_printf(m, "\n%-23s", "Number of mixed blocks ");
+	for (mtype = 0; mtype < MIGRATE_TYPES; mtype++)
+		seq_printf(m, "%12s ", migratetype_names[mtype]);
+	seq_putc(m, '\n');
+
+	walk_zones_in_node(m, pgdat, pagetypeinfo_showmixedcount_print);
+#endif /* CONFIG_PAGE_OWNER */
+}
+
+/*
+ * This prints out statistics in relation to grouping pages by mobility.
+ * It is expensive to collect so do not constantly read the file.
+ */
+static int pagetypeinfo_show(struct seq_file *m, void *arg)
+{
+	pg_data_t *pgdat = (pg_data_t *)arg;
+
+	seq_printf(m, "Page block order: %d\n", pageblock_order);
+	seq_printf(m, "Pages per block:  %lu\n", pageblock_nr_pages);
+	seq_putc(m, '\n');
+	pagetypeinfo_showfree(m, pgdat);
+	pagetypeinfo_showblockcount(m, pgdat);
+	pagetypeinfo_showmixedcount(m, pgdat);
+
 	return 0;
 }
 
@@ -453,6 +669,13 @@ const struct seq_operations fragmentatio
 	.show	= frag_show,
 };
 
+const struct seq_operations pagetypeinfo_op = {
+	.start	= frag_start,
+	.next	= frag_next,
+	.stop	= frag_stop,
+	.show	= pagetypeinfo_show,
+};
+
 #ifdef CONFIG_ZONE_DMA
 #define TEXT_FOR_DMA(xx) xx "_dma",
 #else
@@ -531,84 +754,78 @@ static const char * const vmstat_text[] 
 #endif
 };
 
-/*
- * Output information about zones in @pgdat.
- */
-static int zoneinfo_show(struct seq_file *m, void *arg)
+static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat,
+							struct zone *zone)
 {
-	pg_data_t *pgdat = arg;
-	struct zone *zone;
-	struct zone *node_zones = pgdat->node_zones;
-	unsigned long flags;
-
-	for (zone = node_zones; zone - node_zones < MAX_NR_ZONES; zone++) {
-		int i;
-
-		if (!populated_zone(zone))
-			continue;
-
-		spin_lock_irqsave(&zone->lock, flags);
-		seq_printf(m, "Node %d, zone %8s", pgdat->node_id, zone->name);
-		seq_printf(m,
-			   "\n  pages free     %lu"
-			   "\n        min      %lu"
-			   "\n        low      %lu"
-			   "\n        high     %lu"
-			   "\n        scanned  %lu (a: %lu i: %lu)"
-			   "\n        spanned  %lu"
-			   "\n        present  %lu",
-			   zone_page_state(zone, NR_FREE_PAGES),
-			   zone->pages_min,
-			   zone->pages_low,
-			   zone->pages_high,
-			   zone->pages_scanned,
-			   zone->nr_scan_active, zone->nr_scan_inactive,
-			   zone->spanned_pages,
-			   zone->present_pages);
+	int i;
+	seq_printf(m, "Node %d, zone %8s", pgdat->node_id, zone->name);
+	seq_printf(m,
+		   "\n  pages free     %lu"
+		   "\n        min      %lu"
+		   "\n        low      %lu"
+		   "\n        high     %lu"
+		   "\n        scanned  %lu (a: %lu i: %lu)"
+		   "\n        spanned  %lu"
+		   "\n        present  %lu",
+		   zone_page_state(zone, NR_FREE_PAGES),
+		   zone->pages_min,
+		   zone->pages_low,
+		   zone->pages_high,
+		   zone->pages_scanned,
+		   zone->nr_scan_active, zone->nr_scan_inactive,
+		   zone->spanned_pages,
+		   zone->present_pages);
 
-		for (i = 0; i < NR_VM_ZONE_STAT_ITEMS; i++)
-			seq_printf(m, "\n    %-12s %lu", vmstat_text[i],
-					zone_page_state(zone, i));
+	for (i = 0; i < NR_VM_ZONE_STAT_ITEMS; i++)
+		seq_printf(m, "\n    %-12s %lu", vmstat_text[i],
+				zone_page_state(zone, i));
 
-		seq_printf(m,
-			   "\n        protection: (%lu",
-			   zone->lowmem_reserve[0]);
-		for (i = 1; i < ARRAY_SIZE(zone->lowmem_reserve); i++)
-			seq_printf(m, ", %lu", zone->lowmem_reserve[i]);
-		seq_printf(m,
-			   ")"
-			   "\n  pagesets");
-		for_each_online_cpu(i) {
-			struct per_cpu_pageset *pageset;
-			int j;
-
-			pageset = zone_pcp(zone, i);
-			for (j = 0; j < ARRAY_SIZE(pageset->pcp); j++) {
-				seq_printf(m,
-					   "\n    cpu: %i pcp: %i"
-					   "\n              count: %i"
-					   "\n              high:  %i"
-					   "\n              batch: %i",
-					   i, j,
-					   pageset->pcp[j].count,
-					   pageset->pcp[j].high,
-					   pageset->pcp[j].batch);
+	seq_printf(m,
+		   "\n        protection: (%lu",
+		   zone->lowmem_reserve[0]);
+	for (i = 1; i < ARRAY_SIZE(zone->lowmem_reserve); i++)
+		seq_printf(m, ", %lu", zone->lowmem_reserve[i]);
+	seq_printf(m,
+		   ")"
+		   "\n  pagesets");
+	for_each_online_cpu(i) {
+		struct per_cpu_pageset *pageset;
+		int j;
+
+		pageset = zone_pcp(zone, i);
+		for (j = 0; j < ARRAY_SIZE(pageset->pcp); j++) {
+			seq_printf(m,
+				   "\n    cpu: %i pcp: %i"
+				   "\n              count: %i"
+				   "\n              high:  %i"
+				   "\n              batch: %i",
+				   i, j,
+				   pageset->pcp[j].count,
+				   pageset->pcp[j].high,
+				   pageset->pcp[j].batch);
 			}
 #ifdef CONFIG_SMP
-			seq_printf(m, "\n  vm stats threshold: %d",
-					pageset->stat_threshold);
+		seq_printf(m, "\n  vm stats threshold: %d",
+				pageset->stat_threshold);
 #endif
-		}
-		seq_printf(m,
-			   "\n  all_unreclaimable: %u"
-			   "\n  prev_priority:     %i"
-			   "\n  start_pfn:         %lu",
-			   zone->all_unreclaimable,
-			   zone->prev_priority,
-			   zone->zone_start_pfn);
-		spin_unlock_irqrestore(&zone->lock, flags);
-		seq_putc(m, '\n');
 	}
+	seq_printf(m,
+		   "\n  all_unreclaimable: %u"
+		   "\n  prev_priority:     %i"
+		   "\n  start_pfn:         %lu",
+		   zone->all_unreclaimable,
+		   zone->prev_priority,
+		   zone->zone_start_pfn);
+	seq_putc(m, '\n');
+}
+
+/*
+ * Output information about zones in @pgdat.
+ */
+static int zoneinfo_show(struct seq_file *m, void *arg)
+{
+	pg_data_t *pgdat = (pg_data_t *)arg;
+	walk_zones_in_node(m, pgdat, zoneinfo_show_print);
 	return 0;
 }
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH 2/7] KAMEZAWA Hiroyuki - migration by kernel
  2007-05-29 17:36 [PATCH 0/7] [RFC] Memory Compaction v1 Mel Gorman
  2007-05-29 17:36 ` [PATCH 1/7] Roll-up patch of what has been sent already Mel Gorman
@ 2007-05-29 17:36 ` Mel Gorman
  2007-05-30  2:42   ` KAMEZAWA Hiroyuki
  2007-05-29 17:37 ` [PATCH 3/7] Allow CONFIG_MIGRATION to be set without CONFIG_NUMA Mel Gorman
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 21+ messages in thread
From: Mel Gorman @ 2007-05-29 17:36 UTC (permalink / raw)
  To: linux-mm, linux-kernel; +Cc: Mel Gorman, kamezawa.hiroyu, clameter

This is a patch from KAMEZAWA Hiroyuki for using page migration on remote
processes without races. This patch is still undergoing development and
is expected to be a pre-requisite for both memory hot-remove and memory
compaction.

Changelog from KAMEZAWA Hiroyuki version
o Removed the MIGRATION_BY_KERNEL as a compile-time option

=====

This patch adds a feature that the kernel can migrate user pages by its own
context.

Now, sys_migrate(), a system call to migrate pages, works well.
When we want to migrate pages by some kernel codes, we have 2 approachs.

(a) acquire some mm->sem of a mapper of the target page.
(b) avoid race condition by additional check codes.

This patch implements (b) and adds following 2 codes.

1. delay freeing anon_vma while a page which belongs to it is migrated.
2. check page_mapped() before calling try_to_unmap().
 
Maybe more check will be needed. At least, this patch's migration_nocntext()
works well under heavy memory pressure on my environment.

From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Yasonori Goto <y-goto@jp.fujitsu.com>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---

 include/linux/migrate.h |    3 ++-
 include/linux/rmap.h    |   24 ++++++++++++++++++++++++
 mm/migrate.c            |   42 +++++++++++++++++++++++++++++++++---------
 mm/rmap.c               |   23 +++++++++++++++++++++++
 4 files changed, 82 insertions(+), 10 deletions(-)

diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.22-rc2-mm1-001_lameter-v4r4/include/linux/migrate.h linux-2.6.22-rc2-mm1-005_migrate_nocontext/include/linux/migrate.h
--- linux-2.6.22-rc2-mm1-001_lameter-v4r4/include/linux/migrate.h	2007-05-19 05:06:17.000000000 +0100
+++ linux-2.6.22-rc2-mm1-005_migrate_nocontext/include/linux/migrate.h	2007-05-28 14:11:32.000000000 +0100
@@ -30,7 +30,8 @@ extern int putback_lru_pages(struct list
 extern int migrate_page(struct address_space *,
 			struct page *, struct page *);
 extern int migrate_pages(struct list_head *l, new_page_t x, unsigned long);
-
+extern int migrate_pages_nocontext(struct list_head *l, new_page_t x,
+					unsigned long);
 extern int fail_migrate_page(struct address_space *,
 			struct page *, struct page *);
 
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.22-rc2-mm1-001_lameter-v4r4/include/linux/rmap.h linux-2.6.22-rc2-mm1-005_migrate_nocontext/include/linux/rmap.h
--- linux-2.6.22-rc2-mm1-001_lameter-v4r4/include/linux/rmap.h	2007-05-19 05:06:17.000000000 +0100
+++ linux-2.6.22-rc2-mm1-005_migrate_nocontext/include/linux/rmap.h	2007-05-28 14:11:32.000000000 +0100
@@ -26,12 +26,16 @@
 struct anon_vma {
 	spinlock_t lock;	/* Serialize access to vma list */
 	struct list_head head;	/* List of private "related" vmas */
+#ifdef CONFIG_MIGRATION
+	atomic_t	ref;	/* special refcnt for migration */
+#endif
 };
 
 #ifdef CONFIG_MMU
 
 extern struct kmem_cache *anon_vma_cachep;
 
+#ifndef CONFIG_MIGRATION
 static inline struct anon_vma *anon_vma_alloc(void)
 {
 	return kmem_cache_alloc(anon_vma_cachep, GFP_KERNEL);
@@ -41,6 +45,26 @@ static inline void anon_vma_free(struct 
 {
 	kmem_cache_free(anon_vma_cachep, anon_vma);
 }
+#define anon_vma_hold(page)	do{}while(0)
+#define anon_vma_release(anon)	do{}while(0)
+
+#else /* CONFIG_MIGRATION */
+static inline struct anon_vma *anon_vma_alloc(void)
+{
+	struct anon_vma *ret = kmem_cache_alloc(anon_vma_cachep, GFP_KERNEL);
+	if (ret)
+		atomic_set(&ret->ref, 0);
+	return ret;
+}
+static inline void anon_vma_free(struct anon_vma *anon_vma)
+{
+	if (atomic_read(&anon_vma->ref) == 0)
+		kmem_cache_free(anon_vma_cachep, anon_vma);
+}
+extern struct anon_vma *anon_vma_hold(struct page *page);
+extern void anon_vma_release(struct anon_vma *anon_vma);
+
+#endif /* CONFIG_MIGRATION */
 
 static inline void anon_vma_lock(struct vm_area_struct *vma)
 {
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.22-rc2-mm1-001_lameter-v4r4/mm/migrate.c linux-2.6.22-rc2-mm1-005_migrate_nocontext/mm/migrate.c
--- linux-2.6.22-rc2-mm1-001_lameter-v4r4/mm/migrate.c	2007-05-24 10:13:34.000000000 +0100
+++ linux-2.6.22-rc2-mm1-005_migrate_nocontext/mm/migrate.c	2007-05-28 14:11:32.000000000 +0100
@@ -607,11 +607,12 @@ static int move_to_new_page(struct page 
  * to the newly allocated page in newpage.
  */
 static int unmap_and_move(new_page_t get_new_page, unsigned long private,
-			struct page *page, int force)
+			struct page *page, int force, int context)
 {
 	int rc = 0;
 	int *result = NULL;
 	struct page *newpage = get_new_page(page, private, &result);
+	struct anon_vma *anon_vma = NULL;
 
 	if (!newpage)
 		return -ENOMEM;
@@ -633,15 +634,22 @@ static int unmap_and_move(new_page_t get
 		wait_on_page_writeback(page);
 	}
 
-	/*
-	 * Establish migration ptes or remove ptes
-	 */
-	try_to_unmap(page, 1);
+	if (PageAnon(page) && context)
+		/* hold this anon_vma until page migration ends */
+		anon_vma = anon_vma_hold(page);
+
+	if (page_mapped(page))
+		try_to_unmap(page, 1);
+
 	if (!page_mapped(page))
 		rc = move_to_new_page(newpage, page);
 
-	if (rc)
+	if (rc) {
 		remove_migration_ptes(page, page);
+	}
+
+	if (anon_vma)
+		anon_vma_release(anon_vma);
 
 unlock:
 	unlock_page(page);
@@ -686,8 +694,8 @@ move_newpage:
  *
  * Return: Number of pages not migrated or error code.
  */
-int migrate_pages(struct list_head *from,
-		new_page_t get_new_page, unsigned long private)
+int __migrate_pages(struct list_head *from,
+		new_page_t get_new_page, unsigned long private, int context)
 {
 	int retry = 1;
 	int nr_failed = 0;
@@ -707,7 +715,7 @@ int migrate_pages(struct list_head *from
 			cond_resched();
 
 			rc = unmap_and_move(get_new_page, private,
-						page, pass > 2);
+						page, pass > 2, context);
 
 			switch(rc) {
 			case -ENOMEM:
@@ -737,6 +745,22 @@ out:
 	return nr_failed + retry;
 }
 
+int migrate_pages(struct list_head *from,
+	new_page_t get_new_page, unsigned long private)
+{
+	return __migrate_pages(from, get_new_page, private, 0);
+}
+
+/*
+ * When page migration is issued by the kernel itself without page mapper's
+ * mm->sem, we have to be more careful to do page migration.
+ */
+int migrate_pages_nocontext(struct list_head *from,
+	new_page_t get_new_page, unsigned long private)
+{
+	return __migrate_pages(from, get_new_page, private, 1);
+}
+
 #ifdef CONFIG_NUMA
 /*
  * Move a list of individual pages
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.22-rc2-mm1-001_lameter-v4r4/mm/rmap.c linux-2.6.22-rc2-mm1-005_migrate_nocontext/mm/rmap.c
--- linux-2.6.22-rc2-mm1-001_lameter-v4r4/mm/rmap.c	2007-05-24 10:13:34.000000000 +0100
+++ linux-2.6.22-rc2-mm1-005_migrate_nocontext/mm/rmap.c	2007-05-28 14:11:32.000000000 +0100
@@ -204,6 +204,29 @@ static void page_unlock_anon_vma(struct 
 	rcu_read_unlock();
 }
 
+#ifdef CONFIG_MIGRATION
+struct anon_vma *anon_vma_hold(struct page *page) {
+	struct anon_vma *anon_vma;
+	anon_vma = page_lock_anon_vma(page);
+	if (!anon_vma)
+		return NULL;
+	atomic_set(&anon_vma->ref, 1);
+	spin_unlock(&anon_vma->lock);
+	return anon_vma;
+}
+
+void anon_vma_release(struct anon_vma *anon_vma)
+{
+	int empty;
+	spin_lock(&anon_vma->lock);
+	atomic_set(&anon_vma->ref, 0);
+	empty = list_empty(&anon_vma->head);
+	spin_unlock(&anon_vma->lock);
+	if (empty)
+		anon_vma_free(anon_vma);
+}
+#endif /* CONFIG_MIGRATION */
+
 /*
  * At what user virtual address is page expected in vma?
  */

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 2/7] KAMEZAWA Hiroyuki - migration by kernel
  2007-05-29 17:36 ` [PATCH 2/7] KAMEZAWA Hiroyuki - migration by kernel Mel Gorman
@ 2007-05-30  2:42   ` KAMEZAWA Hiroyuki
  2007-05-30  2:47     ` Christoph Lameter
  2007-05-30 19:57     ` Hugh Dickins
  0 siblings, 2 replies; 21+ messages in thread
From: KAMEZAWA Hiroyuki @ 2007-05-30  2:42 UTC (permalink / raw)
  To: Mel Gorman; +Cc: linux-mm, linux-kernel, clameter

On Tue, 29 May 2007 18:36:50 +0100 (IST)
Mel Gorman <mel@csn.ul.ie> wrote:

> 
> This is a patch from KAMEZAWA Hiroyuki for using page migration on remote
> processes without races. This patch is still undergoing development and
> is expected to be a pre-requisite for both memory hot-remove and memory
> compaction.
> 
> Changelog from KAMEZAWA Hiroyuki version
> o Removed the MIGRATION_BY_KERNEL as a compile-time option
> 
This is my latest version.
(not tested because caller of this function is being rewritten now..)

I'll move this patch to the top of my series and prepare to post this patch as
a single patch.

==
page migration by kernel v2.

Changelog V1 -> V2
 *removed atomic ops.
 *removes changes in anon_vma_free() and add check before calling it.
 *reflected feedback of review.
 *remove CONFIG_MIGRATION_BY_KERNEL

In usual, migrate_pages(page,,) is called with holoding mm->sem by systemcall.
(mm here is a mm_struct which maps the migration target page.)
This semaphore helps avoiding some race conditions.

But, if we want to migrate a page by some kernel codes, we have to avoid
some races. This patch adds check code for following race condition.

1. A page which is not mapped can be target of migration. Then, we have
   to check page_mapped() before calling try_to_unmap().

2. We can't trust page->mapping if page_mapcount() can goes down to 0.
   But when we map newpage back to original ptes, we have to access
   anon_vma from a page, which page_mapcount() is 0.
   This patch adds a special refcnt to anon_vma, which is synced by
   anon_vma->lock and delays freeing anon_vma.

Signed-Off-By: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>


---
 include/linux/migrate.h |    5 ++++-
 include/linux/rmap.h    |   11 +++++++++++
 mm/migrate.c            |   35 +++++++++++++++++++++++++++++------
 mm/rmap.c               |   36 +++++++++++++++++++++++++++++++++++-
 4 files changed, 79 insertions(+), 8 deletions(-)

Index: linux-2.6.22-rc2-mm1/mm/migrate.c
===================================================================
--- linux-2.6.22-rc2-mm1.orig/mm/migrate.c
+++ linux-2.6.22-rc2-mm1/mm/migrate.c
@@ -607,11 +607,12 @@ static int move_to_new_page(struct page 
  * to the newly allocated page in newpage.
  */
 static int unmap_and_move(new_page_t get_new_page, unsigned long private,
-			struct page *page, int force)
+			struct page *page, int force, int nocontext)
 {
 	int rc = 0;
 	int *result = NULL;
 	struct page *newpage = get_new_page(page, private, &result);
+	struct anon_vma *anon_vma = NULL;
 
 	if (!newpage)
 		return -ENOMEM;
@@ -632,17 +633,23 @@ static int unmap_and_move(new_page_t get
 			goto unlock;
 		wait_on_page_writeback(page);
 	}
-
+	/* hold this anon_vma until page migration ends */
+	if (nocontext && PageAnon(page) && page_mapped(page))
+		anon_vma = anon_vma_hold(page);
 	/*
 	 * Establish migration ptes or remove ptes
 	 */
-	try_to_unmap(page, 1);
+	if (page_mapped(page))
+		try_to_unmap(page, 1);
+
 	if (!page_mapped(page))
 		rc = move_to_new_page(newpage, page);
 
 	if (rc)
 		remove_migration_ptes(page, page);
 
+	anon_vma_release(anon_vma);
+
 unlock:
 	unlock_page(page);
 
@@ -686,8 +693,8 @@ move_newpage:
  *
  * Return: Number of pages not migrated or error code.
  */
-int migrate_pages(struct list_head *from,
-		new_page_t get_new_page, unsigned long private)
+int __migrate_pages(struct list_head *from,
+		new_page_t get_new_page, unsigned long private, int nocontext)
 {
 	int retry = 1;
 	int nr_failed = 0;
@@ -707,7 +714,7 @@ int migrate_pages(struct list_head *from
 			cond_resched();
 
 			rc = unmap_and_move(get_new_page, private,
-						page, pass > 2);
+						page, pass > 2, nocontext);
 
 			switch(rc) {
 			case -ENOMEM:
@@ -737,6 +744,22 @@ out:
 	return nr_failed + retry;
 }
 
+int migrate_pages(struct list_head *from,
+	new_page_t get_new_page, unsigned long private)
+{
+	return __migrate_pages(from, get_new_page, private, 0);
+}
+
+/*
+ * When page migration is issued by the kernel itself without page mapper's
+ * mm->sem, we have to be more careful to do page migration.
+ */
+int migrate_pages_nocontext(struct list_head *from,
+	new_page_t get_new_page, unsigned long private)
+{
+	return __migrate_pages(from, get_new_page, private, 1);
+}
+
 #ifdef CONFIG_NUMA
 /*
  * Move a list of individual pages
Index: linux-2.6.22-rc2-mm1/include/linux/rmap.h
===================================================================
--- linux-2.6.22-rc2-mm1.orig/include/linux/rmap.h
+++ linux-2.6.22-rc2-mm1/include/linux/rmap.h
@@ -26,6 +26,9 @@
 struct anon_vma {
 	spinlock_t lock;	/* Serialize access to vma list */
 	struct list_head head;	/* List of private "related" vmas */
+#ifdef CONFIG_MIGRATION
+	int	ref;	/* special refcnt for migration */
+#endif
 };
 
 #ifdef CONFIG_MMU
@@ -42,6 +45,14 @@ static inline void anon_vma_free(struct 
 	kmem_cache_free(anon_vma_cachep, anon_vma);
 }
 
+#ifdef  CONFIG_MIGRATION
+extern struct anon_vma *anon_vma_hold(struct page *page);
+extern void anon_vma_release(struct anon_vma *anon_vma);
+#else
+#define anon_vma_hold(page)     do{}while(0)
+#define anon_vma_release(anon)  do{}while(0)
+#endif
+
 static inline void anon_vma_lock(struct vm_area_struct *vma)
 {
 	struct anon_vma *anon_vma = vma->anon_vma;
Index: linux-2.6.22-rc2-mm1/mm/rmap.c
===================================================================
--- linux-2.6.22-rc2-mm1.orig/mm/rmap.c
+++ linux-2.6.22-rc2-mm1/mm/rmap.c
@@ -90,6 +90,9 @@ int anon_vma_prepare(struct vm_area_stru
 			anon_vma = anon_vma_alloc();
 			if (unlikely(!anon_vma))
 				return -ENOMEM;
+#ifdef CONFIG_MIGRATION
+			anon_vma->ref = 0;
+#endif
 			allocated = anon_vma;
 			locked = NULL;
 		}
@@ -150,9 +153,13 @@ void anon_vma_unlink(struct vm_area_stru
 	spin_lock(&anon_vma->lock);
 	validate_anon_vma(vma);
 	list_del(&vma->anon_vma_node);
-
 	/* We must garbage collect the anon_vma if it's empty */
 	empty = list_empty(&anon_vma->head);
+#ifdef CONFIG_MIGRATION
+	/* this means migrate_pages() has reference to this */
+	if (anon_vma->ref)
+		empty = 0;
+#endif
 	spin_unlock(&anon_vma->lock);
 
 	if (empty)
@@ -203,6 +210,33 @@ static void page_unlock_anon_vma(struct 
 	spin_unlock(&anon_vma->lock);
 	rcu_read_unlock();
 }
+#ifdef CONFIG_MIGRATION
+struct anon_vma *anon_vma_hold(struct page *page) {
+	struct anon_vma *anon_vma = NULL;
+	anon_vma = page_lock_anon_vma(page);
+	if (!anon_vma)
+		return NULL;
+	if (!list_empty(&anon_vma->head))
+		anon_vma->ref++;
+	spin_unlock(&anon_vma->lock);
+	return anon_vma;
+}
+
+void anon_vma_release(struct anon_vma *anon_vma)
+{
+	int empty;
+	if (!anon_vma) /* noting to do */
+		return;
+	spin_lock(&anon_vma->lock);
+	empty = list_empty(&anon_vma->head);
+	anon_vma->ref--;
+	if (!anon_vma->ref)
+		empty = 0;
+	spin_unlock(&anon_vma->lock);
+	if (empty)
+		anon_vma_free(anon_vma);
+}
+#endif
 
 /*
  * At what user virtual address is page expected in vma?
Index: linux-2.6.22-rc2-mm1/include/linux/migrate.h
===================================================================
--- linux-2.6.22-rc2-mm1.orig/include/linux/migrate.h
+++ linux-2.6.22-rc2-mm1/include/linux/migrate.h
@@ -30,7 +30,10 @@ extern int putback_lru_pages(struct list
 extern int migrate_page(struct address_space *,
 			struct page *, struct page *);
 extern int migrate_pages(struct list_head *l, new_page_t x, unsigned long);
-
+#ifdef CONFIG_MIGRATION_BY_KERNEL
+extern int migrate_pages_nocontext(struct list_head *l, new_page_t x,
+					unsigned long);
+#endif
 extern int fail_migrate_page(struct address_space *,
 			struct page *, struct page *);
 


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 2/7] KAMEZAWA Hiroyuki - migration by kernel
  2007-05-30  2:42   ` KAMEZAWA Hiroyuki
@ 2007-05-30  2:47     ` Christoph Lameter
  2007-05-30 19:57     ` Hugh Dickins
  1 sibling, 0 replies; 21+ messages in thread
From: Christoph Lameter @ 2007-05-30  2:47 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: Mel Gorman, linux-mm, linux-kernel

Looks good. I will ack it when I have a chance to test either your or 
Mel's patchset. Likely after the next iteration.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 2/7] KAMEZAWA Hiroyuki - migration by kernel
  2007-05-30  2:42   ` KAMEZAWA Hiroyuki
  2007-05-30  2:47     ` Christoph Lameter
@ 2007-05-30 19:57     ` Hugh Dickins
  2007-05-30 20:07       ` Christoph Lameter
  2007-05-31 12:26       ` KAMEZAWA Hiroyuki
  1 sibling, 2 replies; 21+ messages in thread
From: Hugh Dickins @ 2007-05-30 19:57 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: Mel Gorman, linux-mm, linux-kernel, clameter

On Wed, 30 May 2007, KAMEZAWA Hiroyuki wrote:
> This is my latest version.
> (not tested because caller of this function is being rewritten now..)
> 
> I'll move this patch to the top of my series and prepare to post this patch as
> a single patch.
> 
> ==
> page migration by kernel v2.
> 
> Changelog V1 -> V2
>  *removed atomic ops.
>  *removes changes in anon_vma_free() and add check before calling it.
>  *reflected feedback of review.
>  *remove CONFIG_MIGRATION_BY_KERNEL
> 
> In usual, migrate_pages(page,,) is called with holoding mm->sem by systemcall.
> (mm here is a mm_struct which maps the migration target page.)
> This semaphore helps avoiding some race conditions.
> 
> But, if we want to migrate a page by some kernel codes, we have to avoid
> some races. This patch adds check code for following race condition.
> 
> 1. A page which is not mapped can be target of migration. Then, we have
>    to check page_mapped() before calling try_to_unmap().
> 
> 2. We can't trust page->mapping if page_mapcount() can goes down to 0.
>    But when we map newpage back to original ptes, we have to access
>    anon_vma from a page, which page_mapcount() is 0.
>    This patch adds a special refcnt to anon_vma, which is synced by
>    anon_vma->lock and delays freeing anon_vma.
> 
> Signed-Off-By: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

I've taken a look at last.  It looks like a good fix to a real problem,
but may I suggest a simpler version?  The anon_vma isn't usually held
by a refcount, but by having a vma on its linked list: why not just
put a dummy vma into that linked list?  No need to add a refcount.

The NUMA shmem_alloc_page already uses a dummy vma on its stack,
so you can reasonably declare a vm_area_struct on unmap_and_move's
stack.  No need for a special anon_vma_release, anon_vma_unlink
should do fine.  I've not reworked your whole patch, but show
what I think the mm/rmap.c part would be at the bottom.

A few comments on your version below...

> 
> 
> ---
>  include/linux/migrate.h |    5 ++++-
>  include/linux/rmap.h    |   11 +++++++++++
>  mm/migrate.c            |   35 +++++++++++++++++++++++++++++------
>  mm/rmap.c               |   36 +++++++++++++++++++++++++++++++++++-
>  4 files changed, 79 insertions(+), 8 deletions(-)
> 
> Index: linux-2.6.22-rc2-mm1/mm/migrate.c
> ===================================================================
> --- linux-2.6.22-rc2-mm1.orig/mm/migrate.c
> +++ linux-2.6.22-rc2-mm1/mm/migrate.c
> @@ -607,11 +607,12 @@ static int move_to_new_page(struct page 
>   * to the newly allocated page in newpage.
>   */
>  static int unmap_and_move(new_page_t get_new_page, unsigned long private,
> -			struct page *page, int force)
> +			struct page *page, int force, int nocontext)
>  {

An "int context" would be a lot better than the negative "int nocontext";
even better would be "int holds_mmap_sem".  Or even skip the additional
argument completely, use the anon_vma_hold method always without relying
on whether or not mmap_sem is held.  I don't know how significant it is
to avoid extra locking here: on the one hand we like to avoid unnecessary
locking; on the other hand there's probably a thousand commoner places in
the kernel where we could pass down an arg to say, actually you won't
need to lock in such and such a case.

>  	int rc = 0;
>  	int *result = NULL;
>  	struct page *newpage = get_new_page(page, private, &result);
> +	struct anon_vma *anon_vma = NULL;
>  
>  	if (!newpage)
>  		return -ENOMEM;
> @@ -632,17 +633,23 @@ static int unmap_and_move(new_page_t get
>  			goto unlock;
>  		wait_on_page_writeback(page);
>  	}
> -
> +	/* hold this anon_vma until page migration ends */
> +	if (nocontext && PageAnon(page) && page_mapped(page))
> +		anon_vma = anon_vma_hold(page);
>  	/*
>  	 * Establish migration ptes or remove ptes
>  	 */
> -	try_to_unmap(page, 1);
> +	if (page_mapped(page))
> +		try_to_unmap(page, 1);
> +

All these preliminary tests: yes, I suppose they avoid unnecessary
locking, so I guess they're good; but it should work without them.

>  	if (!page_mapped(page))
>  		rc = move_to_new_page(newpage, page);
>  
>  	if (rc)
>  		remove_migration_ptes(page, page);
>  
> +	anon_vma_release(anon_vma);
> +
>  unlock:
>  	unlock_page(page);
>  
> @@ -686,8 +693,8 @@ move_newpage:
>   *
>   * Return: Number of pages not migrated or error code.
>   */
> -int migrate_pages(struct list_head *from,
> -		new_page_t get_new_page, unsigned long private)
> +int __migrate_pages(struct list_head *from,
> +		new_page_t get_new_page, unsigned long private, int nocontext)
>  {

Remarks on nocontext as above: mmm, I think keep the patch small
and don't add that extra argument at all.

>  	int retry = 1;
>  	int nr_failed = 0;
> @@ -707,7 +714,7 @@ int migrate_pages(struct list_head *from
>  			cond_resched();
>  
>  			rc = unmap_and_move(get_new_page, private,
> -						page, pass > 2);
> +						page, pass > 2, nocontext);
>  
>  			switch(rc) {
>  			case -ENOMEM:
> @@ -737,6 +744,22 @@ out:
>  	return nr_failed + retry;
>  }
>  
> +int migrate_pages(struct list_head *from,
> +	new_page_t get_new_page, unsigned long private)
> +{
> +	return __migrate_pages(from, get_new_page, private, 0);
> +}
> +
> +/*
> + * When page migration is issued by the kernel itself without page mapper's
> + * mm->sem, we have to be more careful to do page migration.
> + */
> +int migrate_pages_nocontext(struct list_head *from,
> +	new_page_t get_new_page, unsigned long private)
> +{
> +	return __migrate_pages(from, get_new_page, private, 1);
> +}
> +
>  #ifdef CONFIG_NUMA
>  /*
>   * Move a list of individual pages
> Index: linux-2.6.22-rc2-mm1/include/linux/rmap.h
> ===================================================================
> --- linux-2.6.22-rc2-mm1.orig/include/linux/rmap.h
> +++ linux-2.6.22-rc2-mm1/include/linux/rmap.h
> @@ -26,6 +26,9 @@
>  struct anon_vma {
>  	spinlock_t lock;	/* Serialize access to vma list */
>  	struct list_head head;	/* List of private "related" vmas */
> +#ifdef CONFIG_MIGRATION
> +	int	ref;	/* special refcnt for migration */
> +#endif
>  };
>  
>  #ifdef CONFIG_MMU
> @@ -42,6 +45,14 @@ static inline void anon_vma_free(struct 
>  	kmem_cache_free(anon_vma_cachep, anon_vma);
>  }
>  
> +#ifdef  CONFIG_MIGRATION
> +extern struct anon_vma *anon_vma_hold(struct page *page);
> +extern void anon_vma_release(struct anon_vma *anon_vma);
> +#else
> +#define anon_vma_hold(page)     do{}while(0)
> +#define anon_vma_release(anon)  do{}while(0)

Rather than change those to "do {} while (0)", to which others
will ask for static inlines, just delete them, can't you -
they're simply not needed in the !CONFIG_MIGRATION case, right?

> +#endif
> +
>  static inline void anon_vma_lock(struct vm_area_struct *vma)
>  {
>  	struct anon_vma *anon_vma = vma->anon_vma;
> Index: linux-2.6.22-rc2-mm1/mm/rmap.c
> ===================================================================
> --- linux-2.6.22-rc2-mm1.orig/mm/rmap.c
> +++ linux-2.6.22-rc2-mm1/mm/rmap.c
> @@ -90,6 +90,9 @@ int anon_vma_prepare(struct vm_area_stru
>  			anon_vma = anon_vma_alloc();
>  			if (unlikely(!anon_vma))
>  				return -ENOMEM;
> +#ifdef CONFIG_MIGRATION
> +			anon_vma->ref = 0;
> +#endif
>  			allocated = anon_vma;
>  			locked = NULL;
>  		}
> @@ -150,9 +153,13 @@ void anon_vma_unlink(struct vm_area_stru
>  	spin_lock(&anon_vma->lock);
>  	validate_anon_vma(vma);
>  	list_del(&vma->anon_vma_node);
> -
>  	/* We must garbage collect the anon_vma if it's empty */
>  	empty = list_empty(&anon_vma->head);
> +#ifdef CONFIG_MIGRATION
> +	/* this means migrate_pages() has reference to this */
> +	if (anon_vma->ref)
> +		empty = 0;
> +#endif
>  	spin_unlock(&anon_vma->lock);
>  
>  	if (empty)
> @@ -203,6 +210,33 @@ static void page_unlock_anon_vma(struct 
>  	spin_unlock(&anon_vma->lock);
>  	rcu_read_unlock();
>  }
> +#ifdef CONFIG_MIGRATION
> +struct anon_vma *anon_vma_hold(struct page *page) {
> +	struct anon_vma *anon_vma = NULL;
> +	anon_vma = page_lock_anon_vma(page);
> +	if (!anon_vma)
> +		return NULL;
> +	if (!list_empty(&anon_vma->head))
> +		anon_vma->ref++;
> +	spin_unlock(&anon_vma->lock);
> +	return anon_vma;
> +}
> +
> +void anon_vma_release(struct anon_vma *anon_vma)
> +{
> +	int empty;
> +	if (!anon_vma) /* noting to do */
> +		return;
> +	spin_lock(&anon_vma->lock);
> +	empty = list_empty(&anon_vma->head);
> +	anon_vma->ref--;
> +	if (!anon_vma->ref)
> +		empty = 0;
> +	spin_unlock(&anon_vma->lock);
> +	if (empty)
> +		anon_vma_free(anon_vma);
> +}
> +#endif
>  
>  /*
>   * At what user virtual address is page expected in vma?
> Index: linux-2.6.22-rc2-mm1/include/linux/migrate.h
> ===================================================================
> --- linux-2.6.22-rc2-mm1.orig/include/linux/migrate.h
> +++ linux-2.6.22-rc2-mm1/include/linux/migrate.h
> @@ -30,7 +30,10 @@ extern int putback_lru_pages(struct list
>  extern int migrate_page(struct address_space *,
>  			struct page *, struct page *);
>  extern int migrate_pages(struct list_head *l, new_page_t x, unsigned long);
> -
> +#ifdef CONFIG_MIGRATION_BY_KERNEL
> +extern int migrate_pages_nocontext(struct list_head *l, new_page_t x,
> +					unsigned long);
> +#endif
>  extern int fail_migrate_page(struct address_space *,
>  			struct page *, struct page *);
>  

--- 2.6.22-rc3/mm/rmap.c	2007-05-19 07:36:34.000000000 +0100
+++ linux/mm/rmap.c	2007-05-30 20:11:21.000000000 +0100
@@ -204,6 +204,23 @@ static void page_unlock_anon_vma(struct 
 	rcu_read_unlock();
 }
 
+#ifdef CONFIG_MIGRATION
+/*
+ * Insert a dummy vm_struct_struct into the page's anon_vma list of vmas,
+ * to hold it from being freed during page migration (lacking mmap_sem).
+ */
+void anon_vma_hold(struct page *page, struct vm_area_struct *holder)
+{
+	holder->anon_vma = page_lock_anon_vma(page);
+	if (holder->anon_vma) {
+		/* Make any call to vma_address() fail */
+		holder->vm_start = holder->vm_end = 0;
+		list_add_tail(&holder->anon_vma_node, &holder->anon_vma->head);
+		page_unlock_anon_vma(holder->anon_vma);
+	}
+}
+#endif
+
 /*
  * At what user virtual address is page expected in vma?
  */

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 2/7] KAMEZAWA Hiroyuki - migration by kernel
  2007-05-30 19:57     ` Hugh Dickins
@ 2007-05-30 20:07       ` Christoph Lameter
  2007-05-30 20:10         ` Christoph Lameter
  2007-05-31 12:26       ` KAMEZAWA Hiroyuki
  1 sibling, 1 reply; 21+ messages in thread
From: Christoph Lameter @ 2007-05-30 20:07 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: KAMEZAWA Hiroyuki, Mel Gorman, linux-mm, linux-kernel

On Wed, 30 May 2007, Hugh Dickins wrote:

> I've taken a look at last.  It looks like a good fix to a real problem,
> but may I suggest a simpler version?  The anon_vma isn't usually held
> by a refcount, but by having a vma on its linked list: why not just
> put a dummy vma into that linked list?  No need to add a refcount.
> 
> The NUMA shmem_alloc_page already uses a dummy vma on its stack,
> so you can reasonably declare a vm_area_struct on unmap_and_move's
> stack.  No need for a special anon_vma_release, anon_vma_unlink
> should do fine.  I've not reworked your whole patch, but show
> what I think the mm/rmap.c part would be at the bottom.

Hummm.. shmem_alloc_pages version only uses the vma as a placeholder 
for memory policies. So we would put the page on a vma that is on the 
stack? That would mean changing the mapping of the page? Is that safe?

And then later we would be changing the mapping back to the old vma?
What guarantees that the old vma is not gone by then?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 2/7] KAMEZAWA Hiroyuki - migration by kernel
  2007-05-30 20:07       ` Christoph Lameter
@ 2007-05-30 20:10         ` Christoph Lameter
  0 siblings, 0 replies; 21+ messages in thread
From: Christoph Lameter @ 2007-05-30 20:10 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: KAMEZAWA Hiroyuki, Mel Gorman, linux-mm, linux-kernel

On Wed, 30 May 2007, Christoph Lameter wrote:

> What guarantees that the old vma is not gone by then?

We would need to add the vma on the stack to the anon vma list .... Right.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 2/7] KAMEZAWA Hiroyuki - migration by kernel
  2007-05-30 19:57     ` Hugh Dickins
  2007-05-30 20:07       ` Christoph Lameter
@ 2007-05-31 12:26       ` KAMEZAWA Hiroyuki
  1 sibling, 0 replies; 21+ messages in thread
From: KAMEZAWA Hiroyuki @ 2007-05-31 12:26 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: mel, linux-mm, linux-kernel, clameter

On Wed, 30 May 2007 20:57:38 +0100 (BST)
Hugh Dickins <hugh@veritas.com> wrote:

> I've taken a look at last.  It looks like a good fix to a real problem,
> but may I suggest a simpler version?  The anon_vma isn't usually held
> by a refcount, but by having a vma on its linked list: why not just
> put a dummy vma into that linked list?  No need to add a refcount.
> 
> The NUMA shmem_alloc_page already uses a dummy vma on its stack,
Oh, I didn't notice that. If dummy-vma works well now, I'll use it.
thank you.

>
> >  static int unmap_and_move(new_page_t get_new_page, unsigned long private,
> > -			struct page *page, int force)
> > +			struct page *page, int force, int nocontext)
> >  {
> 
> An "int context" would be a lot better than the negative "int nocontext";
> even better would be "int holds_mmap_sem".  Or even skip the additional
> argument completely, use the anon_vma_hold method always without relying
> on whether or not mmap_sem is held.  I don't know how significant it is
> to avoid extra locking here: on the one hand we like to avoid unnecessary
> locking; on the other hand there's probably a thousand commoner places in
> the kernel where we could pass down an arg to say, actually you won't
> need to lock in such and such a case.
Hmm, ok. I'd like to try make things simpler.

> 
> >  	int rc = 0;
> >  	int *result = NULL;
> >  	struct page *newpage = get_new_page(page, private, &result);
> > +	struct anon_vma *anon_vma = NULL;
> >  
> >  	if (!newpage)
> >  		return -ENOMEM;
> > @@ -632,17 +633,23 @@ static int unmap_and_move(new_page_t get
> >  			goto unlock;
> >  		wait_on_page_writeback(page);
> >  	}
> > -
> > +	/* hold this anon_vma until page migration ends */
> > +	if (nocontext && PageAnon(page) && page_mapped(page))
> > +		anon_vma = anon_vma_hold(page);
> >  	/*
> >  	 * Establish migration ptes or remove ptes
> >  	 */
> > -	try_to_unmap(page, 1);
> > +	if (page_mapped(page))
> > +		try_to_unmap(page, 1);
> > +
> 
> All these preliminary tests: yes, I suppose they avoid unnecessary
> locking, so I guess they're good; but it should work without them.
> 
> >  	if (!page_mapped(page))
> >  		rc = move_to_new_page(newpage, page);
> >  
> >  	if (rc)
> >  		remove_migration_ptes(page, page);
> >  
> > +	anon_vma_release(anon_vma);
> > +
> >  unlock:
> >  	unlock_page(page);
> >  
> > @@ -686,8 +693,8 @@ move_newpage:
> >   *
> >   * Return: Number of pages not migrated or error code.
> >   */
> > -int migrate_pages(struct list_head *from,
> > -		new_page_t get_new_page, unsigned long private)
> > +int __migrate_pages(struct list_head *from,
> > +		new_page_t get_new_page, unsigned long private, int nocontext)
> >  {
> 
> Remarks on nocontext as above: mmm, I think keep the patch small
> and don't add that extra argument at all.
> 
> >  	int retry = 1;
> >  	int nr_failed = 0;
> > @@ -707,7 +714,7 @@ int migrate_pages(struct list_head *from
> >  			cond_resched();
> >  
> >  			rc = unmap_and_move(get_new_page, private,
> > -						page, pass > 2);
> > +						page, pass > 2, nocontext);
> >  
> >  			switch(rc) {
> >  			case -ENOMEM:
> > @@ -737,6 +744,22 @@ out:
> >  	return nr_failed + retry;
> >  }
> >  
> > +int migrate_pages(struct list_head *from,
> > +	new_page_t get_new_page, unsigned long private)
> > +{
> > +	return __migrate_pages(from, get_new_page, private, 0);
> > +}
> > +
> > +/*
> > + * When page migration is issued by the kernel itself without page mapper's
> > + * mm->sem, we have to be more careful to do page migration.
> > + */
> > +int migrate_pages_nocontext(struct list_head *from,
> > +	new_page_t get_new_page, unsigned long private)
> > +{
> > +	return __migrate_pages(from, get_new_page, private, 1);
> > +}
> > +
> >  #ifdef CONFIG_NUMA
> >  /*
> >   * Move a list of individual pages
> > Index: linux-2.6.22-rc2-mm1/include/linux/rmap.h
> > ===================================================================
> > --- linux-2.6.22-rc2-mm1.orig/include/linux/rmap.h
> > +++ linux-2.6.22-rc2-mm1/include/linux/rmap.h
> > @@ -26,6 +26,9 @@
> >  struct anon_vma {
> >  	spinlock_t lock;	/* Serialize access to vma list */
> >  	struct list_head head;	/* List of private "related" vmas */
> > +#ifdef CONFIG_MIGRATION
> > +	int	ref;	/* special refcnt for migration */
> > +#endif
> >  };
> >  
> >  #ifdef CONFIG_MMU
> > @@ -42,6 +45,14 @@ static inline void anon_vma_free(struct 
> >  	kmem_cache_free(anon_vma_cachep, anon_vma);
> >  }
> >  
> > +#ifdef  CONFIG_MIGRATION
> > +extern struct anon_vma *anon_vma_hold(struct page *page);
> > +extern void anon_vma_release(struct anon_vma *anon_vma);
> > +#else
> > +#define anon_vma_hold(page)     do{}while(0)
> > +#define anon_vma_release(anon)  do{}while(0)
> 
> Rather than change those to "do {} while (0)", to which others
> will ask for static inlines, just delete them, can't you -
> they're simply not needed in the !CONFIG_MIGRATION case, right?
> 
Ok. they are not necessary if !CONFIG_MIGRATION. I'll delete.

Maybe I was confused at deleting CONFIG_MIGRATON_BY_KERNEL...which needed ifdef.
 
Thank you!.
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH 3/7] Allow CONFIG_MIGRATION to be set without CONFIG_NUMA
  2007-05-29 17:36 [PATCH 0/7] [RFC] Memory Compaction v1 Mel Gorman
  2007-05-29 17:36 ` [PATCH 1/7] Roll-up patch of what has been sent already Mel Gorman
  2007-05-29 17:36 ` [PATCH 2/7] KAMEZAWA Hiroyuki - migration by kernel Mel Gorman
@ 2007-05-29 17:37 ` Mel Gorman
  2007-05-29 18:01   ` Christoph Lameter
  2007-05-29 17:37 ` [PATCH 4/7] Introduce isolate_lru_page_nolock() as a lockless version of isolate_lru_page() Mel Gorman
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 21+ messages in thread
From: Mel Gorman @ 2007-05-29 17:37 UTC (permalink / raw)
  To: linux-mm, linux-kernel; +Cc: Mel Gorman, kamezawa.hiroyu, clameter

CONFIG_MIGRATION currently depends on CONFIG_NUMA. move_pages() is the only
user of migration today and as this system call is only meaningful on NUMA,
it makes sense. However, memory compaction will operate within a zone and is
useful on both NUMA and non-NUMA systems. This patch allows CONFIG_MIGRATION
to be used in all memory models. To preserve existing behaviour, move_pages()
is only available when CONFIG_NUMA is set.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Andy Whitcroft <apw@shadowen.org>
---

 include/linux/migrate.h |   11 ++++++++---
 include/linux/mm.h      |    2 ++
 mm/Kconfig              |    5 ++++-
 mm/migrate.c            |    4 ++--
 4 files changed, 16 insertions(+), 6 deletions(-)

diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.22-rc2-mm1-005_migrate_nocontext/include/linux/migrate.h linux-2.6.22-rc2-mm1-015_migration_flatmem/include/linux/migrate.h
--- linux-2.6.22-rc2-mm1-005_migrate_nocontext/include/linux/migrate.h	2007-05-28 14:11:32.000000000 +0100
+++ linux-2.6.22-rc2-mm1-015_migration_flatmem/include/linux/migrate.h	2007-05-29 10:00:09.000000000 +0100
@@ -7,7 +7,7 @@
 
 typedef struct page *new_page_t(struct page *, unsigned long private, int **);
 
-#ifdef CONFIG_MIGRATION
+#ifdef CONFIG_SYSCALL_MOVE_PAGES
 /* Check if a vma is migratable */
 static inline int vma_migratable(struct vm_area_struct *vma)
 {
@@ -24,7 +24,14 @@ static inline int vma_migratable(struct 
 			return 0;
 	return 1;
 }
+#else
+static inline int vma_migratable(struct vm_area_struct *vma)
+{
+	return 0;
+}
+#endif
 
+#ifdef CONFIG_MIGRATION
 extern int isolate_lru_page(struct page *p, struct list_head *pagelist);
 extern int putback_lru_pages(struct list_head *l);
 extern int migrate_page(struct address_space *,
@@ -40,8 +47,6 @@ extern int migrate_vmas(struct mm_struct
 		const nodemask_t *from, const nodemask_t *to,
 		unsigned long flags);
 #else
-static inline int vma_migratable(struct vm_area_struct *vma)
-					{ return 0; }
 
 static inline int isolate_lru_page(struct page *p, struct list_head *list)
 					{ return -ENOSYS; }
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.22-rc2-mm1-005_migrate_nocontext/include/linux/mm.h linux-2.6.22-rc2-mm1-015_migration_flatmem/include/linux/mm.h
--- linux-2.6.22-rc2-mm1-005_migrate_nocontext/include/linux/mm.h	2007-05-24 10:13:34.000000000 +0100
+++ linux-2.6.22-rc2-mm1-015_migration_flatmem/include/linux/mm.h	2007-05-28 14:13:44.000000000 +0100
@@ -242,6 +242,8 @@ struct vm_operations_struct {
 	int (*set_policy)(struct vm_area_struct *vma, struct mempolicy *new);
 	struct mempolicy *(*get_policy)(struct vm_area_struct *vma,
 					unsigned long addr);
+#endif /* CONFIG_NUMA */
+#ifdef CONFIG_MIGRATION
 	int (*migrate)(struct vm_area_struct *vma, const nodemask_t *from,
 		const nodemask_t *to, unsigned long flags);
 #endif
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.22-rc2-mm1-005_migrate_nocontext/mm/Kconfig linux-2.6.22-rc2-mm1-015_migration_flatmem/mm/Kconfig
--- linux-2.6.22-rc2-mm1-005_migrate_nocontext/mm/Kconfig	2007-05-24 10:13:34.000000000 +0100
+++ linux-2.6.22-rc2-mm1-015_migration_flatmem/mm/Kconfig	2007-05-29 09:57:23.000000000 +0100
@@ -145,13 +145,16 @@ config SPLIT_PTLOCK_CPUS
 config MIGRATION
 	bool "Page migration"
 	def_bool y
-	depends on NUMA
 	help
 	  Allows the migration of the physical location of pages of processes
 	  while the virtual addresses are not changed. This is useful for
 	  example on NUMA systems to put pages nearer to the processors accessing
 	  the page.
 
+config SYSCALL_MOVE_PAGES
+	def_bool y
+	depends on MIGRATION && NUMA
+
 config RESOURCES_64BIT
 	bool "64 bit Memory and IO resources (EXPERIMENTAL)" if (!64BIT && EXPERIMENTAL)
 	default 64BIT
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.22-rc2-mm1-005_migrate_nocontext/mm/migrate.c linux-2.6.22-rc2-mm1-015_migration_flatmem/mm/migrate.c
--- linux-2.6.22-rc2-mm1-005_migrate_nocontext/mm/migrate.c	2007-05-28 14:11:32.000000000 +0100
+++ linux-2.6.22-rc2-mm1-015_migration_flatmem/mm/migrate.c	2007-05-29 10:01:40.000000000 +0100
@@ -761,7 +761,7 @@ int migrate_pages_nocontext(struct list_
 	return __migrate_pages(from, get_new_page, private, 1);
 }
 
-#ifdef CONFIG_NUMA
+#ifdef CONFIG_SYSCALL_MOVE_PAGES
 /*
  * Move a list of individual pages
  */
@@ -1018,7 +1018,7 @@ out2:
 	mmput(mm);
 	return err;
 }
-#endif
+#endif /* CONFIG_SYSCALL_MOVE_PAGES */
 
 /*
  * Call migration functions in the vma_ops that may prepare

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 3/7] Allow CONFIG_MIGRATION to be set without CONFIG_NUMA
  2007-05-29 17:37 ` [PATCH 3/7] Allow CONFIG_MIGRATION to be set without CONFIG_NUMA Mel Gorman
@ 2007-05-29 18:01   ` Christoph Lameter
  2007-05-29 18:21     ` Mel Gorman
  0 siblings, 1 reply; 21+ messages in thread
From: Christoph Lameter @ 2007-05-29 18:01 UTC (permalink / raw)
  To: Mel Gorman; +Cc: linux-mm, linux-kernel, kamezawa.hiroyu

On Tue, 29 May 2007, Mel Gorman wrote:

> CONFIG_MIGRATION currently depends on CONFIG_NUMA. move_pages() is the only
> user of migration today and as this system call is only meaningful on NUMA,
> it makes sense. However, memory compaction will operate within a zone and is
> useful on both NUMA and non-NUMA systems. This patch allows CONFIG_MIGRATION
> to be used in all memory models. To preserve existing behaviour, move_pages()
> is only available when CONFIG_NUMA is set.

Hmmm... I thought I had this already set up so that it would be easy to 
switch page migration to not depend on CONFIG_NUMA. Not so it seems.

> --- linux-2.6.22-rc2-mm1-005_migrate_nocontext/include/linux/migrate.h	2007-05-28 14:11:32.000000000 +0100
> +++ linux-2.6.22-rc2-mm1-015_migration_flatmem/include/linux/migrate.h	2007-05-29 10:00:09.000000000 +0100
> @@ -7,7 +7,7 @@
>  
>  typedef struct page *new_page_t(struct page *, unsigned long private, int **);
>  
> -#ifdef CONFIG_MIGRATION
> +#ifdef CONFIG_SYSCALL_MOVE_PAGES
>  /* Check if a vma is migratable */
>  static inline int vma_migratable(struct vm_area_struct *vma)
>  {
> @@ -24,7 +24,14 @@ static inline int vma_migratable(struct 
>  			return 0;
>  	return 1;
>  }
> +#else
> +static inline int vma_migratable(struct vm_area_struct *vma)
> +{
> +	return 0;
> +}
> +#endif

I guess we get compilation failures because of the reference to 
policy_zone here for the !NUMA case? I think vma migratable is not used at 
all if !NUMA.


> +#ifdef CONFIG_MIGRATION
>  extern int isolate_lru_page(struct page *p, struct list_head *pagelist);
>  extern int putback_lru_pages(struct list_head *l);
>  extern int migrate_page(struct address_space *,
> @@ -40,8 +47,6 @@ extern int migrate_vmas(struct mm_struct
>  		const nodemask_t *from, const nodemask_t *to,
>  		unsigned long flags);
>  #else
> -static inline int vma_migratable(struct vm_area_struct *vma)
> -					{ return 0; }

Maybe this block is not necessary?

> diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.22-rc2-mm1-005_migrate_nocontext/include/linux/mm.h linux-2.6.22-rc2-mm1-015_migration_flatmem/include/linux/mm.h
> --- linux-2.6.22-rc2-mm1-005_migrate_nocontext/include/linux/mm.h	2007-05-24 10:13:34.000000000 +0100
> +++ linux-2.6.22-rc2-mm1-015_migration_flatmem/include/linux/mm.h	2007-05-28 14:13:44.000000000 +0100
> @@ -242,6 +242,8 @@ struct vm_operations_struct {
>  	int (*set_policy)(struct vm_area_struct *vma, struct mempolicy *new);
>  	struct mempolicy *(*get_policy)(struct vm_area_struct *vma,
>  					unsigned long addr);
> +#endif /* CONFIG_NUMA */
> +#ifdef CONFIG_MIGRATION
>  	int (*migrate)(struct vm_area_struct *vma, const nodemask_t *from,
>  		const nodemask_t *to, unsigned long flags);
>  #endif

Correct.

> diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.22-rc2-mm1-005_migrate_nocontext/mm/Kconfig linux-2.6.22-rc2-mm1-015_migration_flatmem/mm/Kconfig
> --- linux-2.6.22-rc2-mm1-005_migrate_nocontext/mm/Kconfig	2007-05-24 10:13:34.000000000 +0100
> +++ linux-2.6.22-rc2-mm1-015_migration_flatmem/mm/Kconfig	2007-05-29 09:57:23.000000000 +0100
> @@ -145,13 +145,16 @@ config SPLIT_PTLOCK_CPUS
>  config MIGRATION
>  	bool "Page migration"
>  	def_bool y
> -	depends on NUMA
>  	help
>  	  Allows the migration of the physical location of pages of processes
>  	  while the virtual addresses are not changed. This is useful for
>  	  example on NUMA systems to put pages nearer to the processors accessing
>  	  the page.
>  
> +config SYSCALL_MOVE_PAGES
> +	def_bool y
> +	depends on MIGRATION && NUMA
> +

Do we really need the CONFIG_SYSCALL_MOVE_PAGES? I think you will directly 
access the lower levels. So why have it? CONFIG_SYSCALL_MOVE_PAGES == 
CONFIG_NUMA.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 3/7] Allow CONFIG_MIGRATION to be set without CONFIG_NUMA
  2007-05-29 18:01   ` Christoph Lameter
@ 2007-05-29 18:21     ` Mel Gorman
  2007-05-29 18:36       ` Christoph Lameter
  0 siblings, 1 reply; 21+ messages in thread
From: Mel Gorman @ 2007-05-29 18:21 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: linux-mm, linux-kernel, kamezawa.hiroyu

On Tue, 29 May 2007, Christoph Lameter wrote:

> On Tue, 29 May 2007, Mel Gorman wrote:
>
>> CONFIG_MIGRATION currently depends on CONFIG_NUMA. move_pages() is the only
>> user of migration today and as this system call is only meaningful on NUMA,
>> it makes sense. However, memory compaction will operate within a zone and is
>> useful on both NUMA and non-NUMA systems. This patch allows CONFIG_MIGRATION
>> to be used in all memory models. To preserve existing behaviour, move_pages()
>> is only available when CONFIG_NUMA is set.
>
> Hmmm... I thought I had this already set up so that it would be easy to
> switch page migration to not depend on CONFIG_NUMA. Not so it seems.
>

It's only policy_zone that it got hung-up on. In an earlir version, I just 
defined policy_zone outside of mempolicy but it was messy looking.

>> --- linux-2.6.22-rc2-mm1-005_migrate_nocontext/include/linux/migrate.h	2007-05-28 14:11:32.000000000 +0100
>> +++ linux-2.6.22-rc2-mm1-015_migration_flatmem/include/linux/migrate.h	2007-05-29 10:00:09.000000000 +0100
>> @@ -7,7 +7,7 @@
>>
>>  typedef struct page *new_page_t(struct page *, unsigned long private, int **);
>>
>> -#ifdef CONFIG_MIGRATION
>> +#ifdef CONFIG_SYSCALL_MOVE_PAGES
>>  /* Check if a vma is migratable */
>>  static inline int vma_migratable(struct vm_area_struct *vma)
>>  {
>> @@ -24,7 +24,14 @@ static inline int vma_migratable(struct
>>  			return 0;
>>  	return 1;
>>  }
>> +#else
>> +static inline int vma_migratable(struct vm_area_struct *vma)
>> +{
>> +	return 0;
>> +}
>> +#endif
>
> I guess we get compilation failures because of the reference to
> policy_zone here for the !NUMA case? I think vma migratable is not used at
> all if !NUMA.
>

It isn't that I could tell.

>
>> +#ifdef CONFIG_MIGRATION
>>  extern int isolate_lru_page(struct page *p, struct list_head *pagelist);
>>  extern int putback_lru_pages(struct list_head *l);
>>  extern int migrate_page(struct address_space *,
>> @@ -40,8 +47,6 @@ extern int migrate_vmas(struct mm_struct
>>  		const nodemask_t *from, const nodemask_t *to,
>>  		unsigned long flags);
>>  #else
>> -static inline int vma_migratable(struct vm_area_struct *vma)
>> -					{ return 0; }
>
> Maybe this block is not necessary?
>

Agreed. For a long time I didn't have it included at all but put it back 
in to preserve existing behaviour. I'll remove it.

>> diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.22-rc2-mm1-005_migrate_nocontext/include/linux/mm.h linux-2.6.22-rc2-mm1-015_migration_flatmem/include/linux/mm.h
>> --- linux-2.6.22-rc2-mm1-005_migrate_nocontext/include/linux/mm.h	2007-05-24 10:13:34.000000000 +0100
>> +++ linux-2.6.22-rc2-mm1-015_migration_flatmem/include/linux/mm.h	2007-05-28 14:13:44.000000000 +0100
>> @@ -242,6 +242,8 @@ struct vm_operations_struct {
>>  	int (*set_policy)(struct vm_area_struct *vma, struct mempolicy *new);
>>  	struct mempolicy *(*get_policy)(struct vm_area_struct *vma,
>>  					unsigned long addr);
>> +#endif /* CONFIG_NUMA */
>> +#ifdef CONFIG_MIGRATION
>>  	int (*migrate)(struct vm_area_struct *vma, const nodemask_t *from,
>>  		const nodemask_t *to, unsigned long flags);
>>  #endif
>
> Correct.
>
>> diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.22-rc2-mm1-005_migrate_nocontext/mm/Kconfig linux-2.6.22-rc2-mm1-015_migration_flatmem/mm/Kconfig
>> --- linux-2.6.22-rc2-mm1-005_migrate_nocontext/mm/Kconfig	2007-05-24 10:13:34.000000000 +0100
>> +++ linux-2.6.22-rc2-mm1-015_migration_flatmem/mm/Kconfig	2007-05-29 09:57:23.000000000 +0100
>> @@ -145,13 +145,16 @@ config SPLIT_PTLOCK_CPUS
>>  config MIGRATION
>>  	bool "Page migration"
>>  	def_bool y
>> -	depends on NUMA
>>  	help
>>  	  Allows the migration of the physical location of pages of processes
>>  	  while the virtual addresses are not changed. This is useful for
>>  	  example on NUMA systems to put pages nearer to the processors accessing
>>  	  the page.
>>
>> +config SYSCALL_MOVE_PAGES
>> +	def_bool y
>> +	depends on MIGRATION && NUMA
>> +
>
> Do we really need the CONFIG_SYSCALL_MOVE_PAGES? I think you will directly
> access the lower levels. So why have it? CONFIG_SYSCALL_MOVE_PAGES ==
> CONFIG_NUMA.

Without SYSCALL_MOVE_PAGES, the check in migrate.h becomes

#if defined(CONFIG_NUMA) && defined(CONFIG_MIGRATION)
/* Check if a vma is migratable */
static inline int vma_migratable(struct vm_area_struct *vma)
#endif

That in itself is fine but in mm/migrate.c I didn't want to define 
sys_move_pages() in the non-NUMA case. Whatever about the header file 
where SYSCALL_MOVE_PAGES obscures understanding, I think it makes sense to 
have SYSCALL_MOVE_PAGES for mm/migrate.c . What do you think?

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 3/7] Allow CONFIG_MIGRATION to be set without CONFIG_NUMA
  2007-05-29 18:21     ` Mel Gorman
@ 2007-05-29 18:36       ` Christoph Lameter
  2007-05-29 18:49         ` Mel Gorman
  0 siblings, 1 reply; 21+ messages in thread
From: Christoph Lameter @ 2007-05-29 18:36 UTC (permalink / raw)
  To: Mel Gorman; +Cc: linux-mm, linux-kernel, kamezawa.hiroyu

On Tue, 29 May 2007, Mel Gorman wrote:

> > > +config SYSCALL_MOVE_PAGES
> > > +	def_bool y
> > > +	depends on MIGRATION && NUMA
> > > +
> > 
> > Do we really need the CONFIG_SYSCALL_MOVE_PAGES? I think you will directly
> > access the lower levels. So why have it? CONFIG_SYSCALL_MOVE_PAGES ==
> > CONFIG_NUMA.
> 
> Without SYSCALL_MOVE_PAGES, the check in migrate.h becomes
> 
> #if defined(CONFIG_NUMA) && defined(CONFIG_MIGRATION)
> /* Check if a vma is migratable */
> static inline int vma_migratable(struct vm_area_struct *vma)
> #endif

Why do you need vma_migratable for the CONFIG_MIGRATION case? The use of 
vma_migratable in a !NUMA sitation would not be working right as far as I 
can tell.

#ifdef CONFIG_NUMA

is fine.

> That in itself is fine but in mm/migrate.c I didn't want to define
> sys_move_pages() in the non-NUMA case. Whatever about the header file where
> SYSCALL_MOVE_PAGES obscures understanding, I think it makes sense to have
> SYSCALL_MOVE_PAGES for mm/migrate.c . What do you think?

Why do you need sys_move_pages for the non-NUMA case?

The low level function that I intended to be used by defrag is 
migrate_pages and that one is outside of #ifdef CONFIG_NUMA.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 3/7] Allow CONFIG_MIGRATION to be set without CONFIG_NUMA
  2007-05-29 18:36       ` Christoph Lameter
@ 2007-05-29 18:49         ` Mel Gorman
  0 siblings, 0 replies; 21+ messages in thread
From: Mel Gorman @ 2007-05-29 18:49 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: linux-mm, linux-kernel, kamezawa.hiroyu

On Tue, 29 May 2007, Christoph Lameter wrote:

> On Tue, 29 May 2007, Mel Gorman wrote:
>
>>>> +config SYSCALL_MOVE_PAGES
>>>> +	def_bool y
>>>> +	depends on MIGRATION && NUMA
>>>> +
>>>
>>> Do we really need the CONFIG_SYSCALL_MOVE_PAGES? I think you will directly
>>> access the lower levels. So why have it? CONFIG_SYSCALL_MOVE_PAGES ==
>>> CONFIG_NUMA.
>>
>> Without SYSCALL_MOVE_PAGES, the check in migrate.h becomes
>>
>> #if defined(CONFIG_NUMA) && defined(CONFIG_MIGRATION)
>> /* Check if a vma is migratable */
>> static inline int vma_migratable(struct vm_area_struct *vma)
>> #endif
>
> Why do you need vma_migratable for the CONFIG_MIGRATION case? The use of
> vma_migratable in a !NUMA sitation would not be working right as far as I
> can tell.
>
> #ifdef CONFIG_NUMA
>
> is fine.
>

Makes sense.

>> That in itself is fine but in mm/migrate.c I didn't want to define
>> sys_move_pages() in the non-NUMA case. Whatever about the header file where
>> SYSCALL_MOVE_PAGES obscures understanding, I think it makes sense to have
>> SYSCALL_MOVE_PAGES for mm/migrate.c . What do you think?
>
> Why do you need sys_move_pages for the non-NUMA case?
>
> The low level function that I intended to be used by defrag is
> migrate_pages and that one is outside of #ifdef CONFIG_NUMA.
>

Also make sense. It'll be fixed up in the next verion minus the 
SYSCALL_MOVE_PAGES dirt. It'll even simplify the patch.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH 4/7] Introduce isolate_lru_page_nolock() as a lockless version of isolate_lru_page()
  2007-05-29 17:36 [PATCH 0/7] [RFC] Memory Compaction v1 Mel Gorman
                   ` (2 preceding siblings ...)
  2007-05-29 17:37 ` [PATCH 3/7] Allow CONFIG_MIGRATION to be set without CONFIG_NUMA Mel Gorman
@ 2007-05-29 17:37 ` Mel Gorman
  2007-05-29 17:37 ` [PATCH 5/7] Provide metrics on the extent of fragmentation in zones Mel Gorman
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 21+ messages in thread
From: Mel Gorman @ 2007-05-29 17:37 UTC (permalink / raw)
  To: linux-mm, linux-kernel; +Cc: Mel Gorman, kamezawa.hiroyu, clameter

Migration uses isolate_lru_page() to isolate an LRU page. This acquires
the zone->lru_lock to safely remove the page and place it on a private
list. However, this prevents the caller from batching up isolation of
multiple pages.  This patch introduces a nolock version of isolate_lru_page()
for callers that are aware of the locking requirements.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Andy Whitcroft <apw@shadowen.org>
---

 include/linux/migrate.h |    8 +++++++-
 mm/migrate.c            |   37 +++++++++++++++++++++++++++----------
 2 files changed, 34 insertions(+), 11 deletions(-)

diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.22-rc2-mm1-015_migration_flatmem/include/linux/migrate.h linux-2.6.22-rc2-mm1-020_isolate_nolock/include/linux/migrate.h
--- linux-2.6.22-rc2-mm1-015_migration_flatmem/include/linux/migrate.h	2007-05-29 10:00:09.000000000 +0100
+++ linux-2.6.22-rc2-mm1-020_isolate_nolock/include/linux/migrate.h	2007-05-29 10:18:51.000000000 +0100
@@ -32,6 +32,8 @@ static inline int vma_migratable(struct 
 #endif
 
 #ifdef CONFIG_MIGRATION
+extern int isolate_lru_page_nolock(struct zone *zone, struct page *p,
+						struct list_head *pagelist);
 extern int isolate_lru_page(struct page *p, struct list_head *pagelist);
 extern int putback_lru_pages(struct list_head *l);
 extern int migrate_page(struct address_space *,
@@ -47,7 +49,11 @@ extern int migrate_vmas(struct mm_struct
 		const nodemask_t *from, const nodemask_t *to,
 		unsigned long flags);
 #else
-
+static inline int isolate_lru_page_nolock(struct zone *zone, struct page *p,
+						struct list_head *list)
+{
+	return -ENOSYS;
+}
 static inline int isolate_lru_page(struct page *p, struct list_head *list)
 					{ return -ENOSYS; }
 static inline int putback_lru_pages(struct list_head *l) { return 0; }
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.22-rc2-mm1-015_migration_flatmem/mm/migrate.c linux-2.6.22-rc2-mm1-020_isolate_nolock/mm/migrate.c
--- linux-2.6.22-rc2-mm1-015_migration_flatmem/mm/migrate.c	2007-05-29 10:01:40.000000000 +0100
+++ linux-2.6.22-rc2-mm1-020_isolate_nolock/mm/migrate.c	2007-05-29 10:18:51.000000000 +0100
@@ -41,6 +41,32 @@
  *  -EBUSY: page not on LRU list
  *  0: page removed from LRU list and added to the specified list.
  */
+int isolate_lru_page_nolock(struct zone *zone, struct page *page,
+						struct list_head *pagelist)
+{
+	int ret = -EBUSY;
+
+	if (PageLRU(page)) {
+		ret = 0;
+		get_page(page);
+		ClearPageLRU(page);
+		if (PageActive(page))
+			del_page_from_active_list(zone, page);
+		else
+			del_page_from_inactive_list(zone, page);
+		list_add_tail(&page->lru, pagelist);
+	}
+	return ret;
+}
+
+/*
+ * Acquire the zone->lru_lock and isolate one page from the LRU lists. If
+ * successful put it onto the indicated list with elevated page count.
+ *
+ * Result:
+ *  -EBUSY: page not on LRU list
+ *  0: page removed from LRU list and added to the specified list.
+ */
 int isolate_lru_page(struct page *page, struct list_head *pagelist)
 {
 	int ret = -EBUSY;
@@ -49,16 +75,7 @@ int isolate_lru_page(struct page *page, 
 		struct zone *zone = page_zone(page);
 
 		spin_lock_irq(&zone->lru_lock);
-		if (PageLRU(page)) {
-			ret = 0;
-			get_page(page);
-			ClearPageLRU(page);
-			if (PageActive(page))
-				del_page_from_active_list(zone, page);
-			else
-				del_page_from_inactive_list(zone, page);
-			list_add_tail(&page->lru, pagelist);
-		}
+		ret = isolate_lru_page_nolock(zone, page, pagelist);
 		spin_unlock_irq(&zone->lru_lock);
 	}
 	return ret;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH 5/7] Provide metrics on the extent of fragmentation in zones
  2007-05-29 17:36 [PATCH 0/7] [RFC] Memory Compaction v1 Mel Gorman
                   ` (3 preceding siblings ...)
  2007-05-29 17:37 ` [PATCH 4/7] Introduce isolate_lru_page_nolock() as a lockless version of isolate_lru_page() Mel Gorman
@ 2007-05-29 17:37 ` Mel Gorman
  2007-05-29 17:38 ` [PATCH 6/7] Introduce a means of compacting memory within a zone Mel Gorman
  2007-05-29 17:38 ` [PATCH 7/7] Add /proc/sys/vm/compact_node for the explicit compaction of a node Mel Gorman
  6 siblings, 0 replies; 21+ messages in thread
From: Mel Gorman @ 2007-05-29 17:37 UTC (permalink / raw)
  To: linux-mm, linux-kernel; +Cc: Mel Gorman, kamezawa.hiroyu, clameter

It is useful to know the state of external fragmentation in the system
and whether allocation failures are due to low memory or external
fragmentation. This patch introduces two metrics for evaluation the state of
fragmentation and exports the information to /proc/pagetypeinfo. The metrics
will be used in the future to determine if it is better to compact memory
or directly reclaim for a high-order allocation to succeed.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Andy Whitcroft <apw@shadowen.org>
---

 include/linux/compaction.h |   18 ++++++++
 mm/Makefile                |    2 
 mm/compaction.c            |   86 ++++++++++++++++++++++++++++++++++++++++
 mm/vmstat.c                |   53 ++++++++++++++++++++++++
 4 files changed, 158 insertions(+), 1 deletion(-)

diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.22-rc2-mm1-020_isolate_nolock/include/linux/compaction.h linux-2.6.22-rc2-mm1-105_measure_fragmentation/include/linux/compaction.h
--- linux-2.6.22-rc2-mm1-020_isolate_nolock/include/linux/compaction.h	2007-05-28 10:16:07.000000000 +0100
+++ linux-2.6.22-rc2-mm1-105_measure_fragmentation/include/linux/compaction.h	2007-05-29 10:20:32.000000000 +0100
@@ -0,0 +1,18 @@
+#ifndef _LINUX_COMPACTION_H
+#define _LINUX_COMPACTION_H
+
+#ifdef CONFIG_MIGRATION
+extern int unusable_free_index(struct zone *zone, unsigned int target_order);
+extern int fragmentation_index(struct zone *zone, unsigned int target_order);
+#else
+static inline int unusable_free_index(struct zone *z, unsigned int o)
+{
+	return -1;
+}
+
+static inline int fragmentation_index(struct zone *z, unsigned int o)
+{
+	return -1;
+}
+#endif /* CONFIG_MIGRATION */
+#endif /* _LINUX_COMPACTION_H */
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.22-rc2-mm1-020_isolate_nolock/mm/compaction.c linux-2.6.22-rc2-mm1-105_measure_fragmentation/mm/compaction.c
--- linux-2.6.22-rc2-mm1-020_isolate_nolock/mm/compaction.c	2007-05-25 10:35:12.000000000 +0100
+++ linux-2.6.22-rc2-mm1-105_measure_fragmentation/mm/compaction.c	2007-05-29 10:20:32.000000000 +0100
@@ -0,0 +1,86 @@
+/*
+ * linux/mm/compaction.c
+ *
+ * Memory compaction for the reduction of external fragmentation
+ * Copyright IBM Corp. 2007 Mel Gorman <mel@csn.ul.ie>
+ */
+#include <linux/mmzone.h>
+
+/*
+ * Calculate the number of free pages in a zone and how many contiguous
+ * pages are free and how many are large enough to satisfy an allocation of
+ * the target size
+ */
+void calculate_freepages(struct zone *zone, unsigned int target_order,
+				unsigned long *ret_freepages,
+				unsigned long *ret_areas_free,
+				unsigned long *ret_suitable_areas_free)
+{
+	unsigned int order;
+	unsigned long freepages;
+	unsigned long areas_free;
+	unsigned long suitable_areas_free;
+
+	freepages = areas_free = suitable_areas_free = 0;
+	for (order = 0; order < MAX_ORDER; order++) {
+		unsigned long order_areas_free;
+
+		/* Count number of free blocks */
+		order_areas_free = zone->free_area[order].nr_free;
+		areas_free += order_areas_free;
+
+		/* Count free base pages */
+		freepages += order_areas_free << order;
+
+		/* Count number of suitably large free blocks */
+		if (order >= target_order)
+			suitable_areas_free += order_areas_free <<
+							(order - target_order);
+	}
+
+	*ret_freepages = freepages;
+	*ret_areas_free = areas_free;
+	*ret_suitable_areas_free = suitable_areas_free;
+}
+
+/*
+ * Return an index indicating how much of the available free memory is
+ * unusable for an allocation of the requested size. A value towards 100
+ * implies that the majority of free memory is unusable and compaction
+ * may be required.
+ */
+int unusable_free_index(struct zone *zone, unsigned int target_order)
+{
+	unsigned long freepages, areas_free, suitable_areas_free;
+
+	calculate_freepages(zone, target_order,
+				&freepages, &areas_free, &suitable_areas_free);
+
+	/* No free memory is interpreted as all free memory is unusable */
+	if (freepages == 0)
+		return 100;
+
+	return ((freepages - (suitable_areas_free << target_order)) * 100) /
+								freepages;
+}
+
+/*
+ * Return the external fragmentation index for a zone. Values towards 100
+ * imply the allocation failure was due to external fragmentation. Values
+ * towards 0 imply the failure was due to lack of memory. The value is only
+ * useful when an allocation of the requested order would fail and it does
+ * not take into account pages free on the pcp list.
+ */
+int fragmentation_index(struct zone *zone, unsigned int target_order)
+{
+	unsigned long freepages, areas_free, suitable_areas_free;
+
+	calculate_freepages(zone, target_order,
+				&freepages, &areas_free, &suitable_areas_free);
+
+	/* An allocation succeeding implies this index has no meaning */
+	if (suitable_areas_free)
+		return -1;
+
+	return 100 - ((freepages / (1 << target_order)) * 100) / areas_free;
+}
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.22-rc2-mm1-020_isolate_nolock/mm/Makefile linux-2.6.22-rc2-mm1-105_measure_fragmentation/mm/Makefile
--- linux-2.6.22-rc2-mm1-020_isolate_nolock/mm/Makefile	2007-05-24 10:13:34.000000000 +0100
+++ linux-2.6.22-rc2-mm1-105_measure_fragmentation/mm/Makefile	2007-05-29 10:20:32.000000000 +0100
@@ -27,7 +27,7 @@ obj-$(CONFIG_SLAB) += slab.o
 obj-$(CONFIG_SLUB) += slub.o
 obj-$(CONFIG_MEMORY_HOTPLUG) += memory_hotplug.o
 obj-$(CONFIG_FS_XIP) += filemap_xip.o
-obj-$(CONFIG_MIGRATION) += migrate.o
+obj-$(CONFIG_MIGRATION) += migrate.o compaction.o
 obj-$(CONFIG_SMP) += allocpercpu.o
 obj-$(CONFIG_QUICKLIST) += quicklist.o
 
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.22-rc2-mm1-020_isolate_nolock/mm/vmstat.c linux-2.6.22-rc2-mm1-105_measure_fragmentation/mm/vmstat.c
--- linux-2.6.22-rc2-mm1-020_isolate_nolock/mm/vmstat.c	2007-05-28 14:09:40.000000000 +0100
+++ linux-2.6.22-rc2-mm1-105_measure_fragmentation/mm/vmstat.c	2007-05-29 10:20:32.000000000 +0100
@@ -13,6 +13,7 @@
 #include <linux/module.h>
 #include <linux/cpu.h>
 #include <linux/sched.h>
+#include <linux/compaction.h>
 #include "internal.h"
 
 #ifdef CONFIG_VM_EVENT_COUNTERS
@@ -624,6 +625,56 @@ static void pagetypeinfo_showmixedcount_
 }
 #endif /* CONFIG_PAGE_OWNER */
 
+static void pagetypeinfo_showunusable_print(struct seq_file *m,
+					pg_data_t *pgdat, struct zone *zone)
+{
+	unsigned int order;
+
+	seq_printf(m, "Node %4d, zone %8s %19s",
+				pgdat->node_id,
+				zone->name, " ");
+	for (order = 0; order < MAX_ORDER; ++order)
+		seq_printf(m, "%6d ", unusable_free_index(zone, order));
+
+	seq_putc(m, '\n');
+}
+
+/* Print out percentage of unusable free memory at each order */
+static int pagetypeinfo_showunusable(struct seq_file *m, void *arg)
+{
+	pg_data_t *pgdat = (pg_data_t *)arg;
+
+	seq_printf(m, "\nPercentage unusable free memory at order\n");
+	walk_zones_in_node(m, pgdat, pagetypeinfo_showunusable_print);
+
+	return 0;
+}
+
+static void pagetypeinfo_showfragmentation_print(struct seq_file *m,
+					pg_data_t *pgdat, struct zone *zone)
+{
+	unsigned int order;
+
+	seq_printf(m, "Node %4d, zone %8s %19s",
+				pgdat->node_id,
+				zone->name, " ");
+	for (order = 0; order < MAX_ORDER; ++order)
+		seq_printf(m, "%6d ", fragmentation_index(zone, order));
+
+	seq_putc(m, '\n');
+}
+
+/* Print the fragmentation index at each order */
+static int pagetypeinfo_showfragmentation(struct seq_file *m, void *arg)
+{
+	pg_data_t *pgdat = (pg_data_t *)arg;
+
+	seq_printf(m, "\nFragmentation index\n");
+	walk_zones_in_node(m, pgdat, pagetypeinfo_showfragmentation_print);
+
+	return 0;
+}
+
 /*
  * Print out the number of pageblocks for each migratetype that contain pages
  * of other types. This gives an indication of how well fallbacks are being
@@ -656,6 +707,8 @@ static int pagetypeinfo_show(struct seq_
 	seq_printf(m, "Pages per block:  %lu\n", pageblock_nr_pages);
 	seq_putc(m, '\n');
 	pagetypeinfo_showfree(m, pgdat);
+	pagetypeinfo_showunusable(m, pgdat);
+	pagetypeinfo_showfragmentation(m, pgdat);
 	pagetypeinfo_showblockcount(m, pgdat);
 	pagetypeinfo_showmixedcount(m, pgdat);
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH 6/7] Introduce a means of compacting memory within a zone
  2007-05-29 17:36 [PATCH 0/7] [RFC] Memory Compaction v1 Mel Gorman
                   ` (4 preceding siblings ...)
  2007-05-29 17:37 ` [PATCH 5/7] Provide metrics on the extent of fragmentation in zones Mel Gorman
@ 2007-05-29 17:38 ` Mel Gorman
  2007-05-29 17:38 ` [PATCH 7/7] Add /proc/sys/vm/compact_node for the explicit compaction of a node Mel Gorman
  6 siblings, 0 replies; 21+ messages in thread
From: Mel Gorman @ 2007-05-29 17:38 UTC (permalink / raw)
  To: linux-mm, linux-kernel; +Cc: Mel Gorman, kamezawa.hiroyu, clameter

This patch is the core of the memory compaction mechanism. It compacts memory
in a zone such that movable pages are relocated towards the end of the zone.

A single compaction run involves a migration scanner and a free scanner.
Both scanners operate on pageblock-sized areas in the zone. The migration
scanner starts at the bottom of the zone and searches for all movable pages
within each area, isolating them onto a private list called migratelist.
The free scanner starts at the top of the zone and searches for suitable
areas and consumes the free pages within making them available for the
migration scanner.

Note that after this patch is applied there is still no means of triggering
a compaction run. Later patches will introduce the triggers, initially a
manual trigger.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Andy Whitcroft <apw@shadowen.org>
---

 include/linux/mm.h |    1 
 mm/compaction.c    |  288 ++++++++++++++++++++++++++++++++++++++++++++++++
 mm/page_alloc.c    |   40 ++++++
 3 files changed, 329 insertions(+)

diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.22-rc2-mm1-105_measure_fragmentation/include/linux/mm.h linux-2.6.22-rc2-mm1-110_compact_zone/include/linux/mm.h
--- linux-2.6.22-rc2-mm1-105_measure_fragmentation/include/linux/mm.h	2007-05-28 14:13:44.000000000 +0100
+++ linux-2.6.22-rc2-mm1-110_compact_zone/include/linux/mm.h	2007-05-29 10:22:15.000000000 +0100
@@ -337,6 +337,7 @@ void put_page(struct page *page);
 void put_pages_list(struct list_head *pages);
 
 void split_page(struct page *page, unsigned int order);
+int split_pagebuddy_page(struct page *page);
 
 /*
  * Compound pages have a destructor function.  Provide a
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.22-rc2-mm1-105_measure_fragmentation/mm/compaction.c linux-2.6.22-rc2-mm1-110_compact_zone/mm/compaction.c
--- linux-2.6.22-rc2-mm1-105_measure_fragmentation/mm/compaction.c	2007-05-29 10:20:32.000000000 +0100
+++ linux-2.6.22-rc2-mm1-110_compact_zone/mm/compaction.c	2007-05-29 10:22:15.000000000 +0100
@@ -5,6 +5,29 @@
  * Copyright IBM Corp. 2007 Mel Gorman <mel@csn.ul.ie>
  */
 #include <linux/mmzone.h>
+#include <linux/gfp.h>
+#include <linux/list.h>
+#include <linux/vmstat.h>
+#include <linux/swap.h>
+#include <linux/migrate.h>
+#include <linux/swap-prefetch.h>
+#include "internal.h"
+
+/*
+ * compact_control is used to track pages being migrated and the free pages
+ * they are being migrated to during memory compaction. The free_pfn starts
+ * at the end of a zone and migrate_pfn begins at the start. Movable pages
+ * are moved to the end of a zone during a compaction run and the run
+ * completes when free_pfn <= migrate_pfn
+ */
+struct compact_control {
+	struct list_head freepages;	/* List of free pages to migrate to */
+	struct list_head migratepages;	/* List of pages being migated */
+	unsigned long nr_freepages;	/* Number of free pages */
+	unsigned long nr_migratepages;	/* Number of migrate pages */
+	unsigned long free_pfn;		/* isolate_freepages search area */
+	unsigned long migrate_pfn;	/* isolate_migratepages search area */
+};
 
 /*
  * Calculate the number of free pages in a zone and how many contiguous
@@ -84,3 +107,268 @@ int fragmentation_index(struct zone *zon
 
 	return 100 - ((freepages / (1 << target_order)) * 100) / areas_free;
 }
+
+static int release_freepages(struct zone *zone, struct list_head *freelist)
+{
+	struct page *page, *next;
+	int count = 0;
+
+	list_for_each_entry_safe(page, next, freelist, lru) {
+		list_del(&page->lru);
+		__free_page(page);
+		count++;
+	}
+
+	return count;
+}
+
+/* Isolate free pages onto a private freelist. Must hold zone->lock */
+static int isolate_freepages_block(struct zone *zone,
+				unsigned long blockpfn,
+				struct list_head *freelist)
+{
+	unsigned long zone_end_pfn, end_pfn;
+	int total_isolated = 0;
+
+	/* Get the last PFN we should scan for free pages at */
+	zone_end_pfn = zone->zone_start_pfn + zone->spanned_pages;
+	end_pfn = blockpfn + pageblock_nr_pages;
+	if (end_pfn > zone_end_pfn)
+		end_pfn = zone_end_pfn;
+
+	/* Isolate free pages */
+	for (; blockpfn < end_pfn; blockpfn++) {
+		struct page *page;
+		int isolated, i;
+
+		if (!pfn_valid_within(blockpfn))
+			continue;
+
+		page = pfn_to_page(blockpfn);
+		if (!PageBuddy(page))
+			continue;
+
+		/* Found a free page, break it into order-0 pages */
+		isolated = split_pagebuddy_page(page);
+		total_isolated += isolated;
+		for (i = 0; i < isolated; i++) {
+			list_add(&page->lru, freelist);
+			page++;
+		}
+		blockpfn += isolated - 1;
+	}
+
+	return total_isolated;
+}
+
+/* Returns 1 if the page is within a block suitable for migration to */
+static int pageblock_suitable_migration(struct page *page)
+{
+	/* If the page is a large free page, then allow migration */
+	if (PageBuddy(page) && page_order(page) >= pageblock_order)
+		return 1;
+
+	/* If the block is MIGRATE_MOVABLE, allow migration */
+	if (get_pageblock_migratetype(page) == MIGRATE_MOVABLE)
+		return 1;
+
+	/* Otherwise skip the block */
+	return 0;
+}
+
+/*
+ * Based on information in the current compact_control, find blocks
+ * suitable for isolating free pages within
+ */
+static void isolate_freepages(struct zone *zone,
+				struct compact_control *cc)
+{
+	struct page *page;
+	unsigned long highpfn, lowpfn, pfn;
+	int nr_freepages = cc->nr_freepages;
+	struct list_head *freelist = &cc->freepages;
+	unsigned long flags;
+
+	pfn = cc->free_pfn;
+	lowpfn = cc->migrate_pfn + pageblock_nr_pages;
+	highpfn = lowpfn;
+
+	/*
+	 * Isolate free pages until enough are available to migrate the
+	 * pages on cc->migratepages. We stop searching if the migrate
+	 * and free page scanners meet or enough free pages are isolated.
+	 */
+	spin_lock_irqsave(&zone->lock, flags);
+	for (; pfn > lowpfn && cc->nr_migratepages > nr_freepages;
+					pfn -= pageblock_nr_pages) {
+		int isolated;
+
+		if (!pfn_valid(pfn))
+			continue;
+
+		/* Check for overlapping nodes/zones */
+		page = pfn_to_page(pfn);
+		if (page_zone(page) != zone)
+			continue;
+
+		/* Check the block is suitable for migration */
+		if (!pageblock_suitable_migration(page))
+			continue;
+
+		/* Found a block suitable for isolating free pages from */
+		isolated = isolate_freepages_block(zone, pfn, freelist);
+		nr_freepages += isolated;
+
+		/*
+		 * Record the highest PFN we isolated pages from. When next
+		 * looking for free pages, the search will start here in case
+		 * migration did not use all free pages.
+		 */
+		if (isolated)
+			highpfn = max(highpfn, pfn);
+	}
+	spin_unlock_irqrestore(&zone->lock, flags);
+
+	cc->free_pfn = highpfn;
+	cc->nr_freepages = nr_freepages;
+}
+
+/*
+ * Isolate all pages that can be migrated from the block pointed to by
+ * the migrate scanner within compact_control. We migrate pages from
+ * all block-types as the intention is to have all movable pages towards
+ * the end of the zone.
+ */
+static int isolate_migratepages(struct zone *zone,
+					struct compact_control *cc)
+{
+	unsigned long highpfn, lowpfn, end_pfn, start_pfn;
+	struct page *page;
+	int isolated = 0;
+	struct list_head *migratelist;
+
+	highpfn = cc->free_pfn;
+	lowpfn = ALIGN(cc->migrate_pfn, pageblock_nr_pages);
+	migratelist = &cc->migratepages;
+
+	/* Do not scan outside zone boundaries */
+	if (lowpfn < zone->zone_start_pfn)
+		lowpfn = zone->zone_start_pfn;
+
+	/* Setup to scan one block but not past where we are migrating to */
+	end_pfn = ALIGN(lowpfn + pageblock_nr_pages, pageblock_nr_pages);
+	if (end_pfn > highpfn)
+		end_pfn = highpfn;
+	start_pfn = lowpfn;
+
+	/* Time to isolate some pages for migration */
+	spin_lock_irq(&zone->lru_lock);
+	for (; lowpfn < end_pfn; lowpfn++) {
+		if (!pfn_valid_within(lowpfn))
+			continue;
+
+		/* Get the page and skip if free */
+		page = pfn_to_page(lowpfn);
+		if (PageBuddy(page)) {
+			lowpfn += (1 << page_order(page)) - 1;
+			continue;
+		}
+
+		/* Try isolate the page */
+		if (isolate_lru_page_nolock(zone, page, migratelist) == 0)
+			isolated++;
+	}
+	spin_unlock_irq(&zone->lru_lock);
+
+	cc->migrate_pfn = end_pfn;
+	cc->nr_migratepages += isolated;
+	return isolated;
+}
+
+/*
+ * This is a migrate-callback that "allocates" freepages by taking pages
+ * from the isolated freelists in the block we are migrating to.
+ */
+static struct page *compaction_alloc(struct page *migratepage,
+					unsigned long data,
+					int **result)
+{
+	struct compact_control *cc = (struct compact_control *)data;
+	struct page *freepage;
+
+	VM_BUG_ON(cc == NULL);
+	if (list_empty(&cc->freepages))
+		return NULL;
+
+	freepage = list_entry(cc->freepages.next, struct page, lru);
+	list_del(&freepage->lru);
+	cc->nr_freepages--;
+
+#ifdef CONFIG_PAGE_OWNER
+	freepage->order = migratepage->order;
+	freepage->gfp_mask = migratepage->gfp_mask;
+	memcpy(freepage->trace, migratepage->trace, sizeof(freepage->trace));
+#endif
+
+	return freepage;
+}
+
+/*
+ * We cannot control nr_migratepages and nr_freepages fully when migation is
+ * running as migrate_pages() has no knowledge of compact_control. When
+ * migration is complete, we count the number of pages on the lists by hand.
+ */
+static void update_nr_listpages(struct compact_control *cc)
+{
+	int nr_migratepages = 0;
+	int nr_freepages = 0;
+	struct page *page;
+	list_for_each_entry(page, &cc->migratepages, lru)
+		nr_migratepages++;
+	list_for_each_entry(page, &cc->freepages, lru)
+		nr_freepages++;
+
+	cc->nr_migratepages = nr_migratepages;
+	cc->nr_freepages = nr_freepages;
+}
+
+static unsigned long compact_zone(struct zone *zone, struct compact_control *cc)
+{
+	/* Setup to move all movable pages to the end of the zone */
+	cc->migrate_pfn = zone->zone_start_pfn;
+	cc->free_pfn = cc->migrate_pfn + zone->spanned_pages;
+	cc->free_pfn &= ~(pageblock_nr_pages-1);
+
+	/* Flush pening updates to the LRU lists */
+	lru_add_drain_all();
+
+	/* Compact until the two PFN pointers cross */
+	while (cc->free_pfn > cc->migrate_pfn) {
+		isolate_migratepages(zone, cc);
+
+		if (!cc->nr_migratepages)
+			continue;
+
+		/* Isolate free pages if necessary */
+		if (cc->nr_freepages < cc->nr_migratepages)
+			isolate_freepages(zone, cc);
+
+		/* Stop compacting if we cannot get enough free pages */
+		if (cc->nr_freepages < cc->nr_migratepages)
+			break;
+
+		migrate_pages_nocontext(&cc->migratepages, compaction_alloc,
+							(unsigned long)cc);
+		update_nr_listpages(cc);
+	}
+
+	/* Release free pages and check accounting */
+	cc->nr_freepages -= release_freepages(zone, &cc->freepages);
+	WARN_ON(cc->nr_freepages != 0);
+
+	/* Release LRU pages not migrated */
+	if (!list_empty(&cc->migratepages))
+		putback_lru_pages(&cc->migratepages);
+
+	return 0;
+}
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.22-rc2-mm1-105_measure_fragmentation/mm/page_alloc.c linux-2.6.22-rc2-mm1-110_compact_zone/mm/page_alloc.c
--- linux-2.6.22-rc2-mm1-105_measure_fragmentation/mm/page_alloc.c	2007-05-28 14:09:40.000000000 +0100
+++ linux-2.6.22-rc2-mm1-110_compact_zone/mm/page_alloc.c	2007-05-29 10:22:15.000000000 +0100
@@ -1065,6 +1065,46 @@ void split_page(struct page *page, unsig
 }
 
 /*
+ * Similar to split_page except the page is already free.
+ *
+ * TODO: This potentially goes below watermarks and knowing we are going
+ *       to free the pages soon is no good because we may need to make small
+ *       allocations for migration to succeed. Obey watermarks
+ */
+int split_pagebuddy_page(struct page *page)
+{
+	int order;
+	struct zone *zone;
+
+	/* Should never happen but lets handle it anyway */
+	if (!page || !PageBuddy(page))
+		return 0;
+
+	zone = page_zone(page);
+	order = page_order(page);
+
+	/* Remove page from free list */
+	list_del(&page->lru);
+	zone->free_area[order].nr_free--;
+	rmv_page_order(page);
+	__mod_zone_page_state(zone, NR_FREE_PAGES, -(1UL << order));
+
+	/* Split into individual pages */
+	set_page_refcounted(page);
+	split_page(page, order);
+
+	/* Set the migratetype of the block if necessary */
+	if (order >= pageblock_order - 1 &&
+			get_pageblock_migratetype(page) != MIGRATE_MOVABLE) {
+		struct page *endpage = page + (1 << order) - 1;
+		for (; page < endpage; page += pageblock_nr_pages)
+			set_pageblock_migratetype(page, MIGRATE_MOVABLE);
+	}
+
+	return 1 << order;
+}
+
+/*
  * Really, prep_compound_page() should be called from __rmqueue_bulk().  But
  * we cheat by calling it from here, in the order > 0 path.  Saves a branch
  * or two.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH 7/7] Add /proc/sys/vm/compact_node for the explicit compaction of a node
  2007-05-29 17:36 [PATCH 0/7] [RFC] Memory Compaction v1 Mel Gorman
                   ` (5 preceding siblings ...)
  2007-05-29 17:38 ` [PATCH 6/7] Introduce a means of compacting memory within a zone Mel Gorman
@ 2007-05-29 17:38 ` Mel Gorman
  2007-05-30  4:14   ` Christoph Lameter
  6 siblings, 1 reply; 21+ messages in thread
From: Mel Gorman @ 2007-05-29 17:38 UTC (permalink / raw)
  To: linux-mm, linux-kernel; +Cc: Mel Gorman, kamezawa.hiroyu, clameter

This patch adds a special file /proc/sys/vm/compact_node. When a number is
written to this file, each zone in that node will be compacted. sysfs did
not look appropriate for exporting this trigger. While the current use of
this trigger is for debugging, it is not clear if it should be only exported
via debugfs. Hence, it is exposed via /proc to start with.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Andy Whitcroft <apw@shadowen.org>
---

 include/linux/compaction.h |    3 ++
 include/linux/sysctl.h     |    1 
 kernel/sysctl.c            |   13 +++++++++
 mm/compaction.c            |   53 ++++++++++++++++++++++++++++++++++++++++
 4 files changed, 70 insertions(+)

diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.22-rc2-mm1-110_compact_zone/include/linux/compaction.h linux-2.6.22-rc2-mm1-115_compact_viaproc/include/linux/compaction.h
--- linux-2.6.22-rc2-mm1-110_compact_zone/include/linux/compaction.h	2007-05-29 10:20:32.000000000 +0100
+++ linux-2.6.22-rc2-mm1-115_compact_viaproc/include/linux/compaction.h	2007-05-29 10:23:51.000000000 +0100
@@ -2,6 +2,9 @@
 #define _LINUX_COMPACTION_H
 
 #ifdef CONFIG_MIGRATION
+extern int sysctl_compaction_handler(struct ctl_table *table, int write,
+				struct file *file, void __user *buffer,
+				size_t *length, loff_t *ppos);
 extern int unusable_free_index(struct zone *zone, unsigned int target_order);
 extern int fragmentation_index(struct zone *zone, unsigned int target_order);
 #else
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.22-rc2-mm1-110_compact_zone/include/linux/sysctl.h linux-2.6.22-rc2-mm1-115_compact_viaproc/include/linux/sysctl.h
--- linux-2.6.22-rc2-mm1-110_compact_zone/include/linux/sysctl.h	2007-05-24 10:13:34.000000000 +0100
+++ linux-2.6.22-rc2-mm1-115_compact_viaproc/include/linux/sysctl.h	2007-05-29 10:23:51.000000000 +0100
@@ -209,6 +209,7 @@ enum
 	VM_VDSO_ENABLED=34,	/* map VDSO into new processes? */
 	VM_MIN_SLAB=35,		 /* Percent pages ignored by zone reclaim */
 	VM_HUGETLB_TREAT_MOVABLE=36, /* Allocate hugepages from ZONE_MOVABLE */
+	VM_COMPACT_NODE = 37,	/* Compact memory within a node */
 
 	/* s390 vm cmm sysctls */
 	VM_CMM_PAGES=1111,
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.22-rc2-mm1-110_compact_zone/kernel/sysctl.c linux-2.6.22-rc2-mm1-115_compact_viaproc/kernel/sysctl.c
--- linux-2.6.22-rc2-mm1-110_compact_zone/kernel/sysctl.c	2007-05-24 10:13:34.000000000 +0100
+++ linux-2.6.22-rc2-mm1-115_compact_viaproc/kernel/sysctl.c	2007-05-29 10:23:51.000000000 +0100
@@ -47,6 +47,7 @@
 #include <linux/nfs_fs.h>
 #include <linux/acpi.h>
 #include <linux/reboot.h>
+#include <linux/compaction.h>
 
 #include <asm/uaccess.h>
 #include <asm/processor.h>
@@ -77,6 +78,7 @@ extern int printk_ratelimit_jiffies;
 extern int printk_ratelimit_burst;
 extern int pid_max_min, pid_max_max;
 extern int sysctl_drop_caches;
+extern int sysctl_compact_node;
 extern int percpu_pagelist_fraction;
 extern int compat_log;
 extern int maps_protect;
@@ -829,6 +831,17 @@ static ctl_table vm_table[] = {
 		.proc_handler	= drop_caches_sysctl_handler,
 		.strategy	= &sysctl_intvec,
 	},
+#ifdef CONFIG_MIGRATION
+	{
+		.ctl_name	= VM_COMPACT_NODE,
+		.procname	= "compact_node",
+		.data		= &sysctl_compact_node,
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= sysctl_compaction_handler,
+		.strategy	= &sysctl_intvec,
+	},
+#endif /* CONFIG_MIGRATION */
 	{
 		.ctl_name	= VM_MIN_FREE_KBYTES,
 		.procname	= "min_free_kbytes",
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.22-rc2-mm1-110_compact_zone/mm/compaction.c linux-2.6.22-rc2-mm1-115_compact_viaproc/mm/compaction.c
--- linux-2.6.22-rc2-mm1-110_compact_zone/mm/compaction.c	2007-05-29 10:22:15.000000000 +0100
+++ linux-2.6.22-rc2-mm1-115_compact_viaproc/mm/compaction.c	2007-05-29 10:23:51.000000000 +0100
@@ -10,6 +10,7 @@
 #include <linux/vmstat.h>
 #include <linux/swap.h>
 #include <linux/migrate.h>
+#include <linux/sysctl.h>
 #include <linux/swap-prefetch.h>
 #include "internal.h"
 
@@ -372,3 +373,55 @@ static unsigned long compact_zone(struct
 
 	return 0;
 }
+
+/* Compact all zones within a node */
+int compact_node(int nodeid)
+{
+	int zoneid;
+	pg_data_t *pgdat;
+	struct zone *zone;
+	struct compact_control cc;
+
+	if (nodeid < 0)
+		return -EINVAL;
+
+	pgdat = NODE_DATA(nodeid);
+	if (!pgdat || pgdat->node_id != nodeid)
+		return -EINVAL;
+
+	printk(KERN_INFO "Compacting memory in node %d\n", nodeid);
+	for (zoneid = 0; zoneid < MAX_NR_ZONES; zoneid++) {
+
+		zone = &pgdat->node_zones[zoneid];
+		if (!populated_zone(zone))
+			continue;
+
+		cc.nr_freepages = 0,
+		cc.nr_migratepages = 0,
+		INIT_LIST_HEAD(&cc.freepages);
+		INIT_LIST_HEAD(&cc.migratepages);
+
+		compact_zone(zone, &cc);
+
+		VM_BUG_ON(!list_empty(&cc.freepages));
+		VM_BUG_ON(!list_empty(&cc.migratepages));
+	}
+	printk(KERN_INFO "Compaction of node %d complete\n", nodeid);
+
+	return 0;
+}
+
+/* This is global and fierce ugly but it's straight-forward */
+int sysctl_compact_node;
+
+/* This is the entry point for compacting nodes via /proc/sys/vm */
+int sysctl_compaction_handler(struct ctl_table *table, int write,
+			struct file *file, void __user *buffer,
+			size_t *length, loff_t *ppos)
+{
+	proc_dointvec(table, write, file, buffer, length, ppos);
+	if (write)
+		return compact_node(sysctl_compact_node);
+
+	return 0;
+}

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 7/7] Add /proc/sys/vm/compact_node for the explicit compaction of a node
  2007-05-29 17:38 ` [PATCH 7/7] Add /proc/sys/vm/compact_node for the explicit compaction of a node Mel Gorman
@ 2007-05-30  4:14   ` Christoph Lameter
  2007-05-30  8:26     ` Mel Gorman
  0 siblings, 1 reply; 21+ messages in thread
From: Christoph Lameter @ 2007-05-30  4:14 UTC (permalink / raw)
  To: Mel Gorman; +Cc: linux-mm, linux-kernel, kamezawa.hiroyu

On Tue, 29 May 2007, Mel Gorman wrote:

> +	if (nodeid < 0)
> +		return -EINVAL;
> +
> +	pgdat = NODE_DATA(nodeid);
> +	if (!pgdat || pgdat->node_id != nodeid)
> +		return -EINVAL;

You cannot pass an arbitrary number to node data since NODE_DATA may do a 
simple array lookup.

Check for node < nr_node_ids first.

pgdat->node_id != nodeid? Sounds like something you should BUG() on.

IA64's NODE_DATA is

struct ia64_node_data {
        short                   active_cpu_count;
        short                   node;
        struct pglist_data      *pg_data_ptrs[MAX_NUMNODES];
};

/*
 * Given a node id, return a pointer to the pg_data_t for the node.
 *
 * NODE_DATA    - should be used in all code not related to system
 *                initialization. It uses pernode data structures to minimize
 *                offnode memory references. However, these structure are not
 *                present during boot. This macro can be used once cpu_init
 *                completes.
 */
#define NODE_DATA(nid)          (local_node_data->pg_data_ptrs[nid])

x86_64 also does

#define NODE_DATA(nid)          (node_data[nid])

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 7/7] Add /proc/sys/vm/compact_node for the explicit compaction of a node
  2007-05-30  4:14   ` Christoph Lameter
@ 2007-05-30  8:26     ` Mel Gorman
  2007-05-30 17:33       ` Christoph Lameter
  0 siblings, 1 reply; 21+ messages in thread
From: Mel Gorman @ 2007-05-30  8:26 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: linux-mm, linux-kernel, kamezawa.hiroyu

On Tue, 29 May 2007, Christoph Lameter wrote:

> On Tue, 29 May 2007, Mel Gorman wrote:
>
>> +	if (nodeid < 0)
>> +		return -EINVAL;
>> +
>> +	pgdat = NODE_DATA(nodeid);
>> +	if (!pgdat || pgdat->node_id != nodeid)
>> +		return -EINVAL;
>
> You cannot pass an arbitrary number to node data since NODE_DATA may do a
> simple array lookup.
>
> Check for node < nr_node_ids first.
>

Very good point. Will fix

> pgdat->node_id != nodeid? Sounds like something you should BUG() on.
>

On non-NUMA, NODE_DATA(anything) returns contig_page_data. I was catching 
where the node ID's didn't match up because node 0 was always returned. 
Checking nr_node_ids is the correct way of doing this.

It's not a BUG() if bad ID is passed in here because we're checking user 
input. By returning -EINVAL the proc writer knows something bad happened 
without making a big deal about it.

> IA64's NODE_DATA is
>
> struct ia64_node_data {
>        short                   active_cpu_count;
>        short                   node;
>        struct pglist_data      *pg_data_ptrs[MAX_NUMNODES];
> };
>
> /*
> * Given a node id, return a pointer to the pg_data_t for the node.
> *
> * NODE_DATA    - should be used in all code not related to system
> *                initialization. It uses pernode data structures to minimize
> *                offnode memory references. However, these structure are not
> *                present during boot. This macro can be used once cpu_init
> *                completes.
> */
> #define NODE_DATA(nid)          (local_node_data->pg_data_ptrs[nid])
>
> x86_64 also does
>
> #define NODE_DATA(nid)          (node_data[nid])
>

All spot on. Will fix.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 7/7] Add /proc/sys/vm/compact_node for the explicit compaction of a node
  2007-05-30  8:26     ` Mel Gorman
@ 2007-05-30 17:33       ` Christoph Lameter
  0 siblings, 0 replies; 21+ messages in thread
From: Christoph Lameter @ 2007-05-30 17:33 UTC (permalink / raw)
  To: Mel Gorman; +Cc: linux-mm, linux-kernel, kamezawa.hiroyu

On Wed, 30 May 2007, Mel Gorman wrote:

> > Check for node < nr_node_ids first.
> Very good point. Will fix

And check if the node is online first? F.e.

node_online(node) ?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2007-05-31 12:26 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2007-05-29 17:36 [PATCH 0/7] [RFC] Memory Compaction v1 Mel Gorman
2007-05-29 17:36 ` [PATCH 1/7] Roll-up patch of what has been sent already Mel Gorman
2007-05-29 17:36 ` [PATCH 2/7] KAMEZAWA Hiroyuki - migration by kernel Mel Gorman
2007-05-30  2:42   ` KAMEZAWA Hiroyuki
2007-05-30  2:47     ` Christoph Lameter
2007-05-30 19:57     ` Hugh Dickins
2007-05-30 20:07       ` Christoph Lameter
2007-05-30 20:10         ` Christoph Lameter
2007-05-31 12:26       ` KAMEZAWA Hiroyuki
2007-05-29 17:37 ` [PATCH 3/7] Allow CONFIG_MIGRATION to be set without CONFIG_NUMA Mel Gorman
2007-05-29 18:01   ` Christoph Lameter
2007-05-29 18:21     ` Mel Gorman
2007-05-29 18:36       ` Christoph Lameter
2007-05-29 18:49         ` Mel Gorman
2007-05-29 17:37 ` [PATCH 4/7] Introduce isolate_lru_page_nolock() as a lockless version of isolate_lru_page() Mel Gorman
2007-05-29 17:37 ` [PATCH 5/7] Provide metrics on the extent of fragmentation in zones Mel Gorman
2007-05-29 17:38 ` [PATCH 6/7] Introduce a means of compacting memory within a zone Mel Gorman
2007-05-29 17:38 ` [PATCH 7/7] Add /proc/sys/vm/compact_node for the explicit compaction of a node Mel Gorman
2007-05-30  4:14   ` Christoph Lameter
2007-05-30  8:26     ` Mel Gorman
2007-05-30 17:33       ` Christoph Lameter

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox