linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/11] Avoiding fragmentation with subzone groupings v26
@ 2006-11-01 11:16 Mel Gorman
  2006-11-01 11:16 ` [PATCH 1/11] Add __GFP_EASYRCLM flag and update callers Mel Gorman
                   ` (11 more replies)
  0 siblings, 12 replies; 13+ messages in thread
From: Mel Gorman @ 2006-11-01 11:16 UTC (permalink / raw)
  To: linux-mm; +Cc: Mel Gorman, linux-kernel

This is the latest version of anti-fragmentation based on sub-zones (previously
called list-based anti-fragmentation) based on top of 2.6.19-rc4-mm1. In
it's last release, it was decided that the scheme should be implemented with
zones to avoid affecting the page allocator hot paths. However, at VM Summit,
it was made clear that zones may not be the right answer either because zones
have their own issues. Hence, this is a reintroduction of the first approach.

Changelog Since V25
o Fix loop order of for_each_rclmtype_order so that order of loop matches args
o gfpflags_to_rclmtype uses gfp_t instead of unsigned long
o Rename get_pageblock_type() to get_page_rclmtype()
o Fix alignment problem in move_freepages()
o Add mechanism for assigning flags to blocks of pages instead of page->flags
o On fallback, do not examine the preferred list of free pages a second time

The purpose of these patches is to reduce external fragmentation by grouping
pages of related types together. The objective is that when page reclaim
occurs, there is a greater chance that large contiguous pages will be
free. Note that this is not a defragmentation which would get contiguous
pages by moving pages around.

This patch works by categorising allocations by their reclaimability;

EasyReclaimable - These are userspace pages that are easily reclaimable. This
	flag is set when it is known that the pages will be trivially reclaimed
	by writing the page out to swap or syncing with backing storage

KernelReclaimable - These are allocations for some kernel caches that are
	reclaimable or allocations that are known to be very short-lived.

KernelNonReclaimable - These are pages that are allocated by the kernel that
	are not trivially reclaimed. For example, the memory allocated for a
	loaded module would be in this category. By default, allocations are
	considered to be of this type

Instead of having one MAX_ORDER-sized array of free lists in struct free_area,
there is one for each type of reclaimability. Once a 2^MAX_ORDER block of
pages is split for a type of allocation, it is added to the free-lists for
that type, in effect reserving it. Hence, over time, pages of the different
types can be clustered together. When a page is allocated, the page-flags
are updated with a value indicating it's type of reclaimability so that it
is placed on the correct list on free.

When the preferred freelists are expired, the largest possible block is taken
from an alternative list. Buddies that are split from that large block are
placed on the preferred allocation-type freelists to mitigate fragmentation.

This implementation gives best-effort for low fragmentation in all zones. To
be effective, min_free_kbytes needs to be set to a value about 10% of physical
memory (10% was found by experimentation, it may be workload dependant). To
get that value lower, anti-fragmentation needs to be more invasive so it's
best to find out what sorts of workloads still cause fragmentation before
taking further steps.

Our tests show that about 60-70% of physical memory can be allocated on
a desktop after a few days uptime. In benchmarks and stress tests, we are
finding that 80% of memory is available as contiguous blocks at the end of
the test. To compare, a standard kernel was getting < 1% of memory as large
pages on a desktop and about 8-12% of memory as large pages at the end of
stress tests.

Performance tests are within 0.1% for kbuild on a number of test machines. aim9
is usually within 1% except on x86_64 where aim9 results are unreliable.
I have never been able to show it but it is possible the main allocator
path is adversely affected by anti-fragmentation (cache footprint might be a
problem) and it may be exposed by using differnet compilers or benchmarks. If
any regressions are detected due to anti-fragmentation, it may be simply
disabled via the kernel configuration and I'd appreciate a report detailing
the regression and how to trigger it.

Following this email are 8 patches that implement antifragmentation with an
additional 3 patches that provide an alternative to using page->flags. The
early patches introduce the split between user and kernel allocations.
Later we introduce a further split for kernel allocations, into KernRclm
and KernNoRclm.  Note that although in early patches an additional page
flag is consumed, later patches reuse the suspend bits, releasing this bit
again. The last three patches remove the restriction on suspend by introducing
an alternative solution for tracking page blocks which remove the need for
any page bits.

Comments?
-- 
-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH 1/11] Add __GFP_EASYRCLM flag and update callers
  2006-11-01 11:16 [PATCH 0/11] Avoiding fragmentation with subzone groupings v26 Mel Gorman
@ 2006-11-01 11:16 ` Mel Gorman
  2006-11-01 11:17 ` [PATCH 2/11] Split the free lists into kernel and user parts Mel Gorman
                   ` (10 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: Mel Gorman @ 2006-11-01 11:16 UTC (permalink / raw)
  To: linux-mm; +Cc: Mel Gorman, linux-kernel

This patch adds a flag __GFP_EASYRCLM.  Allocations using the __GFP_EASYRCLM
flag are expected to be easily reclaimed by syncing with backing storage (be
it a file or swap) or cleaning the buffers and discarding.


Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---

 fs/block_dev.c          |    3 ++-
 fs/buffer.c             |    3 ++-
 fs/compat.c             |    3 ++-
 fs/exec.c               |    3 ++-
 fs/inode.c              |    3 ++-
 include/asm-i386/page.h |    4 +++-
 include/linux/gfp.h     |   12 +++++++++++-
 include/linux/highmem.h |    4 +++-
 mm/memory.c             |    8 ++++++--
 mm/swap_state.c         |    4 +++-
 10 files changed, 36 insertions(+), 11 deletions(-)

diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.19-rc4-mm1-clean/fs/block_dev.c linux-2.6.19-rc4-mm1-001_antifrag_flags/fs/block_dev.c
--- linux-2.6.19-rc4-mm1-clean/fs/block_dev.c	2006-10-31 13:27:12.000000000 +0000
+++ linux-2.6.19-rc4-mm1-001_antifrag_flags/fs/block_dev.c	2006-10-31 13:29:03.000000000 +0000
@@ -380,7 +380,8 @@ struct block_device *bdget(dev_t dev)
 		inode->i_rdev = dev;
 		inode->i_bdev = bdev;
 		inode->i_data.a_ops = &def_blk_aops;
-		mapping_set_gfp_mask(&inode->i_data, GFP_USER);
+		mapping_set_gfp_mask(&inode->i_data,
+				set_rclmflags(GFP_USER, __GFP_EASYRCLM));
 		inode->i_data.backing_dev_info = &default_backing_dev_info;
 		spin_lock(&bdev_lock);
 		list_add(&bdev->bd_list, &all_bdevs);
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.19-rc4-mm1-clean/fs/buffer.c linux-2.6.19-rc4-mm1-001_antifrag_flags/fs/buffer.c
--- linux-2.6.19-rc4-mm1-clean/fs/buffer.c	2006-10-31 13:27:12.000000000 +0000
+++ linux-2.6.19-rc4-mm1-001_antifrag_flags/fs/buffer.c	2006-10-31 13:29:03.000000000 +0000
@@ -995,7 +995,8 @@ grow_dev_page(struct block_device *bdev,
 	struct page *page;
 	struct buffer_head *bh;
 
-	page = find_or_create_page(inode->i_mapping, index, GFP_NOFS);
+	page = find_or_create_page(inode->i_mapping, index,
+				   set_rclmflags(GFP_NOFS, __GFP_EASYRCLM));
 	if (!page)
 		return NULL;
 
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.19-rc4-mm1-clean/fs/compat.c linux-2.6.19-rc4-mm1-001_antifrag_flags/fs/compat.c
--- linux-2.6.19-rc4-mm1-clean/fs/compat.c	2006-10-31 13:27:12.000000000 +0000
+++ linux-2.6.19-rc4-mm1-001_antifrag_flags/fs/compat.c	2006-10-31 13:29:03.000000000 +0000
@@ -1419,7 +1419,8 @@ static int compat_copy_strings(int argc,
 			page = bprm->page[i];
 			new = 0;
 			if (!page) {
-				page = alloc_page(GFP_HIGHUSER);
+				page = alloc_page(set_rclmflags(GFP_HIGHUSER,
+							__GFP_EASYRCLM));
 				bprm->page[i] = page;
 				if (!page) {
 					ret = -ENOMEM;
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.19-rc4-mm1-clean/fs/exec.c linux-2.6.19-rc4-mm1-001_antifrag_flags/fs/exec.c
--- linux-2.6.19-rc4-mm1-clean/fs/exec.c	2006-10-31 13:27:12.000000000 +0000
+++ linux-2.6.19-rc4-mm1-001_antifrag_flags/fs/exec.c	2006-10-31 13:29:03.000000000 +0000
@@ -239,7 +239,8 @@ static int copy_strings(int argc, char _
 			page = bprm->page[i];
 			new = 0;
 			if (!page) {
-				page = alloc_page(GFP_HIGHUSER);
+				page = alloc_page(set_rclmflags(GFP_HIGHUSER,
+							__GFP_EASYRCLM));
 				bprm->page[i] = page;
 				if (!page) {
 					ret = -ENOMEM;
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.19-rc4-mm1-clean/fs/inode.c linux-2.6.19-rc4-mm1-001_antifrag_flags/fs/inode.c
--- linux-2.6.19-rc4-mm1-clean/fs/inode.c	2006-10-31 13:27:12.000000000 +0000
+++ linux-2.6.19-rc4-mm1-001_antifrag_flags/fs/inode.c	2006-10-31 13:29:03.000000000 +0000
@@ -146,7 +146,8 @@ static struct inode *alloc_inode(struct 
 		mapping->a_ops = &empty_aops;
  		mapping->host = inode;
 		mapping->flags = 0;
-		mapping_set_gfp_mask(mapping, GFP_HIGHUSER);
+		mapping_set_gfp_mask(mapping,
+				set_rclmflags(GFP_HIGHUSER, __GFP_EASYRCLM));
 		mapping->assoc_mapping = NULL;
 		mapping->backing_dev_info = &default_backing_dev_info;
 
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.19-rc4-mm1-clean/include/asm-i386/page.h linux-2.6.19-rc4-mm1-001_antifrag_flags/include/asm-i386/page.h
--- linux-2.6.19-rc4-mm1-clean/include/asm-i386/page.h	2006-10-31 13:27:13.000000000 +0000
+++ linux-2.6.19-rc4-mm1-001_antifrag_flags/include/asm-i386/page.h	2006-10-31 13:29:03.000000000 +0000
@@ -35,7 +35,9 @@
 #define clear_user_page(page, vaddr, pg)	clear_page(page)
 #define copy_user_page(to, from, vaddr, pg)	copy_page(to, from)
 
-#define alloc_zeroed_user_highpage(vma, vaddr) alloc_page_vma(GFP_HIGHUSER | __GFP_ZERO, vma, vaddr)
+#define alloc_zeroed_user_highpage(vma, vaddr) \
+	alloc_page_vma(set_rclmflags(GFP_HIGHUSER|__GFP_ZERO, __GFP_EASYRCLM),\
+								vma, vaddr)
 #define __HAVE_ARCH_ALLOC_ZEROED_USER_HIGHPAGE
 
 /*
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.19-rc4-mm1-clean/include/linux/gfp.h linux-2.6.19-rc4-mm1-001_antifrag_flags/include/linux/gfp.h
--- linux-2.6.19-rc4-mm1-clean/include/linux/gfp.h	2006-10-31 13:27:13.000000000 +0000
+++ linux-2.6.19-rc4-mm1-001_antifrag_flags/include/linux/gfp.h	2006-10-31 13:29:03.000000000 +0000
@@ -46,6 +46,7 @@ struct vm_area_struct;
 #define __GFP_NOMEMALLOC ((__force gfp_t)0x10000u) /* Don't use emergency reserves */
 #define __GFP_HARDWALL   ((__force gfp_t)0x20000u) /* Enforce hardwall cpuset memory allocs */
 #define __GFP_THISNODE	((__force gfp_t)0x40000u)/* No fallback, no policies */
+#define __GFP_EASYRCLM	((__force gfp_t)0x80000u) /* Easily reclaimed page */
 
 #define __GFP_BITS_SHIFT 20	/* Room for 20 __GFP_FOO bits */
 #define __GFP_BITS_MASK ((__force gfp_t)((1 << __GFP_BITS_SHIFT) - 1))
@@ -54,7 +55,11 @@ struct vm_area_struct;
 #define GFP_LEVEL_MASK (__GFP_WAIT|__GFP_HIGH|__GFP_IO|__GFP_FS| \
 			__GFP_COLD|__GFP_NOWARN|__GFP_REPEAT| \
 			__GFP_NOFAIL|__GFP_NORETRY|__GFP_NO_GROW|__GFP_COMP| \
-			__GFP_NOMEMALLOC|__GFP_HARDWALL|__GFP_THISNODE)
+			__GFP_NOMEMALLOC|__GFP_HARDWALL|__GFP_THISNODE|\
+			__GFP_EASYRCLM)
+
+/* This mask makes up all the RCLM-related flags */
+#define GFP_RECLAIM_MASK (__GFP_EASYRCLM)
 
 /* This equals 0, but use constants in case they ever change */
 #define GFP_NOWAIT	(GFP_ATOMIC & ~__GFP_HIGH)
@@ -100,6 +105,11 @@ static inline enum zone_type gfp_zone(gf
 	return ZONE_NORMAL;
 }
 
+static inline gfp_t set_rclmflags(gfp_t gfp, gfp_t reclaim_flags)
+{
+	return (gfp & ~(GFP_RECLAIM_MASK)) | reclaim_flags;
+}
+
 /*
  * There is only one page-allocator function, and two main namespaces to
  * it. The alloc_page*() variants return 'struct page *' and as such
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.19-rc4-mm1-clean/include/linux/highmem.h linux-2.6.19-rc4-mm1-001_antifrag_flags/include/linux/highmem.h
--- linux-2.6.19-rc4-mm1-clean/include/linux/highmem.h	2006-10-31 03:37:36.000000000 +0000
+++ linux-2.6.19-rc4-mm1-001_antifrag_flags/include/linux/highmem.h	2006-10-31 13:29:03.000000000 +0000
@@ -63,7 +63,9 @@ static inline void clear_user_highpage(s
 static inline struct page *
 alloc_zeroed_user_highpage(struct vm_area_struct *vma, unsigned long vaddr)
 {
-	struct page *page = alloc_page_vma(GFP_HIGHUSER, vma, vaddr);
+	struct page *page = alloc_page_vma(
+				set_rclmflags(GFP_HIGHUSER, __GFP_EASYRCLM),
+				vma, vaddr);
 
 	if (page)
 		clear_user_highpage(page, vaddr);
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.19-rc4-mm1-clean/mm/memory.c linux-2.6.19-rc4-mm1-001_antifrag_flags/mm/memory.c
--- linux-2.6.19-rc4-mm1-clean/mm/memory.c	2006-10-31 13:27:13.000000000 +0000
+++ linux-2.6.19-rc4-mm1-001_antifrag_flags/mm/memory.c	2006-10-31 13:29:03.000000000 +0000
@@ -1564,7 +1564,9 @@ gotten:
 		if (!new_page)
 			goto oom;
 	} else {
-		new_page = alloc_page_vma(GFP_HIGHUSER, vma, address);
+		new_page = alloc_page_vma(
+				set_rclmflags(GFP_HIGHUSER, __GFP_EASYRCLM),
+				vma, address);
 		if (!new_page)
 			goto oom;
 		cow_user_page(new_page, old_page, address);
@@ -2188,7 +2190,9 @@ retry:
 
 			if (unlikely(anon_vma_prepare(vma)))
 				goto oom;
-			page = alloc_page_vma(GFP_HIGHUSER, vma, address);
+			page = alloc_page_vma(
+				set_rclmflags(GFP_HIGHUSER, __GFP_EASYRCLM),
+				vma, address);
 			if (!page)
 				goto oom;
 			copy_user_highpage(page, new_page, address);
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.19-rc4-mm1-clean/mm/swap_state.c linux-2.6.19-rc4-mm1-001_antifrag_flags/mm/swap_state.c
--- linux-2.6.19-rc4-mm1-clean/mm/swap_state.c	2006-10-31 13:27:13.000000000 +0000
+++ linux-2.6.19-rc4-mm1-001_antifrag_flags/mm/swap_state.c	2006-10-31 13:29:03.000000000 +0000
@@ -343,7 +343,9 @@ struct page *read_swap_cache_async(swp_e
 		 * Get a new page to read into from swap.
 		 */
 		if (!new_page) {
-			new_page = alloc_page_vma(GFP_HIGHUSER, vma, addr);
+			new_page = alloc_page_vma(
+				set_rclmflags(GFP_HIGHUSER, __GFP_EASYRCLM),
+				vma, addr);
 			if (!new_page)
 				break;		/* Out of memory */
 		}

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH 2/11] Split the free lists into kernel and user parts
  2006-11-01 11:16 [PATCH 0/11] Avoiding fragmentation with subzone groupings v26 Mel Gorman
  2006-11-01 11:16 ` [PATCH 1/11] Add __GFP_EASYRCLM flag and update callers Mel Gorman
@ 2006-11-01 11:17 ` Mel Gorman
  2006-11-01 11:17 ` [PATCH 3/11] Split the per-cpu lists into RCLM_TYPES lists Mel Gorman
                   ` (9 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: Mel Gorman @ 2006-11-01 11:17 UTC (permalink / raw)
  To: linux-mm; +Cc: Mel Gorman, linux-kernel

This patch adds the core of the anti-fragmentation strategy. It works by
grouping related allocation types together. The idea is that large groups of
pages that may be reclaimed are placed near each other. The zone->free_area
list is broken into RCLM_TYPES number of lists.


Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Joel Schopp <jschopp@austin.ibm.com>
---

 include/linux/mmzone.h     |   10 +++
 include/linux/page-flags.h |    7 ++
 mm/page_alloc.c            |  109 +++++++++++++++++++++++++++++++---------
 3 files changed, 102 insertions(+), 24 deletions(-)

diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.19-rc4-mm1-001_antifrag_flags/include/linux/mmzone.h linux-2.6.19-rc4-mm1-002_fragcore/include/linux/mmzone.h
--- linux-2.6.19-rc4-mm1-001_antifrag_flags/include/linux/mmzone.h	2006-10-31 13:27:13.000000000 +0000
+++ linux-2.6.19-rc4-mm1-002_fragcore/include/linux/mmzone.h	2006-10-31 13:31:10.000000000 +0000
@@ -24,8 +24,16 @@
 #endif
 #define MAX_ORDER_NR_PAGES (1 << (MAX_ORDER - 1))
 
+#define RCLM_NORCLM 0
+#define RCLM_EASY   1
+#define RCLM_TYPES  2
+
+#define for_each_rclmtype_order(order, type) \
+	for (order = 0; order < MAX_ORDER; order++) \
+		for (type = 0; type < RCLM_TYPES; type++)
+
 struct free_area {
-	struct list_head	free_list;
+	struct list_head	free_list[RCLM_TYPES];
 	unsigned long		nr_free;
 };
 
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.19-rc4-mm1-001_antifrag_flags/include/linux/page-flags.h linux-2.6.19-rc4-mm1-002_fragcore/include/linux/page-flags.h
--- linux-2.6.19-rc4-mm1-001_antifrag_flags/include/linux/page-flags.h	2006-10-31 13:27:13.000000000 +0000
+++ linux-2.6.19-rc4-mm1-002_fragcore/include/linux/page-flags.h	2006-10-31 13:31:10.000000000 +0000
@@ -93,6 +93,7 @@
 
 #define PG_readahead		20	/* Reminder to do readahead */
 
+#define PG_easyrclm		21	/* Page is an easy reclaim page */
 
 #if (BITS_PER_LONG > 32)
 /*
@@ -253,6 +254,12 @@ static inline void SetPageUptodate(struc
 #define SetPageReadahead(page)	set_bit(PG_readahead, &(page)->flags)
 #define TestClearPageReadahead(page) test_and_clear_bit(PG_readahead, &(page)->flags)
 
+#define PageEasyRclm(page)	test_bit(PG_easyrclm, &(page)->flags)
+#define SetPageEasyRclm(page)	set_bit(PG_easyrclm, &(page)->flags)
+#define ClearPageEasyRclm(page)	clear_bit(PG_easyrclm, &(page)->flags)
+#define __SetPageEasyRclm(page)	__set_bit(PG_easyrclm, &(page)->flags)
+#define __ClearPageEasyRclm(page) __clear_bit(PG_easyrclm, &(page)->flags)
+
 struct page;	/* forward declaration */
 
 int test_clear_page_dirty(struct page *page);
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.19-rc4-mm1-001_antifrag_flags/mm/page_alloc.c linux-2.6.19-rc4-mm1-002_fragcore/mm/page_alloc.c
--- linux-2.6.19-rc4-mm1-001_antifrag_flags/mm/page_alloc.c	2006-10-31 13:27:13.000000000 +0000
+++ linux-2.6.19-rc4-mm1-002_fragcore/mm/page_alloc.c	2006-10-31 13:33:27.000000000 +0000
@@ -135,6 +135,16 @@ static unsigned long __initdata dma_rese
 #endif /* CONFIG_MEMORY_HOTPLUG_RESERVE */
 #endif /* CONFIG_ARCH_POPULATES_NODE_MAP */
 
+static inline int get_page_rclmtype(struct page *page)
+{
+	return (PageEasyRclm(page) != 0);
+}
+
+static inline int gfpflags_to_rclmtype(gfp_t gfp_flags)
+{
+	return ((gfp_flags & __GFP_EASYRCLM) != 0);
+}
+
 #ifdef CONFIG_DEBUG_VM
 static int page_outside_zone_boundaries(struct zone *zone, struct page *page)
 {
@@ -404,11 +414,13 @@ static inline void __free_one_page(struc
 {
 	unsigned long page_idx;
 	int order_size = 1 << order;
+	int rclmtype = get_page_rclmtype(page);
 
 	if (unlikely(PageCompound(page)))
 		destroy_compound_page(page, order);
 
 	page_idx = page_to_pfn(page) & ((1 << MAX_ORDER) - 1);
+	__SetPageEasyRclm(page);
 
 	VM_BUG_ON(page_idx & (order_size - 1));
 	VM_BUG_ON(bad_range(zone, page));
@@ -416,7 +428,6 @@ static inline void __free_one_page(struc
 	zone->free_pages += order_size;
 	while (order < MAX_ORDER-1) {
 		unsigned long combined_idx;
-		struct free_area *area;
 		struct page *buddy;
 
 		buddy = __page_find_buddy(page, page_idx, order);
@@ -424,8 +435,7 @@ static inline void __free_one_page(struc
 			break;		/* Move the buddy up one level. */
 
 		list_del(&buddy->lru);
-		area = zone->free_area + order;
-		area->nr_free--;
+		zone->free_area[order].nr_free--;
 		rmv_page_order(buddy);
 		combined_idx = __find_combined_index(page_idx, order);
 		page = page + (combined_idx - page_idx);
@@ -433,7 +443,7 @@ static inline void __free_one_page(struc
 		order++;
 	}
 	set_page_order(page, order);
-	list_add(&page->lru, &zone->free_area[order].free_list);
+	list_add(&page->lru, &zone->free_area[order].free_list[rclmtype]);
 	zone->free_area[order].nr_free++;
 }
 
@@ -568,7 +578,8 @@ void fastcall __init __free_pages_bootme
  * -- wli
  */
 static inline void expand(struct zone *zone, struct page *page,
- 	int low, int high, struct free_area *area)
+ 	int low, int high, struct free_area *area,
+	int rclmtype)
 {
 	unsigned long size = 1 << high;
 
@@ -577,7 +588,7 @@ static inline void expand(struct zone *z
 		high--;
 		size >>= 1;
 		VM_BUG_ON(bad_range(zone, &page[size]));
-		list_add(&page[size].lru, &area->free_list);
+		list_add(&page[size].lru, &area->free_list[rclmtype]);
 		area->nr_free++;
 		set_page_order(&page[size], high);
 	}
@@ -630,31 +641,80 @@ static int prep_new_page(struct page *pa
 	return 0;
 }
 
+/* Remove an element from the buddy allocator from the fallback list */
+static struct page *__rmqueue_fallback(struct zone *zone, int order,
+							gfp_t gfp_flags)
+{
+	struct free_area * area;
+	int current_order;
+	struct page *page;
+	int rclmtype = gfpflags_to_rclmtype(gfp_flags);
+
+	/* Find the largest possible block of pages in the other list */
+	rclmtype = !rclmtype;
+	for (current_order = MAX_ORDER-1; current_order >= order;
+						--current_order) {
+		area = &(zone->free_area[current_order]);
+ 		if (list_empty(&area->free_list[rclmtype]))
+ 			continue;
+
+		page = list_entry(area->free_list[rclmtype].next,
+					struct page, lru);
+		area->nr_free--;
+
+		/*
+		 * If breaking a large block of pages, place the buddies
+		 * on the preferred allocation list
+		 */
+		if (unlikely(current_order >= MAX_ORDER / 2))
+			rclmtype = !rclmtype;
+
+		/* Remove the page from the freelists */
+		list_del(&page->lru);
+		rmv_page_order(page);
+		zone->free_pages -= 1UL << order;
+		expand(zone, page, order, current_order, area, rclmtype);
+		return page;
+	}
+
+	return NULL;
+}
+
 /* 
  * Do the hard work of removing an element from the buddy allocator.
  * Call me with the zone->lock already held.
  */
-static struct page *__rmqueue(struct zone *zone, unsigned int order)
+static struct page *__rmqueue(struct zone *zone, unsigned int order,
+						gfp_t gfp_flags)
 {
 	struct free_area * area;
 	unsigned int current_order;
 	struct page *page;
+	int rclmtype = gfpflags_to_rclmtype(gfp_flags);
 
+	/* Find a page of the appropriate size in the preferred list */
 	for (current_order = order; current_order < MAX_ORDER; ++current_order) {
-		area = zone->free_area + current_order;
-		if (list_empty(&area->free_list))
+		area = &(zone->free_area[current_order]);
+		if (list_empty(&area->free_list[rclmtype]))
 			continue;
 
-		page = list_entry(area->free_list.next, struct page, lru);
+		page = list_entry(area->free_list[rclmtype].next,
+					struct page, lru);
 		list_del(&page->lru);
 		rmv_page_order(page);
 		area->nr_free--;
 		zone->free_pages -= 1UL << order;
-		expand(zone, page, order, current_order, area);
-		return page;
+		expand(zone, page, order, current_order, area, rclmtype);
+		goto got_page;
 	}
 
-	return NULL;
+	page = __rmqueue_fallback(zone, order, gfp_flags);
+
+got_page:
+	if (unlikely(rclmtype == RCLM_NORCLM) && page)
+		__ClearPageEasyRclm(page);
+
+	return page;
 }
 
 /* 
@@ -663,13 +723,14 @@ static struct page *__rmqueue(struct zon
  * Returns the number of new pages which were placed at *list.
  */
 static int rmqueue_bulk(struct zone *zone, unsigned int order, 
-			unsigned long count, struct list_head *list)
+			unsigned long count, struct list_head *list,
+			gfp_t gfp_flags)
 {
 	int i;
 	
 	spin_lock(&zone->lock);
 	for (i = 0; i < count; ++i) {
-		struct page *page = __rmqueue(zone, order);
+		struct page *page = __rmqueue(zone, order, gfp_flags);
 		if (unlikely(page == NULL))
 			break;
 		list_add_tail(&page->lru, list);
@@ -744,7 +805,7 @@ void mark_free_pages(struct zone *zone)
 {
 	unsigned long pfn, max_zone_pfn;
 	unsigned long flags;
-	int order;
+	int order, t;
 	struct list_head *curr;
 
 	if (!zone->spanned_pages)
@@ -761,14 +822,15 @@ void mark_free_pages(struct zone *zone)
 				ClearPageNosaveFree(page);
 		}
 
-	for (order = MAX_ORDER - 1; order >= 0; --order)
-		list_for_each(curr, &zone->free_area[order].free_list) {
+	for_each_rclmtype_order(order, t) {
+		list_for_each(curr, &zone->free_area[order].free_list[t]) {
 			unsigned long i;
 
 			pfn = page_to_pfn(list_entry(curr, struct page, lru));
 			for (i = 0; i < (1UL << order); i++)
 				SetPageNosaveFree(pfn_to_page(pfn + i));
 		}
+	}
 
 	spin_unlock_irqrestore(&zone->lock, flags);
 }
@@ -868,7 +930,7 @@ again:
 		local_irq_save(flags);
 		if (!pcp->count) {
 			pcp->count = rmqueue_bulk(zone, 0,
-						pcp->batch, &pcp->list);
+					pcp->batch, &pcp->list, gfp_flags);
 			if (unlikely(!pcp->count))
 				goto failed;
 		}
@@ -877,7 +939,7 @@ again:
 		pcp->count--;
 	} else {
 		spin_lock_irqsave(&zone->lock, flags);
-		page = __rmqueue(zone, order);
+		page = __rmqueue(zone, order, gfp_flags);
 		spin_unlock(&zone->lock);
 		if (!page)
 			goto failed;
@@ -1959,6 +2021,7 @@ void __meminit memmap_init_zone(unsigned
 		init_page_count(page);
 		reset_page_mapcount(page);
 		SetPageReserved(page);
+		SetPageEasyRclm(page);
 		INIT_LIST_HEAD(&page->lru);
 #ifdef WANT_PAGE_VIRTUAL
 		/* The shift won't overflow because ZONE_NORMAL is below 4G. */
@@ -1974,9 +2037,9 @@ void __meminit memmap_init_zone(unsigned
 void zone_init_free_lists(struct pglist_data *pgdat, struct zone *zone,
 				unsigned long size)
 {
-	int order;
-	for (order = 0; order < MAX_ORDER ; order++) {
-		INIT_LIST_HEAD(&zone->free_area[order].free_list);
+	int order, rclmtype;
+	for_each_rclmtype_order(order, rclmtype) {
+		INIT_LIST_HEAD(&zone->free_area[order].free_list[rclmtype]);
 		zone->free_area[order].nr_free = 0;
 	}
 }

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH 3/11] Split the per-cpu lists into RCLM_TYPES lists
  2006-11-01 11:16 [PATCH 0/11] Avoiding fragmentation with subzone groupings v26 Mel Gorman
  2006-11-01 11:16 ` [PATCH 1/11] Add __GFP_EASYRCLM flag and update callers Mel Gorman
  2006-11-01 11:17 ` [PATCH 2/11] Split the free lists into kernel and user parts Mel Gorman
@ 2006-11-01 11:17 ` Mel Gorman
  2006-11-01 11:17 ` [PATCH 4/11] Add a configure option for anti-fragmentation Mel Gorman
                   ` (8 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: Mel Gorman @ 2006-11-01 11:17 UTC (permalink / raw)
  To: linux-mm; +Cc: Mel Gorman, linux-kernel

The freelists for each allocation type can slowly become fragmented due to
the per-cpu list. Consider what happens when the following happens

1. A 2^(MAX_ORDER-1) list is reserved for __GFP_EASYRCLM pages
2. An order-0 page is allocated from the newly reserved block
3. The page is freed and placed on the per-cpu list
4. alloc_page() is called with GFP_KERNEL as the gfp_mask
5. The per-cpu list is used to satisfy the allocation

This results in a kernel page is in the middle of a RCLM_EASY region. This
means that over long periods of the time, the anti-fragmentation scheme
slowly degrades to the standard allocator.

This patch divides the per-cpu lists into RCLM_TYPES number of lists.


Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Joel Schopp <jschopp@austin.ibm.com>
---

 include/linux/mmzone.h |   16 +++++++++-
 mm/page_alloc.c        |   66 ++++++++++++++++++++++++++++----------------
 mm/vmstat.c            |    4 +-
 3 files changed, 58 insertions(+), 28 deletions(-)

diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.19-rc4-mm1-002_fragcore/include/linux/mmzone.h linux-2.6.19-rc4-mm1-003_percpu/include/linux/mmzone.h
--- linux-2.6.19-rc4-mm1-002_fragcore/include/linux/mmzone.h	2006-10-31 13:31:10.000000000 +0000
+++ linux-2.6.19-rc4-mm1-003_percpu/include/linux/mmzone.h	2006-10-31 13:35:47.000000000 +0000
@@ -28,6 +28,8 @@
 #define RCLM_EASY   1
 #define RCLM_TYPES  2
 
+#define for_each_rclmtype(type) \
+	for (type = 0; type < RCLM_TYPES; type++)
 #define for_each_rclmtype_order(order, type) \
 	for (order = 0; order < MAX_ORDER; order++) \
 		for (type = 0; type < RCLM_TYPES; type++)
@@ -78,10 +80,10 @@ enum zone_stat_item {
 	NR_VM_ZONE_STAT_ITEMS };
 
 struct per_cpu_pages {
-	int count;		/* number of pages in the list */
+	int counts[RCLM_TYPES];	/* number of pages in the list */
 	int high;		/* high watermark, emptying needed */
 	int batch;		/* chunk size for buddy add/remove */
-	struct list_head list;	/* the list of pages */
+	struct list_head list[RCLM_TYPES];	/* the list of pages */
 };
 
 struct per_cpu_pageset {
@@ -92,6 +94,16 @@ struct per_cpu_pageset {
 #endif
 } ____cacheline_aligned_in_smp;
 
+static inline int pcp_count(struct per_cpu_pages *pcp)
+{
+	int rclmtype, count = 0;
+
+	for_each_rclmtype(rclmtype)
+		count += pcp->counts[rclmtype];
+
+	return count;
+}
+
 #ifdef CONFIG_NUMA
 #define zone_pcp(__z, __cpu) ((__z)->pageset[(__cpu)])
 #else
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.19-rc4-mm1-002_fragcore/mm/page_alloc.c linux-2.6.19-rc4-mm1-003_percpu/mm/page_alloc.c
--- linux-2.6.19-rc4-mm1-002_fragcore/mm/page_alloc.c	2006-10-31 13:33:27.000000000 +0000
+++ linux-2.6.19-rc4-mm1-003_percpu/mm/page_alloc.c	2006-10-31 13:40:01.000000000 +0000
@@ -748,7 +748,7 @@ static int rmqueue_bulk(struct zone *zon
  */
 void drain_node_pages(int nodeid)
 {
-	int i;
+	int i, pindex;
 	enum zone_type z;
 	unsigned long flags;
 
@@ -764,10 +764,14 @@ void drain_node_pages(int nodeid)
 			struct per_cpu_pages *pcp;
 
 			pcp = &pset->pcp[i];
-			if (pcp->count) {
+			if (pcp_count(pcp)) {
 				local_irq_save(flags);
-				free_pages_bulk(zone, pcp->count, &pcp->list, 0);
-				pcp->count = 0;
+				for_each_rclmtype(pindex) {
+					free_pages_bulk(zone,
+							pcp->counts[pindex],
+							&pcp->list[pindex], 0);
+					pcp->counts[pindex] = 0;
+				}
 				local_irq_restore(flags);
 			}
 		}
@@ -780,7 +784,7 @@ static void __drain_pages(unsigned int c
 {
 	unsigned long flags;
 	struct zone *zone;
-	int i;
+	int i, pindex;
 
 	for_each_zone(zone) {
 		struct per_cpu_pageset *pset;
@@ -791,8 +795,13 @@ static void __drain_pages(unsigned int c
 
 			pcp = &pset->pcp[i];
 			local_irq_save(flags);
-			free_pages_bulk(zone, pcp->count, &pcp->list, 0);
-			pcp->count = 0;
+			for_each_rclmtype(pindex) {
+				free_pages_bulk(zone,
+						pcp->counts[pindex],
+						&pcp->list[pindex], 0);
+
+				pcp->counts[pindex] = 0;
+			}
 			local_irq_restore(flags);
 		}
 	}
@@ -854,6 +863,7 @@ void drain_local_pages(void)
 static void fastcall free_hot_cold_page(struct page *page, int cold)
 {
 	struct zone *zone = page_zone(page);
+	int pindex = get_page_rclmtype(page);
 	struct per_cpu_pages *pcp;
 	unsigned long flags;
 
@@ -870,11 +880,11 @@ static void fastcall free_hot_cold_page(
 	pcp = &zone_pcp(zone, get_cpu())->pcp[cold];
 	local_irq_save(flags);
 	__count_vm_event(PGFREE);
-	list_add(&page->lru, &pcp->list);
-	pcp->count++;
-	if (pcp->count >= pcp->high) {
-		free_pages_bulk(zone, pcp->batch, &pcp->list, 0);
-		pcp->count -= pcp->batch;
+	list_add(&page->lru, &pcp->list[pindex]);
+	pcp->counts[pindex]++;
+	if (pcp->counts[pindex] >= pcp->high) {
+		free_pages_bulk(zone, pcp->batch, &pcp->list[pindex], 0);
+		pcp->counts[pindex] -= pcp->batch;
 	}
 	local_irq_restore(flags);
 	put_cpu();
@@ -920,6 +930,7 @@ static struct page *buffered_rmqueue(str
 	struct page *page;
 	int cold = !!(gfp_flags & __GFP_COLD);
 	int cpu;
+	int rclmtype = gfpflags_to_rclmtype(gfp_flags);
 
 again:
 	cpu  = get_cpu();
@@ -928,15 +939,15 @@ again:
 
 		pcp = &zone_pcp(zone, cpu)->pcp[cold];
 		local_irq_save(flags);
-		if (!pcp->count) {
-			pcp->count = rmqueue_bulk(zone, 0,
-					pcp->batch, &pcp->list, gfp_flags);
-			if (unlikely(!pcp->count))
+		if (!pcp->counts[rclmtype]) {
+			pcp->counts[rclmtype] += rmqueue_bulk(zone, 0,
+				pcp->batch, &pcp->list[rclmtype], gfp_flags);
+			if (unlikely(!pcp->counts[rclmtype]))
 				goto failed;
 		}
-		page = list_entry(pcp->list.next, struct page, lru);
+		page = list_entry(pcp->list[rclmtype].next, struct page, lru);
 		list_del(&page->lru);
-		pcp->count--;
+		pcp->counts[rclmtype]--;
 	} else {
 		spin_lock_irqsave(&zone->lock, flags);
 		page = __rmqueue(zone, order, gfp_flags);
@@ -1621,9 +1632,10 @@ void show_free_areas(void)
 			printk("CPU %4d: Hot: hi:%5d, btch:%4d usd:%4d   "
 			       "Cold: hi:%5d, btch:%4d usd:%4d\n",
 			       cpu, pageset->pcp[0].high,
-			       pageset->pcp[0].batch, pageset->pcp[0].count,
+			       pageset->pcp[0].batch,
+			       pcp_count(&pageset->pcp[0]),
 			       pageset->pcp[1].high, pageset->pcp[1].batch,
-			       pageset->pcp[1].count);
+			       pcp_count(&pageset->pcp[1]));
 		}
 	}
 
@@ -2084,20 +2096,26 @@ static int __cpuinit zone_batchsize(stru
 inline void setup_pageset(struct per_cpu_pageset *p, unsigned long batch)
 {
 	struct per_cpu_pages *pcp;
+	int rclmtype;
 
 	memset(p, 0, sizeof(*p));
 
 	pcp = &p->pcp[0];		/* hot */
-	pcp->count = 0;
+	for_each_rclmtype(rclmtype) {
+		pcp->counts[rclmtype] = 0;
+		INIT_LIST_HEAD(&pcp->list[rclmtype]);
+	}
 	pcp->high = 6 * batch;
 	pcp->batch = max(1UL, 1 * batch);
-	INIT_LIST_HEAD(&pcp->list);
+	INIT_LIST_HEAD(&pcp->list[RCLM_EASY]);
 
 	pcp = &p->pcp[1];		/* cold*/
-	pcp->count = 0;
+	for_each_rclmtype(rclmtype) {
+		pcp->counts[rclmtype] = 0;
+		INIT_LIST_HEAD(&pcp->list[rclmtype]);
+	}
 	pcp->high = 2 * batch;
 	pcp->batch = max(1UL, batch/2);
-	INIT_LIST_HEAD(&pcp->list);
 }
 
 /*
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.19-rc4-mm1-002_fragcore/mm/vmstat.c linux-2.6.19-rc4-mm1-003_percpu/mm/vmstat.c
--- linux-2.6.19-rc4-mm1-002_fragcore/mm/vmstat.c	2006-10-31 13:27:13.000000000 +0000
+++ linux-2.6.19-rc4-mm1-003_percpu/mm/vmstat.c	2006-10-31 13:35:47.000000000 +0000
@@ -569,7 +569,7 @@ static int zoneinfo_show(struct seq_file
 
 			pageset = zone_pcp(zone, i);
 			for (j = 0; j < ARRAY_SIZE(pageset->pcp); j++) {
-				if (pageset->pcp[j].count)
+				if (pcp_count(&pageset->pcp[j]))
 					break;
 			}
 			if (j == ARRAY_SIZE(pageset->pcp))
@@ -581,7 +581,7 @@ static int zoneinfo_show(struct seq_file
 					   "\n              high:  %i"
 					   "\n              batch: %i",
 					   i, j,
-					   pageset->pcp[j].count,
+					   pcp_count(&pageset->pcp[j]),
 					   pageset->pcp[j].high,
 					   pageset->pcp[j].batch);
 			}

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH 4/11] Add a configure option for anti-fragmentation
  2006-11-01 11:16 [PATCH 0/11] Avoiding fragmentation with subzone groupings v26 Mel Gorman
                   ` (2 preceding siblings ...)
  2006-11-01 11:17 ` [PATCH 3/11] Split the per-cpu lists into RCLM_TYPES lists Mel Gorman
@ 2006-11-01 11:17 ` Mel Gorman
  2006-11-01 11:18 ` [PATCH 5/11] Drain per-cpu lists when high-order allocations fail Mel Gorman
                   ` (7 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: Mel Gorman @ 2006-11-01 11:17 UTC (permalink / raw)
  To: linux-mm; +Cc: Mel Gorman, linux-kernel

The anti-fragmentation strategy has memory overhead. This patch allows
the strategy to be disabled for small memory systems or if it is known the
workload is suffering because of the strategy. It also acts to show where
the anti-frag strategy interacts with the standard buddy allocator.


Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Joel Schopp <jschopp@austin.ibm.com>
---

 init/Kconfig    |   14 ++++++++++++++
 mm/page_alloc.c |   20 ++++++++++++++++++++
 2 files changed, 34 insertions(+)

diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.19-rc4-mm1-003_percpu/init/Kconfig linux-2.6.19-rc4-mm1-004_configurable/init/Kconfig
--- linux-2.6.19-rc4-mm1-003_percpu/init/Kconfig	2006-10-31 13:27:13.000000000 +0000
+++ linux-2.6.19-rc4-mm1-004_configurable/init/Kconfig	2006-10-31 13:42:06.000000000 +0000
@@ -481,6 +481,20 @@ config SLOB
 	default !SLAB
 	bool
 
+config PAGEALLOC_ANTIFRAG
+ 	bool "Avoid fragmentation in the page allocator"
+ 	def_bool n
+ 	help
+ 	  The standard allocator will fragment memory over time which means
+ 	  that high order allocations will fail even if kswapd is running. If
+ 	  this option is set, the allocator will try and group page types into
+ 	  two groups, kernel and easy reclaimable. The gain is a best effort
+ 	  attempt at lowering fragmentation which a few workloads care about.
+ 	  The loss is a more complex allocactor that may perform slower. If
+	  you are interested in working with large pages, say Y and set
+	  /proc/sys/vm/min_free_bytes to be 10% of physical memory. Otherwise
+ 	  say N
+
 menu "Loadable module support"
 
 config MODULES
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.19-rc4-mm1-003_percpu/mm/page_alloc.c linux-2.6.19-rc4-mm1-004_configurable/mm/page_alloc.c
--- linux-2.6.19-rc4-mm1-003_percpu/mm/page_alloc.c	2006-10-31 13:40:01.000000000 +0000
+++ linux-2.6.19-rc4-mm1-004_configurable/mm/page_alloc.c	2006-10-31 13:42:06.000000000 +0000
@@ -135,6 +135,7 @@ static unsigned long __initdata dma_rese
 #endif /* CONFIG_MEMORY_HOTPLUG_RESERVE */
 #endif /* CONFIG_ARCH_POPULATES_NODE_MAP */
 
+#ifdef CONFIG_PAGEALLOC_ANTIFRAG
 static inline int get_page_rclmtype(struct page *page)
 {
 	return (PageEasyRclm(page) != 0);
@@ -144,6 +145,17 @@ static inline int gfpflags_to_rclmtype(g
 {
 	return ((gfp_flags & __GFP_EASYRCLM) != 0);
 }
+#else
+static inline int get_page_rclmtype(struct page *page)
+{
+	return RCLM_NORCLM;
+}
+
+static inline int gfpflags_to_rclmtype(gfp_t gfp_flags)
+{
+	return RCLM_NORCLM;
+}
+#endif /* CONFIG_PAGEALLOC_ANTIFRAG */
 
 #ifdef CONFIG_DEBUG_VM
 static int page_outside_zone_boundaries(struct zone *zone, struct page *page)
@@ -641,6 +653,7 @@ static int prep_new_page(struct page *pa
 	return 0;
 }
 
+#ifdef CONFIG_PAGEALLOC_ANTIFRAG
 /* Remove an element from the buddy allocator from the fallback list */
 static struct page *__rmqueue_fallback(struct zone *zone, int order,
 							gfp_t gfp_flags)
@@ -679,6 +692,13 @@ static struct page *__rmqueue_fallback(s
 
 	return NULL;
 }
+#else
+static struct page *__rmqueue_fallback(struct zone *zone, unsigned int order,
+							int rclmtype)
+{
+	return NULL;
+}
+#endif /* CONFIG_PAGEALLOC_ANTIFRAG */
 
 /* 
  * Do the hard work of removing an element from the buddy allocator.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH 5/11] Drain per-cpu lists when high-order allocations fail
  2006-11-01 11:16 [PATCH 0/11] Avoiding fragmentation with subzone groupings v26 Mel Gorman
                   ` (3 preceding siblings ...)
  2006-11-01 11:17 ` [PATCH 4/11] Add a configure option for anti-fragmentation Mel Gorman
@ 2006-11-01 11:18 ` Mel Gorman
  2006-11-01 11:18 ` [PATCH 6/11] Move free pages between lists on steal Mel Gorman
                   ` (6 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: Mel Gorman @ 2006-11-01 11:18 UTC (permalink / raw)
  To: linux-mm; +Cc: Mel Gorman, linux-kernel

Per-cpu pages can accidentally cause fragmentation because they are free, but
pinned pages in an otherwise contiguous block.  When this patch is applied,
the per-cpu caches are drained after the direct-reclaim is entered if the
requested order is greater than 0. It simply reuses the code used by suspend
and hotplug.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---

 Kconfig      |    4 ++++
 page_alloc.c |   32 +++++++++++++++++++++++++++++---
 2 files changed, 33 insertions(+), 3 deletions(-)

diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.19-rc4-mm1-004_configurable/mm/Kconfig linux-2.6.19-rc4-mm1-005_drainpercpu/mm/Kconfig
--- linux-2.6.19-rc4-mm1-004_configurable/mm/Kconfig	2006-10-31 13:27:13.000000000 +0000
+++ linux-2.6.19-rc4-mm1-005_drainpercpu/mm/Kconfig	2006-10-31 13:44:09.000000000 +0000
@@ -247,3 +247,7 @@ config READAHEAD_SMOOTH_AGING
 		- have the danger of readahead thrashing(i.e. memory tight)
 
 	  This feature is only available on non-NUMA systems.
+
+config NEED_DRAIN_PERCPU_PAGES
+	def_bool y
+	depends on PM || HOTPLUG_CPU || PAGEALLOC_ANTIFRAG
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.19-rc4-mm1-004_configurable/mm/page_alloc.c linux-2.6.19-rc4-mm1-005_drainpercpu/mm/page_alloc.c
--- linux-2.6.19-rc4-mm1-004_configurable/mm/page_alloc.c	2006-10-31 13:42:06.000000000 +0000
+++ linux-2.6.19-rc4-mm1-005_drainpercpu/mm/page_alloc.c	2006-10-31 13:44:09.000000000 +0000
@@ -799,7 +799,7 @@ void drain_node_pages(int nodeid)
 }
 #endif
 
-#if defined(CONFIG_PM) || defined(CONFIG_HOTPLUG_CPU)
+#ifdef CONFIG_NEED_DRAIN_PERCPU_PAGES
 static void __drain_pages(unsigned int cpu)
 {
 	unsigned long flags;
@@ -826,7 +826,7 @@ static void __drain_pages(unsigned int c
 		}
 	}
 }
-#endif /* CONFIG_PM || CONFIG_HOTPLUG_CPU */
+#endif /* CONFIG_DRAIN_PERCPU_PAGES */
 
 #ifdef CONFIG_PM
 
@@ -863,7 +863,9 @@ void mark_free_pages(struct zone *zone)
 
 	spin_unlock_irqrestore(&zone->lock, flags);
 }
+#endif /* CONFIG_PM */
 
+#if defined(CONFIG_PM) || defined(CONFIG_PAGEALLOC_ANTIFRAG)
 /*
  * Spill all of this CPU's per-cpu pages back into the buddy allocator.
  */
@@ -875,7 +877,28 @@ void drain_local_pages(void)
 	__drain_pages(smp_processor_id());
 	local_irq_restore(flags);	
 }
-#endif /* CONFIG_PM */
+
+void smp_drain_local_pages(void *arg)
+{
+	drain_local_pages();
+}
+
+/*
+ * Spill all the per-cpu pages from all CPUs back into the buddy allocator
+ */
+void drain_all_local_pages(void)
+{
+	unsigned long flags;
+
+	local_irq_save(flags);
+	__drain_pages(smp_processor_id());
+	local_irq_restore(flags);
+
+	smp_call_function(smp_drain_local_pages, NULL, 0, 1);
+}
+#else
+void drain_all_local_pages(void) {}
+#endif /* CONFIG_PM || CONFIG_PAGEALLOC_ANTIFRAG */
 
 /*
  * Free a 0-order page
@@ -1381,6 +1404,9 @@ rebalance:
 
 	cond_resched();
 
+	if (order != 0)
+		drain_all_local_pages();
+
 	if (likely(did_some_progress)) {
 		page = get_page_from_freelist(gfp_mask, order,
 						zonelist, alloc_flags);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH 6/11] Move free pages between lists on steal
  2006-11-01 11:16 [PATCH 0/11] Avoiding fragmentation with subzone groupings v26 Mel Gorman
                   ` (4 preceding siblings ...)
  2006-11-01 11:18 ` [PATCH 5/11] Drain per-cpu lists when high-order allocations fail Mel Gorman
@ 2006-11-01 11:18 ` Mel Gorman
  2006-11-01 11:18 ` [PATCH 7/11] Introduce the RCLM_KERN allocation type Mel Gorman
                   ` (5 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: Mel Gorman @ 2006-11-01 11:18 UTC (permalink / raw)
  To: linux-mm; +Cc: Mel Gorman, linux-kernel

When a fallback occurs, there will be free pages for one allocation type
stored on the list for another. When a large steal occurs, this patch
will move all the free pages within one list to one allocation type.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---

 page_alloc.c |   84 ++++++++++++++++++++++++++++++++++++++++++++++++------
 1 files changed, 75 insertions(+), 9 deletions(-)

diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.19-rc4-mm1-005_drainpercpu/mm/page_alloc.c linux-2.6.19-rc4-mm1-006_movefree/mm/page_alloc.c
--- linux-2.6.19-rc4-mm1-005_drainpercpu/mm/page_alloc.c	2006-10-31 13:44:09.000000000 +0000
+++ linux-2.6.19-rc4-mm1-006_movefree/mm/page_alloc.c	2006-10-31 13:50:10.000000000 +0000
@@ -654,6 +654,62 @@ static int prep_new_page(struct page *pa
 }
 
 #ifdef CONFIG_PAGEALLOC_ANTIFRAG
+/*
+ * Move the free pages in a range to the free lists of the requested type.
+ * Note that start_page and end_pages are not aligned in a MAX_ORDER_NR_PAGES
+ * boundary. If alignment is required, use move_freepages_block()
+ */
+int move_freepages(struct zone *zone,
+			struct page *start_page, struct page *end_page,
+			int rclmtype)
+{
+	struct page *page;
+	unsigned long order;
+	int blocks_moved = 0;
+
+	BUG_ON(page_zone(start_page) != page_zone(end_page));
+
+	for (page = start_page; page < end_page;) {
+		if (!PageBuddy(page)) {
+			page++;
+			continue;
+		}
+#ifdef CONFIG_HOLES_IN_ZONE
+		if (!pfn_valid(page_to_pfn(page))) {
+			page++;
+			continue;
+		}
+#endif
+
+		order = page_order(page);
+		list_del(&page->lru);
+		list_add(&page->lru,
+			&zone->free_area[order].free_list[rclmtype]);
+		page += 1 << order;
+		blocks_moved++;
+	}
+
+	return blocks_moved;
+}
+
+int move_freepages_block(struct zone *zone, struct page *page, int rclmtype)
+{
+	unsigned long start_pfn;
+	struct page *start_page, *end_page;
+
+	start_pfn = page_to_pfn(page);
+	start_pfn = start_pfn & ~(MAX_ORDER_NR_PAGES-1);
+	start_page = pfn_to_page(start_pfn);
+	end_page = start_page + MAX_ORDER_NR_PAGES;
+
+	if (page_zone(page) != page_zone(start_page))
+		start_page = page;
+	if (page_zone(page) != page_zone(end_page))
+		return 0;
+
+	return move_freepages(zone, start_page, end_page, rclmtype);
+}
+
 /* Remove an element from the buddy allocator from the fallback list */
 static struct page *__rmqueue_fallback(struct zone *zone, int order,
 							gfp_t gfp_flags)
@@ -661,10 +717,10 @@ static struct page *__rmqueue_fallback(s
 	struct free_area * area;
 	int current_order;
 	struct page *page;
-	int rclmtype = gfpflags_to_rclmtype(gfp_flags);
+	int start_rclmtype = gfpflags_to_rclmtype(gfp_flags);
+	int rclmtype = !start_rclmtype;
 
 	/* Find the largest possible block of pages in the other list */
-	rclmtype = !rclmtype;
 	for (current_order = MAX_ORDER-1; current_order >= order;
 						--current_order) {
 		area = &(zone->free_area[current_order]);
@@ -675,24 +731,34 @@ static struct page *__rmqueue_fallback(s
 					struct page, lru);
 		area->nr_free--;
 
-		/*
-		 * If breaking a large block of pages, place the buddies
-		 * on the preferred allocation list
-		 */
-		if (unlikely(current_order >= MAX_ORDER / 2))
-			rclmtype = !rclmtype;
-
 		/* Remove the page from the freelists */
 		list_del(&page->lru);
 		rmv_page_order(page);
 		zone->free_pages -= 1UL << order;
 		expand(zone, page, order, current_order, area, rclmtype);
+
+		/* Move free pages between lists if stealing a large block */
+		if (current_order > MAX_ORDER / 2)
+			move_freepages_block(zone, page, start_rclmtype);
+
 		return page;
 	}
 
 	return NULL;
 }
 #else
+int move_freepages(struct zone *zone,
+			struct page *start_page, struct page *end_page,
+			int rclmtype)
+{
+	return 0;
+}
+
+int move_freepages_block(struct zone *zone, struct page *page, int rclmtype)
+{
+	return 0;
+}
+
 static struct page *__rmqueue_fallback(struct zone *zone, unsigned int order,
 							int rclmtype)
 {

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH 7/11] Introduce the RCLM_KERN allocation type
  2006-11-01 11:16 [PATCH 0/11] Avoiding fragmentation with subzone groupings v26 Mel Gorman
                   ` (5 preceding siblings ...)
  2006-11-01 11:18 ` [PATCH 6/11] Move free pages between lists on steal Mel Gorman
@ 2006-11-01 11:18 ` Mel Gorman
  2006-11-01 11:19 ` [PATCH 8/11] [DEBUG] Add statistics Mel Gorman
                   ` (4 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: Mel Gorman @ 2006-11-01 11:18 UTC (permalink / raw)
  To: linux-mm; +Cc: Mel Gorman, linux-kernel

Some kernel allocations are easily reclaimable such as inode caches and
these reclaimable kernel allocations are by far the most common type of
kernel allocation. This patch marks those type of allocations explicitly
and tries to group them together.

As another page bit would normally be required, it was decided to reuse the
suspend-related page bits and make anti-fragmentation and software suspend
mutually exclusive. A later set of patches introduce a mechanism for setting
flags for a whole block of pages.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---

 arch/x86_64/kernel/e820.c  |    8 +++++
 fs/buffer.c                |    5 +--
 fs/dcache.c                |    3 +
 fs/ext2/super.c            |    3 +
 fs/ext3/super.c            |    3 +
 fs/jbd/journal.c           |    6 ++-
 fs/jbd/revoke.c            |    6 ++-
 fs/ntfs/inode.c            |    6 ++-
 fs/reiserfs/super.c        |    3 +
 include/linux/gfp.h        |   10 +++---
 include/linux/mmzone.h     |    5 +--
 include/linux/page-flags.h |   51 ++++++++++++++++++++++++++-----
 init/Kconfig               |    1 
 lib/radix-tree.c           |    6 ++-
 mm/page_alloc.c            |   64 ++++++++++++++++++++++++++--------------
 mm/shmem.c                 |    8 +++--
 net/core/skbuff.c          |    1 
 17 files changed, 135 insertions(+), 54 deletions(-)

diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.19-rc4-mm1-006_movefree/arch/x86_64/kernel/e820.c linux-2.6.19-rc4-mm1-007_kernrclm/arch/x86_64/kernel/e820.c
--- linux-2.6.19-rc4-mm1-006_movefree/arch/x86_64/kernel/e820.c	2006-10-31 03:37:36.000000000 +0000
+++ linux-2.6.19-rc4-mm1-007_kernrclm/arch/x86_64/kernel/e820.c	2006-10-31 13:52:17.000000000 +0000
@@ -217,6 +217,13 @@ void __init e820_reserve_resources(void)
 	}
 }
 
+#ifdef CONFIG_PAGEALLOC_ANTIFRAG
+static void __init
+e820_mark_nosave_range(unsigned long start, unsigned long end)
+{
+	printk("Nosave not set when anti-frag is enabled");
+}
+#else
 /* Mark pages corresponding to given address range as nosave */
 static void __init
 e820_mark_nosave_range(unsigned long start, unsigned long end)
@@ -232,6 +239,7 @@ e820_mark_nosave_range(unsigned long sta
 		if (pfn_valid(pfn))
 			SetPageNosave(pfn_to_page(pfn));
 }
+#endif
 
 /*
  * Find the ranges of physical addresses that do not correspond to
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.19-rc4-mm1-006_movefree/fs/buffer.c linux-2.6.19-rc4-mm1-007_kernrclm/fs/buffer.c
--- linux-2.6.19-rc4-mm1-006_movefree/fs/buffer.c	2006-10-31 13:29:03.000000000 +0000
+++ linux-2.6.19-rc4-mm1-007_kernrclm/fs/buffer.c	2006-10-31 13:52:17.000000000 +0000
@@ -2656,7 +2656,7 @@ int submit_bh(int rw, struct buffer_head
 	 * from here on down, it's all bio -- do the initial mapping,
 	 * submit_bio -> generic_make_request may further map this bio around
 	 */
-	bio = bio_alloc(GFP_NOIO, 1);
+	bio = bio_alloc(set_rclmflags(GFP_NOIO, __GFP_EASYRCLM), 1);
 
 	bio->bi_sector = bh->b_blocknr * (bh->b_size >> 9);
 	bio->bi_bdev = bh->b_bdev;
@@ -2936,7 +2936,8 @@ static void recalc_bh_state(void)
 	
 struct buffer_head *alloc_buffer_head(gfp_t gfp_flags)
 {
-	struct buffer_head *ret = kmem_cache_alloc(bh_cachep, gfp_flags);
+	struct buffer_head *ret = kmem_cache_alloc(bh_cachep,
+				set_rclmflags(gfp_flags, __GFP_KERNRCLM));
 	if (ret) {
 		get_cpu_var(bh_accounting).nr++;
 		recalc_bh_state();
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.19-rc4-mm1-006_movefree/fs/dcache.c linux-2.6.19-rc4-mm1-007_kernrclm/fs/dcache.c
--- linux-2.6.19-rc4-mm1-006_movefree/fs/dcache.c	2006-10-31 13:27:12.000000000 +0000
+++ linux-2.6.19-rc4-mm1-007_kernrclm/fs/dcache.c	2006-10-31 13:52:17.000000000 +0000
@@ -861,7 +861,8 @@ struct dentry *d_alloc(struct dentry * p
 	struct dentry *dentry;
 	char *dname;
 
-	dentry = kmem_cache_alloc(dentry_cache, GFP_KERNEL); 
+	dentry = kmem_cache_alloc(dentry_cache,
+				set_rclmflags(GFP_KERNEL, __GFP_KERNRCLM));
 	if (!dentry)
 		return NULL;
 
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.19-rc4-mm1-006_movefree/fs/ext2/super.c linux-2.6.19-rc4-mm1-007_kernrclm/fs/ext2/super.c
--- linux-2.6.19-rc4-mm1-006_movefree/fs/ext2/super.c	2006-10-31 13:27:12.000000000 +0000
+++ linux-2.6.19-rc4-mm1-007_kernrclm/fs/ext2/super.c	2006-10-31 13:52:17.000000000 +0000
@@ -140,7 +140,8 @@ static kmem_cache_t * ext2_inode_cachep;
 static struct inode *ext2_alloc_inode(struct super_block *sb)
 {
 	struct ext2_inode_info *ei;
-	ei = (struct ext2_inode_info *)kmem_cache_alloc(ext2_inode_cachep, SLAB_KERNEL);
+	ei = (struct ext2_inode_info *)kmem_cache_alloc(ext2_inode_cachep,
+				set_rclmflags(SLAB_KERNEL, __GFP_KERNRCLM));
 	if (!ei)
 		return NULL;
 #ifdef CONFIG_EXT2_FS_POSIX_ACL
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.19-rc4-mm1-006_movefree/fs/ext3/super.c linux-2.6.19-rc4-mm1-007_kernrclm/fs/ext3/super.c
--- linux-2.6.19-rc4-mm1-006_movefree/fs/ext3/super.c	2006-10-31 13:27:12.000000000 +0000
+++ linux-2.6.19-rc4-mm1-007_kernrclm/fs/ext3/super.c	2006-10-31 13:52:17.000000000 +0000
@@ -445,7 +445,8 @@ static struct inode *ext3_alloc_inode(st
 {
 	struct ext3_inode_info *ei;
 
-	ei = kmem_cache_alloc(ext3_inode_cachep, SLAB_NOFS);
+	ei = kmem_cache_alloc(ext3_inode_cachep,
+				set_rclmflags(SLAB_NOFS, __GFP_KERNRCLM));
 	if (!ei)
 		return NULL;
 #ifdef CONFIG_EXT3_FS_POSIX_ACL
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.19-rc4-mm1-006_movefree/fs/jbd/journal.c linux-2.6.19-rc4-mm1-007_kernrclm/fs/jbd/journal.c
--- linux-2.6.19-rc4-mm1-006_movefree/fs/jbd/journal.c	2006-10-31 13:27:12.000000000 +0000
+++ linux-2.6.19-rc4-mm1-007_kernrclm/fs/jbd/journal.c	2006-10-31 13:52:17.000000000 +0000
@@ -1735,7 +1735,8 @@ static struct journal_head *journal_allo
 #ifdef CONFIG_JBD_DEBUG
 	atomic_inc(&nr_journal_heads);
 #endif
-	ret = kmem_cache_alloc(journal_head_cache, GFP_NOFS);
+	ret = kmem_cache_alloc(journal_head_cache,
+				set_rclmflags(GFP_NOFS, __GFP_KERNRCLM));
 	if (ret == 0) {
 		jbd_debug(1, "out of memory for journal_head\n");
 		if (time_after(jiffies, last_warning + 5*HZ)) {
@@ -1745,7 +1746,8 @@ static struct journal_head *journal_allo
 		}
 		while (ret == 0) {
 			yield();
-			ret = kmem_cache_alloc(journal_head_cache, GFP_NOFS);
+			ret = kmem_cache_alloc(journal_head_cache,
+				set_rclmflags(GFP_NOFS, __GFP_KERNRCLM));
 		}
 	}
 	return ret;
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.19-rc4-mm1-006_movefree/fs/jbd/revoke.c linux-2.6.19-rc4-mm1-007_kernrclm/fs/jbd/revoke.c
--- linux-2.6.19-rc4-mm1-006_movefree/fs/jbd/revoke.c	2006-10-31 03:37:36.000000000 +0000
+++ linux-2.6.19-rc4-mm1-007_kernrclm/fs/jbd/revoke.c	2006-10-31 13:52:17.000000000 +0000
@@ -206,7 +206,8 @@ int journal_init_revoke(journal_t *journ
 	while((tmp >>= 1UL) != 0UL)
 		shift++;
 
-	journal->j_revoke_table[0] = kmem_cache_alloc(revoke_table_cache, GFP_KERNEL);
+	journal->j_revoke_table[0] = kmem_cache_alloc(revoke_table_cache,
+				set_rclmflags(GFP_KERNEL, __GFP_KERNRCLM));
 	if (!journal->j_revoke_table[0])
 		return -ENOMEM;
 	journal->j_revoke = journal->j_revoke_table[0];
@@ -229,7 +230,8 @@ int journal_init_revoke(journal_t *journ
 	for (tmp = 0; tmp < hash_size; tmp++)
 		INIT_LIST_HEAD(&journal->j_revoke->hash_table[tmp]);
 
-	journal->j_revoke_table[1] = kmem_cache_alloc(revoke_table_cache, GFP_KERNEL);
+	journal->j_revoke_table[1] = kmem_cache_alloc(revoke_table_cache,
+				set_rclmflags(GFP_KERNEL, __GFP_KERNRCLM));
 	if (!journal->j_revoke_table[1]) {
 		kfree(journal->j_revoke_table[0]->hash_table);
 		kmem_cache_free(revoke_table_cache, journal->j_revoke_table[0]);
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.19-rc4-mm1-006_movefree/fs/ntfs/inode.c linux-2.6.19-rc4-mm1-007_kernrclm/fs/ntfs/inode.c
--- linux-2.6.19-rc4-mm1-006_movefree/fs/ntfs/inode.c	2006-10-31 03:37:36.000000000 +0000
+++ linux-2.6.19-rc4-mm1-007_kernrclm/fs/ntfs/inode.c	2006-10-31 13:52:17.000000000 +0000
@@ -324,7 +324,8 @@ struct inode *ntfs_alloc_big_inode(struc
 	ntfs_inode *ni;
 
 	ntfs_debug("Entering.");
-	ni = kmem_cache_alloc(ntfs_big_inode_cache, SLAB_NOFS);
+	ni = kmem_cache_alloc(ntfs_big_inode_cache,
+				set_rclmflags(SLAB_NOFS, __GFP_KERNRCLM));
 	if (likely(ni != NULL)) {
 		ni->state = 0;
 		return VFS_I(ni);
@@ -349,7 +350,8 @@ static inline ntfs_inode *ntfs_alloc_ext
 	ntfs_inode *ni;
 
 	ntfs_debug("Entering.");
-	ni = kmem_cache_alloc(ntfs_inode_cache, SLAB_NOFS);
+	ni = kmem_cache_alloc(ntfs_inode_cache,
+				set_rclmflags(SLAB_NOFS, __GFP_KERNRCLM));
 	if (likely(ni != NULL)) {
 		ni->state = 0;
 		return ni;
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.19-rc4-mm1-006_movefree/fs/reiserfs/super.c linux-2.6.19-rc4-mm1-007_kernrclm/fs/reiserfs/super.c
--- linux-2.6.19-rc4-mm1-006_movefree/fs/reiserfs/super.c	2006-10-31 13:27:13.000000000 +0000
+++ linux-2.6.19-rc4-mm1-007_kernrclm/fs/reiserfs/super.c	2006-10-31 13:52:17.000000000 +0000
@@ -496,7 +496,8 @@ static struct inode *reiserfs_alloc_inod
 {
 	struct reiserfs_inode_info *ei;
 	ei = (struct reiserfs_inode_info *)
-	    kmem_cache_alloc(reiserfs_inode_cachep, SLAB_KERNEL);
+	    kmem_cache_alloc(reiserfs_inode_cachep,
+			    	set_rclmflags(SLAB_KERNEL, __GFP_KERNRCLM));
 	if (!ei)
 		return NULL;
 	return &ei->vfs_inode;
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.19-rc4-mm1-006_movefree/include/linux/gfp.h linux-2.6.19-rc4-mm1-007_kernrclm/include/linux/gfp.h
--- linux-2.6.19-rc4-mm1-006_movefree/include/linux/gfp.h	2006-10-31 13:29:03.000000000 +0000
+++ linux-2.6.19-rc4-mm1-007_kernrclm/include/linux/gfp.h	2006-10-31 13:52:17.000000000 +0000
@@ -46,9 +46,10 @@ struct vm_area_struct;
 #define __GFP_NOMEMALLOC ((__force gfp_t)0x10000u) /* Don't use emergency reserves */
 #define __GFP_HARDWALL   ((__force gfp_t)0x20000u) /* Enforce hardwall cpuset memory allocs */
 #define __GFP_THISNODE	((__force gfp_t)0x40000u)/* No fallback, no policies */
-#define __GFP_EASYRCLM	((__force gfp_t)0x80000u) /* Easily reclaimed page */
+#define __GFP_KERNRCLM	((__force gfp_t)0x80000u) /* Kernel reclaimable page */
+#define __GFP_EASYRCLM	((__force gfp_t)0x100000u) /* Easily reclaimed page */
 
-#define __GFP_BITS_SHIFT 20	/* Room for 20 __GFP_FOO bits */
+#define __GFP_BITS_SHIFT 21	/* Room for 21 __GFP_FOO bits */
 #define __GFP_BITS_MASK ((__force gfp_t)((1 << __GFP_BITS_SHIFT) - 1))
 
 /* if you forget to add the bitmask here kernel will crash, period */
@@ -56,10 +57,10 @@ struct vm_area_struct;
 			__GFP_COLD|__GFP_NOWARN|__GFP_REPEAT| \
 			__GFP_NOFAIL|__GFP_NORETRY|__GFP_NO_GROW|__GFP_COMP| \
 			__GFP_NOMEMALLOC|__GFP_HARDWALL|__GFP_THISNODE|\
-			__GFP_EASYRCLM)
+			__GFP_KERNRCLM|__GFP_EASYRCLM)
 
 /* This mask makes up all the RCLM-related flags */
-#define GFP_RECLAIM_MASK (__GFP_EASYRCLM)
+#define GFP_RECLAIM_MASK (__GFP_KERNRCLM|__GFP_EASYRCLM)
 
 /* This equals 0, but use constants in case they ever change */
 #define GFP_NOWAIT	(GFP_ATOMIC & ~__GFP_HIGH)
@@ -107,6 +108,7 @@ static inline enum zone_type gfp_zone(gf
 
 static inline gfp_t set_rclmflags(gfp_t gfp, gfp_t reclaim_flags)
 {
+	BUG_ON((gfp & GFP_RECLAIM_MASK) == GFP_RECLAIM_MASK);
 	return (gfp & ~(GFP_RECLAIM_MASK)) | reclaim_flags;
 }
 
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.19-rc4-mm1-006_movefree/include/linux/mmzone.h linux-2.6.19-rc4-mm1-007_kernrclm/include/linux/mmzone.h
--- linux-2.6.19-rc4-mm1-006_movefree/include/linux/mmzone.h	2006-10-31 13:35:47.000000000 +0000
+++ linux-2.6.19-rc4-mm1-007_kernrclm/include/linux/mmzone.h	2006-10-31 13:52:17.000000000 +0000
@@ -25,8 +25,9 @@
 #define MAX_ORDER_NR_PAGES (1 << (MAX_ORDER - 1))
 
 #define RCLM_NORCLM 0
-#define RCLM_EASY   1
-#define RCLM_TYPES  2
+#define RCLM_KERN   1
+#define RCLM_EASY   2
+#define RCLM_TYPES  3
 
 #define for_each_rclmtype(type) \
 	for (type = 0; type < RCLM_TYPES; type++)
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.19-rc4-mm1-006_movefree/include/linux/page-flags.h linux-2.6.19-rc4-mm1-007_kernrclm/include/linux/page-flags.h
--- linux-2.6.19-rc4-mm1-006_movefree/include/linux/page-flags.h	2006-10-31 13:31:10.000000000 +0000
+++ linux-2.6.19-rc4-mm1-007_kernrclm/include/linux/page-flags.h	2006-10-31 13:52:17.000000000 +0000
@@ -82,18 +82,28 @@
 #define PG_private		11	/* If pagecache, has fs-private data */
 
 #define PG_writeback		12	/* Page is under writeback */
-#define PG_nosave		13	/* Used for system suspend/resume */
 #define PG_compound		14	/* Part of a compound page */
 #define PG_swapcache		15	/* Swap page: swp_entry_t in private */
 
 #define PG_mappedtodisk		16	/* Has blocks allocated on-disk */
 #define PG_reclaim		17	/* To be reclaimed asap */
-#define PG_nosave_free		18	/* Used for system suspend/resume */
 #define PG_buddy		19	/* Page is free, on buddy lists */
 
 #define PG_readahead		20	/* Reminder to do readahead */
 
-#define PG_easyrclm		21	/* Page is an easy reclaim page */
+/*
+ * As anti-fragmentation requires two flags, it was best to reuse the suspend
+ * flags and make anti-fragmentation depend on !SOFTWARE_SUSPEND. This works
+ * on the assumption that machines being suspended do not really care about
+ * large contiguous allocations.
+ */
+#ifndef CONFIG_PAGEALLOC_ANTIFRAG
+#define PG_nosave		13	/* Used for system suspend/resume */
+#define PG_nosave_free		18	/* Free, should not be written */
+#else
+#define PG_kernrclm		13	/* Page is a kernel reclaim page */
+#define PG_easyrclm		18	/* Page is an easy reclaim page */
+#endif
 
 #if (BITS_PER_LONG > 32)
 /*
@@ -211,6 +221,7 @@ static inline void SetPageUptodate(struc
 		ret;							\
 	})
 
+#ifndef CONFIG_PAGEALLOC_ANTIFRAG
 #define PageNosave(page)	test_bit(PG_nosave, &(page)->flags)
 #define SetPageNosave(page)	set_bit(PG_nosave, &(page)->flags)
 #define TestSetPageNosave(page)	test_and_set_bit(PG_nosave, &(page)->flags)
@@ -221,6 +232,34 @@ static inline void SetPageUptodate(struc
 #define SetPageNosaveFree(page)	set_bit(PG_nosave_free, &(page)->flags)
 #define ClearPageNosaveFree(page)		clear_bit(PG_nosave_free, &(page)->flags)
 
+#define PageKernRclm(page)	(0)
+#define SetPageKernRclm(page)	do {} while (0)
+#define ClearPageKernRclm(page)	do {} while (0)
+#define __SetPageKernRclm(page)	do {} while (0)
+#define __ClearPageKernRclm(page) do {} while (0)
+
+#define PageEasyRclm(page)	(0)
+#define SetPageEasyRclm(page)	do {} while (0)
+#define ClearPageEasyRclm(page)	do {} while (0)
+#define __SetPageEasyRclm(page)	do {} while (0)
+#define __ClearPageEasyRclm(page) do {} while (0)
+
+#else
+
+#define PageKernRclm(page)	test_bit(PG_kernrclm, &(page)->flags)
+#define SetPageKernRclm(page)	set_bit(PG_kernrclm, &(page)->flags)
+#define ClearPageKernRclm(page)	clear_bit(PG_kernrclm, &(page)->flags)
+#define __SetPageKernRclm(page)	__set_bit(PG_kernrclm, &(page)->flags)
+#define __ClearPageKernRclm(page) __clear_bit(PG_kernrclm, &(page)->flags)
+
+#define PageEasyRclm(page)	test_bit(PG_easyrclm, &(page)->flags)
+#define SetPageEasyRclm(page)	set_bit(PG_easyrclm, &(page)->flags)
+#define ClearPageEasyRclm(page)	clear_bit(PG_easyrclm, &(page)->flags)
+#define __SetPageEasyRclm(page)	__set_bit(PG_easyrclm, &(page)->flags)
+#define __ClearPageEasyRclm(page) __clear_bit(PG_easyrclm, &(page)->flags)
+#endif /* CONFIG_PAGEALLOC_ANTIFRAG */
+
+
 #define PageBuddy(page)		test_bit(PG_buddy, &(page)->flags)
 #define __SetPageBuddy(page)	__set_bit(PG_buddy, &(page)->flags)
 #define __ClearPageBuddy(page)	__clear_bit(PG_buddy, &(page)->flags)
@@ -254,12 +293,6 @@ static inline void SetPageUptodate(struc
 #define SetPageReadahead(page)	set_bit(PG_readahead, &(page)->flags)
 #define TestClearPageReadahead(page) test_and_clear_bit(PG_readahead, &(page)->flags)
 
-#define PageEasyRclm(page)	test_bit(PG_easyrclm, &(page)->flags)
-#define SetPageEasyRclm(page)	set_bit(PG_easyrclm, &(page)->flags)
-#define ClearPageEasyRclm(page)	clear_bit(PG_easyrclm, &(page)->flags)
-#define __SetPageEasyRclm(page)	__set_bit(PG_easyrclm, &(page)->flags)
-#define __ClearPageEasyRclm(page) __clear_bit(PG_easyrclm, &(page)->flags)
-
 struct page;	/* forward declaration */
 
 int test_clear_page_dirty(struct page *page);
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.19-rc4-mm1-006_movefree/init/Kconfig linux-2.6.19-rc4-mm1-007_kernrclm/init/Kconfig
--- linux-2.6.19-rc4-mm1-006_movefree/init/Kconfig	2006-10-31 13:42:06.000000000 +0000
+++ linux-2.6.19-rc4-mm1-007_kernrclm/init/Kconfig	2006-10-31 13:52:17.000000000 +0000
@@ -494,6 +494,7 @@ config PAGEALLOC_ANTIFRAG
 	  you are interested in working with large pages, say Y and set
 	  /proc/sys/vm/min_free_bytes to be 10% of physical memory. Otherwise
  	  say N
+	depends on !SOFTWARE_SUSPEND
 
 menu "Loadable module support"
 
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.19-rc4-mm1-006_movefree/lib/radix-tree.c linux-2.6.19-rc4-mm1-007_kernrclm/lib/radix-tree.c
--- linux-2.6.19-rc4-mm1-006_movefree/lib/radix-tree.c	2006-10-31 13:27:13.000000000 +0000
+++ linux-2.6.19-rc4-mm1-007_kernrclm/lib/radix-tree.c	2006-10-31 13:52:17.000000000 +0000
@@ -93,7 +93,8 @@ radix_tree_node_alloc(struct radix_tree_
 	struct radix_tree_node *ret;
 	gfp_t gfp_mask = root_gfp_mask(root);
 
-	ret = kmem_cache_alloc(radix_tree_node_cachep, gfp_mask);
+	ret = kmem_cache_alloc(radix_tree_node_cachep,
+					set_rclmflags(gfp_mask, __GFP_KERNRCLM));
 	if (ret == NULL && !(gfp_mask & __GFP_WAIT)) {
 		struct radix_tree_preload *rtp;
 
@@ -137,7 +138,8 @@ int radix_tree_preload(gfp_t gfp_mask)
 	rtp = &__get_cpu_var(radix_tree_preloads);
 	while (rtp->nr < ARRAY_SIZE(rtp->nodes)) {
 		preempt_enable();
-		node = kmem_cache_alloc(radix_tree_node_cachep, gfp_mask);
+		node = kmem_cache_alloc(radix_tree_node_cachep,
+					set_rclmflags(gfp_mask, __GFP_KERNRCLM));
 		if (node == NULL)
 			goto out;
 		preempt_disable();
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.19-rc4-mm1-006_movefree/mm/page_alloc.c linux-2.6.19-rc4-mm1-007_kernrclm/mm/page_alloc.c
--- linux-2.6.19-rc4-mm1-006_movefree/mm/page_alloc.c	2006-10-31 13:50:10.000000000 +0000
+++ linux-2.6.19-rc4-mm1-007_kernrclm/mm/page_alloc.c	2006-10-31 13:52:17.000000000 +0000
@@ -138,12 +138,16 @@ static unsigned long __initdata dma_rese
 #ifdef CONFIG_PAGEALLOC_ANTIFRAG
 static inline int get_page_rclmtype(struct page *page)
 {
-	return (PageEasyRclm(page) != 0);
+	return ((PageEasyRclm(page) != 0) << 1) | (PageKernRclm(page) != 0);
 }
 
 static inline int gfpflags_to_rclmtype(gfp_t gfp_flags)
 {
-	return ((gfp_flags & __GFP_EASYRCLM) != 0);
+	gfp_t badflags = (__GFP_EASYRCLM | __GFP_KERNRCLM);
+	WARN_ON((gfp_flags & badflags) == badflags);
+
+	return (((gfp_flags & __GFP_EASYRCLM) != 0) << 1) |
+		((gfp_flags & __GFP_KERNRCLM) != 0);
 }
 #else
 static inline int get_page_rclmtype(struct page *page)
@@ -433,6 +437,7 @@ static inline void __free_one_page(struc
 
 	page_idx = page_to_pfn(page) & ((1 << MAX_ORDER) - 1);
 	__SetPageEasyRclm(page);
+	__ClearPageKernRclm(page);
 
 	VM_BUG_ON(page_idx & (order_size - 1));
 	VM_BUG_ON(bad_range(zone, page));
@@ -710,6 +715,12 @@ int move_freepages_block(struct zone *zo
 	return move_freepages(zone, start_page, end_page, rclmtype);
 }
 
+static int fallbacks[RCLM_TYPES][RCLM_TYPES] = {
+	{ RCLM_KERN,   RCLM_EASY  }, /* RCLM_NORCLM Fallback */
+	{ RCLM_NORCLM, RCLM_EASY  }, /* RCLM_KERN Fallback */
+	{ RCLM_KERN,   RCLM_NORCLM}  /* RCLM_EASY Fallback */
+};
+
 /* Remove an element from the buddy allocator from the fallback list */
 static struct page *__rmqueue_fallback(struct zone *zone, int order,
 							gfp_t gfp_flags)
@@ -718,30 +729,36 @@ static struct page *__rmqueue_fallback(s
 	int current_order;
 	struct page *page;
 	int start_rclmtype = gfpflags_to_rclmtype(gfp_flags);
-	int rclmtype = !start_rclmtype;
+	int rclmtype, i;
 
 	/* Find the largest possible block of pages in the other list */
 	for (current_order = MAX_ORDER-1; current_order >= order;
 						--current_order) {
-		area = &(zone->free_area[current_order]);
- 		if (list_empty(&area->free_list[rclmtype]))
- 			continue;
+		for (i = 0; i < RCLM_TYPES - 1; i++) {
+			rclmtype = fallbacks[start_rclmtype][i];
 
-		page = list_entry(area->free_list[rclmtype].next,
-					struct page, lru);
-		area->nr_free--;
+			area = &(zone->free_area[current_order]);
+			if (list_empty(&area->free_list[rclmtype]))
+				continue;
 
-		/* Remove the page from the freelists */
-		list_del(&page->lru);
-		rmv_page_order(page);
-		zone->free_pages -= 1UL << order;
-		expand(zone, page, order, current_order, area, rclmtype);
+			page = list_entry(area->free_list[rclmtype].next,
+					struct page, lru);
+			area->nr_free--;
 
-		/* Move free pages between lists if stealing a large block */
-		if (current_order > MAX_ORDER / 2)
-			move_freepages_block(zone, page, start_rclmtype);
+			/* Remove the page from the freelists */
+			list_del(&page->lru);
+			rmv_page_order(page);
+			zone->free_pages -= 1UL << order;
+			expand(zone, page, order, current_order, area,
+							start_rclmtype);
+
+			/* Move free pages between lists for large blocks */
+			if (current_order >= MAX_ORDER / 2)
+				move_freepages_block(zone, page,
+							start_rclmtype);
 
-		return page;
+			return page;
+		}
 	}
 
 	return NULL;
@@ -797,9 +814,12 @@ static struct page *__rmqueue(struct zon
 	page = __rmqueue_fallback(zone, order, gfp_flags);
 
 got_page:
-	if (unlikely(rclmtype == RCLM_NORCLM) && page)
+	if (unlikely(rclmtype != RCLM_EASY) && page)
 		__ClearPageEasyRclm(page);
 
+	if (rclmtype == RCLM_KERN && page)
+		SetPageKernRclm(page);
+
 	return page;
 }
 
@@ -894,7 +914,7 @@ static void __drain_pages(unsigned int c
 }
 #endif /* CONFIG_DRAIN_PERCPU_PAGES */
 
-#ifdef CONFIG_PM
+#ifdef CONFIG_SOFTWARE_SUSPEND
 
 void mark_free_pages(struct zone *zone)
 {
@@ -2217,7 +2237,7 @@ inline void setup_pageset(struct per_cpu
 		pcp->counts[rclmtype] = 0;
 		INIT_LIST_HEAD(&pcp->list[rclmtype]);
 	}
-	pcp->high = 6 * batch;
+	pcp->high = 3 * batch;
 	pcp->batch = max(1UL, 1 * batch);
 	INIT_LIST_HEAD(&pcp->list[RCLM_EASY]);
 
@@ -2226,7 +2246,7 @@ inline void setup_pageset(struct per_cpu
 		pcp->counts[rclmtype] = 0;
 		INIT_LIST_HEAD(&pcp->list[rclmtype]);
 	}
-	pcp->high = 2 * batch;
+	pcp->high = batch;
 	pcp->batch = max(1UL, batch/2);
 }
 
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.19-rc4-mm1-006_movefree/mm/shmem.c linux-2.6.19-rc4-mm1-007_kernrclm/mm/shmem.c
--- linux-2.6.19-rc4-mm1-006_movefree/mm/shmem.c	2006-10-31 13:27:13.000000000 +0000
+++ linux-2.6.19-rc4-mm1-007_kernrclm/mm/shmem.c	2006-10-31 13:52:17.000000000 +0000
@@ -94,7 +94,8 @@ static inline struct page *shmem_dir_all
 	 * BLOCKS_PER_PAGE on indirect pages, assume PAGE_CACHE_SIZE:
 	 * might be reconsidered if it ever diverges from PAGE_SIZE.
 	 */
-	return alloc_pages(gfp_mask, PAGE_CACHE_SHIFT-PAGE_SHIFT);
+	return alloc_pages(set_rclmflags(gfp_mask, __GFP_KERNRCLM),
+						PAGE_CACHE_SHIFT-PAGE_SHIFT);
 }
 
 static inline void shmem_dir_free(struct page *page)
@@ -976,7 +977,8 @@ shmem_alloc_page(gfp_t gfp, struct shmem
 	pvma.vm_policy = mpol_shared_policy_lookup(&info->policy, idx);
 	pvma.vm_pgoff = idx;
 	pvma.vm_end = PAGE_SIZE;
-	page = alloc_page_vma(gfp | __GFP_ZERO, &pvma, 0);
+	page = alloc_page_vma(set_rclmflags(gfp | __GFP_ZERO, __GFP_KERNRCLM),
+								&pvma, 0);
 	mpol_free(pvma.vm_policy);
 	return page;
 }
@@ -996,7 +998,7 @@ shmem_swapin(struct shmem_inode_info *in
 static inline struct page *
 shmem_alloc_page(gfp_t gfp,struct shmem_inode_info *info, unsigned long idx)
 {
-	return alloc_page(gfp | __GFP_ZERO);
+	return alloc_page(set_rclmflags(gfp | __GFP_ZERO, __GFP_KERNRCLM));
 }
 #endif
 
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.19-rc4-mm1-006_movefree/net/core/skbuff.c linux-2.6.19-rc4-mm1-007_kernrclm/net/core/skbuff.c
--- linux-2.6.19-rc4-mm1-006_movefree/net/core/skbuff.c	2006-10-31 13:27:13.000000000 +0000
+++ linux-2.6.19-rc4-mm1-007_kernrclm/net/core/skbuff.c	2006-10-31 13:52:17.000000000 +0000
@@ -148,6 +148,7 @@ struct sk_buff *__alloc_skb(unsigned int
 	u8 *data;
 
 	cache = fclone ? skbuff_fclone_cache : skbuff_head_cache;
+	gfp_mask |= __GFP_KERNRCLM;
 
 	/* Get the HEAD */
 	skb = kmem_cache_alloc(cache, gfp_mask & ~__GFP_DMA);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH 8/11] [DEBUG] Add statistics
  2006-11-01 11:16 [PATCH 0/11] Avoiding fragmentation with subzone groupings v26 Mel Gorman
                   ` (6 preceding siblings ...)
  2006-11-01 11:18 ` [PATCH 7/11] Introduce the RCLM_KERN allocation type Mel Gorman
@ 2006-11-01 11:19 ` Mel Gorman
  2006-11-01 11:19 ` [PATCH 9/11] Add a bitmap that is used to track flags affecting a block of pages Mel Gorman
                   ` (3 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: Mel Gorman @ 2006-11-01 11:19 UTC (permalink / raw)
  To: linux-mm; +Cc: Mel Gorman, linux-kernel

This patch is strictly debug only. With static markers from SystemTap (what is
the current story with these?) or any other type of static marking of probe
points, this could be replaced by a relatively trivial script. Until such
static probes exist, this patch outputs some information to /proc/buddyinfo
that may help explain what went wrong if the anti-fragmentation strategy fails.


Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---

 page_alloc.c |   20 ++++++++++++++++++++
 vmstat.c     |   16 ++++++++++++++++
 2 files changed, 36 insertions(+)

diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.19-rc4-mm1-007_kernrclm/mm/page_alloc.c linux-2.6.19-rc4-mm1-009_stats/mm/page_alloc.c
--- linux-2.6.19-rc4-mm1-007_kernrclm/mm/page_alloc.c	2006-10-31 13:52:17.000000000 +0000
+++ linux-2.6.19-rc4-mm1-009_stats/mm/page_alloc.c	2006-10-31 13:54:43.000000000 +0000
@@ -57,6 +57,10 @@ unsigned long totalram_pages __read_most
 unsigned long totalreserve_pages __read_mostly;
 long nr_swap_pages;
 int percpu_pagelist_fraction;
+int split_count[RCLM_TYPES];
+#ifdef CONFIG_PAGEALLOC_ANTIFRAG
+int fallback_counts[RCLM_TYPES];
+#endif
 
 static void __free_pages_ok(struct page *page, unsigned int order);
 
@@ -745,6 +749,12 @@ static struct page *__rmqueue_fallback(s
 					struct page, lru);
 			area->nr_free--;
 
+			/* Account for a MAX_ORDER block being split */
+			if (current_order == MAX_ORDER - 1 &&
+					order < MAX_ORDER - 1) {
+				split_count[start_rclmtype]++;
+			}
+
 			/* Remove the page from the freelists */
 			list_del(&page->lru);
 			rmv_page_order(page);
@@ -757,6 +767,12 @@ static struct page *__rmqueue_fallback(s
 				move_freepages_block(zone, page,
 							start_rclmtype);
 
+			/* Account for fallbacks */
+			if (order < MAX_ORDER - 1 &&
+					current_order != MAX_ORDER - 1) {
+				fallback_counts[start_rclmtype]++;
+			}
+
 			return page;
 		}
 	}
@@ -807,6 +823,10 @@ static struct page *__rmqueue(struct zon
 		rmv_page_order(page);
 		area->nr_free--;
 		zone->free_pages -= 1UL << order;
+
+		if (current_order == MAX_ORDER - 1 && order < MAX_ORDER - 1)
+			split_count[rclmtype]++;
+
 		expand(zone, page, order, current_order, area, rclmtype);
 		goto got_page;
 	}
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.19-rc4-mm1-007_kernrclm/mm/vmstat.c linux-2.6.19-rc4-mm1-009_stats/mm/vmstat.c
--- linux-2.6.19-rc4-mm1-007_kernrclm/mm/vmstat.c	2006-10-31 13:35:47.000000000 +0000
+++ linux-2.6.19-rc4-mm1-009_stats/mm/vmstat.c	2006-10-31 13:54:43.000000000 +0000
@@ -13,6 +13,11 @@
 #include <linux/module.h>
 #include <linux/cpu.h>
 
+#ifdef CONFIG_PAGEALLOC_ANTIFRAG
+extern int split_count[RCLM_TYPES];
+extern int fallback_counts[RCLM_TYPES];
+#endif
+
 void __get_zone_counts(unsigned long *active, unsigned long *inactive,
 			unsigned long *free, struct pglist_data *pgdat)
 {
@@ -403,6 +408,17 @@ static void *frag_next(struct seq_file *
 
 static void frag_stop(struct seq_file *m, void *arg)
 {
+#ifdef CONFIG_PAGEALLOC_ANTIFRAG
+	seq_printf(m, "Fallback counts\n");
+	seq_printf(m, "KernNoRclm: %8d\n", fallback_counts[RCLM_NORCLM]);
+	seq_printf(m, "KernRclm:   %8d\n", fallback_counts[RCLM_KERN]);
+	seq_printf(m, "EasyRclm:   %8d\n", fallback_counts[RCLM_EASY]);
+
+	seq_printf(m, "\nSplit counts\n");
+	seq_printf(m, "KernNoRclm: %8d\n", split_count[RCLM_NORCLM]);
+	seq_printf(m, "KernRclm:   %8d\n", split_count[RCLM_KERN]);
+	seq_printf(m, "EasyRclm:   %8d\n", split_count[RCLM_EASY]);
+#endif /* CONFIG_PAGEALLOC_ANTIFRAG */
 }
 
 /*

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH 9/11] Add a bitmap that is used to track flags affecting a block of pages
  2006-11-01 11:16 [PATCH 0/11] Avoiding fragmentation with subzone groupings v26 Mel Gorman
                   ` (7 preceding siblings ...)
  2006-11-01 11:19 ` [PATCH 8/11] [DEBUG] Add statistics Mel Gorman
@ 2006-11-01 11:19 ` Mel Gorman
  2006-11-01 11:19 ` [PATCH 10/11] Remove dependency on page->flag bits Mel Gorman
                   ` (2 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: Mel Gorman @ 2006-11-01 11:19 UTC (permalink / raw)
  To: linux-mm; +Cc: Mel Gorman, linux-kernel

Anti-fragmentation uses two bits per page to track what the page's
reclaimability is. However, what is of real interest is what the whole
block of pages is being used for.  This patch adds a bitmap that is used
for flags affecting a whole a MAX_ORDER block of pages. Later patches drop
the requirement to use page->flags and this bitmap is used instead.

In non-SPARSEMEM configurations, the bitmap is stored in the struct zone
and allocated during initialisation. SPARSEMEM statically allocates the
bitmap in a struct mem_section so that bitmaps do not have to be resized
during memory hotadd. This wastes a small amount of memory per unused section
(usually sizeof(unsigned long)) but the complexity of dynamically allocating
the memory is quite high.

This mechanism is a proof of concept, so it uses obviously correct over optimal
implementation.

Additional credit to Andy Whitcroft who reviewed up an earlier implementation
of the mechanism an suggested how to make it a *lot* cleaner.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---

 include/linux/mmzone.h          |   13 ++++
 include/linux/pageblock-flags.h |   48 +++++++++++++++
 mm/page_alloc.c                 |  112 +++++++++++++++++++++++++++++++++++
 3 files changed, 173 insertions(+)

diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.19-rc4-mm1-009_stats/include/linux/mmzone.h linux-2.6.19-rc4-mm1-101_pageblock_bits/include/linux/mmzone.h
--- linux-2.6.19-rc4-mm1-009_stats/include/linux/mmzone.h	2006-10-31 13:52:17.000000000 +0000
+++ linux-2.6.19-rc4-mm1-101_pageblock_bits/include/linux/mmzone.h	2006-10-31 17:42:25.000000000 +0000
@@ -13,6 +13,7 @@
 #include <linux/init.h>
 #include <linux/seqlock.h>
 #include <linux/nodemask.h>
+#include <linux/pageblock-flags.h>
 #include <asm/atomic.h>
 #include <asm/page.h>
 
@@ -227,6 +228,14 @@ struct zone {
 #endif
 	struct free_area	free_area[MAX_ORDER];
 
+#ifndef CONFIG_SPARSEMEM
+	/*
+	 * Flags for a MAX_ORDER_NR_PAGES block. See pageblock-flags.h.
+	 * In SPARSEMEM, this map is stored in struct mem_section
+	 */
+	unsigned long           *pageblock_flags;
+#endif /* CONFIG_SPARSEMEM */
+
 
 	ZONE_PADDING(_pad1_)
 
@@ -682,6 +691,9 @@ extern struct zone *next_zone(struct zon
 #define PAGES_PER_SECTION       (1UL << PFN_SECTION_SHIFT)
 #define PAGE_SECTION_MASK	(~(PAGES_PER_SECTION-1))
 
+#define SECTION_BLOCKFLAGS_BITS \
+		((SECTION_SIZE_BITS - (MAX_ORDER-1)) * NR_PAGEBLOCK_BITS)
+
 #if (MAX_ORDER - 1 + PAGE_SHIFT) > SECTION_SIZE_BITS
 #error Allocator MAX_ORDER exceeds SECTION_SIZE
 #endif
@@ -701,6 +713,7 @@ struct mem_section {
 	 * before using it wrong.
 	 */
 	unsigned long section_mem_map;
+	DECLARE_BITMAP(pageblock_flags, SECTION_BLOCKFLAGS_BITS);
 };
 
 #ifdef CONFIG_SPARSEMEM_EXTREME
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.19-rc4-mm1-009_stats/include/linux/pageblock-flags.h linux-2.6.19-rc4-mm1-101_pageblock_bits/include/linux/pageblock-flags.h
--- linux-2.6.19-rc4-mm1-009_stats/include/linux/pageblock-flags.h	2006-10-31 18:05:45.000000000 +0000
+++ linux-2.6.19-rc4-mm1-101_pageblock_bits/include/linux/pageblock-flags.h	2006-10-31 17:42:25.000000000 +0000
@@ -0,0 +1,48 @@
+/*
+ * Macros for manipulating and testing flags related to a
+ * MAX_ORDER_NR_PAGES block of pages.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation version 2 of the License
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
+ *
+ * Copyright (C) IBM Corporation, 2006
+ *
+ * Original author, Mel Gorman
+ * Major cleanups and reduction of bit operations, Andy Whitcroft
+ */
+#ifndef PAGEBLOCK_FLAGS_H
+#define PAGEBLOCK_FLAGS_H
+
+#include <linux/types.h>
+
+/* Bit indices that affect a whole block of pages */
+enum pageblock_bits {
+	NR_PAGEBLOCK_BITS
+};
+
+/* Forward declaration */
+struct page;
+
+/* Declarations for getting and setting flags. See mm/page_alloc.c */
+unsigned long get_pageblock_flags_group(struct page *page,
+					int start_bitidx, int end_bitidx);
+void set_pageblock_flags_group(struct page *page, unsigned long flags,
+					int start_bitidx, int end_bitidx);
+
+#define get_pageblock_flags(page) \
+			get_pageblock_flags_group(page, 0, NR_PAGEBLOCK_BITS-1)
+#define set_pageblock_flags(page) \
+			set_pageblock_flags_group(page, 0, NR_PAGEBLOCK_BITS-1)
+
+#endif	/* PAGEBLOCK_FLAGS_H */
+
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.19-rc4-mm1-009_stats/mm/page_alloc.c linux-2.6.19-rc4-mm1-101_pageblock_bits/mm/page_alloc.c
--- linux-2.6.19-rc4-mm1-009_stats/mm/page_alloc.c	2006-10-31 13:54:43.000000000 +0000
+++ linux-2.6.19-rc4-mm1-101_pageblock_bits/mm/page_alloc.c	2006-10-31 17:42:25.000000000 +0000
@@ -2823,6 +2823,38 @@ static void __init calculate_node_totalp
 							realtotalpages);
 }
 
+#ifndef CONFIG_SPARSEMEM
+/*
+ * Calculate the size of the zone->blockflags rounded to an unsigned long
+ * Start by making sure zonesize is a multiple of MAX_ORDER-1 by rounding up
+ * Then figure 1 NR_PAGEBLOCK_BITS worth of bits per MAX_ORDER-1, finally
+ * round what is now in bits to nearest long in bits, then return it in
+ * bytes.
+ */
+static unsigned long __init usemap_size(unsigned long zonesize)
+{
+	unsigned long usemapsize;
+
+	usemapsize = roundup(zonesize, MAX_ORDER_NR_PAGES);
+	usemapsize = usemapsize >> (MAX_ORDER-1);
+	usemapsize *= NR_PAGEBLOCK_BITS;
+	usemapsize = roundup(usemapsize, 8 * sizeof(unsigned long));
+
+	return usemapsize / 8;
+}
+
+static void __init setup_usemap(struct pglist_data *pgdat,
+				struct zone *zone, unsigned long zonesize)
+{
+	unsigned long usemapsize = usemap_size(zonesize);
+	zone->pageblock_flags = alloc_bootmem_node(pgdat, usemapsize);
+	memset(zone->pageblock_flags, 0, usemapsize);
+}
+#else
+static void inline setup_usemap(struct pglist_data *pgdat,
+				struct zone *zone, unsigned long zonesize) {}
+#endif /* CONFIG_SPARSEMEM */
+
 /*
  * Set up the zone data structures:
  *   - mark all pages reserved
@@ -2909,6 +2941,7 @@ static void __meminit free_area_init_cor
 		ret = init_currently_empty_zone(zone, zone_start_pfn, size);
 		BUG_ON(ret);
 		zone_start_pfn += size;
+		setup_usemap(pgdat, zone, size);
 	}
 }
 
@@ -3622,3 +3655,82 @@ int highest_possible_node_id(void)
 }
 EXPORT_SYMBOL(highest_possible_node_id);
 #endif
+
+/* Return a pointer to the bitmap storing bits affecting a block of pages */
+static inline unsigned long *get_pageblock_bitmap(struct zone *zone,
+							unsigned long pfn)
+{
+#ifdef CONFIG_SPARSEMEM
+	unsigned long blockpfn;
+	blockpfn = pfn & ~(MAX_ORDER_NR_PAGES - 1);
+	return __pfn_to_section(blockpfn)->pageblock_flags;
+#else
+	return zone->pageblock_flags;
+#endif /* CONFIG_SPARSEMEM */
+}
+
+static inline int pfn_to_bitidx(struct zone *zone, unsigned long pfn)
+{
+#ifdef CONFIG_SPARSEMEM
+	pfn &= (PAGES_PER_SECTION-1);
+	return (pfn >> (MAX_ORDER-1)) * NR_PAGEBLOCK_BITS;
+#else
+	pfn = pfn - zone->zone_start_pfn;
+	return (pfn >> (MAX_ORDER-1)) * NR_PAGEBLOCK_BITS;
+#endif /* CONFIG_SPARSEMEM */
+}
+
+/**
+ * get_pageblock_flags_group - Return the requested group of flags for the MAX_ORDER_NR_PAGES block of pages
+ * @page: The page within the block of interest
+ * @start_bitidx: The first bit of interest to retrieve
+ * @end_bitidx: The last bit of interest
+ * returns pageblock_bits flags
+ */
+unsigned long get_pageblock_flags_group(struct page *page,
+					int start_bitidx, int end_bitidx)
+{
+	struct zone *zone;
+	unsigned long *bitmap;
+	unsigned long pfn, bitidx;
+	unsigned long flags = 0;
+	unsigned long value = 1;
+
+	zone = page_zone(page);
+	pfn = page_to_pfn(page);
+	bitmap = get_pageblock_bitmap(zone, pfn);
+	bitidx = pfn_to_bitidx(zone, pfn);
+
+	for (; start_bitidx <= end_bitidx; start_bitidx++, value <<= 1)
+		if (test_bit(bitidx + start_bitidx, bitmap))
+			flags |= value;
+	
+	return flags;
+}
+
+/**
+ * set_pageblock_flags_group - Set the requested group of flags for a MAX_ORDER_NR_PAGES block of pages
+ * @page: The page within the block of interest
+ * @start_bitidx: The first bit of interest
+ * @end_bitidx: The last bit of interest
+ * @flags: The flags to set
+ */
+void set_pageblock_flags_group(struct page *page, unsigned long flags,
+					int start_bitidx, int end_bitidx)
+{
+	struct zone *zone;
+	unsigned long *bitmap;
+	unsigned long pfn, bitidx;
+	unsigned long value = 1;
+
+	zone = page_zone(page);
+	pfn = page_to_pfn(page);
+	bitmap = get_pageblock_bitmap(zone, pfn);
+	bitidx = pfn_to_bitidx(zone, pfn);
+
+	for (; start_bitidx <= end_bitidx; start_bitidx++, value <<= 1)
+		if (flags & value)
+			__set_bit(bitidx + start_bitidx, bitmap);
+		else
+			__clear_bit(bitidx + start_bitidx, bitmap);
+}

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH 10/11] Remove dependency on page->flag bits
  2006-11-01 11:16 [PATCH 0/11] Avoiding fragmentation with subzone groupings v26 Mel Gorman
                   ` (8 preceding siblings ...)
  2006-11-01 11:19 ` [PATCH 9/11] Add a bitmap that is used to track flags affecting a block of pages Mel Gorman
@ 2006-11-01 11:19 ` Mel Gorman
  2006-11-01 11:20 ` [PATCH 11/11] Use pageblock flags for anti-fragmentation Mel Gorman
       [not found] ` <p734ptilcie.fsf@verdi.suse.de>
  11 siblings, 0 replies; 13+ messages in thread
From: Mel Gorman @ 2006-11-01 11:19 UTC (permalink / raw)
  To: linux-mm; +Cc: Mel Gorman, linux-kernel

The anti-fragmentation implementation uses page flags to track page usage.
In preparation for their replacement with corresponding pageblock flags
remove the page->flags manipulation.

After this patch, anti-fragmentation is broken until the next patch in the
set is applied.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---

 arch/x86_64/kernel/e820.c  |    8 -------
 include/linux/page-flags.h |   45 +---------------------------------------
 init/Kconfig               |    1 
 mm/page_alloc.c            |   10 --------
 4 files changed, 2 insertions(+), 62 deletions(-)

diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.19-rc4-mm1-101_pageblock_bits/arch/x86_64/kernel/e820.c linux-2.6.19-rc4-mm1-102_remove_antifrag_pageflags/arch/x86_64/kernel/e820.c
--- linux-2.6.19-rc4-mm1-101_pageblock_bits/arch/x86_64/kernel/e820.c	2006-10-31 13:52:17.000000000 +0000
+++ linux-2.6.19-rc4-mm1-102_remove_antifrag_pageflags/arch/x86_64/kernel/e820.c	2006-10-31 17:44:48.000000000 +0000
@@ -217,13 +217,6 @@ void __init e820_reserve_resources(void)
 	}
 }
 
-#ifdef CONFIG_PAGEALLOC_ANTIFRAG
-static void __init
-e820_mark_nosave_range(unsigned long start, unsigned long end)
-{
-	printk("Nosave not set when anti-frag is enabled");
-}
-#else
 /* Mark pages corresponding to given address range as nosave */
 static void __init
 e820_mark_nosave_range(unsigned long start, unsigned long end)
@@ -239,7 +232,6 @@ e820_mark_nosave_range(unsigned long sta
 		if (pfn_valid(pfn))
 			SetPageNosave(pfn_to_page(pfn));
 }
-#endif
 
 /*
  * Find the ranges of physical addresses that do not correspond to
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.19-rc4-mm1-101_pageblock_bits/include/linux/page-flags.h linux-2.6.19-rc4-mm1-102_remove_antifrag_pageflags/include/linux/page-flags.h
--- linux-2.6.19-rc4-mm1-101_pageblock_bits/include/linux/page-flags.h	2006-10-31 13:52:17.000000000 +0000
+++ linux-2.6.19-rc4-mm1-102_remove_antifrag_pageflags/include/linux/page-flags.h	2006-10-31 17:44:48.000000000 +0000
@@ -82,29 +82,17 @@
 #define PG_private		11	/* If pagecache, has fs-private data */
 
 #define PG_writeback		12	/* Page is under writeback */
+#define PG_nosave		13	/* Used for system suspend/resume */
 #define PG_compound		14	/* Part of a compound page */
 #define PG_swapcache		15	/* Swap page: swp_entry_t in private */
 
 #define PG_mappedtodisk		16	/* Has blocks allocated on-disk */
 #define PG_reclaim		17	/* To be reclaimed asap */
+#define PG_nosave_free		18	/* Free, should not be written */
 #define PG_buddy		19	/* Page is free, on buddy lists */
 
 #define PG_readahead		20	/* Reminder to do readahead */
 
-/*
- * As anti-fragmentation requires two flags, it was best to reuse the suspend
- * flags and make anti-fragmentation depend on !SOFTWARE_SUSPEND. This works
- * on the assumption that machines being suspended do not really care about
- * large contiguous allocations.
- */
-#ifndef CONFIG_PAGEALLOC_ANTIFRAG
-#define PG_nosave		13	/* Used for system suspend/resume */
-#define PG_nosave_free		18	/* Free, should not be written */
-#else
-#define PG_kernrclm		13	/* Page is a kernel reclaim page */
-#define PG_easyrclm		18	/* Page is an easy reclaim page */
-#endif
-
 #if (BITS_PER_LONG > 32)
 /*
  * 64-bit-only flags build down from bit 31
@@ -221,7 +209,6 @@ static inline void SetPageUptodate(struc
 		ret;							\
 	})
 
-#ifndef CONFIG_PAGEALLOC_ANTIFRAG
 #define PageNosave(page)	test_bit(PG_nosave, &(page)->flags)
 #define SetPageNosave(page)	set_bit(PG_nosave, &(page)->flags)
 #define TestSetPageNosave(page)	test_and_set_bit(PG_nosave, &(page)->flags)
@@ -232,34 +219,6 @@ static inline void SetPageUptodate(struc
 #define SetPageNosaveFree(page)	set_bit(PG_nosave_free, &(page)->flags)
 #define ClearPageNosaveFree(page)		clear_bit(PG_nosave_free, &(page)->flags)
 
-#define PageKernRclm(page)	(0)
-#define SetPageKernRclm(page)	do {} while (0)
-#define ClearPageKernRclm(page)	do {} while (0)
-#define __SetPageKernRclm(page)	do {} while (0)
-#define __ClearPageKernRclm(page) do {} while (0)
-
-#define PageEasyRclm(page)	(0)
-#define SetPageEasyRclm(page)	do {} while (0)
-#define ClearPageEasyRclm(page)	do {} while (0)
-#define __SetPageEasyRclm(page)	do {} while (0)
-#define __ClearPageEasyRclm(page) do {} while (0)
-
-#else
-
-#define PageKernRclm(page)	test_bit(PG_kernrclm, &(page)->flags)
-#define SetPageKernRclm(page)	set_bit(PG_kernrclm, &(page)->flags)
-#define ClearPageKernRclm(page)	clear_bit(PG_kernrclm, &(page)->flags)
-#define __SetPageKernRclm(page)	__set_bit(PG_kernrclm, &(page)->flags)
-#define __ClearPageKernRclm(page) __clear_bit(PG_kernrclm, &(page)->flags)
-
-#define PageEasyRclm(page)	test_bit(PG_easyrclm, &(page)->flags)
-#define SetPageEasyRclm(page)	set_bit(PG_easyrclm, &(page)->flags)
-#define ClearPageEasyRclm(page)	clear_bit(PG_easyrclm, &(page)->flags)
-#define __SetPageEasyRclm(page)	__set_bit(PG_easyrclm, &(page)->flags)
-#define __ClearPageEasyRclm(page) __clear_bit(PG_easyrclm, &(page)->flags)
-#endif /* CONFIG_PAGEALLOC_ANTIFRAG */
-
-
 #define PageBuddy(page)		test_bit(PG_buddy, &(page)->flags)
 #define __SetPageBuddy(page)	__set_bit(PG_buddy, &(page)->flags)
 #define __ClearPageBuddy(page)	__clear_bit(PG_buddy, &(page)->flags)
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.19-rc4-mm1-101_pageblock_bits/init/Kconfig linux-2.6.19-rc4-mm1-102_remove_antifrag_pageflags/init/Kconfig
--- linux-2.6.19-rc4-mm1-101_pageblock_bits/init/Kconfig	2006-10-31 13:52:17.000000000 +0000
+++ linux-2.6.19-rc4-mm1-102_remove_antifrag_pageflags/init/Kconfig	2006-10-31 17:44:48.000000000 +0000
@@ -494,7 +494,6 @@ config PAGEALLOC_ANTIFRAG
 	  you are interested in working with large pages, say Y and set
 	  /proc/sys/vm/min_free_bytes to be 10% of physical memory. Otherwise
  	  say N
-	depends on !SOFTWARE_SUSPEND
 
 menu "Loadable module support"
 
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.19-rc4-mm1-101_pageblock_bits/mm/page_alloc.c linux-2.6.19-rc4-mm1-102_remove_antifrag_pageflags/mm/page_alloc.c
--- linux-2.6.19-rc4-mm1-101_pageblock_bits/mm/page_alloc.c	2006-10-31 17:42:25.000000000 +0000
+++ linux-2.6.19-rc4-mm1-102_remove_antifrag_pageflags/mm/page_alloc.c	2006-10-31 17:44:48.000000000 +0000
@@ -142,7 +142,6 @@ static unsigned long __initdata dma_rese
 #ifdef CONFIG_PAGEALLOC_ANTIFRAG
 static inline int get_page_rclmtype(struct page *page)
 {
-	return ((PageEasyRclm(page) != 0) << 1) | (PageKernRclm(page) != 0);
 }
 
 static inline int gfpflags_to_rclmtype(gfp_t gfp_flags)
@@ -440,8 +439,6 @@ static inline void __free_one_page(struc
 		destroy_compound_page(page, order);
 
 	page_idx = page_to_pfn(page) & ((1 << MAX_ORDER) - 1);
-	__SetPageEasyRclm(page);
-	__ClearPageKernRclm(page);
 
 	VM_BUG_ON(page_idx & (order_size - 1));
 	VM_BUG_ON(bad_range(zone, page));
@@ -834,12 +831,6 @@ static struct page *__rmqueue(struct zon
 	page = __rmqueue_fallback(zone, order, gfp_flags);
 
 got_page:
-	if (unlikely(rclmtype != RCLM_EASY) && page)
-		__ClearPageEasyRclm(page);
-
-	if (rclmtype == RCLM_KERN && page)
-		SetPageKernRclm(page);
-
 	return page;
 }
 
@@ -2185,7 +2176,6 @@ void __meminit memmap_init_zone(unsigned
 		init_page_count(page);
 		reset_page_mapcount(page);
 		SetPageReserved(page);
-		SetPageEasyRclm(page);
 		INIT_LIST_HEAD(&page->lru);
 #ifdef WANT_PAGE_VIRTUAL
 		/* The shift won't overflow because ZONE_NORMAL is below 4G. */

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH 11/11] Use pageblock flags for anti-fragmentation
  2006-11-01 11:16 [PATCH 0/11] Avoiding fragmentation with subzone groupings v26 Mel Gorman
                   ` (9 preceding siblings ...)
  2006-11-01 11:19 ` [PATCH 10/11] Remove dependency on page->flag bits Mel Gorman
@ 2006-11-01 11:20 ` Mel Gorman
       [not found] ` <p734ptilcie.fsf@verdi.suse.de>
  11 siblings, 0 replies; 13+ messages in thread
From: Mel Gorman @ 2006-11-01 11:20 UTC (permalink / raw)
  To: linux-mm; +Cc: Mel Gorman, linux-kernel

This patch alters anti-fragmentation to use the pageblock bits for tracking
the reclaimability of a block of pages.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---

 include/linux/pageblock-flags.h |    4 ++++
 mm/page_alloc.c                 |   22 ++++++++++++++++++----
 2 files changed, 22 insertions(+), 4 deletions(-)

diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.19-rc4-mm1-102_remove_antifrag_pageflags/include/linux/pageblock-flags.h linux-2.6.19-rc4-mm1-103_antifrag_pageblock_bits/include/linux/pageblock-flags.h
--- linux-2.6.19-rc4-mm1-102_remove_antifrag_pageflags/include/linux/pageblock-flags.h	2006-10-31 17:42:25.000000000 +0000
+++ linux-2.6.19-rc4-mm1-103_antifrag_pageblock_bits/include/linux/pageblock-flags.h	2006-10-31 18:25:54.000000000 +0000
@@ -27,6 +27,10 @@
 
 /* Bit indices that affect a whole block of pages */
 enum pageblock_bits {
+#ifdef CONFIG_PAGEALLOC_ANTIFRAG
+	PB_rclmtype,
+	PB_rclmtype_end = (PB_rclmtype + 2) - 1, /* 2 bits for rclm types */
+#endif
 	NR_PAGEBLOCK_BITS
 };
 
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.19-rc4-mm1-102_remove_antifrag_pageflags/mm/page_alloc.c linux-2.6.19-rc4-mm1-103_antifrag_pageblock_bits/mm/page_alloc.c
--- linux-2.6.19-rc4-mm1-102_remove_antifrag_pageflags/mm/page_alloc.c	2006-10-31 17:44:48.000000000 +0000
+++ linux-2.6.19-rc4-mm1-103_antifrag_pageblock_bits/mm/page_alloc.c	2006-10-31 18:24:48.000000000 +0000
@@ -140,8 +140,15 @@ static unsigned long __initdata dma_rese
 #endif /* CONFIG_ARCH_POPULATES_NODE_MAP */
 
 #ifdef CONFIG_PAGEALLOC_ANTIFRAG
-static inline int get_page_rclmtype(struct page *page)
+static inline int get_pageblock_rclmtype(struct page *page)
 {
+	return get_pageblock_flags_group(page, PB_rclmtype, PB_rclmtype_end);
+}
+
+static void set_pageblock_rclmtype(struct page *page, int rclmtype)
+{
+	set_pageblock_flags_group(page, (unsigned long)rclmtype,
+						PB_rclmtype, PB_rclmtype_end);
 }
 
 static inline int gfpflags_to_rclmtype(gfp_t gfp_flags)
@@ -153,11 +161,13 @@ static inline int gfpflags_to_rclmtype(g
 		((gfp_flags & __GFP_KERNRCLM) != 0);
 }
 #else
-static inline int get_page_rclmtype(struct page *page)
+static inline int get_pageblock_rclmtype(struct page *page)
 {
 	return RCLM_NORCLM;
 }
 
+static inline void set_pageblock_rclmtype(struct page *page, int rclmtype) {}
+
 static inline int gfpflags_to_rclmtype(gfp_t gfp_flags)
 {
 	return RCLM_NORCLM;
@@ -433,7 +443,7 @@ static inline void __free_one_page(struc
 {
 	unsigned long page_idx;
 	int order_size = 1 << order;
-	int rclmtype = get_page_rclmtype(page);
+	int rclmtype = get_pageblock_rclmtype(page);
 
 	if (unlikely(PageCompound(page)))
 		destroy_compound_page(page, order);
@@ -713,6 +723,7 @@ int move_freepages_block(struct zone *zo
 	if (page_zone(page) != page_zone(end_page))
 		return 0;
 
+	set_pageblock_rclmtype(start_page, rclmtype);
 	return move_freepages(zone, start_page, end_page, rclmtype);
 }
 
@@ -825,6 +836,10 @@ static struct page *__rmqueue(struct zon
 			split_count[rclmtype]++;
 
 		expand(zone, page, order, current_order, area, rclmtype);
+
+		if (current_order == MAX_ORDER - 1)
+			set_pageblock_rclmtype(page, rclmtype);
+
 		goto got_page;
 	}
 
@@ -1003,7 +1018,7 @@ void drain_all_local_pages(void) {}
 static void fastcall free_hot_cold_page(struct page *page, int cold)
 {
 	struct zone *zone = page_zone(page);
-	int pindex = get_page_rclmtype(page);
+	int pindex = get_pageblock_rclmtype(page);
 	struct per_cpu_pages *pcp;
 	unsigned long flags;
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 0/11] Avoiding fragmentation with subzone groupings v26
       [not found] ` <p734ptilcie.fsf@verdi.suse.de>
@ 2006-11-02 11:21   ` Mel Gorman
  0 siblings, 0 replies; 13+ messages in thread
From: Mel Gorman @ 2006-11-02 11:21 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Linux Kernel Mailing List, Linux Memory Management List

On Thu, 2 Nov 2006, Andi Kleen wrote:

> Mel Gorman <mel@csn.ul.ie> writes:
>>
>> Our tests show that about 60-70% of physical memory can be allocated on
>> a desktop after a few days uptime. In benchmarks and stress tests, we are
>> finding that 80% of memory is available as contiguous blocks at the end of
>> the test. To compare, a standard kernel was getting < 1% of memory as large
>> pages on a desktop and about 8-12% of memory as large pages at the end of
>> stress tests.
>
> If you don't have a fixed limit on the unreclaimable memory you could
> still get into a situation where all memory is fragmented and unreclaimable,
> right?
>

Right, it's just considerably harder so there will be adverse workloads 
that will break it (heavy IO on very large numbers of files under high 
load with reiserfs is one). I don't have a list of real workloads that 
break anti-frag yet so so I want to get anti-frag out there and see does 
it help people who really care about hugepages or not.

I've included a script below that tries to get as many hugepages as 
possible via the proc interface. What I usually do is run it after a 
series of stress tests or sometimes one a desktop after a few days to see 
how it gets on in comparison to the standard allocator. A test I ran there 
got 73% of memory as huge pages on a system with 19 days uptime. However, 
the machine wasn't heavily stressed during that time and I had configured 
min_free_kbytes to be 10% as suggested in the CONFIG help.

Generally anti-frag gets you way more hugepages, but not necessarily the 
whole systems worth. To get all free memory as huge pages, I'd need to be 
moving memory around and that would be very invasive. It gets better 
results with linear-reclaim or lumpy-reclaim patches applied.

For people to get 100% expected results, they still will need to size the 
hugepages pool at boot-time or set aside a zone of reclaimable pages at 
boot time. This patch is aimed at relaxing the restriction of sizing the 
pool up while the system is in use. For example, take a batch-scheduled 
machine running HPC jobs. I want it to be able to get more or less 
hugepages between jobs without requiring reboots. I'd like to hear from 
people who try resizing the pool what sort of success they have and what 
sort of workloads broke the strategy on them.

> It might be much harder to hit, but we have so many users that at least
> a few will eventually.
>

This is true. There are additional steps that could be taken that would 
make it even harder to break down but I'd like to get more data on what 
sort of workloads break this strategy before I complicate things more.

>> Performance tests are within 0.1% for kbuild on a number of test machines. aim9
>> is usually within 1%
>
> 1% is a lot.
>

Well, yes, but two things. First, aim9 is a microbenchmark. Small 
differences in aim9 seem to make very little difference to other 
benchmarks like kbuild. On some arches, aim9 results vary widely between 
subsequent runs making it very sensitive. I used aim9 initially because if 
it showed *large* regressions, something was usually up.

Second, I didn't say it was always a 1% regression, just that it generally 
within 1%. Here are the last aim9 result comparison on the x86_64

                  2.6.19-rc4-mm1-clean  2.6.19-rc4-mm1-list-based
  1 creat-clo                150666.67                  157083.33    6416.66  4.26% File Creations and Closes/second
  2 page_test                186915.00                  189065.16    2150.16  1.15% System Allocations & Pages/second
  3 brk_test                1863739.38                 1972521.25  108781.87  5.84% System Memory Allocations/second
  4 jmp_test               16388101.98                16381716.67   -6385.31 -0.04% Non-local gotos/second
  5 signal_test              464500.00                  501649.73   37149.73  8.00% Signal Traps/second
  6 exec_test                   165.17                     162.59      -2.58 -1.56% Program Loads/second
  7 fork_test                  4283.57                    4365.21      81.64  1.91% Task Creations/second
  8 link_test                 50129.19                   47658.31   -2470.88 -4.93% Link/Unlink Pairs/second

It's actally showing some performance improvements there according to aim9

Here are the aim9 results on a ppc64 LPAR

                  2.6.19-rc4-mm1-clean  2.6.19-rc4-mm1-list-based
  1 creat-clo                134460.92                  134816.67     355.75  0.26% File Creations and Closes/second
  2 page_test                307473.33                  304900.85   -2572.48 -0.84% System Allocations & Pages/second
  3 brk_test                1547025.50                 1565439.09   18413.59  1.19% System Memory Allocations/second
  4 jmp_test               10353816.67                10211531.41 -142285.26 -1.37% Non-local gotos/second
  5 signal_test              257007.17                  257066.67      59.50  0.02% Signal Traps/second
  6 exec_test                   108.61                     108.76       0.15  0.14% Program Loads/second
  7 fork_test                  3276.12                    3289.45      13.33  0.41% Task Creations/second
  8 link_test                 47225.33                   48289.50    1064.17  2.25% Link/Unlink Pairs/second

And here is the comparison on a numaq

                  2.6.19-rc4-mm1-clean  2.6.19-rc4-mm1-list-based
  1 creat-clo                 46660.00                   48609.03    1949.03  4.18% File Creations and Closes/second
  2 page_test                 47555.81                   47588.68      32.87  0.07% System Allocations & Pages/second
  3 brk_test                 247910.77                  254179.15    6268.38  2.53% System Memory Allocations/second
  4 jmp_test                2276287.29                 2275924.69    -362.60 -0.02% Non-local gotos/second
  5 signal_test               65561.48                   64778.41    -783.07 -1.19% Signal Traps/second
  6 exec_test                    21.32                      21.31      -0.01 -0.05% Program Loads/second
  7 fork_test                   880.79                     906.36      25.57  2.90% Task Creations/second
  8 link_test                 19058.50                   18726.81    -331.69 -1.74% Link/Unlink Pairs/second

These results tend to vary by a few percent in each run, even on 
subsequent runs so I consider the results to be very noisy and I haven't 
done the legwork yet to get an average over multiple runs. To give an idea 
of how mad the results can be, this is an older set of results on an 
x86_64. Look at the brk_test results even. Between 2.6.19-rc2-mm2-clean 
and 2.6.19-rc3-mm1, there is a 12% regression apparently, but it's 
unlikely to be reflected in "real" benchmarks.

                  2.6.19-rc2-mm2-clean  2.6.19-rc2-mm2-list-based
  1 creat-clo                142759.54                  170083.33   27323.79 19.14% File Creations and Closes/second
  2 page_test                187305.90                  179716.71   -7589.19 -4.05% System Allocations & Pages/second
  3 brk_test                2139943.34                 2377053.82  237110.48 11.08% System Memory Allocations/second
  4 jmp_test               16387850.00                16380453.26   -7396.74 -0.05% Non-local gotos/second
  5 signal_test              536933.33                  495550.74  -41382.59 -7.71% Signal Traps/second
  6 exec_test                   166.17                     162.39      -3.78 -2.27% Program Loads/second
  7 fork_test                  4201.23                    4261.91      60.68  1.44% Task Creations/second
  8 link_test                 48980.64                   58369.22    9388.58 19.17% Link/Unlink Pairs/second

Hence, I'd like to get a better idea of what sort of performance effect 
other people see on the benchmarks they care about.

Here is the script I use to grab hugepages;

#!/bin/bash
# This benchmark checks how many hugepages can be allocated in the hugepage
# pool

P=hugepages_get-bench
SLEEP_INTERVAL=3
FAIL_AFTER_NO_CHANGE_ATTEMPTS=20

# Args
while [ "$1" != "" ]; do
 	case "$1" in
 		-s)		export SLEEP_INTERVAL=$2; shift 2;;
 		-f)		export FAIL_AFTER_NO_CHANGE_ATTEMPTS=$2; shift 2;;
 	esac
done

# Check proc entry exists
if [ ! -e /proc/sys/vm/nr_hugepages ]; then
 	echo Attempting load of hugetlbfs module
 	modprobe hugetlbfs
 	if [ ! -e /proc/sys/vm/nr_hugepages ]; then
 		echo ERROR: /proc/sys/vm/nr_hugepages does not exist
 		exit $EXIT_TERMINATE
 	fi
fi

echo Allocating hugepages test
echo -------------------------

# Disable OOM killed
echo Disabling OOM Killer for current test process
echo -17 > /proc/self/oom_adj

# Record existing hugepage count
STARTING_COUNT=`cat /proc/sys/vm/nr_hugepages`
echo Starting page count: $STARTING_COUNT

# Ensure we have permission to write
echo $STARTING_COUNT > /proc/sys/vm/nr_hugepages || {
 	echo ERROR: Do not have permission to adjust nr_hugepages count
 	exit $EXIT_TERMINATE
}

# Start test
CURRENT_COUNT=$STARTING_COUNT
LAST_COUNT=$STARTING_COUNT
NOCHANGE_COUNT=0
ATTEMPT=0

while [ $NOCHANGE_COUNT -ne $FAIL_AFTER_NO_CHANGE_ATTEMPTS ]; do
 	ATTEMPT=$((ATTEMPT+1))
 	PAGES_COUNT=$(($CURRENT_COUNT+100))
 	echo $PAGES_COUNT > /proc/sys/vm/nr_hugepages

 	CURRENT_COUNT=`cat /proc/sys/vm/nr_hugepages`
 	PROGRESS=
 	if [ "$CURRENT_COUNT" = "$LAST_COUNT" ]; then
 		NOCHANGE_COUNT=$(($NOCHANGE_COUNT+1))
 	else
 		NOCHANGE_COUNT=0
 		PROGRESS="Progress made with $(($CURRENT_COUNT-$LAST_COUNT)) pages"
 	fi

 	echo Attempt $ATTEMPT: $CURRENT_COUNT pages $PROGRESS
 	LAST_COUNT=$CURRENT_COUNT
 	sleep $SLEEP_INTERVAL
done

echo Final page count: $CURRENT_COUNT
echo $STARTING_COUNT > /proc/sys/vm/nr_hugepages
exit $EXIT_SUCCESS


-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2006-11-02 11:21 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2006-11-01 11:16 [PATCH 0/11] Avoiding fragmentation with subzone groupings v26 Mel Gorman
2006-11-01 11:16 ` [PATCH 1/11] Add __GFP_EASYRCLM flag and update callers Mel Gorman
2006-11-01 11:17 ` [PATCH 2/11] Split the free lists into kernel and user parts Mel Gorman
2006-11-01 11:17 ` [PATCH 3/11] Split the per-cpu lists into RCLM_TYPES lists Mel Gorman
2006-11-01 11:17 ` [PATCH 4/11] Add a configure option for anti-fragmentation Mel Gorman
2006-11-01 11:18 ` [PATCH 5/11] Drain per-cpu lists when high-order allocations fail Mel Gorman
2006-11-01 11:18 ` [PATCH 6/11] Move free pages between lists on steal Mel Gorman
2006-11-01 11:18 ` [PATCH 7/11] Introduce the RCLM_KERN allocation type Mel Gorman
2006-11-01 11:19 ` [PATCH 8/11] [DEBUG] Add statistics Mel Gorman
2006-11-01 11:19 ` [PATCH 9/11] Add a bitmap that is used to track flags affecting a block of pages Mel Gorman
2006-11-01 11:19 ` [PATCH 10/11] Remove dependency on page->flag bits Mel Gorman
2006-11-01 11:20 ` [PATCH 11/11] Use pageblock flags for anti-fragmentation Mel Gorman
     [not found] ` <p734ptilcie.fsf@verdi.suse.de>
2006-11-02 11:21   ` [PATCH 0/11] Avoiding fragmentation with subzone groupings v26 Mel Gorman

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox