Avoiding external fragmentation with a placement policy Version 12

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* Avoiding external fragmentation with a placement policy Version 12
@ 2005-05-31 11:20 Mel Gorman
  2005-06-01 20:55 ` Joel Schopp
  0 siblings, 1 reply; 42+ messages in thread
From: Mel Gorman @ 2005-05-31 11:20 UTC (permalink / raw)
  To: linux-mm; +Cc: linux-kernel, akpm

Changelog since V11
o Mainly a redefiff against 2.6.12-rc5
o Use #defines for indexing into pcpu lists
o Fix rounding error in the size of usemap

Changelog since V10
o All allocation types now use per-cpu caches like the standard allocator
o Removed all the additional buddy allocator statistic code
o Elimated three zone fields that can be lived without
o Simplified some loops
o Removed many unnecessary calculations

Changelog since V9
o Tightened what pools are used for fallbacks, less likely to fragment
o Many micro-optimisations to have the same performance as the standard 
  allocator. Modified allocator now faster than standard allocator using
  gcc 3.3.5
o Add counter for splits/coalescing

Changelog since V8
o rmqueue_bulk() allocates pages in large blocks and breaks it up into the
  requested size. Reduces the number of calls to __rmqueue()
o Beancounters are now a configurable option under "Kernel Hacking"
o Broke out some code into inline functions to be more Hotplug-friendly
o Increased the size of reserve for fallbacks from 10% to 12.5%. 

Changelog since V7
o Updated to 2.6.11-rc4
o Lots of cleanups, mainly related to beancounters
o Fixed up a miscalculation in the bitmap size as pointed out by Mike Kravetz
  (thanks Mike)
o Introduced a 10% reserve for fallbacks. Drastically reduces the number of
  kernnorclm allocations that go to the wrong places
o Don't trigger OOM when large allocations are involved

Changelog since V6
o Updated to 2.6.11-rc2
o Minor change to allow prezeroing to be a cleaner looking patch

Changelog since V5
o Fixed up gcc-2.95 errors
o Fixed up whitespace damage

Changelog since V4
o No changes. Applies cleanly against 2.6.11-rc1 and 2.6.11-rc1-bk6. Applies
  with offsets to 2.6.11-rc1-mm1

Changelog since V3
o inlined get_pageblock_type() and set_pageblock_type()
o set_pageblock_type() now takes a zone parameter to avoid a call to page_zone()
o When taking from the global pool, do not scan all the low-order lists

Changelog since V2
o Do not to interfere with the "min" decay
o Update the __GFP_BITS_SHIFT properly. Old value broke fsync and probably
  anything to do with asynchronous IO
  
Changelog since V1
o Update patch to 2.6.11-rc1
o Cleaned up bug where memory was wasted on a large bitmap
o Remove code that needed the binary buddy bitmaps
o Update flags to avoid colliding with __GFP_ZERO changes
o Extended fallback_count bean counters to show the fallback count for each
  allocation type
o In-code documentation

Version 1
o Initial release against 2.6.9

This patch is designed to reduce fragmentation in the standard buddy allocator
without impairing the performance of the allocator. High fragmentation in
the standard binary buddy allocator means that high-order allocations can
rarely be serviced. This patch works by dividing allocations into three
different types of allocations;

UserReclaimable - These are userspace pages that are easily reclaimable. Right
	now, all allocations of GFP_USER, GFP_HIGHUSER and disk buffers are
	in this category. These pages are trivially reclaimed by writing
	the page out to swap or syncing with backing storage

KernelReclaimable - These are pages allocated by the kernel that are easily
	reclaimed. This is stuff like inode caches, dcache, buffer_heads etc.
	These type of pages potentially could be reclaimed by dumping the
	caches and reaping the slabs

KernelNonReclaimable - These are pages that are allocated by the kernel that
	are not trivially reclaimed. For example, the memory allocated for a
	loaded module would be in this category. By default, allocations are
	considered to be of this type

Instead of having one global MAX_ORDER-sized array of free lists, there
are four, one for each type of allocation and another 12.5% reserve for
fallbacks. Finally, there is a list of pages of size 2^MAX_ORDER which is
a global pool of the largest pages the kernel deals with.

Once a 2^MAX_ORDER block of pages it split for a type of allocation, it is
added to the free-lists for that type, in effect reserving it. Hence, over
time, pages of the different types can be clustered together. This means that
if we wanted 2^MAX_ORDER number of pages, we could linearly scan a block of
pages allocated for UserReclaimable and page each of them out.

Fallback is used when there are no 2^MAX_ORDER pages available and there
are no free pages of the desired type. The fallback lists were chosen in a
way that keeps the most easily reclaimable pages together.

Three benchmark results are included all based on a 2.6.12-rc3 kernel
compiled with gcc 3.3.5 (it is known that gcc 2.95.4 results in significantly
different results). The first is the output of portions of AIM9 for the
vanilla allocator and the modified one;

(Tests run with bench-aim9.sh from VMRegress 0.14)
2.6.12-rc5-standard
------------------------------------------------------------------------------------------------------------
 Test        Test        Elapsed  Iteration    Iteration          Operation
Number       Name      Time (sec)   Count   Rate (loops/sec)    Rate (ops/sec)
------------------------------------------------------------------------------------------------------------
     1 creat-clo           60.00       1109   18.48333        18483.33 File Creations and Closes/second
     2 page_test           60.01       4304   71.72138       121926.35 System Allocations & Pages/second
     3 brk_test            60.03       1560   25.98701       441779.11 System Memory Allocations/second
     4 jmp_test            60.00     251053 4184.21667      4184216.67 Non-local gotos/second
     5 signal_test         60.01       5524   92.05132        92051.32 Signal Traps/second
     6 exec_test           60.04        779   12.97468           64.87 Program Loads/second
     7 fork_test           60.02        927   15.44485         1544.49 Task Creations/second
     8 link_test           60.01       6044  100.71655         6345.14 Link/Unlink Pairs/second

2.6.12-rc5-mbuddy-v11
------------------------------------------------------------------------------------------------------------
 Test        Test        Elapsed  Iteration    Iteration          Operation
Number       Name      Time (sec)   Count   Rate (loops/sec)    Rate (ops/sec)
------------------------------------------------------------------------------------------------------------
     1 creat-clo           60.05       1116   18.58451        18584.51 File Creations and Closes/second
     2 page_test           60.01       4414   73.55441       125042.49 System Allocations & Pages/second
     3 brk_test            60.04       1608   26.78215       455296.47 System Memory Allocations/second
     4 jmp_test            60.00     250917 4181.95000      4181950.00 Non-local gotos/second
     5 signal_test         60.01       5448   90.78487        90784.87 Signal Traps/second
     6 exec_test           60.03        781   13.01016           65.05 Program Loads/second
     7 fork_test           60.05        928   15.45379         1545.38 Task Creations/second
     8 link_test           60.01       6102  101.68305         6406.03 Link/Unlink Pairs/second

Difference in performance operations report generated by diff-aim9.sh
 1 creat-clo      18483.33   18584.51     101.18  0.55% File Creations and Closes/second
 2 page_test     121926.35  125042.49    3116.14  2.56% System Allocations & Pages/second
 3 brk_test      441779.11  455296.47   13517.36  3.06% System Memory Allocations/second
 4 jmp_test     4184216.67 4181950.00   -2266.67 -0.05% Non-local gotos/second
 5 signal_test    92051.32   90784.87   -1266.45 -1.38% Signal Traps/second
 6 exec_test         64.87      65.05       0.18  0.28% Program Loads/second
 7 fork_test       1544.49    1545.38       0.89  0.06% Task Creations/second
 8 link_test       6345.14    6406.03      60.89  0.96% Link/Unlink Pairs/second

The aim9 results show that there are improvements for common page-related
operations so we can provide lower fragmentation without performance
loss. The results are compiler dependant and there are variances of 1-2%
between versions.

The second benchmark tested the CPU cache usage to make sure it was not
getting clobbered. The test was to repeatedly render a large postscript file
10 times and get the average. The result is;

gsbench-2.6.12-rc4-standard
Average: 42.84 real, 42.754 user, 0.037 sys

gsbench-2.6.12-rc4-mbuddy-v11
Average: 42.907 real, 42.825 user, 0.037 sys

So there are no adverse cache effects. The last test is to show that the
allocator can satisfy more high-order allocations, especially under load,
than the standard allocator. The test performs the following;

1. Start updatedb running in the background
2. Load kernel modules that tries to allocate high-order blocks on demand
3. Clean a kernel tree
4. Make 6 copies of the tree. As each copy finishes, a compile starts at -j4
5. Start compiling the primary tree
6. Sleep 3 minutes while the 7 trees are being compiled
7. Use the kernel module to attempt 160 times to allocate a 2^10 block of pages
    - note, it only attempts 160 times, no matter how often it succeeds
    - An allocation is attempted every 1/10th of a second
    - Performance will get badly shot as it forces consider amounts of pageout

The result of the allocations under load (load averaging 25) were;

2.6.12-rc5 Standard
Order:                 10
Attempted allocations: 160
Success allocs:        3
Failed allocs:         108
% Success:            1

2.6.12-rc4 MBuddy V12
Order:                 10
Attempted allocations: 160
Success allocs:        63
Failed allocs:         97
% Success:             39

It is important to note that the standard allocator invoked the out-of-memory
killer so often that it killed almost all available processes including X,
sshd and all instances of make and gcc. The patch with the placement policy
never invoked the OOM killer. The downside of the mbuddy allocator is that
it takes a long time for it to free up the MAX_ORDER sized pages as pages
are freed in LRU order.

At rest, it is known that the modified allocator can allocate almost all
160 MAX_ORDER-1 blocks of pages. However, it takes a few attempts and a
while at rest making it an unrealistic test. However, it shows that with
a linear scanner freeing pages, we could almost guarantee availability of
large pages At rest after all the compilations have finished, the results are

The results show that the modified allocator as fast, and often faster the
normal allocator, has no adverse cache effects but is far less fragmented
and able to satisfy high-order allocations.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.12-rc5-standard/fs/buffer.c linux-2.6.12-rc5-mbuddy-v12/fs/buffer.c
--- linux-2.6.12-rc5-standard/fs/buffer.c	2005-05-25 04:31:20.000000000 +0100
+++ linux-2.6.12-rc5-mbuddy-v12/fs/buffer.c	2005-05-31 09:49:11.000000000 +0100
@@ -1135,7 +1135,8 @@ grow_dev_page(struct block_device *bdev,
 	struct page *page;
 	struct buffer_head *bh;
 
-	page = find_or_create_page(inode->i_mapping, index, GFP_NOFS);
+	page = find_or_create_page(inode->i_mapping, index, 
+					GFP_NOFS | __GFP_USERRCLM);
 	if (!page)
 		return NULL;
 
@@ -3056,7 +3057,8 @@ static void recalc_bh_state(void)
 	
 struct buffer_head *alloc_buffer_head(unsigned int __nocast gfp_flags)
 {
-	struct buffer_head *ret = kmem_cache_alloc(bh_cachep, gfp_flags);
+	struct buffer_head *ret = kmem_cache_alloc(bh_cachep, 
+						gfp_flags|__GFP_KERNRCLM);
 	if (ret) {
 		preempt_disable();
 		__get_cpu_var(bh_accounting).nr++;
diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.12-rc5-standard/fs/dcache.c linux-2.6.12-rc5-mbuddy-v12/fs/dcache.c
--- linux-2.6.12-rc5-standard/fs/dcache.c	2005-05-25 04:31:20.000000000 +0100
+++ linux-2.6.12-rc5-mbuddy-v12/fs/dcache.c	2005-05-31 09:49:11.000000000 +0100
@@ -719,7 +719,8 @@ struct dentry *d_alloc(struct dentry * p
 	struct dentry *dentry;
 	char *dname;
 
-	dentry = kmem_cache_alloc(dentry_cache, GFP_KERNEL); 
+	dentry = kmem_cache_alloc(dentry_cache, 
+				GFP_KERNEL|__GFP_KERNRCLM); 
 	if (!dentry)
 		return NULL;
 
diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.12-rc5-standard/fs/ext2/super.c linux-2.6.12-rc5-mbuddy-v12/fs/ext2/super.c
--- linux-2.6.12-rc5-standard/fs/ext2/super.c	2005-05-25 04:31:20.000000000 +0100
+++ linux-2.6.12-rc5-mbuddy-v12/fs/ext2/super.c	2005-05-31 09:49:11.000000000 +0100
@@ -137,7 +137,7 @@ static kmem_cache_t * ext2_inode_cachep;
 static struct inode *ext2_alloc_inode(struct super_block *sb)
 {
 	struct ext2_inode_info *ei;
-	ei = (struct ext2_inode_info *)kmem_cache_alloc(ext2_inode_cachep, SLAB_KERNEL);
+	ei = (struct ext2_inode_info *)kmem_cache_alloc(ext2_inode_cachep, SLAB_KERNEL|__GFP_KERNRCLM);
 	if (!ei)
 		return NULL;
 #ifdef CONFIG_EXT2_FS_POSIX_ACL
diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.12-rc5-standard/fs/ext3/super.c linux-2.6.12-rc5-mbuddy-v12/fs/ext3/super.c
--- linux-2.6.12-rc5-standard/fs/ext3/super.c	2005-05-25 04:31:20.000000000 +0100
+++ linux-2.6.12-rc5-mbuddy-v12/fs/ext3/super.c	2005-05-31 09:49:11.000000000 +0100
@@ -440,7 +440,7 @@ static struct inode *ext3_alloc_inode(st
 {
 	struct ext3_inode_info *ei;
 
-	ei = kmem_cache_alloc(ext3_inode_cachep, SLAB_NOFS);
+	ei = kmem_cache_alloc(ext3_inode_cachep, SLAB_NOFS|__GFP_KERNRCLM);
 	if (!ei)
 		return NULL;
 #ifdef CONFIG_EXT3_FS_POSIX_ACL
diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.12-rc5-standard/fs/ntfs/inode.c linux-2.6.12-rc5-mbuddy-v12/fs/ntfs/inode.c
--- linux-2.6.12-rc5-standard/fs/ntfs/inode.c	2005-05-25 04:31:20.000000000 +0100
+++ linux-2.6.12-rc5-mbuddy-v12/fs/ntfs/inode.c	2005-05-31 09:49:11.000000000 +0100
@@ -318,7 +318,7 @@ struct inode *ntfs_alloc_big_inode(struc
 
 	ntfs_debug("Entering.");
 	ni = (ntfs_inode *)kmem_cache_alloc(ntfs_big_inode_cache,
-			SLAB_NOFS);
+			SLAB_NOFS|__GFP_KERNRCLM);
 	if (likely(ni != NULL)) {
 		ni->state = 0;
 		return VFS_I(ni);
@@ -343,7 +343,8 @@ static inline ntfs_inode *ntfs_alloc_ext
 	ntfs_inode *ni;
 
 	ntfs_debug("Entering.");
-	ni = (ntfs_inode *)kmem_cache_alloc(ntfs_inode_cache, SLAB_NOFS);
+	ni = (ntfs_inode *)kmem_cache_alloc(ntfs_inode_cache, 
+					SLAB_NOFS|__GFP_KERNRCLM);
 	if (likely(ni != NULL)) {
 		ni->state = 0;
 		return ni;
diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.12-rc5-standard/include/linux/gfp.h linux-2.6.12-rc5-mbuddy-v12/include/linux/gfp.h
--- linux-2.6.12-rc5-standard/include/linux/gfp.h	2005-05-25 04:31:20.000000000 +0100
+++ linux-2.6.12-rc5-mbuddy-v12/include/linux/gfp.h	2005-05-31 09:49:11.000000000 +0100
@@ -39,7 +39,10 @@ struct vm_area_struct;
 #define __GFP_COMP	0x4000u	/* Add compound page metadata */
 #define __GFP_ZERO	0x8000u	/* Return zeroed page on success */
 #define __GFP_NOMEMALLOC 0x10000u /* Don't use emergency reserves */
+#define __GFP_KERNRCLM  0x20000u  /* Kernel page that is easily reclaimable */
+#define __GFP_USERRCLM  0x40000u  /* User is a userspace user */
 
+#define __GFP_TYPE_SHIFT 17     /* Translate RCLM flags to array index */
 #define __GFP_BITS_SHIFT 20	/* Room for 20 __GFP_FOO bits */
 #define __GFP_BITS_MASK ((1 << __GFP_BITS_SHIFT) - 1)
 
@@ -47,14 +50,14 @@ struct vm_area_struct;
 #define GFP_LEVEL_MASK (__GFP_WAIT|__GFP_HIGH|__GFP_IO|__GFP_FS| \
 			__GFP_COLD|__GFP_NOWARN|__GFP_REPEAT| \
 			__GFP_NOFAIL|__GFP_NORETRY|__GFP_NO_GROW|__GFP_COMP| \
-			__GFP_NOMEMALLOC)
+			__GFP_NOMEMALLOC|__GFP_KERNRCLM|__GFP_USERRCLM)
 
 #define GFP_ATOMIC	(__GFP_HIGH)
 #define GFP_NOIO	(__GFP_WAIT)
 #define GFP_NOFS	(__GFP_WAIT | __GFP_IO)
 #define GFP_KERNEL	(__GFP_WAIT | __GFP_IO | __GFP_FS)
-#define GFP_USER	(__GFP_WAIT | __GFP_IO | __GFP_FS)
-#define GFP_HIGHUSER	(__GFP_WAIT | __GFP_IO | __GFP_FS | __GFP_HIGHMEM)
+#define GFP_USER	(__GFP_WAIT | __GFP_IO | __GFP_FS | __GFP_USERRCLM )
+#define GFP_HIGHUSER	(__GFP_WAIT | __GFP_IO | __GFP_FS | __GFP_HIGHMEM | __GFP_USERRCLM)
 
 /* Flag - indicates that the buffer will be suitable for DMA.  Ignored on some
    platforms, used as appropriate on others */
diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.12-rc5-standard/include/linux/mmzone.h linux-2.6.12-rc5-mbuddy-v12/include/linux/mmzone.h
--- linux-2.6.12-rc5-standard/include/linux/mmzone.h	2005-05-25 04:31:20.000000000 +0100
+++ linux-2.6.12-rc5-mbuddy-v12/include/linux/mmzone.h	2005-05-31 09:52:20.000000000 +0100
@@ -21,6 +21,16 @@
 #define MAX_ORDER CONFIG_FORCE_MAX_ZONEORDER
 #endif
 
+/* Page allocations are divided into these types */
+#define ALLOC_TYPES 4
+#define ALLOC_KERNNORCLM 0
+#define ALLOC_KERNRCLM 1
+#define ALLOC_USERRCLM 2
+#define ALLOC_FALLBACK 3
+
+/* Number of bits required to encode the type */
+#define BITS_PER_ALLOC_TYPE 2
+
 struct free_area {
 	struct list_head	free_list;
 	unsigned long		nr_free;
@@ -43,12 +53,28 @@ struct zone_padding {
 #define ZONE_PADDING(name)
 #endif
 
+/*
+ * Shared per-cpu lists would cause fragmentation over time
+ * The pcpu_list is to keep kernel and userrclm allocations
+ * apart while still allowing all allocation types to have
+ * per-cpu lists
+ */
+struct pcpu_list {
+	int count;
+	struct list_head list;
+} ____cacheline_aligned_in_smp;
+
+
+/* Indices into pcpu_list */
+#define PCPU_KERNEL 0
+#define PCPU_USER 1
 struct per_cpu_pages {
-	int count;		/* number of pages in the list */
+	struct pcpu_list pcpu_list[2]; /* PCPU_KERNEL: kernel
+					* PCPU_USER: user
+					*/
 	int low;		/* low watermark, refill needed */
 	int high;		/* high watermark, emptying needed */
 	int batch;		/* chunk size for buddy add/remove */
-	struct list_head list;	/* the list of pages */
 };
 
 struct per_cpu_pageset {
@@ -128,8 +154,34 @@ struct zone {
 	 * free areas of different sizes
 	 */
 	spinlock_t		lock;
-	struct free_area	free_area[MAX_ORDER];
 
+ 	/*
+ 	 * The map tracks what each 2^MAX_ORDER-1 sized block is being used for
+	 * using BITS_PER_ALLOC_TYPE number of bites. Bits are set when a large
+	 * block is being split and rechecked during freeing of the page
+ 	 */
+ 	unsigned long		*free_area_usemap;
+	
+	/*
+	 * There are ALLOC_TYPE number of MAX_ORDER free lists. Once a 
+	 * MAX_ORDER block of pages has been split for an allocation type,
+	 * the whole block is reserved for that type of allocation.
+	 */
+ 	struct free_area	free_area_lists[ALLOC_TYPES][MAX_ORDER];
+
+	/*
+	 * A percentage of a zone is reserved for falling back to. Without
+	 * a fallback, memory will slowly fragment over time meaning the
+	 * placement policy only delays the fragmentation problem, not
+	 * fixes it
+	 */
+	unsigned long fallback_reserve;
+
+	/*
+	 * When negative, 2^MAX_ORDER-1 sized blocks of pages will be reserved
+	 * for fallbacks
+	 */
+	long fallback_balance;
 
 	ZONE_PADDING(_pad1_)
 
@@ -212,6 +264,11 @@ struct zone {
 	char			*name;
 } ____cacheline_maxaligned_in_smp;
 
+#define inc_reserve_count(zone, type) \
+	type == ALLOC_FALLBACK ? zone->fallback_reserve++ : 0
+#define dec_reserve_count(zone, type) \
+	(type == ALLOC_FALLBACK && zone->fallback_reserve) ? \
+		zone->fallback_reserve-- : 0
 
 /*
  * The "priority" of VM scanning is how much of the queues we will scan in one
diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.12-rc5-standard/mm/page_alloc.c linux-2.6.12-rc5-mbuddy-v12/mm/page_alloc.c
--- linux-2.6.12-rc5-standard/mm/page_alloc.c	2005-05-25 04:31:20.000000000 +0100
+++ linux-2.6.12-rc5-mbuddy-v12/mm/page_alloc.c	2005-05-31 10:08:00.000000000 +0100
@@ -64,6 +64,25 @@ int sysctl_lowmem_reserve_ratio[MAX_NR_Z
 EXPORT_SYMBOL(totalram_pages);
 EXPORT_SYMBOL(nr_swap_pages);
 
+/**
+ * The allocator tries to put allocations of the same type in the
+ * same 2^MAX_ORDER-1 blocks of pages. When memory is low, this may
+ * not be possible so this array describes what order allocations should
+ * fall back to
+ *
+ * The order of the fallback is chosen to keep the userrclm and kernrclm
+ * pools as low as fragmentation as possible. The FALLBACK zone is never
+ * directly used but acts as a reserve to allocate pages from when the
+ * normal pools are depleted.
+ *
+ */
+int fallback_allocs[ALLOC_TYPES][ALLOC_TYPES+1] = { 
+	{ALLOC_KERNNORCLM,ALLOC_FALLBACK,  ALLOC_KERNRCLM,  ALLOC_USERRCLM,-1},
+	{ALLOC_KERNRCLM,  ALLOC_FALLBACK,  ALLOC_KERNNORCLM,ALLOC_USERRCLM,-1},
+	{ALLOC_USERRCLM,  ALLOC_FALLBACK,  ALLOC_KERNNORCLM,ALLOC_KERNRCLM,-1},
+	{ALLOC_FALLBACK,  ALLOC_KERNNORCLM,ALLOC_KERNRCLM,  ALLOC_USERRCLM,-1}
+};
+
 /*
  * Used by page_zone() to look up the address of the struct zone whose
  * id is encoded in the upper bits of page->flags
@@ -118,6 +137,64 @@ static void bad_page(const char *functio
 	tainted |= TAINT_BAD_PAGE;
 }
 
+/*
+ * Return what type of page is being allocated from this 2^MAX_ORDER-1 block 
+ * of pages. A bitmap is used as a char array performs slightly slower
+ */
+static inline unsigned int get_pageblock_type(struct zone *zone, struct page *page) {
+	int bitidx = (page_to_pfn(page) >> (MAX_ORDER-1)) * BITS_PER_ALLOC_TYPE;
+	unsigned int type = !!test_bit(bitidx, zone->free_area_usemap);
+	int i;
+
+	for (i=1; i < BITS_PER_ALLOC_TYPE; i++) {
+		type = (type << 1) | (!!test_bit(bitidx+i, zone->free_area_usemap));
+	}
+
+#ifdef CONFIG_DEBUG_PAGEALLOC
+	if (type >= ALLOC_TYPES) {
+		printk("\nBogus type in get_pageblock_type: %u\n", type);
+		BUG();
+	}
+#endif
+	return type;
+}
+
+/*
+ * Reserve a block of pages for an allocation type
+ */
+static inline void set_pageblock_type(struct zone *zone, struct page *page, 
+					int type) {
+	int bitidx = (page_to_pfn(page) >> (MAX_ORDER-1)) * BITS_PER_ALLOC_TYPE;
+
+	/* Bits set match the alloc types defined in mmzone.h */
+	switch (type) {
+		case ALLOC_KERNRCLM:
+			clear_bit(bitidx, zone->free_area_usemap);
+			set_bit(bitidx+1, zone->free_area_usemap);
+			break;
+
+		case ALLOC_USERRCLM:
+			set_bit(bitidx, zone->free_area_usemap);
+			clear_bit(bitidx+1, zone->free_area_usemap);
+			break;
+
+		default:
+			set_bit(bitidx, zone->free_area_usemap);
+			set_bit(bitidx+1, zone->free_area_usemap);
+			break;
+	}
+}
+
+/*
+ * A percentage of a zone is reserved for allocations to fallback to. These
+ * macros calculate if a fallback reserve is already reserved and if it is
+ * needed
+ */
+#define need_min_fallback_reserve(zone) \
+	(zone->free_pages >> (MAX_ORDER) < zone->fallback_reserve)
+#define is_min_fallback_reserved(zone) \
+	(zone->fallback_balance < 0)
+
 #ifndef CONFIG_HUGETLB_PAGE
 #define prep_compound_page(page, order) do { } while (0)
 #define destroy_compound_page(page, order) do { } while (0)
@@ -276,6 +353,8 @@ static inline void __free_pages_bulk (st
 {
 	unsigned long page_idx;
 	int order_size = 1 << order;
+	struct free_area *area;
+	struct free_area *freelist;
 
 	if (unlikely(order))
 		destroy_compound_page(page, order);
@@ -285,12 +364,14 @@ static inline void __free_pages_bulk (st
 	BUG_ON(page_idx & (order_size - 1));
 	BUG_ON(bad_range(zone, page));
 
+	/* Select the areas to place free pages on */
+	freelist = zone->free_area_lists[get_pageblock_type(zone,page)];
+	
 	zone->free_pages += order_size;
 	while (order < MAX_ORDER-1) {
 		unsigned long combined_idx;
-		struct free_area *area;
 		struct page *buddy;
-
+		
 		combined_idx = __find_combined_index(page_idx, order);
 		buddy = __page_find_buddy(page, page_idx, order);
 
@@ -299,16 +380,19 @@ static inline void __free_pages_bulk (st
 		if (!page_is_buddy(buddy, order))
 			break;		/* Move the buddy up one level. */
 		list_del(&buddy->lru);
-		area = zone->free_area + order;
+		area = freelist + order;
 		area->nr_free--;
 		rmv_page_order(buddy);
 		page = page + (combined_idx - page_idx);
 		page_idx = combined_idx;
 		order++;
 	}
+
+	if (unlikely(order == MAX_ORDER-1)) zone->fallback_balance++;
 	set_page_order(page, order);
-	list_add(&page->lru, &zone->free_area[order].free_list);
-	zone->free_area[order].nr_free++;
+	area = freelist + order;
+	list_add_tail(&page->lru, &area->free_list);
+	area->nr_free++;
 }
 
 static inline void free_pages_check(const char *function, struct page *page)
@@ -460,57 +544,211 @@ static void prep_new_page(struct page *p
 	kernel_map_pages(page, 1 << order, 1);
 }
 
+/*
+ * Find a list that has a 2^MAX_ORDER-1 block of pages available and
+ * return it
+ */
+static inline struct page* steal_largepage(struct zone *zone, int alloctype) {
+	struct page *page;
+	struct free_area *area;
+	int i=0;
+
+	/* Search the other allocation type lists */
+	while (i < ALLOC_TYPES) {
+		area = &(zone->free_area_lists[i][MAX_ORDER-1]);
+		if (!list_empty(&(area->free_list))) break;
+		if (++i == alloctype) i++;
+	}
+	if (i == ALLOC_TYPES) return NULL;
+
+	/*
+	 * Remove a MAX_ORDER block from the global pool and add
+	 * it to the list of desired alloc_type
+	 */
+	page = list_entry(area->free_list.next, struct page, lru);
+	area->nr_free--;
+
+	/* 
+	 * Reserve this whole block of pages. When the pool shrinks, a 
+	 * percentage will be reserved for fallbacks.
+	 */
+	if (!is_min_fallback_reserved(zone) &&
+	    need_min_fallback_reserve(zone)) {
+		alloctype = ALLOC_FALLBACK;
+	}
+
+	set_pageblock_type(zone, page, alloctype);
+	dec_reserve_count(zone, i);
+	inc_reserve_count(zone, alloctype);
+
+	return page;
+
+}
 /* 
  * Do the hard work of removing an element from the buddy allocator.
  * Call me with the zone->lock already held.
  */
-static struct page *__rmqueue(struct zone *zone, unsigned int order)
+static struct page *__rmqueue(struct zone *zone, unsigned int order, int alloctype)
 {
 	struct free_area * area;
 	unsigned int current_order;
 	struct page *page;
+	int *fallback_list;
+	int start_alloctype;
 
-	for (current_order = order; current_order < MAX_ORDER; ++current_order) {
-		area = zone->free_area + current_order;
-		if (list_empty(&area->free_list))
-			continue;
+	alloctype >>= __GFP_TYPE_SHIFT;
+
+	/* Search the list for the alloctype */
+	area = zone->free_area_lists[alloctype] + order;
+	for (current_order = order;
+			current_order < MAX_ORDER;
+			current_order++, area++) {
+		if (list_empty(&area->free_list)) continue;
 
 		page = list_entry(area->free_list.next, struct page, lru);
-		list_del(&page->lru);
-		rmv_page_order(page);
 		area->nr_free--;
-		zone->free_pages -= 1UL << order;
-		return expand(zone, page, order, current_order, area);
+		goto remove_page;
+	}
+
+	/* Allocate from the global list if the preferred free list is empty */
+	if ((page = steal_largepage(zone, alloctype)) != NULL) {
+		area--;
+		current_order--;
+		goto remove_page;
 	}
 
+	/* Ok, pick the fallback order based on the type */
+	fallback_list = fallback_allocs[alloctype];
+	start_alloctype = alloctype;
+
+	/*
+	 * Here, the alloc type lists has been depleted as well as the global
+	 * pool, so fallback. When falling back, the largest possible block
+	 * will be taken to keep the fallbacks clustered if possible
+	 */
+	while ((alloctype = *(++fallback_list)) != -1) {
+
+		if (alloctype < 0 || alloctype >= ALLOC_TYPES) BUG();
+
+		/* Find a block to allocate */
+		area = zone->free_area_lists[alloctype] + MAX_ORDER;
+		current_order=MAX_ORDER;
+		do {
+			current_order--;
+			area--;
+			if (!list_empty(&area->free_list)) {
+
+				page = list_entry(area->free_list.next, struct page, lru);
+				area->nr_free--;
+				goto fallback_alloc;
+			}
+
+		} while (current_order != order);
+
+	}
+	
 	return NULL;
+	
+fallback_alloc:
+	/*
+ 	 * If we are falling back, and the allocation is KERNNORCLM,
+ 	 * then reserve any buddies for the KERNNORCLM pool. These
+ 	 * allocations fragment the worst so this helps keep them 
+ 	 * in the one place
+ 	 */
+	if (start_alloctype == ALLOC_KERNNORCLM) {
+		area = zone->free_area_lists[ALLOC_KERNNORCLM] + current_order;
+
+		/* Reserve the whole block if this is a large split */
+		if (current_order >= MAX_ORDER / 2) {
+			int reserve_type=ALLOC_KERNNORCLM;
+			dec_reserve_count(zone, get_pageblock_type(zone,page));
+
+			/*
+			 * Use this block for fallbacks if the
+			 * minimum reserve is not being met
+			 */
+			if (!is_min_fallback_reserved(zone))
+				reserve_type = ALLOC_FALLBACK;
+
+			set_pageblock_type(zone, page, reserve_type);
+			inc_reserve_count(zone, reserve_type);
+		}
+
+	}
+
+remove_page:
+	/*
+	 * At this point, page is expected to be the page we are
+	 * about to remove and the area->nr_free count should have been
+	 * updated 
+	 */
+	if (unlikely(current_order == MAX_ORDER-1)) zone->fallback_balance--;
+	list_del(&page->lru);
+	rmv_page_order(page);
+	zone->free_pages -= 1UL << order;
+	return expand(zone, page, order, current_order, area);
 }
 
 /* 
- * Obtain a specified number of elements from the buddy allocator, all under
- * a single hold of the lock, for efficiency.  Add them to the supplied list.
- * Returns the number of new pages which were placed at *list.
+ * Obtain a specified number of order-0 elements from the buddy allocator, all 
+ * under a single hold of the lock, for efficiency.  Add them to the supplied 
+ * list. An attempt is made to keep the allocatoed memory in physically
+ * contiguous blocks. Returns the number of new pages which were placed at 
+ * *list.
+ *
  */
-static int rmqueue_bulk(struct zone *zone, unsigned int order, 
-			unsigned long count, struct list_head *list)
+static int rmqueue_bulk(struct zone *zone,
+			unsigned long count, struct list_head *list,
+			int alloctype)
 {
 	unsigned long flags;
 	int i;
-	int allocated = 0;
+	unsigned long allocated = count;
 	struct page *page;
+	unsigned long current_order= 0;
 	
+	/* Find what order we should start allocating blocks at */
+	current_order = ffs(count) - 1;
+
 	spin_lock_irqsave(&zone->lock, flags);
-	for (i = 0; i < count; ++i) {
-		page = __rmqueue(zone, order);
-		if (page == NULL)
-			break;
-		allocated++;
-		list_add_tail(&page->lru, list);
+
+	/*
+	 * Satisfy the request in as the largest possible physically 
+	 * contiguous block
+	 */
+	while (allocated) {
+		if ((1 << current_order) > allocated) 
+			current_order--;
+
+		/* Allocate a block at the current_order */
+		page = __rmqueue(zone, current_order, alloctype);
+		if (page == NULL) {
+			if (current_order == 0) break;
+			current_order--;
+			continue;
+		}
+		allocated -= 1 << current_order;
+
+		/* Move to the next block if order is already 0 */
+		if (current_order == 0) {
+			list_add_tail(&page->lru, list);
+			continue;
+		}
+
+		/* Split the large block into order-sized blocks  */
+		for (i = 1 << current_order; i != 0; i--) {
+			list_add_tail(&page->lru, list);
+			page++;
+		}
 	}
+
 	spin_unlock_irqrestore(&zone->lock, flags);
-	return allocated;
+	return count - allocated;
 }
 
+
+
 #if defined(CONFIG_PM) || defined(CONFIG_HOTPLUG_CPU)
 static void __drain_pages(unsigned int cpu)
 {
@@ -525,8 +763,11 @@ static void __drain_pages(unsigned int c
 			struct per_cpu_pages *pcp;
 
 			pcp = &pset->pcp[i];
-			pcp->count -= free_pages_bulk(zone, pcp->count,
-						&pcp->list, 0);
+			pcp->pcpu_list[0].count -= free_pages_bulk(zone, pcp->pcpu_list[0].count,
+						&pcp->pcpu_list[0].list, 0);
+
+			pcp->pcpu_list[1].count -= free_pages_bulk(zone, pcp->pcpu_list[1].count,
+						&pcp->pcpu_list[1].list, 0);
 		}
 	}
 }
@@ -537,7 +778,7 @@ static void __drain_pages(unsigned int c
 void mark_free_pages(struct zone *zone)
 {
 	unsigned long zone_pfn, flags;
-	int order;
+	int order, type;
 	struct list_head *curr;
 
 	if (!zone->spanned_pages)
@@ -547,14 +788,17 @@ void mark_free_pages(struct zone *zone)
 	for (zone_pfn = 0; zone_pfn < zone->spanned_pages; ++zone_pfn)
 		ClearPageNosaveFree(pfn_to_page(zone_pfn + zone->zone_start_pfn));
 
-	for (order = MAX_ORDER - 1; order >= 0; --order)
-		list_for_each(curr, &zone->free_area[order].free_list) {
-			unsigned long start_pfn, i;
+	for (type=0; type < ALLOC_TYPES; type++) {
 
-			start_pfn = page_to_pfn(list_entry(curr, struct page, lru));
+		for (order = MAX_ORDER - 1; order >= 0; --order)
+			list_for_each(curr, &zone->free_area_lists[type][order].free_list) {
+				unsigned long start_pfn, i;
+	
+				start_pfn = page_to_pfn(list_entry(curr, struct page, lru));
 
-			for (i=0; i < (1<<order); i++)
-				SetPageNosaveFree(pfn_to_page(start_pfn+i));
+				for (i=0; i < (1<<order); i++)
+					SetPageNosaveFree(pfn_to_page(start_pfn+i));
+		}
 	}
 	spin_unlock_irqrestore(&zone->lock, flags);
 }
@@ -607,6 +851,7 @@ static void fastcall free_hot_cold_page(
 	struct zone *zone = page_zone(page);
 	struct per_cpu_pages *pcp;
 	unsigned long flags;
+	struct pcpu_list *plist;
 
 	arch_free_page(page, 0);
 
@@ -616,11 +861,20 @@ static void fastcall free_hot_cold_page(
 		page->mapping = NULL;
 	free_pages_check(__FUNCTION__, page);
 	pcp = &zone->pageset[get_cpu()].pcp[cold];
+
+	/* 
+	 * Strictly speaking, we should not be accessing the zone information
+	 * here. In this case, it does not matter if the read is incorrect
+	 */
+	if (get_pageblock_type(zone, page) == ALLOC_USERRCLM)
+		plist = &pcp->pcpu_list[1];
+	else 
+		plist = &pcp->pcpu_list[0];
 	local_irq_save(flags);
-	if (pcp->count >= pcp->high)
-		pcp->count -= free_pages_bulk(zone, pcp->batch, &pcp->list, 0);
-	list_add(&page->lru, &pcp->list);
-	pcp->count++;
+	if (plist->count >= pcp->high)
+		plist->count -= free_pages_bulk(zone, pcp->batch, &plist->list, 0);
+	list_add(&page->lru, &plist->list);
+	plist->count++;
 	local_irq_restore(flags);
 	put_cpu();
 }
@@ -650,7 +904,7 @@ static inline void prep_zero_page(struct
  * or two.
  */
 static struct page *
-buffered_rmqueue(struct zone *zone, int order, unsigned int __nocast gfp_flags)
+buffered_rmqueue(struct zone *zone, int order, unsigned int __nocast gfp_flags, int alloctype)
 {
 	unsigned long flags;
 	struct page *page = NULL;
@@ -658,16 +912,22 @@ buffered_rmqueue(struct zone *zone, int 
 
 	if (order == 0) {
 		struct per_cpu_pages *pcp;
+		struct pcpu_list *plist;
 
 		pcp = &zone->pageset[get_cpu()].pcp[cold];
 		local_irq_save(flags);
-		if (pcp->count <= pcp->low)
-			pcp->count += rmqueue_bulk(zone, 0,
-						pcp->batch, &pcp->list);
-		if (pcp->count) {
-			page = list_entry(pcp->list.next, struct page, lru);
+
+		if (alloctype == __GFP_USERRCLM) plist = &pcp->pcpu_list[1];
+		else plist = &pcp->pcpu_list[0];
+
+		if (plist->count <= pcp->low)
+			plist->count += rmqueue_bulk(zone,
+						pcp->batch, &plist->list,
+						alloctype);
+		if (plist->count) {
+			page = list_entry(plist->list.next, struct page, lru);
 			list_del(&page->lru);
-			pcp->count--;
+			plist->count--;
 		}
 		local_irq_restore(flags);
 		put_cpu();
@@ -675,7 +935,7 @@ buffered_rmqueue(struct zone *zone, int 
 
 	if (page == NULL) {
 		spin_lock_irqsave(&zone->lock, flags);
-		page = __rmqueue(zone, order);
+		page = __rmqueue(zone, order, alloctype);
 		spin_unlock_irqrestore(&zone->lock, flags);
 	}
 
@@ -702,6 +962,7 @@ int zone_watermark_ok(struct zone *z, in
 {
 	/* free_pages my go negative - that's OK */
 	long min = mark, free_pages = z->free_pages - (1 << order) + 1;
+	struct free_area *kernnorclm, *kernrclm, *userrclm;
 	int o;
 
 	if (gfp_high)
@@ -711,15 +972,25 @@ int zone_watermark_ok(struct zone *z, in
 
 	if (free_pages <= min + z->lowmem_reserve[classzone_idx])
 		return 0;
+	kernnorclm = z->free_area_lists[ALLOC_KERNNORCLM];
+	kernrclm = z->free_area_lists[ALLOC_KERNRCLM];
+	userrclm = z->free_area_lists[ALLOC_USERRCLM];
 	for (o = 0; o < order; o++) {
 		/* At the next order, this order's pages become unavailable */
-		free_pages -= z->free_area[o].nr_free << o;
+		free_pages -= (
+			kernnorclm->nr_free +
+			kernrclm->nr_free +
+			userrclm->nr_free) << o;
 
 		/* Require fewer higher order pages to be free */
 		min >>= 1;
 
 		if (free_pages <= min)
 			return 0;
+
+		kernnorclm++;
+		kernrclm++;
+		userrclm++;
 	}
 	return 1;
 }
@@ -741,6 +1012,8 @@ __alloc_pages(unsigned int __nocast gfp_
 	int do_retry;
 	int can_try_harder;
 	int did_some_progress;
+	int alloctype;
+	int highorder_retry=3;
 
 	might_sleep_if(wait);
 
@@ -760,6 +1033,13 @@ __alloc_pages(unsigned int __nocast gfp_
 
 	classzone_idx = zone_idx(zones[0]);
 
+	/*
+	 * Find what type of allocation this is. Later, this value will
+	 * be shifted __GFP_TYPE_SHIFT bits to the right to give an
+	 * index within the zones freelist
+	 */
+	alloctype = (gfp_mask & (__GFP_USERRCLM | __GFP_KERNRCLM));
+
  restart:
 	/* Go through the zonelist once, looking for a zone with enough free */
 	for (i = 0; (z = zones[i]) != NULL; i++) {
@@ -771,7 +1051,7 @@ __alloc_pages(unsigned int __nocast gfp_
 		if (!cpuset_zone_allowed(z))
 			continue;
 
-		page = buffered_rmqueue(z, order, gfp_mask);
+		page = buffered_rmqueue(z, order, gfp_mask, alloctype);
 		if (page)
 			goto got_pg;
 	}
@@ -795,7 +1075,7 @@ __alloc_pages(unsigned int __nocast gfp_
 		if (wait && !cpuset_zone_allowed(z))
 			continue;
 
-		page = buffered_rmqueue(z, order, gfp_mask);
+		page = buffered_rmqueue(z, order, gfp_mask, alloctype);
 		if (page)
 			goto got_pg;
 	}
@@ -809,7 +1089,8 @@ __alloc_pages(unsigned int __nocast gfp_
 			for (i = 0; (z = zones[i]) != NULL; i++) {
 				if (!cpuset_zone_allowed(z))
 					continue;
-				page = buffered_rmqueue(z, order, gfp_mask);
+				page = buffered_rmqueue(z, order, gfp_mask,
+						alloctype);
 				if (page)
 					goto got_pg;
 			}
@@ -852,7 +1133,7 @@ rebalance:
 			if (!cpuset_zone_allowed(z))
 				continue;
 
-			page = buffered_rmqueue(z, order, gfp_mask);
+			page = buffered_rmqueue(z, order, gfp_mask, alloctype);
 			if (page)
 				goto got_pg;
 		}
@@ -871,12 +1152,19 @@ rebalance:
 			if (!cpuset_zone_allowed(z))
 				continue;
 
-			page = buffered_rmqueue(z, order, gfp_mask);
+			page = buffered_rmqueue(z, order, gfp_mask, alloctype);
 			if (page)
 				goto got_pg;
 		}
 
-		out_of_memory(gfp_mask);
+		if (order < MAX_ORDER/2) out_of_memory(gfp_mask);
+
+		/*
+		 * Due to low fragmentation efforts, we should try a little
+		 * harder to satisfy high order allocations
+		 */
+		if (order >= MAX_ORDER/2 && --highorder_retry > 0)
+			goto rebalance;
 		goto restart;
 	}
 
@@ -893,6 +1181,8 @@ rebalance:
 			do_retry = 1;
 		if (gfp_mask & __GFP_NOFAIL)
 			do_retry = 1;
+		if (order >= MAX_ORDER/2 && --highorder_retry > 0)
+			do_retry=1;
 	}
 	if (do_retry) {
 		blk_congestion_wait(WRITE, HZ/50);
@@ -1220,6 +1510,7 @@ void show_free_areas(void)
 	unsigned long inactive;
 	unsigned long free;
 	struct zone *zone;
+	int type;
 
 	for_each_zone(zone) {
 		show_node(zone);
@@ -1312,8 +1603,10 @@ void show_free_areas(void)
 
 		spin_lock_irqsave(&zone->lock, flags);
 		for (order = 0; order < MAX_ORDER; order++) {
-			nr = zone->free_area[order].nr_free;
-			total += nr << order;
+			for (type=0; type < ALLOC_TYPES; type++) {
+				nr = zone->free_area_lists[type][order].nr_free;
+				total += nr << order;
+			}
 			printk("%lu*%lukB ", nr, K(1UL) << order);
 		}
 		spin_unlock_irqrestore(&zone->lock, flags);
@@ -1609,10 +1902,19 @@ void zone_init_free_lists(struct pglist_
 				unsigned long size)
 {
 	int order;
-	for (order = 0; order < MAX_ORDER ; order++) {
-		INIT_LIST_HEAD(&zone->free_area[order].free_list);
-		zone->free_area[order].nr_free = 0;
+ 	int type;
+ 	struct free_area *area;
+
+ 	/* Initialse the three size ordered lists of free_areas */
+	for (type=0; type < ALLOC_TYPES; type++) {
+		for (order = 0; order < MAX_ORDER; order++) {
+			area = zone->free_area_lists[type];
+ 
+			INIT_LIST_HEAD(&area[order].free_list);
+			area[order].nr_free = 0;
+		}
 	}
+ 
 }
 
 #ifndef __HAVE_ARCH_MEMMAP_INIT
@@ -1621,6 +1923,25 @@ void zone_init_free_lists(struct pglist_
 #endif
 
 /*
+ * Calculate the size of the zone->usemap in bytes rounded to an unsigned long
+ */
+static unsigned long __init usemap_size(unsigned long zonesize) {
+	unsigned long usemapsize;
+
+	/* Rounded-up number of MAX_ORDER-1 blocks */
+	usemapsize = (zonesize + (1 << (MAX_ORDER-1)) - 1) >> (MAX_ORDER-1);
+
+	/* BITS_PER_ALLOC_TYPE bits to record what type of block it is */
+	usemapsize *= BITS_PER_ALLOC_TYPE;
+	
+	/* Round the number of bits to the nearest unsigned long */
+	usemapsize = usemapsize + (sizeof(unsigned long) * 8 + BITS_PER_LONG-1);
+
+	/* Return size in bytes */
+	return usemapsize / 8;
+}
+
+/*
  * Set up the zone data structures:
  *   - mark all pages reserved
  *   - mark all memory queues empty
@@ -1633,6 +1954,7 @@ static void __init free_area_init_core(s
 	const unsigned long zone_required_alignment = 1UL << (MAX_ORDER-1);
 	int cpu, nid = pgdat->node_id;
 	unsigned long zone_start_pfn = pgdat->node_start_pfn;
+	unsigned long usemapsize;
 
 	pgdat->nr_zones = 0;
 	init_waitqueue_head(&pgdat->kswapd_wait);
@@ -1659,6 +1981,11 @@ static void __init free_area_init_core(s
 		spin_lock_init(&zone->lru_lock);
 		zone->zone_pgdat = pgdat;
 		zone->free_pages = 0;
+		zone->fallback_reserve = 0;
+
+		/* Set the balance so about 12.5% will be used for fallbacks */
+		zone->fallback_balance = (realsize >> (MAX_ORDER-1)) - 
+					 (realsize >> (MAX_ORDER+2));
 
 		zone->temp_priority = zone->prev_priority = DEF_PRIORITY;
 
@@ -1692,18 +2019,22 @@ static void __init free_area_init_core(s
 			struct per_cpu_pages *pcp;
 
 			pcp = &zone->pageset[cpu].pcp[0];	/* hot */
-			pcp->count = 0;
+			pcp->pcpu_list[0].count = 0;
+			pcp->pcpu_list[1].count = 0;
 			pcp->low = 2 * batch;
 			pcp->high = 6 * batch;
 			pcp->batch = 1 * batch;
-			INIT_LIST_HEAD(&pcp->list);
+			INIT_LIST_HEAD(&pcp->pcpu_list[0].list);
+			INIT_LIST_HEAD(&pcp->pcpu_list[1].list);
 
 			pcp = &zone->pageset[cpu].pcp[1];	/* cold */
-			pcp->count = 0;
+			pcp->pcpu_list[0].count = 0;
+			pcp->pcpu_list[1].count = 0;
 			pcp->low = 0;
 			pcp->high = 2 * batch;
 			pcp->batch = 1 * batch;
-			INIT_LIST_HEAD(&pcp->list);
+			INIT_LIST_HEAD(&pcp->pcpu_list[0].list);
+			INIT_LIST_HEAD(&pcp->pcpu_list[1].list);
 		}
 		printk(KERN_DEBUG "  %s zone: %lu pages, LIFO batch:%lu\n",
 				zone_names[j], realsize, batch);
@@ -1743,6 +2074,18 @@ static void __init free_area_init_core(s
 		zone_start_pfn += size;
 
 		zone_init_free_lists(pgdat, zone, zone->spanned_pages);
+
+		usemapsize = usemap_size(size);
+
+		zone->free_area_usemap = 
+			(unsigned long *)alloc_bootmem_node(pgdat, 
+					usemapsize);
+
+		memset((unsigned long *)zone->free_area_usemap,
+				ALLOC_KERNNORCLM, usemapsize);
+
+		printk(KERN_DEBUG "  %s zone: %lu pages, %lu real pages, usemap size:%lu\n",
+				zone_names[j], size, realsize, usemapsize);
 	}
 }
 
@@ -1830,19 +2173,38 @@ static int frag_show(struct seq_file *m,
 	struct zone *zone;
 	struct zone *node_zones = pgdat->node_zones;
 	unsigned long flags;
-	int order;
+	int order, type;
+	struct list_head *elem;
+ 	unsigned long nr_bufs = 0;
 
+ 	/* Show global fragmentation statistics */
 	for (zone = node_zones; zone - node_zones < MAX_NR_ZONES; ++zone) {
 		if (!zone->present_pages)
 			continue;
 
 		spin_lock_irqsave(&zone->lock, flags);
-		seq_printf(m, "Node %d, zone %8s ", pgdat->node_id, zone->name);
-		for (order = 0; order < MAX_ORDER; ++order)
-			seq_printf(m, "%6lu ", zone->free_area[order].nr_free);
-		spin_unlock_irqrestore(&zone->lock, flags);
-		seq_putc(m, '\n');
-	}
+ 		seq_printf(m, "Node %d, zone %8s", pgdat->node_id, zone->name);
+ 		for (order = 0; order < MAX_ORDER-1; ++order) {
+ 			nr_bufs = 0;
+ 
+ 			for (type=0; type < ALLOC_TYPES; type++) {
+ 				list_for_each(elem, &(zone->free_area_lists[type][order].free_list))
+ 					++nr_bufs;
+ 			}
+ 			seq_printf(m, "%6lu ", nr_bufs);
+ 		}
+ 
+ 		/* Scan global list */
+ 		nr_bufs = 0;
+		for (type=0; type < ALLOC_TYPES; type++) {
+			nr_bufs += zone->free_area_lists[type][MAX_ORDER-1].nr_free;
+		}
+ 		seq_printf(m, "%6lu ", nr_bufs);
+ 
+ 		spin_unlock_irqrestore(&zone->lock, flags);
+ 		seq_putc(m, '\n');
+ 	}
+ 
 	return 0;
 }
 
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Avoiding external fragmentation with a placement policy Version 12
  2005-05-31 11:20 Avoiding external fragmentation with a placement policy Version 12 Mel Gorman
@ 2005-06-01 20:55 ` Joel Schopp
  2005-06-01 23:09   ` Nick Piggin
  2005-06-02  9:49   ` Mel Gorman
  0 siblings, 2 replies; 42+ messages in thread
From: Joel Schopp @ 2005-06-01 20:55 UTC (permalink / raw)
  To: Mel Gorman; +Cc: linux-mm, linux-kernel, akpm

> -		struct free_area *area;
>  		struct page *buddy;
> -
> +		

...

>  	}
> +
>  	spin_unlock_irqrestore(&zone->lock, flags);
> -	return allocated;
> +	return count - allocated;
>  }
>  
> +
> +

Other than the very minor whitespace changes above I have nothing bad to 
say about this patch.  I think it is about time to pick in up in -mm for 
wider testing.




--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Avoiding external fragmentation with a placement policy Version 12
  2005-06-01 20:55 ` Joel Schopp
@ 2005-06-01 23:09   ` Nick Piggin
  2005-06-01 23:23     ` David S. Miller, Nick Piggin
                       ` (2 more replies)
  2005-06-02  9:49   ` Mel Gorman
  1 sibling, 3 replies; 42+ messages in thread
From: Nick Piggin @ 2005-06-01 23:09 UTC (permalink / raw)
  To: jschopp; +Cc: Mel Gorman, linux-mm, linux-kernel, akpm

Joel Schopp wrote:

> 
> Other than the very minor whitespace changes above I have nothing bad to 
> say about this patch.  I think it is about time to pick in up in -mm for 
> wider testing.
> 

It adds a lot of complexity to the page allocator and while
it might be very good, the only improvement we've been shown
yet is allocating lots of MAX_ORDER allocations I think? (ie.
not very useful)

-- 
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Avoiding external fragmentation with a placement policy Version 12
  2005-06-01 23:09   ` Nick Piggin
@ 2005-06-01 23:23     ` David S. Miller, Nick Piggin
  2005-06-01 23:28     ` Martin J. Bligh
  2005-06-01 23:47     ` Mike Kravetz
  2 siblings, 0 replies; 42+ messages in thread
From: David S. Miller, Nick Piggin @ 2005-06-01 23:23 UTC (permalink / raw)
  To: nickpiggin; +Cc: jschopp, mel, linux-mm, linux-kernel, akpm

> It adds a lot of complexity to the page allocator and while
> it might be very good, the only improvement we've been shown
> yet is allocating lots of MAX_ORDER allocations I think? (ie.
> not very useful)

I've been silently sitting back and considering how much this kind of
patch could help with page coloring, all existing implementations of
which fall apart due to buddy allocator fragmentation issues.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Avoiding external fragmentation with a placement policy Version 12
  2005-06-01 23:09   ` Nick Piggin
  2005-06-01 23:23     ` David S. Miller, Nick Piggin
@ 2005-06-01 23:28     ` Martin J. Bligh
  2005-06-01 23:43       ` Nick Piggin
                         ` (2 more replies)
  2005-06-01 23:47     ` Mike Kravetz
  2 siblings, 3 replies; 42+ messages in thread
From: Martin J. Bligh @ 2005-06-01 23:28 UTC (permalink / raw)
  To: Nick Piggin, jschopp; +Cc: Mel Gorman, linux-mm, linux-kernel, akpm

--On Thursday, June 02, 2005 09:09:23 +1000 Nick Piggin <nickpiggin@yahoo.com.au> wrote:

> Joel Schopp wrote:
> 
>> 
>> Other than the very minor whitespace changes above I have nothing bad to 
>> say about this patch.  I think it is about time to pick in up in -mm for 
>> wider testing.
>> 
> 
> It adds a lot of complexity to the page allocator and while
> it might be very good, the only improvement we've been shown
> yet is allocating lots of MAX_ORDER allocations I think? (ie.
> not very useful)

I agree that MAX_ORDER allocs aren't interesting, but we can hit 
frag problems easily at way less than max order. CIFS does it, NFS 
does it, jumbo frame gigabit ethernet does it, to name a few. The 
most common failure I see is order 3. 

Keep a machine up for a while, get it thoroughly fragmented, then 
push it reasonably hard constant pressure, and try allocating anything
large. 

Seems to me we're basically pointing a blunderbuss at memory, and 
blowing away large portions, and *hoping* something falls out the
bottom that's a big enough chunk?

M.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Avoiding external fragmentation with a placement policy Version 12
  2005-06-01 23:28     ` Martin J. Bligh
@ 2005-06-01 23:43       ` Nick Piggin
  2005-06-02  0:02         ` Martin J. Bligh
  2005-06-02 13:15       ` Mel Gorman
       [not found]       ` <20050603174706.GA25663@localhost.localdomain>
  2 siblings, 1 reply; 42+ messages in thread
From: Nick Piggin @ 2005-06-01 23:43 UTC (permalink / raw)
  To: Martin J. Bligh; +Cc: jschopp, Mel Gorman, linux-mm, linux-kernel, akpm

Martin J. Bligh wrote:
> --On Thursday, June 02, 2005 09:09:23 +1000 Nick Piggin <nickpiggin@yahoo.com.au> wrote:
> 
> 

>>It adds a lot of complexity to the page allocator and while
>>it might be very good, the only improvement we've been shown
>>yet is allocating lots of MAX_ORDER allocations I think? (ie.
>>not very useful)
> 
> 
> I agree that MAX_ORDER allocs aren't interesting, but we can hit 
> frag problems easily at way less than max order. CIFS does it, NFS 
> does it, jumbo frame gigabit ethernet does it, to name a few. The 
> most common failure I see is order 3. 
> 

Still? We had a lot of problems with kswapd not doing its
job properly, and min_free_kbytes reserve was buggy...

But if you still trigger it, I would be interested to see
traces. I don't frequently test things like XFS, or heavy
gige+jumbo loads.

> Keep a machine up for a while, get it thoroughly fragmented, then 
> push it reasonably hard constant pressure, and try allocating anything
> large. 
> 
> Seems to me we're basically pointing a blunderbuss at memory, and 
> blowing away large portions, and *hoping* something falls out the
> bottom that's a big enough chunk?
> 

Yeah more or less. But with the fragmentation patch, it by
no means becomes an exact science ;) I wouldn't have thought
it would make it hugely easier to free an order 2 or 3 area
memory block on a loaded machine.

It does make MAX_ORDER allocations _possible_ when previously
they wouldn't have been, simply by virtue of trying to put all
memory that it knows is reclaimable in a MAX_ORDER area. When
memory fills up and you need an order 3 allocation, you're
more or less in the same boat AFAIKS.

Why not just have kernel allocations going from the bottom
up, and user allocations going from the top down. That would
get you most of the way there, wouldn't it? (disclaimer: I
could well be talking shit here).

-- 
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Avoiding external fragmentation with a placement policy Version 12
  2005-06-01 23:09   ` Nick Piggin
  2005-06-01 23:23     ` David S. Miller, Nick Piggin
  2005-06-01 23:28     ` Martin J. Bligh
@ 2005-06-01 23:47     ` Mike Kravetz
  2005-06-01 23:56       ` Nick Piggin
  2 siblings, 1 reply; 42+ messages in thread
From: Mike Kravetz @ 2005-06-01 23:47 UTC (permalink / raw)
  To: Nick Piggin; +Cc: jschopp, Mel Gorman, linux-mm, linux-kernel, akpm

On Thu, Jun 02, 2005 at 09:09:23AM +1000, Nick Piggin wrote:
> 
> It adds a lot of complexity to the page allocator and while
> it might be very good, the only improvement we've been shown
> yet is allocating lots of MAX_ORDER allocations I think? (ie.
> not very useful)
> 

Allocating lots of MAX_ORDER blocks can be very useful for things
like hot-pluggable memory.  I know that this may not be of interest
to most.  However, I've been combining Mel's defragmenting patch
with the memory hotplug patch set.  As a result, I've been able to
go from 5GB down to 544MB of memory on my ppc64 system via offline
operations.  Note that ppc64 only employs a single (DMA) zone.  So,
page 'grouping' based on use is coming mainly from Mel's patch.

-- 
Mike
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Avoiding external fragmentation with a placement policy Version 12
  2005-06-01 23:47     ` Mike Kravetz
@ 2005-06-01 23:56       ` Nick Piggin
  2005-06-02  0:07         ` Mike Kravetz
  0 siblings, 1 reply; 42+ messages in thread
From: Nick Piggin @ 2005-06-01 23:56 UTC (permalink / raw)
  To: Mike Kravetz; +Cc: jschopp, Mel Gorman, linux-mm, linux-kernel, akpm

Mike Kravetz wrote:
> On Thu, Jun 02, 2005 at 09:09:23AM +1000, Nick Piggin wrote:
> 
>>It adds a lot of complexity to the page allocator and while
>>it might be very good, the only improvement we've been shown
>>yet is allocating lots of MAX_ORDER allocations I think? (ie.
>>not very useful)
>>
> 
> 
> Allocating lots of MAX_ORDER blocks can be very useful for things
> like hot-pluggable memory.  I know that this may not be of interest
> to most.  However, I've been combining Mel's defragmenting patch
> with the memory hotplug patch set.  As a result, I've been able to
> go from 5GB down to 544MB of memory on my ppc64 system via offline
> operations.  Note that ppc64 only employs a single (DMA) zone.  So,
> page 'grouping' based on use is coming mainly from Mel's patch.
> 

Back in the day, Linus would tell you to take a hike if you
wanted to complicate the buddy allocator to better support
memory hotplug ;)

I don't know what's happened to him now though, he seems to
have gone a little soft on you enterprise types.

Seriously - thanks for the data point, I had an idea that you
guys wanted this for mem hotplug.

-- 
SUSE Labs, Novell Inc.

Send instant messages to your online friends http://au.messenger.yahoo.com 
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Avoiding external fragmentation with a placement policy Version 12
  2005-06-01 23:43       ` Nick Piggin
@ 2005-06-02  0:02         ` Martin J. Bligh
  2005-06-02  0:20           ` Nick Piggin
  2005-06-02 18:28           ` Andi Kleen
  0 siblings, 2 replies; 42+ messages in thread
From: Martin J. Bligh @ 2005-06-02  0:02 UTC (permalink / raw)
  To: Nick Piggin; +Cc: jschopp, Mel Gorman, linux-mm, linux-kernel, akpm

>>> It adds a lot of complexity to the page allocator and while
>>> it might be very good, the only improvement we've been shown
>>> yet is allocating lots of MAX_ORDER allocations I think? (ie.
>>> not very useful)
>> 
>> 
>> I agree that MAX_ORDER allocs aren't interesting, but we can hit 
>> frag problems easily at way less than max order. CIFS does it, NFS 
>> does it, jumbo frame gigabit ethernet does it, to name a few. The 
>> most common failure I see is order 3. 
>> 
> 
> Still? We had a lot of problems with kswapd not doing its
> job properly, and min_free_kbytes reserve was buggy...
> 
> But if you still trigger it, I would be interested to see
> traces. I don't frequently test things like XFS, or heavy
> gige+jumbo loads.

It gets very messy when CIFS requires a large buffer to write back
to disk in order to free memory ...

[c0000000af5e6590] [c00000000008f780] .__alloc_pages+0x3a4/0x40c (unreliable)
[c0000000af5e6670] [c0000000000afae8] .alloc_pages_current+0xac/0xd0
[c0000000af5e6700] [c00000000008f808] .__get_free_pages+0x20/0x98
[c0000000af5e6780] [c000000000094594] .kmem_getpages+0x48/0x200
[c0000000af5e6800] [c0000000000959cc] .cache_grow+0xf0/0x1f0
[c0000000af5e68b0] [c000000000095d4c] .cache_alloc_refill+0x280/0x2fc
[c0000000af5e6960] [c000000000096114] .kmem_cache_alloc+0x9c/0xc0
[c0000000af5e69f0] [c00000000008db54] .mempool_alloc_slab+0x1c/0x30
[c0000000af5e6a70] [c00000000008d96c] .mempool_alloc+0x154/0x234
[c0000000af5e6b80] [d0000000004f1ee0] .cifs_buf_get+0x28/0x74 [cifs]
[c0000000af5e6c00] [d0000000004d77d0] .smb_init+0x358/0x3c4 [cifs]
[c0000000af5e6d20] [d0000000004d90a8] .CIFSSMBWrite+0x7c/0x34c [cifs]
[c0000000af5e6e00] [d0000000004ec394] .cifs_write+0x204/0x374 [cifs]
[c0000000af5e6ef0] [d0000000004ec6d0] .cifs_partialpagewrite+0x1cc/0x2b8 [cifs]
[c0000000af5e6fc0] [d0000000004ec87c] .cifs_writepage+0xc0/0x148 [cifs]
[c0000000af5e7050] [c000000000099834] .pageout+0x138/0x1c4
[c0000000af5e7130] [c000000000099b90] .shrink_list+0x2d0/0x608
[c0000000af5e7280] [c00000000009a24c] .shrink_cache+0x384/0x610
[c0000000af5e73c0] [c00000000009ad1c] .shrink_zone+0x104/0x140
[c0000000af5e7460] [c00000000009ade0] .shrink_caches+0x88/0xac
[c0000000af5e74f0] [c00000000009af54] .try_to_free_pages+0x10c/0x280
[c0000000af5e75f0] [c00000000008f660] .__alloc_pages+0x284/0x40c
[c0000000af5e76d0] [c0000000000afae8] .alloc_pages_current+0xac/0xd0
[c0000000af5e7760] [c000000000093b30] .do_page_cache_readahead+0x12c/0x210
[c0000000af5e7840] [c000000000093e38] .page_cache_readahead+0x224/0x280
[c0000000af5e78d0] [c00000000008a664] .do_generic_mapping_read+0x118/0x470
[c0000000af5e7a30] [c00000000008adc0] .__generic_file_aio_read+0x1c0/0x208
[c0000000af5e7b00] [c00000000008ae4c] .generic_file_aio_read+0x44/0x54
[c0000000af5e7b90] [c0000000000b6dd4] .do_sync_read+0xb8/0xfc
[c0000000af5e7cf0] [c0000000000b6f60] .vfs_read+0x148/0x1ac
[c0000000af5e7d90] [c0000000000b72b8] .sys_read+0x4c/0x8c
[c0000000af5e7e30] [c000000000011180] syscall_exit+0x0/0x18

There's one example ... we can probably work around it if we try hard
enough. However, the fundamental question becomes "do we support higher
order allocs, or not?". If not fine ... but we ought to quit pretending
we do. If so, then we need to make them more reliable.

>> Keep a machine up for a while, get it thoroughly fragmented, then 
>> push it reasonably hard constant pressure, and try allocating anything
>> large. 
>> 
>> Seems to me we're basically pointing a blunderbuss at memory, and 
>> blowing away large portions, and *hoping* something falls out the
>> bottom that's a big enough chunk?
> 
> Yeah more or less. But with the fragmentation patch, it by
> no means becomes an exact science ;) I wouldn't have thought
> it would make it hugely easier to free an order 2 or 3 area
> memory block on a loaded machine.

Ummm. so the blunderbuss is an exact science? ;-) At least it fairly
consistently doesn't work, I suppose ;-) ;-)
 
> It does make MAX_ORDER allocations _possible_ when previously
> they wouldn't have been, simply by virtue of trying to put all
> memory that it knows is reclaimable in a MAX_ORDER area. When
> memory fills up and you need an order 3 allocation, you're
> more or less in the same boat AFAIKS.

If we could target specific "clustered blobs" of pages, we can stand
a hope of getting some big chunks back. I think the intent is to 
separate out the reclaimable from the non-reclaimable, to some extent
at least ... give us much better odds.
 
> Why not just have kernel allocations going from the bottom
> up, and user allocations going from the top down. That would
> get you most of the way there, wouldn't it? (disclaimer: I
> could well be talking shit here).

Not sure it's quite that simple, though I haven't looked in detail
at these patches. My point was merely that we need to do *something*.
Off the top of my head ... what happens when kernel meets user in
the middle. where do we free and allocate from now ? ;-) Once we've
been up for a while, mem is nearly all used, nearly all of the time.

Is a good discussion to have though ;-)

M.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Avoiding external fragmentation with a placement policy Version 12
  2005-06-01 23:56       ` Nick Piggin
@ 2005-06-02  0:07         ` Mike Kravetz
  0 siblings, 0 replies; 42+ messages in thread
From: Mike Kravetz @ 2005-06-02  0:07 UTC (permalink / raw)
  To: Nick Piggin; +Cc: jschopp, Mel Gorman, linux-mm, linux-kernel, akpm

On Thu, Jun 02, 2005 at 09:56:18AM +1000, Nick Piggin wrote:
> Mike Kravetz wrote:
> >Allocating lots of MAX_ORDER blocks can be very useful for things
> >like hot-pluggable memory.  I know that this may not be of interest
> >to most.  However, I've been combining Mel's defragmenting patch
> >with the memory hotplug patch set.  As a result, I've been able to
> >go from 5GB down to 544MB of memory on my ppc64 system via offline
> >operations.  Note that ppc64 only employs a single (DMA) zone.  So,
> >page 'grouping' based on use is coming mainly from Mel's patch.
> >
> 
> Back in the day, Linus would tell you to take a hike if you
> wanted to complicate the buddy allocator to better support
> memory hotplug ;)
> 
> I don't know what's happened to him now though, he seems to
> have gone a little soft on you enterprise types.
> 
> Seriously - thanks for the data point, I had an idea that you
> guys wanted this for mem hotplug.

Mel wrote the patch independent of the mem hotplug effort.  As
part of the hotplug effort, we knew fragmentation needed to be
addressed.  So, when Mel released his patch we jumped all over
it.

-- 
Mike
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Avoiding external fragmentation with a placement policy Version 12
  2005-06-02  0:02         ` Martin J. Bligh
@ 2005-06-02  0:20           ` Nick Piggin
  2005-06-02 13:55             ` Mel Gorman
  2005-06-02 15:52             ` Joel Schopp
  2005-06-02 18:28           ` Andi Kleen
  1 sibling, 2 replies; 42+ messages in thread
From: Nick Piggin @ 2005-06-02  0:20 UTC (permalink / raw)
  To: Martin J. Bligh; +Cc: jschopp, Mel Gorman, linux-mm, linux-kernel, akpm

Martin J. Bligh wrote:

> There's one example ... we can probably work around it if we try hard
> enough. However, the fundamental question becomes "do we support higher
> order allocs, or not?". If not fine ... but we ought to quit pretending
> we do. If so, then we need to make them more reliable.
> 

It appears that we basically support order 3 allocations and
less (those will stay in the page allocator until something
happens).

I see your point... Mel's patch has failure cases though.
For example, someone turns swap off, or mlocks some memory
(I guess we then add the page migration defrag patch and
problem is solved?).

I do see your point. The extra complexity makes me cringe though
(no offence to Mel - I'm sure it is a complex problem).

>>Yeah more or less. But with the fragmentation patch, it by
>>no means becomes an exact science ;) I wouldn't have thought
>>it would make it hugely easier to free an order 2 or 3 area
>>memory block on a loaded machine.
> 
> 
> Ummm. so the blunderbuss is an exact science? ;-) At least it fairly
> consistently doesn't work, I suppose ;-) ;-)
>  

No but I was just saying it is just another degree of
"unsuportedness" (or supportedness, if you are a half full man).

>>Why not just have kernel allocations going from the bottom
>>up, and user allocations going from the top down. That would
>>get you most of the way there, wouldn't it? (disclaimer: I
>>could well be talking shit here).
> 
> 
> Not sure it's quite that simple, though I haven't looked in detail
> at these patches. My point was merely that we need to do *something*.
> Off the top of my head ... what happens when kernel meets user in
> the middle. where do we free and allocate from now ? ;-) Once we've
> been up for a while, mem is nearly all used, nearly all of the time.
> 

No, I'm quite sure it isn't that simple, unfortunately. Hence
disclaimer ;)

> Is a good discussion to have though ;-)
> 

Yep, I was trying to help get something going!
Send instant messages to your online friends http://au.messenger.yahoo.com 
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Avoiding external fragmentation with a placement policy Version 12
  2005-06-01 20:55 ` Joel Schopp
  2005-06-01 23:09   ` Nick Piggin
@ 2005-06-02  9:49   ` Mel Gorman
  1 sibling, 0 replies; 42+ messages in thread
From: Mel Gorman @ 2005-06-02  9:49 UTC (permalink / raw)
  To: Joel Schopp; +Cc: linux-mm, linux-kernel, akpm

On Wed, 1 Jun 2005, Joel Schopp wrote:

>
> > -    struct free_area *area;
> >      struct page *buddy;
> > -
> > +
>
> ...
>
> >      }
> > +
> >      spin_unlock_irqrestore(&zone->lock, flags);
> > -    return allocated;
> > +    return count - allocated;
> >  }
> >  +
> > +
>
> Other than the very minor whitespace changes above I have nothing bad to say
> about this patch.  I think it is about time to pick in up in -mm for wider
> testing.
>

Thanks. I posted a V13 without the whitespace damage

-- 
Mel Gorman
Part-time Phd Student                          Java Applications Developer
University of Limerick                         IBM Dublin Software Lab
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Avoiding external fragmentation with a placement policy Version 12
  2005-06-01 23:28     ` Martin J. Bligh
  2005-06-01 23:43       ` Nick Piggin
@ 2005-06-02 13:15       ` Mel Gorman
  2005-06-02 14:01         ` Martin J. Bligh
       [not found]       ` <20050603174706.GA25663@localhost.localdomain>
  2 siblings, 1 reply; 42+ messages in thread
From: Mel Gorman @ 2005-06-02 13:15 UTC (permalink / raw)
  To: Martin J. Bligh; +Cc: Nick Piggin, jschopp, linux-mm, linux-kernel, akpm

On Wed, 1 Jun 2005, Martin J. Bligh wrote:

> --On Thursday, June 02, 2005 09:09:23 +1000 Nick Piggin <nickpiggin@yahoo.com.au> wrote:
>
> > Joel Schopp wrote:
> >
> >>
> >> Other than the very minor whitespace changes above I have nothing bad to
> >> say about this patch.  I think it is about time to pick in up in -mm for
> >> wider testing.
> >>
> >
> > It adds a lot of complexity to the page allocator and while
> > it might be very good, the only improvement we've been shown
> > yet is allocating lots of MAX_ORDER allocations I think? (ie.
> > not very useful)
>
> I agree that MAX_ORDER allocs aren't interesting, but we can hit
> frag problems easily at way less than max order. CIFS does it, NFS
> does it, jumbo frame gigabit ethernet does it, to name a few. The
> most common failure I see is order 3.
>

I focused on the MAX_ORDER allocations for two reasons. The first is
because they are very difficult to satisfy. If we can service MAX_ORDER
allocations, we can certainly service order 3. The second is that my very
long-term (and currently vapour-ware) aim is to transparently support
large pages which will require 4MiB blocks on the x86 at least.

> Keep a machine up for a while, get it thoroughly fragmented, then
> push it reasonably hard constant pressure, and try allocating anything
> large.
>

That is what bench-stresshighalloc does with order-10 allocations. It
compiles 7 trees at -j4 for a few minutes and then tries to allocate 160
pages. It tends to manage about 50% of them comparised to 1-5% with the
standard allocator (which also kills everything).

> Seems to me we're basically pointing a blunderbuss at memory, and
> blowing away large portions, and *hoping* something falls out the
> bottom that's a big enough chunk?
>

With this allocator, we are still using a blunderbus approach but the
chances of big enough chunks been available are a lot better. I released a
proof-of-concept patch that freed pages by linearly scanning that worked
very well, but it needs a lot of work. Linearly scanning would help
guarantee high-order allocations but the penalty is that LRU-ordering
would be violated.

To test lower-order allocations, I ran a slightly different test where I
tried to allocate 6000 order-5 pages under heavy pressure. The standard
allocator repeatadly went OOM and allocated 5190 pages. The modified one
did not OOM and allocated 5961. The test is not very fair though because
it pins memory and the allocations are type GFP_KERNEL. For the gigabit
ethernet and network filesystem tests, I imagine we are dealing with
GFP_ATOMIC or GFP_NFS?

-- 
Mel Gorman
Part-time Phd Student                          Java Applications Developer
University of Limerick                         IBM Dublin Software Lab
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Avoiding external fragmentation with a placement policy Version 12
  2005-06-02  0:20           ` Nick Piggin
@ 2005-06-02 13:55             ` Mel Gorman
  2005-06-02 15:52             ` Joel Schopp
  1 sibling, 0 replies; 42+ messages in thread
From: Mel Gorman @ 2005-06-02 13:55 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Martin J. Bligh, jschopp, linux-mm, linux-kernel, akpm

On Thu, 2 Jun 2005, Nick Piggin wrote:

> Martin J. Bligh wrote:
>
> > There's one example ... we can probably work around it if we try hard
> > enough. However, the fundamental question becomes "do we support higher
> > order allocs, or not?". If not fine ... but we ought to quit pretending
> > we do. If so, then we need to make them more reliable.
> >
>
> It appears that we basically support order 3 allocations and
> less (those will stay in the page allocator until something
> happens).
>

That would appear to be the case. I'd like to run a few tests that require
order-3 allocations a lot. My current thinking is that running stress
tests over a CIFS filesystem may do the job. I'm open to suggestions.

> I see your point... Mel's patch has failure cases though.
> For example, someone turns swap off, or mlocks some memory
> (I guess we then add the page migration defrag patch and
> problem is solved?).
>

If swap is turned off, it will not work as well. However, buffer pages are
grouped together with anonymous pages so we might not get MAX_ORDER
allocations succeeding, but I think it the order-3 to 5 allocations would
still have a higher success rate.  If we found that no-swap was a common
scenario, we could teach the patch to treat anonymous userspace pages as
KERNNORCLM rather than USERRCLM. The ordinary file-backed pages would then
be freeable for the high order allocations. mlock() is a case I have not
thought about at all.

These cases would be addressed by page migration but page migration is a
lot more complex than this lower fragmentation patch.

> I do see your point. The extra complexity makes me cringe though
> (no offence to Mel - I'm sure it is a complex problem).
>

No offence taken, I had trouble keeping the performance of this patch
comparable to the normal allocator and the patch is large for a problem
that seems so straight-forward. Despite the additional complexity, the
performance is still comparable (sometimes it runs faster, other times
slower). It is only when memory is running low that the additional lists
that were introduced have to be searched.

> > > Yeah more or less. But with the fragmentation patch, it by
> > > no means becomes an exact science ;) I wouldn't have thought
> > > it would make it hugely easier to free an order 2 or 3 area
> > > memory block on a loaded machine.
> >
> >
> > Ummm. so the blunderbuss is an exact science? ;-) At least it fairly
> > consistently doesn't work, I suppose ;-) ;-)
> >
>
> No but I was just saying it is just another degree of
> "unsuportedness" (or supportedness, if you are a half full man).
>

Again, I would assert that our Supported-ness is better with this patch
than without it. If we wanted to really support high-order allocations, we
would also need to be able to free pages in adjacent physical pages, not
just LRU ordering.

> > > Why not just have kernel allocations going from the bottom
> > > up, and user allocations going from the top down. That would
> > > get you most of the way there, wouldn't it? (disclaimer: I
> > > could well be talking shit here).
> >
> >
> > Not sure it's quite that simple, though I haven't looked in detail
> > at these patches. My point was merely that we need to do *something*.
> > Off the top of my head ... what happens when kernel meets user in
> > the middle. where do we free and allocate from now ? ;-) Once we've
> > been up for a while, mem is nearly all used, nearly all of the time.
> >
>
> No, I'm quite sure it isn't that simple, unfortunately. Hence
> disclaimer ;)
>

It isn't that simple :(. This was my first solution and one I dropped
after a while. Under a stress test, the two areas meet and fragmentation
is as bad as it ever was. It would only delay the problem, not fix it.
Even if there is a shared zone in the middle that both can use, it will
fill.

I ran a stress test over large periods of time with this patch and it
managed to consistently keep fragmentation down. In fact, the introduction
of the fallback reserve was to address the problem of slowly fragmenting
over long periods of time

> > Is a good discussion to have though ;-)
> >
>
> Yep, I was trying to help get something going!
> Send instant messages to your online friends http://au.messenger.yahoo.com

-- 
Mel Gorman
Part-time Phd Student                          Java Applications Developer
University of Limerick                         IBM Dublin Software Lab
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Avoiding external fragmentation with a placement policy Version 12
  2005-06-02 13:15       ` Mel Gorman
@ 2005-06-02 14:01         ` Martin J. Bligh
  0 siblings, 0 replies; 42+ messages in thread
From: Martin J. Bligh @ 2005-06-02 14:01 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Nick Piggin, jschopp, linux-mm, linux-kernel, akpm

>> >> Other than the very minor whitespace changes above I have nothing bad to
>> >> say about this patch.  I think it is about time to pick in up in -mm for
>> >> wider testing.
>> >> 
>> > 
>> > It adds a lot of complexity to the page allocator and while
>> > it might be very good, the only improvement we've been shown
>> > yet is allocating lots of MAX_ORDER allocations I think? (ie.
>> > not very useful)
>> 
>> I agree that MAX_ORDER allocs aren't interesting, but we can hit
>> frag problems easily at way less than max order. CIFS does it, NFS
>> does it, jumbo frame gigabit ethernet does it, to name a few. The
>> most common failure I see is order 3.
>> 
> 
> I focused on the MAX_ORDER allocations for two reasons. The first is
> because they are very difficult to satisfy. If we can service MAX_ORDER
> allocations, we can certainly service order 3. The second is that my very
> long-term (and currently vapour-ware) aim is to transparently support
> large pages which will require 4MiB blocks on the x86 at least.

Oh, I wasn't arguing with your approach ... is always better to go a bit
further. Was just illustrating that there are real world problems right
now that hit this stuff, ergo we need it. Yes, I'd like to be able to do 
large page, memory hotplug, etc too ... but if people aren't excited about
those, there are plenty of other reasons to fix the frag problem.

It seems apparent statistically that the larger the machine, the worse the
frag problem is, as we'll blow away more memory before getting contig 
blocks. If it wasn't pre-7am, I'd try to calculate the statistics, but 
frankly, I can't be bothered ;-) I'm sure there are others whose math
degree is less rusty than mine.

> With this allocator, we are still using a blunderbus approach but the
> chances of big enough chunks been available are a lot better. I released a
> proof-of-concept patch that freed pages by linearly scanning that worked
> very well, but it needs a lot of work. Linearly scanning would help
> guarantee high-order allocations but the penalty is that LRU-ordering
> would be violated.

Yes, would be nice ... but we need to gather things into freeable and 
non-freeable either way, it seems, so doesn't invalidate what you're 
doing at all.

It seems apparent statistically that the larger the machine, the worse the
frag problem is, as we'll blow away more memory before getting contig 
blocks. If it wasn't pre-7am, I'd try to calculate the statistics, but 
frankly, I can't be bothered ;-) I'm sure there are others whose math
degree is less rusty than mine, and I'd hate to deprive them of the 
opportunity to play ;-)

> To test lower-order allocations, I ran a slightly different test where I
> tried to allocate 6000 order-5 pages under heavy pressure. The standard
> allocator repeatadly went OOM and allocated 5190 pages. The modified one
> did not OOM and allocated 5961. The test is not very fair though because
> it pins memory and the allocations are type GFP_KERNEL. For the gigabit
> ethernet and network filesystem tests, I imagine we are dealing with
> GFP_ATOMIC or GFP_NFS?

cifsd: page allocation failure. order:3, mode:0xd0

M.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Avoiding external fragmentation with a placement policy Version 12
  2005-06-02  0:20           ` Nick Piggin
  2005-06-02 13:55             ` Mel Gorman
@ 2005-06-02 15:52             ` Joel Schopp
  2005-06-02 19:50               ` Ray Bryant
  2005-06-03  3:48               ` Nick Piggin
  1 sibling, 2 replies; 42+ messages in thread
From: Joel Schopp @ 2005-06-02 15:52 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Martin J. Bligh, Mel Gorman, linux-mm, linux-kernel, akpm

> I see your point... Mel's patch has failure cases though.
> For example, someone turns swap off, or mlocks some memory
> (I guess we then add the page migration defrag patch and
> problem is solved?).

This reminds me that page migration defrag will be pretty useless 
without something like this done first.  There will be stuff that can't 
be migrated and it needs to be grouped together somehow.

In summary here are the reasons I see to run with Mel's patch:

1. It really helps with medium-large allocations under memory pressure.
2. Page migration defrag will need it.
3. Memory hotplug remove will need it.

On the downside we have:

1. Slightly more complexity in the allocator.

I'd personally trade a little extra complexity for any of the 3 upsides.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Avoiding external fragmentation with a placement policy Version 12
  2005-06-02  0:02         ` Martin J. Bligh
  2005-06-02  0:20           ` Nick Piggin
@ 2005-06-02 18:28           ` Andi Kleen
  2005-06-02 18:42             ` Martin J. Bligh
  1 sibling, 1 reply; 42+ messages in thread
From: Andi Kleen @ 2005-06-02 18:28 UTC (permalink / raw)
  To: Martin J. Bligh; +Cc: jschopp, Mel Gorman, linux-mm, linux-kernel, akpm

"Martin J. Bligh" <mbligh@mbligh.org> writes:

> It gets very messy when CIFS requires a large buffer to write back
> to disk in order to free memory ...

How about just fixing CIFS to submit memory page by page? The network
stack below it supports that just fine and the VFS above it does anyways, 
so it doesnt make much sense that CIFS sitting below them uses
larger buffers.

> There's one example ... we can probably work around it if we try hard
> enough. However, the fundamental question becomes "do we support higher
> order allocs, or not?". If not fine ... but we ought to quit pretending
> we do. If so, then we need to make them more reliable.

My understanding was that the deal was that order 1 is supposed
to work but somewhat slower, and bigger orders are supposed to work
at boot up time.

-Andi
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Avoiding external fragmentation with a placement policy Version 12
  2005-06-02 18:28           ` Andi Kleen
@ 2005-06-02 18:42             ` Martin J. Bligh
  0 siblings, 0 replies; 42+ messages in thread
From: Martin J. Bligh @ 2005-06-02 18:42 UTC (permalink / raw)
  To: Andi Kleen; +Cc: jschopp, Mel Gorman, linux-mm, linux-kernel, akpm

>> It gets very messy when CIFS requires a large buffer to write back
>> to disk in order to free memory ...
> 
> How about just fixing CIFS to submit memory page by page? The network
> stack below it supports that just fine and the VFS above it does anyways, 
> so it doesnt make much sense that CIFS sitting below them uses
> larger buffers.

Might well be possible, but it's not just CIFS though. I don't see why
CIFS needs phys contig memory, but I think some of the drivers do (at
least they do at the moment). Large pages and hotplug definitely will.

>> There's one example ... we can probably work around it if we try hard
>> enough. However, the fundamental question becomes "do we support higher
>> order allocs, or not?". If not fine ... but we ought to quit pretending
>> we do. If so, then we need to make them more reliable.
> 
> My understanding was that the deal was that order 1 is supposed
> to work but somewhat slower, and bigger orders are supposed to work
> at boot up time.

If that's the decision we come to, I'm OK with it ... but lots of code 
needs fixing first. However, I don't think that's currently the stated 
intent, we try pretty hard for up to order 3 in __alloc_pages(). I think 
we'll have an inherent need for higher orders from what I've seen, and 
thus we'll have to be capable to some extent of reclaiming mem for those
allocs. We should probably put together a list of things that really 
need it, Joel had a start at one later down this thread.

M.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Avoiding external fragmentation with a placement policy Version 12
  2005-06-02 15:52             ` Joel Schopp
@ 2005-06-02 19:50               ` Ray Bryant
  2005-06-02 20:10                 ` Joel Schopp
  2005-06-03  3:48               ` Nick Piggin
  1 sibling, 1 reply; 42+ messages in thread
From: Ray Bryant @ 2005-06-02 19:50 UTC (permalink / raw)
  To: jschopp
  Cc: Nick Piggin, Martin J. Bligh, Mel Gorman, linux-mm, linux-kernel, akpm

> In summary here are the reasons I see to run with Mel's patch:
> 
> 1. It really helps with medium-large allocations under memory pressure.
> 2. Page migration defrag will need it.
> 3. Memory hotplug remove will need it.
> 

Could someone point me at the "Page migration defrag" patch, or
describe what this is.  Does this depend on the page migration
patches from memory hotplug to move pages or is it something
different?

Thanks,
-- 
Best Regards,
Ray
-----------------------------------------------
                   Ray Bryant
512-453-9679 (work)         512-507-7807 (cell)
raybry@sgi.com             raybry@austin.rr.com
The box said: "Requires Windows 98 or better",
            so I installed Linux.
-----------------------------------------------
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Avoiding external fragmentation with a placement policy Version 12
  2005-06-02 19:50               ` Ray Bryant
@ 2005-06-02 20:10                 ` Joel Schopp
  2005-06-04 16:09                   ` Marcelo Tosatti
  0 siblings, 1 reply; 42+ messages in thread
From: Joel Schopp @ 2005-06-02 20:10 UTC (permalink / raw)
  To: Ray Bryant
  Cc: Nick Piggin, Martin J. Bligh, Mel Gorman, linux-mm, linux-kernel, akpm

> Could someone point me at the "Page migration defrag" patch, or
> describe what this is.  Does this depend on the page migration
> patches from memory hotplug to move pages or is it something
> different?

I don't think anybody has actually written such a patch yet (correct me 
if I'm wrong).  When somebody does it will certainly depend on the page 
migration patches.  As far as describing what it is, the concept is 
pretty simple.  Migrate in use pieces of memory around to make lots of 
smaller unallocated memory into fewer larger unallocated memory.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Avoiding external fragmentation with a placement policy Version 12
  2005-06-02 15:52             ` Joel Schopp
  2005-06-02 19:50               ` Ray Bryant
@ 2005-06-03  3:48               ` Nick Piggin
  2005-06-03  4:49                 ` David S. Miller, Nick Piggin
  2005-06-03 13:05                 ` Mel Gorman
  1 sibling, 2 replies; 42+ messages in thread
From: Nick Piggin @ 2005-06-03  3:48 UTC (permalink / raw)
  To: jschopp; +Cc: Martin J. Bligh, Mel Gorman, linux-mm, lkml, Andrew Morton

On Thu, 2005-06-02 at 10:52 -0500, Joel Schopp wrote:
> > I see your point... Mel's patch has failure cases though.
> > For example, someone turns swap off, or mlocks some memory
> > (I guess we then add the page migration defrag patch and
> > problem is solved?).
> 
> This reminds me that page migration defrag will be pretty useless 
> without something like this done first.  There will be stuff that can't 
> be migrated and it needs to be grouped together somehow.
> 
> In summary here are the reasons I see to run with Mel's patch:
> 
> 1. It really helps with medium-large allocations under memory pressure.
> 2. Page migration defrag will need it.
> 3. Memory hotplug remove will need it.
> 

I guess I'm now more convinced of its need ;)

add:
4. large pages
5. (hopefully) helps with smaller allocations (ie. order 3)

It would really help your cause in the short term if you can
demonstrate improvements for say order-3 allocations (eg. use
gige networking, TSO, jumbo frames, etc).


> On the downside we have:
> 
> 1. Slightly more complexity in the allocator.
> 

For some definitions of 'slightly', perhaps :(

Although I can't argue that a buddy allocator is no good without
being able to satisfy higher order allocations.

So in that case, I'm personally OK with it going into -mm. Hopefully
there will be a bit more review and hopefully some simplification if
possible.

Last question: how does it go on systems with really tiny memories?
(4MB, 8MB, that kind of thing).

> I'd personally trade a little extra complexity for any of the 3 upsides.
> 

-- 
SUSE Labs, Novell Inc.




Send instant messages to your online friends http://au.messenger.yahoo.com 
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Avoiding external fragmentation with a placement policy Version 12
  2005-06-03  3:48               ` Nick Piggin
@ 2005-06-03  4:49                 ` David S. Miller, Nick Piggin
  2005-06-03  5:34                   ` Martin J. Bligh
  2005-06-03 13:05                 ` Mel Gorman
  1 sibling, 1 reply; 42+ messages in thread
From: David S. Miller, Nick Piggin @ 2005-06-03  4:49 UTC (permalink / raw)
  To: nickpiggin; +Cc: jschopp, mbligh, mel, linux-mm, linux-kernel, akpm

> It would really help your cause in the short term if you can
> demonstrate improvements for say order-3 allocations (eg. use
> gige networking, TSO, jumbo frames, etc).

TSO chops up the user data into PAGE_SIZE chunks, it doesn't
make use of non-zero page orders.

AF_UNIX sockets, however, will happily use higher order
pages.  But even this is limited to SKB_MAX_ORDER which
is currently defined to 2.

So the only way to get order 3 or larger allocations with
the networking is to use jumbo frames but without TSO enabled.

Actually, even with TSO enabled, you'll get large order
allocations, but for receive packets, and these allocations
happen in software interrupt context.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Avoiding external fragmentation with a placement policy Version 12
  2005-06-03  4:49                 ` David S. Miller, Nick Piggin
@ 2005-06-03  5:34                   ` Martin J. Bligh
  2005-06-03  5:37                     ` David S. Miller, Martin J. Bligh
  2005-06-03  6:43                     ` Nick Piggin
  0 siblings, 2 replies; 42+ messages in thread
From: Martin J. Bligh @ 2005-06-03  5:34 UTC (permalink / raw)
  To: David S. Miller, Nick Piggin; +Cc: jschopp, mel, linux-mm, linux-kernel, akpm

>> It would really help your cause in the short term if you can
>> demonstrate improvements for say order-3 allocations (eg. use
>> gige networking, TSO, jumbo frames, etc).
> 
> TSO chops up the user data into PAGE_SIZE chunks, it doesn't
> make use of non-zero page orders.
> 
> AF_UNIX sockets, however, will happily use higher order
> pages.  But even this is limited to SKB_MAX_ORDER which
> is currently defined to 2.
> 
> So the only way to get order 3 or larger allocations with
> the networking is to use jumbo frames but without TSO enabled.

One of the calls I got the other day was for loopback interface. 
Default MTU is 16K, which seems to screw everything up and do higher 
order allocs. Turning it down to under 4K seemed to fix things. I'm 
fairly sure loopback doesn't really need phys contig memory, but it 
seems to use it at the moment ;-)

> Actually, even with TSO enabled, you'll get large order
> allocations, but for receive packets, and these allocations
> happen in software interrupt context.

Sounds like we still need to cope then ... ?

M.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Avoiding external fragmentation with a placement policy Version 12
  2005-06-03  5:34                   ` Martin J. Bligh
@ 2005-06-03  5:37                     ` David S. Miller, Martin J. Bligh
  2005-06-03  5:42                       ` Martin J. Bligh
  2005-06-03  6:43                     ` Nick Piggin
  1 sibling, 1 reply; 42+ messages in thread
From: David S. Miller, Martin J. Bligh @ 2005-06-03  5:37 UTC (permalink / raw)
  To: mbligh; +Cc: nickpiggin, jschopp, mel, linux-mm, linux-kernel, akpm

> One of the calls I got the other day was for loopback interface. 
> Default MTU is 16K, which seems to screw everything up and do higher 
> order allocs. Turning it down to under 4K seemed to fix things. I'm 
> fairly sure loopback doesn't really need phys contig memory, but it 
> seems to use it at the moment ;-)

It helps get better bandwidth to have larger buffers.
That's why AF_UNIX tries to use larger orders as well.

With all these processors using prefetching in their
memcpy() implementations, reducing the number of memcpy()
calls per byte is getting more and more important.
Each memcpy() call makes you hit the memory latency
cost since the first prefetch can't be done early
enough.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Avoiding external fragmentation with a placement policy Version 12
  2005-06-03  5:37                     ` David S. Miller, Martin J. Bligh
@ 2005-06-03  5:42                       ` Martin J. Bligh
  2005-06-03  5:51                         ` David S. Miller, Martin J. Bligh
  2005-06-03 13:13                         ` Mel Gorman
  0 siblings, 2 replies; 42+ messages in thread
From: Martin J. Bligh @ 2005-06-03  5:42 UTC (permalink / raw)
  To: David S. Miller; +Cc: nickpiggin, jschopp, mel, linux-mm, linux-kernel, akpm


--"David S. Miller" <davem@davemloft.net> wrote (on Thursday, June 02, 2005 22:37:12 -0700):

> From: "Martin J. Bligh" <mbligh@mbligh.org>
> Date: Thu, 02 Jun 2005 22:34:42 -0700
> 
>> One of the calls I got the other day was for loopback interface. 
>> Default MTU is 16K, which seems to screw everything up and do higher 
>> order allocs. Turning it down to under 4K seemed to fix things. I'm 
>> fairly sure loopback doesn't really need phys contig memory, but it 
>> seems to use it at the moment ;-)
> 
> It helps get better bandwidth to have larger buffers.
> That's why AF_UNIX tries to use larger orders as well.

Though surely the reality will be that after your system is up for a 
while, and is thorougly fragmented, your latency becomes frigging horrible 
for most allocs though? You risk writing a crapload of pages out to disk
for every alloc ...

> With all these processors using prefetching in their
> memcpy() implementations, reducing the number of memcpy()
> calls per byte is getting more and more important.
> Each memcpy() call makes you hit the memory latency
> cost since the first prefetch can't be done early
> enough.

but it's vastly different order of magnitude than touching disk.
Can we not do a "sniff alloc" first (ie if this is easy, give it
to me, else just fail and return w/o reclaim), then fall back to
smaller allocs? Though I suspect the reality is that on any real
system, a order 4 alloc will never actually succeed in any sensible
amount of time anyway? Perhaps us lot just reboot too often ;-)

M.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Avoiding external fragmentation with a placement policy Version 12
  2005-06-03  5:42                       ` Martin J. Bligh
@ 2005-06-03  5:51                         ` David S. Miller, Martin J. Bligh
  2005-06-03 13:13                         ` Mel Gorman
  1 sibling, 0 replies; 42+ messages in thread
From: David S. Miller, Martin J. Bligh @ 2005-06-03  5:51 UTC (permalink / raw)
  To: mbligh; +Cc: nickpiggin, jschopp, mel, linux-mm, linux-kernel, akpm

> but it's vastly different order of magnitude than touching disk.
> Can we not do a "sniff alloc" first (ie if this is easy, give it
> to me, else just fail and return w/o reclaim), then fall back to
> smaller allocs?

That's what AF_UNIX does.

But with other protocols, we can't jiggle the loopback
MTU just because higher allocs no longer are easily
obtainable.

Really, the networking should not try to grab anything
more than SKB_MAX_ORDER unless the device's MTU is
larger than PAGE_SIZE << SKB_MAX_ORDER, which loopback's
"16K - fudge" is not.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Avoiding external fragmentation with a placement policy Version 12
  2005-06-03  5:34                   ` Martin J. Bligh
  2005-06-03  5:37                     ` David S. Miller, Martin J. Bligh
@ 2005-06-03  6:43                     ` Nick Piggin
  2005-06-03 13:57                       ` Martin J. Bligh
  2005-06-04  1:44                       ` Herbert Xu
  1 sibling, 2 replies; 42+ messages in thread
From: Nick Piggin @ 2005-06-03  6:43 UTC (permalink / raw)
  To: Martin J. Bligh
  Cc: David S. Miller, jschopp, mel, linux-mm, linux-kernel, akpm

Martin J. Bligh wrote:
>>>It would really help your cause in the short term if you can
>>>demonstrate improvements for say order-3 allocations (eg. use
>>>gige networking, TSO, jumbo frames, etc).
>>
>>TSO chops up the user data into PAGE_SIZE chunks, it doesn't
>>make use of non-zero page orders.
>>

My mistake. Thanks for correcting me.

> One of the calls I got the other day was for loopback interface. 
> Default MTU is 16K, which seems to screw everything up and do higher 
> order allocs. Turning it down to under 4K seemed to fix things. I'm 
> fairly sure loopback doesn't really need phys contig memory, but it 
> seems to use it at the moment ;-)
> 

Out of interest, I did do some tests a while back that showed
16K is good for TCP over loopback bandwidth on a few different
types of CPUs (P3, Xeon, Opteron...).

IIRC 32K may have been slightly faster, but not enough to warrant
that size allocation.

Bandwidth for smaller sizes dropped off quite significantly,
although I'm not sure if that would have been from the actual
memory copy overhead or increased per-'something' overhead in the
network code. If the latter, that would suggest at least in theory
it could use noncongiguous physical pages.

> 
>>Actually, even with TSO enabled, you'll get large order
>>allocations, but for receive packets, and these allocations
>>happen in software interrupt context.
> 
> 
> Sounds like we still need to cope then ... ?
> 

Sure. Although we should try to not use higher order allocs if
possible of course. Even with a fallback mode, you will still be
putting more pressure on higher order areas and thus degrading
the service for *other* allocators, so such schemes should
obviously be justified by performance improvements.

-- 
SUSE Labs, Novell Inc.

Send instant messages to your online friends http://au.messenger.yahoo.com 
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Avoiding external fragmentation with a placement policy Version 12
  2005-06-03  3:48               ` Nick Piggin
  2005-06-03  4:49                 ` David S. Miller, Nick Piggin
@ 2005-06-03 13:05                 ` Mel Gorman
  2005-06-03 14:00                   ` Martin J. Bligh
  1 sibling, 1 reply; 42+ messages in thread
From: Mel Gorman @ 2005-06-03 13:05 UTC (permalink / raw)
  To: Nick Piggin; +Cc: jschopp, Martin J. Bligh, linux-mm, lkml, Andrew Morton

On Fri, 3 Jun 2005, Nick Piggin wrote:

> On Thu, 2005-06-02 at 10:52 -0500, Joel Schopp wrote:
> > > I see your point... Mel's patch has failure cases though.
> > > For example, someone turns swap off, or mlocks some memory
> > > (I guess we then add the page migration defrag patch and
> > > problem is solved?).
> >
> > This reminds me that page migration defrag will be pretty useless
> > without something like this done first.  There will be stuff that can't
> > be migrated and it needs to be grouped together somehow.
> >
> > In summary here are the reasons I see to run with Mel's patch:
> >
> > 1. It really helps with medium-large allocations under memory pressure.
> > 2. Page migration defrag will need it.
> > 3. Memory hotplug remove will need it.
> >
>
> I guess I'm now more convinced of its need ;)
>
> add:
> 4. large pages
> 5. (hopefully) helps with smaller allocations (ie. order 3)
>

6. Avoid calls to the page allocator

If a subsystem needs 8 pages to do a job, it should only have to call the
allocator once, not 8 times only to spend more time breaking the work up
into page-sized chunks.

If you look at rmqueue_bulk() in the patch, you'll see that it tries to
fill the per-cpu lists of order-0 pages with one call to the allocator
which saved some time. The additional statistics patch shows how many of
these bulk requests are made and what sizes were actually allocated. It
shows that a sizable number of additional calls to the buddy allocator are
avoided.

> It would really help your cause in the short term if you can
> demonstrate improvements for say order-3 allocations (eg. use
> gige networking, TSO, jumbo frames, etc).
>

I will work on this early next week unless someone else does not beat me
to it. Right now, I need to catch a flight and I'll be offline for a few
days.

Once I start, I'm going to be running tests on network filesystems mounted
on the loopback device to see what sort of results I find.

>
> > On the downside we have:
> >
> > 1. Slightly more complexity in the allocator.
> >
>
> For some definitions of 'slightly', perhaps :(
>

Does it need more documentation? If so, I'll write up a detailed blurb on
how it works and drop it into Documentation/

> Although I can't argue that a buddy allocator is no good without
> being able to satisfy higher order allocations.
>

Unfortunately, it is a fundemental flaw of the buddy allocator that it
fragments badly. The thing is, other allocators that do not fragment are
also slower.

> So in that case, I'm personally OK with it going into -mm. Hopefully
> there will be a bit more review and hopefully some simplification if
> possible.
>

Fingers crossed. I would be very interested to see anything that makes it
simplier.

> Last question: how does it go on systems with really tiny memories?
> (4MB, 8MB, that kind of thing).
>

I tested it with mem=16M (anything lower and my machine doesn't boot the
standard kernel with going OOM, let alone the patched one). There was no
significant change in performance of the aim9 benchmarks. The additional
memory overhead of the patch is insignificant.

However, at low memory, the fragmentation strategy does not work unless
MAX_ORDER is also dropped. As the block reservations are 2^MAX_ORDER in
size, there is no point reserving anything if the system only has 4
MAX_ORDER blocks to being with.

-- 
Mel Gorman
Part-time Phd Student                          Java Applications Developer
University of Limerick                         IBM Dublin Software Lab
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Avoiding external fragmentation with a placement policy Version 12
  2005-06-03  5:42                       ` Martin J. Bligh
  2005-06-03  5:51                         ` David S. Miller, Martin J. Bligh
@ 2005-06-03 13:13                         ` Mel Gorman
  1 sibling, 0 replies; 42+ messages in thread
From: Mel Gorman @ 2005-06-03 13:13 UTC (permalink / raw)
  To: Martin J. Bligh
  Cc: David S. Miller, nickpiggin, jschopp, linux-mm, linux-kernel, akpm

On Thu, 2 Jun 2005, Martin J. Bligh wrote:

> > From: "Martin J. Bligh" <mbligh@mbligh.org>
> > Date: Thu, 02 Jun 2005 22:34:42 -0700
> >
> >> One of the calls I got the other day was for loopback interface.
> >> Default MTU is 16K, which seems to screw everything up and do higher
> >> order allocs. Turning it down to under 4K seemed to fix things. I'm
> >> fairly sure loopback doesn't really need phys contig memory, but it
> >> seems to use it at the moment ;-)
> >
> > It helps get better bandwidth to have larger buffers.
> > That's why AF_UNIX tries to use larger orders as well.
>
> Though surely the reality will be that after your system is up for a
> while, and is thorougly fragmented, your latency becomes frigging horrible
> for most allocs though? You risk writing a crapload of pages out to disk
> for every alloc ...
>

That would be interesting to find out. I've it on my TODO list to teach
bench-stresshighalloc to time how long allocations are taking. It'll be at
least a week before I get around to it though.

> > With all these processors using prefetching in their
> > memcpy() implementations, reducing the number of memcpy()
> > calls per byte is getting more and more important.
> > Each memcpy() call makes you hit the memory latency
> > cost since the first prefetch can't be done early
> > enough.
>
> but it's vastly different order of magnitude than touching disk.
> Can we not do a "sniff alloc" first (ie if this is easy, give it
> to me, else just fail and return w/o reclaim), then fall back to
> smaller allocs?

rmqueue_bulk() in the patch does something like this. It tries to allocate
in the largest possible blocks and falls back to the lower orders as
necessary. It could always be trying to reclaim though. I think the only
easy way to fail and return w/o reclaim is to use GFP_ATOMIC which would
have other consequences.

> Though I suspect the reality is that on any real
> system, a order 4 alloc will never actually succeed in any sensible
> amount of time anyway? Perhaps us lot just reboot too often ;-)
>

That is quite possible :) . I'll see about teaching the benchmarks to time
allocations to see how much time we spend satisfying order 4 allocations
on the standard kernel and with the patch.

-- 
Mel Gorman
Part-time Phd Student                          Java Applications Developer
University of Limerick                         IBM Dublin Software Lab
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Avoiding external fragmentation with a placement policy Version 12
  2005-06-03  6:43                     ` Nick Piggin
@ 2005-06-03 13:57                       ` Martin J. Bligh
  2005-06-03 16:43                         ` Dave Hansen
  2005-06-04  1:44                       ` Herbert Xu
  1 sibling, 1 reply; 42+ messages in thread
From: Martin J. Bligh @ 2005-06-03 13:57 UTC (permalink / raw)
  To: Nick Piggin; +Cc: David S. Miller, jschopp, mel, linux-mm, linux-kernel, akpm


>>> Actually, even with TSO enabled, you'll get large order
>>> allocations, but for receive packets, and these allocations
>>> happen in software interrupt context.
>> 
>> Sounds like we still need to cope then ... ?
> 
> Sure. Although we should try to not use higher order allocs if
> possible of course. Even with a fallback mode, you will still be
> putting more pressure on higher order areas and thus degrading
> the service for *other* allocators, so such schemes should
> obviously be justified by performance improvements.

My point is that outside of a benchmark situation (where we just
rebooted the machine to run a test) you will NEVER get an order 4
block free anyway, so it's pointless. Moreover, if we use non-contig
order 0 blocks, we can use cache hot pages ;-)

M.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Avoiding external fragmentation with a placement policy Version 12
  2005-06-03 13:05                 ` Mel Gorman
@ 2005-06-03 14:00                   ` Martin J. Bligh
  2005-06-08 17:03                     ` Mel Gorman
  0 siblings, 1 reply; 42+ messages in thread
From: Martin J. Bligh @ 2005-06-03 14:00 UTC (permalink / raw)
  To: Mel Gorman, Nick Piggin; +Cc: jschopp, linux-mm, lkml, Andrew Morton

> Does it need more documentation? If so, I'll write up a detailed blurb on
> how it works and drop it into Documentation/
> 
>> Although I can't argue that a buddy allocator is no good without
>> being able to satisfy higher order allocations.
> 
> Unfortunately, it is a fundemental flaw of the buddy allocator that it
> fragments badly. The thing is, other allocators that do not fragment are
> also slower.

Do we care? 99.9% of allocations are fronted by the hot/cold page cache
now anyway ... and yes, I realise that things popping in/out of that 
obviously aren't going into the "defrag" pool, but still, it should help.
I suppose all we're slowing down is higher order allocs anyway, which
is the uncommon case, but ... worth thinking about.

M.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Avoiding external fragmentation with a placement policy Version 12
  2005-06-03 13:57                       ` Martin J. Bligh
@ 2005-06-03 16:43                         ` Dave Hansen
  2005-06-03 18:43                           ` David S. Miller, Dave Hansen
  0 siblings, 1 reply; 42+ messages in thread
From: Dave Hansen @ 2005-06-03 16:43 UTC (permalink / raw)
  To: Martin J. Bligh
  Cc: Nick Piggin, David S. Miller, jschopp, mel, linux-mm,
	Linux Kernel Mailing List, Andrew Morton

On Fri, 2005-06-03 at 06:57 -0700, Martin J. Bligh wrote:
> 
> >>> Actually, even with TSO enabled, you'll get large order
> >>> allocations, but for receive packets, and these allocations
> >>> happen in software interrupt context.
> >> 
> >> Sounds like we still need to cope then ... ?
> > 
> > Sure. Although we should try to not use higher order allocs if
> > possible of course. Even with a fallback mode, you will still be
> > putting more pressure on higher order areas and thus degrading
> > the service for *other* allocators, so such schemes should
> > obviously be justified by performance improvements.
> 
> My point is that outside of a benchmark situation (where we just
> rebooted the machine to run a test) you will NEVER get an order 4
> block free anyway, so it's pointless.

I ran a little test overnight on a 16GB i386 system.

	cat /dev/zero | ./nc localhost 9999 & ; ./nc -l -p 9999

It pushed around 200MB of traffic through lo.  Is that (relatively low)
transmission rate due to having to kick off kswapd any time it wants to
send a packet?

partial mem/buddyinfo before:
MemTotal:     16375212 kB
MemFree:        214248 kB
HighTotal:    14548952 kB
HighFree:       198272 kB
LowTotal:      1826260 kB
LowFree:         15976 kB
Cached:       14415800 kB

Node 0, zone      DMA    217     35      2      1      1      1      1      0      1      1      1
Node 0, zone   Normal   7236   3020   3885    104      7      0      0      0      0      0      1
Node 0, zone  HighMem     18    503      0      0      1      0      0      1      0      0      0

partial mem/buddyinfo after:
MemTotal:     16375212 kB
MemFree:      13471604 kB
HighTotal:    14548952 kB
HighFree:     13450624 kB
LowTotal:      1826260 kB
LowFree:         20980 kB
Cached:         972988 kB

Node 0, zone      DMA      1      0      1      1      1      1      1      0      1      1      1
Node 0, zone   Normal   1488     52     10     66      7      0      0      0      0      0      1
Node 0, zone  HighMem   1322   3541   3165  20611  20651  14062   8054   5400   2643    664    169

There was surely plenty of other stuff going on, but it looks like
ZONE_HIGHMEM got eaten, and has plenty of large contiguous areas
available.  This probably shows the collateral damage when kswapd goes
randomly shooting down pages.  Are those loopback allocations
GFP_KERNEL?

-- Dave

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Avoiding external fragmentation with a placement policy Version 12
       [not found]       ` <20050603174706.GA25663@localhost.localdomain>
@ 2005-06-03 17:56         ` Martin J. Bligh
  0 siblings, 0 replies; 42+ messages in thread
From: Martin J. Bligh @ 2005-06-03 17:56 UTC (permalink / raw)
  To: Sonny Rao; +Cc: Nick Piggin, jschopp, Mel Gorman, linux-mm, linux-kernel, akpm


--On Friday, June 03, 2005 12:47:06 -0500 Sonny Rao <sonnyrao@us.ibm.com> wrote:

> On Wed, Jun 01, 2005 at 04:28:34PM -0700, Martin J. Bligh wrote:
> <snip> 
>> Seems to me we're basically pointing a blunderbuss at memory, and 
>> blowing away large portions, and *hoping* something falls out the
>> bottom that's a big enough chunk?
> 
> Isn't this also the case with the slab shrinkers ??
> 
> We kill stuff until some free pages hopefully fall out, but this can
> be difficult when you have 20+ non-related items per page (dcache).
> 
> I think there should be a better way there as well.

Yup. Same problem, I've been looking at that too ...

M.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Avoiding external fragmentation with a placement policy Version 12
  2005-06-03 16:43                         ` Dave Hansen
@ 2005-06-03 18:43                           ` David S. Miller, Dave Hansen
  0 siblings, 0 replies; 42+ messages in thread
From: David S. Miller, Dave Hansen @ 2005-06-03 18:43 UTC (permalink / raw)
  To: haveblue; +Cc: mbligh, nickpiggin, jschopp, mel, linux-mm, linux-kernel, akpm

> Are those loopback allocations GFP_KERNEL?

It depends :-)  Most of the time, the packets will be
allocated at sendmsg() time for the user, and thus GFP_KERNEL.

But the flags may be different if, for example, the packet
is being allocated for the NFS client/server code, or some
asynchronous packet generated at software interrupt time
(TCP ACKs, ICMP replies, etc.).
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Avoiding external fragmentation with a placement policy Version 12
  2005-06-03  6:43                     ` Nick Piggin
  2005-06-03 13:57                       ` Martin J. Bligh
@ 2005-06-04  1:44                       ` Herbert Xu
  2005-06-04  2:15                         ` Nick Piggin
  1 sibling, 1 reply; 42+ messages in thread
From: Herbert Xu @ 2005-06-04  1:44 UTC (permalink / raw)
  To: Nick Piggin; +Cc: mbligh, davem, jschopp, mel, linux-mm, linux-kernel, akpm

Nick Piggin <nickpiggin@yahoo.com.au> wrote:
>
> network code. If the latter, that would suggest at least in theory
> it could use noncongiguous physical pages.

With Dave's latest super-TSO patch, TCP over loopback will only be
doing order-0 allocations in the common case.  UDP and others may
still do large allocations but that logic is all localised in
ip_append_data.

So if we wanted we could easily remove most large allocations over
the loopback device.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Avoiding external fragmentation with a placement policy Version 12
  2005-06-04  1:44                       ` Herbert Xu
@ 2005-06-04  2:15                         ` Nick Piggin
  2005-06-05 19:52                           ` David S. Miller, Nick Piggin
  0 siblings, 1 reply; 42+ messages in thread
From: Nick Piggin @ 2005-06-04  2:15 UTC (permalink / raw)
  To: Herbert Xu; +Cc: mbligh, davem, jschopp, mel, linux-mm, linux-kernel, akpm

Herbert Xu wrote:
> Nick Piggin <nickpiggin@yahoo.com.au> wrote:
> 
>>network code. If the latter, that would suggest at least in theory
>>it could use noncongiguous physical pages.
> 
> 
> With Dave's latest super-TSO patch, TCP over loopback will only be
> doing order-0 allocations in the common case.  UDP and others may
> still do large allocations but that logic is all localised in
> ip_append_data.
> 
> So if we wanted we could easily remove most large allocations over
> the loopback device.

I would be very interested to look into that. I would be
willing to do benchmarks on a range of machines too if
that would be of any use to you.

Thanks,
Nick

-- 
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Avoiding external fragmentation with a placement policy Version 12
  2005-06-02 20:10                 ` Joel Schopp
@ 2005-06-04 16:09                   ` Marcelo Tosatti
  0 siblings, 0 replies; 42+ messages in thread
From: Marcelo Tosatti @ 2005-06-04 16:09 UTC (permalink / raw)
  To: Joel Schopp
  Cc: Ray Bryant, Nick Piggin, Martin J. Bligh, Mel Gorman, linux-mm,
	linux-kernel, akpm

On Thu, Jun 02, 2005 at 03:10:44PM -0500, Joel Schopp wrote:
> >Could someone point me at the "Page migration defrag" patch, or
> >describe what this is.  Does this depend on the page migration
> >patches from memory hotplug to move pages or is it something
> >different?
> 
> I don't think anybody has actually written such a patch yet (correct me 
> if I'm wrong).  When somebody does it will certainly depend on the page 
> migration patches.  As far as describing what it is, the concept is 
> pretty simple.  Migrate in use pieces of memory around to make lots of 
> smaller unallocated memory into fewer larger unallocated memory.

I've tried to experiment with such memory defragmentation - this is the 
latest version (a bit crippled, with debugging printks's, etc). 

Major part of this patch is "creation" of nonblocking versions of the
page migration functions (which can be used by any defragmentation 
scheme).

The defragmentation I have implemented is pretty simple: coalesce_memory()
walks the highest order free list, looking for contiguous freeable
pages to migrate.

Testing shows that it works reliably - a test scenario which demands 
large amounts of high order pages shows great improvement: such awaraness 
makes it possible for the memory reclaim code to easily free contiguous 
regions, while in the current situation a scenario with a demand 
for high order pages simply crawls the box down while VM is madly 
reclaiming single pages. 

The problem with using the free lists as a starting point for 
inspection is that, under severe memory shortage, the probability
of finding freeable regions goes down as the number of pages
on the free list goes down. Ought to find another starting
point for inspection.

On the other side of the coin, it requires a huge amount of 
locking and data copying.


--- mmigrate.c.orig	2005-01-15 16:06:35.000000000 -0200
+++ mmigrate.c	2005-03-21 19:58:56.000000000 -0300
@@ -21,6 +21,9 @@
 #include <linux/rmap.h>
 #include <linux/mmigrate.h>
 #include <linux/delay.h>
+#include <linux/idr.h>
+#include <linux/page-flags.h>
+#include <linux/swap.h>
 
 /*
  * The concept of memory migration is to replace a target page with
@@ -36,6 +39,462 @@
  */
 
 
+struct page * migrate_onepage_nonblock(struct page *);
+
+
+int is_page_busy(struct page *page)
+{
+
+	if (PageLocked(page))
+		return -EBUSY;
+
+	if (PageWriteback(page))
+		return -EAGAIN;
+
+	if (!PageLRU(page))
+		return -ENOENT;
+
+	if (PageReserved(page))
+		return -EBUSY;
+
+	if (page_count(page) != 1)
+		return -EBUSY;
+
+	if (page_mapping(page) == NULL)
+		return -ENOENT;
+
+	return 0;
+}
+
+#define total_nr_freeable (back_nr_freeable + fwd_nr_freeable)
+
+extern inline void extract_pages(struct page *, struct zone *,
+                unsigned int, unsigned int,
+                struct free_area *);
+
+extern inline void set_page_order(struct page *page, int order);
+
+
+static void moveback_to_lru(struct list_head *list, struct zone *zone)
+{
+	struct page *page;
+
+	list_for_each_entry(page, list, lru) {
+		__putback_page_to_lru(zone, page);
+		page_cache_release(page);
+	}
+}
+
+static void moveback_to_freelist(struct page *page, struct zone *zone, unsigned int order)
+{
+	int order_size = 1UL << order;
+
+        zone->free_pages += order_size;
+	set_page_order(page, order);
+        list_add_tail(&page->lru, &zone->free_area[order].free_list);
+        zone->free_area[order].nr_free++;
+}
+
+int coalesce_memory(struct zone *zone, unsigned int order, struct list_head *freedlist)
+{
+	unsigned int max_order_delta = 2;
+	unsigned int torder, nr_pages;
+	unsigned int back_nr_freeable = 0, fwd_nr_freeable = 0;
+	int ret = 0;
+	unsigned long flags;
+
+
+	printk(KERN_ERR "coalesce_memory!!\n");
+
+
+	for (torder = order-1; torder > max_order_delta; torder--) {
+		struct list_head *entry;
+		struct page *page, *pwalk, *tmp;
+		struct free_area *area = zone->free_area + torder;
+		int count = area->nr_free;
+		LIST_HEAD(page_list);
+		LIST_HEAD(freed_page_list);
+
+		nr_pages = (1UL << order) - (1UL << torder); 
+
+		if (list_empty(&area->free_list))
+			continue;
+
+        	spin_lock_irqsave(&zone->lock, flags);
+
+		entry = area->free_list.next;
+
+		while (--count > 0) {
+			unsigned int wcount, freed = 0;
+			unsigned int back_moved = 0, fwd_moved = 0;
+			struct page *tpage, *npage = NULL;
+			int err = 0;
+
+			INIT_LIST_HEAD(&page_list);
+			INIT_LIST_HEAD(&freed_page_list);
+
+			back_nr_freeable = 0;
+			fwd_nr_freeable = 0; 
+
+			page = list_entry(entry, struct page, lru);
+			
+			/* Look backwards */
+			for (wcount=1; wcount<=nr_pages; wcount++) {
+				pwalk = page - wcount; 
+
+				if (!is_page_busy(pwalk))
+					back_nr_freeable++;
+				else {
+					err = 1;
+					break;
+				}
+
+				if (back_nr_freeable == nr_pages)
+					break;
+			}
+
+			if (err) {
+				entry = entry->next;
+				continue;
+			}
+
+			/* Look forward */
+			for (wcount = (1UL<<torder); wcount < nr_pages+(1UL<<torder); 
+				wcount++) {
+
+				if (total_nr_freeable == nr_pages)
+					break;
+
+				pwalk = page + wcount;
+				if (!is_page_busy(pwalk))
+					fwd_nr_freeable++;
+				else {
+					err = 1;
+					break;
+				}
+			}
+
+			if (err) {
+				entry = entry->next;
+				continue;
+			}
+
+			/* found enough freeable pages, remove the middle 
+		 	 * page from the free list, target pages 
+			 * pages from LRU, and try to migrate.
+			 */
+
+			extract_pages(page, zone, 0, torder, area); 
+
+			printk(KERN_ERR "extract page!\n");
+
+        		spin_unlock_irqrestore(&zone->lock, flags);
+
+			spin_lock_irq(&zone->lru_lock);
+
+			for (tpage = page - back_nr_freeable; tpage < page; tpage++) {
+
+				if (back_moved == back_nr_freeable)
+					break;
+
+				if (likely(PageLRU(tpage))) {
+					if (__steal_page_from_lru(zone, tpage)) {
+						back_moved++;
+						list_add(&tpage->lru, &page_list);
+						continue;
+					}
+				}
+					
+				moveback_to_lru(&page_list, zone);
+        			spin_unlock_irq(&zone->lru_lock);
+
+        			spin_lock_irqsave(&zone->lock, flags);
+				moveback_to_freelist(page, zone, torder);
+				err = 1;
+			}
+
+			if (err) {
+				entry = area->free_list.next;
+				continue;
+			}
+
+			for (tpage = page + (1UL<<torder);
+				tpage < page + (1UL<<torder) + fwd_nr_freeable; tpage++) {
+				if (fwd_moved == fwd_nr_freeable)
+					break;
+
+				if (likely(PageLRU(tpage))) {
+					if (__steal_page_from_lru(zone, tpage)) {
+						fwd_moved++;
+						list_add(&tpage->lru, &page_list);
+						continue;
+					}
+				}
+
+				moveback_to_lru(&page_list, zone);
+        			spin_unlock_irq(&zone->lru_lock);
+
+        			spin_lock_irqsave(&zone->lock, flags);
+				moveback_to_freelist(page, zone, torder);
+				err = 1;
+				break;
+			}
+
+			if (err) {
+				entry = area->free_list.next;
+				continue;
+			}
+
+			spin_unlock_irq(&zone->lru_lock);
+
+			list_for_each_entry_safe(tpage, tmp, &page_list, lru) {
+				list_del(&tpage->lru);
+				npage = migrate_onepage_nonblock(tpage);
+				
+				printk(KERN_ERR "migrate page!\n");
+
+				if (IS_ERR(npage)) {
+        				spin_lock_irq(&zone->lru_lock);
+					__putback_page_to_lru(zone, tpage);
+					page_cache_release(tpage);
+					moveback_to_lru(&page_list, zone);
+        				spin_unlock_irq(&zone->lru_lock);
+
+					printk(KERN_ERR "migrate failure!\n");
+
+
+				list_for_each_entry(tpage, &freed_page_list, lru) {
+				//	__putback_page_to_lru(zone, tpage);
+					if(page_count(tpage) != 1) {
+						printk(KERN_ERR "Damn, freed_list page has page_count!= 2 (%d) - flags:%lx\n", page_count(tpage), page->flags);
+						page_cache_release(tpage);
+					}
+				}
+
+        				spin_lock_irqsave(&zone->lock, flags);
+					moveback_to_freelist(page, zone, torder); 
+					break;
+				}
+
+				putback_page_to_lru(page_zone(npage), npage);
+				page_cache_release(npage);
+
+				list_add(&tpage->lru, &freed_page_list);
+				freed++;
+			}
+
+			if (freed == nr_pages) {
+
+				struct page *freed_page;
+				printk(KERN_ERR "successfully freed %d pages\n",
+					freed);
+				spin_lock_irqsave(&zone->lock, flags); 
+				moveback_to_freelist(page, zone, torder);
+        			spin_unlock_irqrestore(&zone->lock, flags);
+
+				list_for_each_entry(tpage, &freed_page_list, lru) {
+				//	__putback_page_to_lru(zone, tpage);
+					if(page_count(tpage) != 1) {
+						printk(KERN_ERR "Damn, freed_list page has page_count!= 1 (%d) - flags:%lx\n", page_count(tpage), page->flags);
+						page_cache_release(tpage);
+					}
+				}
+			
+				printk(KERN_ERR "successfully freed %d pages\n",
+				freed);
+
+				return 1;
+			} else {
+				struct page *freed_page;
+				spin_lock_irq(&zone->lru_lock);
+				list_for_each_entry(tpage, &freed_page_list, lru) {
+				//	__putback_page_to_lru(zone, tpage);
+					BUG_ON(page_count(tpage) != 1);
+					page_cache_release(tpage);
+				}
+        			spin_unlock_irq(&zone->lru_lock);
+				printk(KERN_ERR "could'nt free pages %d pages (only %d), bailing out!\n", nr_pages, freed);
+				return NULL;
+			}
+
+		}
+	}
+	
+	printk(KERN_ERR "finished loop but failed to migrate any page!\n");
+
+	return NULL;
+}
+
+
+struct counter {
+	int i;
+};
+
+struct idr migration_idr;
+
+static struct address_space_operations migration_aops = {
+        .writepage      = NULL,
+        .sync_page      = NULL,
+        .set_page_dirty = __set_page_dirty_nobuffers,
+};
+
+static struct backing_dev_info migration_backing_dev_info = {
+        .memory_backed  = 1,    /* Does not contribute to dirty memory */
+        .unplug_io_fn   = NULL,
+};
+
+struct address_space migration_space = {
+        .page_tree      = RADIX_TREE_INIT(GFP_ATOMIC),
+        .tree_lock      = RW_LOCK_UNLOCKED,
+        .a_ops          = &migration_aops,
+        .flags          = GFP_HIGHUSER,
+        .i_mmap_nonlinear = LIST_HEAD_INIT(migration_space.i_mmap_nonlinear),
+        .backing_dev_info = &migration_backing_dev_info,
+};
+
+int init_migration_cache(void) 
+{
+	idr_init(&migration_idr);
+
+	return 0;
+}
+
+__initcall(init_migration_cache);
+
+struct page *lookup_migration_cache(int id) 
+{ 
+	return find_get_page(&migration_space, id);
+}
+
+void migration_duplicate(swp_entry_t entry)
+{
+	struct counter *cnt;
+
+	read_lock_irq(&migration_space.tree_lock);
+
+	cnt = idr_find(&migration_idr, swp_offset(entry));
+	cnt->i = cnt->i + 1;
+
+	read_unlock_irq(&migration_space.tree_lock);
+}
+
+void remove_from_migration_cache(struct page *page, int id)
+{
+	write_lock_irq(&migration_space.tree_lock);
+        idr_remove(&migration_idr, id);
+	radix_tree_delete(&migration_space.page_tree, id);
+	ClearPageSwapCache(page);
+	page->private = NULL;
+	total_migration_pages--;
+	write_unlock_irq(&migration_space.tree_lock);
+}
+
+// FIXME: if the page is locked will it be correctly removed from migr cache?
+// check races
+
+void migration_remove_entry(swp_entry_t entry)
+{
+	struct page *page;
+	
+	page = find_get_page(&migration_space, entry.val);
+
+	if (!page)
+		BUG();
+
+	lock_page(page);	
+
+	migration_remove_reference(page, 1);
+
+	unlock_page(page);
+
+	page_cache_release(page);
+}
+
+int migration_remove_reference(struct page *page, int dec)
+{
+	struct counter *c;
+	swp_entry_t entry;
+
+	entry.val = page->private;
+
+	read_lock_irq(&migration_space.tree_lock);
+
+	c = idr_find(&migration_idr, swp_offset(entry));
+
+	read_unlock_irq(&migration_space.tree_lock);
+
+	BUG_ON(c->i < dec);
+
+	c->i -= dec;
+
+	if (!c->i) {
+		printk(KERN_ERR "removing page from migration cache!\n");
+		remove_from_migration_cache(page, page->private);
+		kfree(c);
+		page_cache_release(page);
+	}
+}
+
+int detach_from_migration_cache(struct page *page)
+{
+
+	lock_page(page);	
+	migration_remove_reference(page, 0);
+	unlock_page(page);
+
+	return 0;
+}
+
+int add_to_migration_cache(struct page *page, int gfp_mask) 
+{
+	int error, offset;
+	struct counter *counter;
+	swp_entry_t entry;
+	
+	BUG_ON(PageSwapCache(page));
+
+	BUG_ON(PagePrivate(page));
+
+        if (idr_pre_get(&migration_idr, GFP_ATOMIC) == 0)
+                return -ENOMEM;
+
+	counter = kmalloc(sizeof(struct counter), GFP_KERNEL);
+
+	if (!counter)
+		return -ENOMEM;
+
+	error = radix_tree_preload(gfp_mask);
+
+	counter->i = 0;
+
+	printk(KERN_ERR "adding to migration cache!\n");
+
+	if (!error) {
+		write_lock_irq(&migration_space.tree_lock);
+	        error = idr_get_new_above(&migration_idr, counter, 1, &offset);
+
+		if (error < 0)
+			BUG();
+
+		entry = swp_entry(MIGRATION_TYPE, offset);
+
+		error = radix_tree_insert(&migration_space.page_tree, entry.val,
+							page);
+		if (!error) {
+			page_cache_get(page);
+			SetPageLocked(page);
+			page->private = entry.val;
+			total_migration_pages++;
+			SetPageSwapCache(page);
+		}
+		write_unlock_irq(&migration_space.tree_lock);
+                radix_tree_preload_end();
+
+	}
+
+	return error;
+}
+
 /*
  * Try to writeback a dirty page to free its buffers.
  */
@@ -121,9 +580,11 @@
 	if (PageWriteback(page))
 		return -EAGAIN;
 	/* The page might have been truncated */
-	truncated = !PageSwapCache(newpage) && page_mapping(page) == NULL;
-	if (page_count(page) + truncated <= freeable_page_count)
+	truncated = !PageSwapCache(newpage) &&
+		page_mapping(page) == NULL;
+	if (page_count(page) + truncated <= freeable_page_count) 
 		return truncated ? -ENOENT : 0;
+
 	return -EAGAIN;
 }
 
@@ -133,7 +594,7 @@
  */
 int
 migrate_page_common(struct page *page, struct page *newpage,
-					struct list_head *vlist)
+					struct list_head *vlist, int block)
 {
 	long timeout = 5000;	/* XXXX */
 	int ret;
@@ -149,6 +610,8 @@
 		case -EBUSY:
 			return ret;
 		case -EAGAIN:
+			if (!block)
+				return ret;
 			writeback_and_free_buffers(page);
 			unlock_page(page);
 			msleep(10);
@@ -268,6 +731,136 @@
 	return 0;
 }
 
+
+/*
+ * Try to migrate one page.  Returns non-zero on failure.
+ *   - Lock for the page must be held when invoked.
+ *   - The page must be attached to an address_space.
+ */
+int
+generic_migrate_page_nonblock(struct page *page, struct page *newpage,
+	int (*migrate_fn)(struct page *, struct page *, struct list_head *, int))
+{
+	int ret;
+
+	/*
+	 * Make sure that the newpage must be locked and keep not up-to-date
+	 * during the page migration, so that it's guaranteed that all
+	 * accesses to the newpage will be blocked until everything has
+	 * become ok.
+	 */
+	if (TestSetPageLocked(newpage))
+		BUG();
+
+	if ((ret = replace_pages(page, newpage)))
+		goto out_removing;
+
+	/*
+	 * With cleared PTEs, any accesses via the PTEs to the page
+	 * can be caught and blocked in a pagefault handler.
+	 */
+	if (page_mapped(page)) {
+		if ((ret = try_to_unmap(page, NULL)) != SWAP_SUCCESS) {
+			ret = -EBUSY;
+			goto out_busy;
+		}
+	}
+
+	if (PageSwapCache(page)) {
+		/*
+		 * The page is not mapped from anywhere now.
+		 * Detach it from the swapcache completely.
+		 */
+		ClearPageSwapCache(page);
+		page->private = 0;
+		page->mapping = NULL;
+	}
+
+	ret = migrate_fn(page, newpage, NULL, 0);
+	switch (ret) {
+	default:
+		/* The page is busy. Try it later. */
+		goto out_busy;
+	case -ENOENT:
+		/* The file the page belongs to has been truncated. */
+		page_cache_get(page);
+		page_cache_release(newpage);
+		newpage->mapping = NULL;
+		/* fall thru */
+	case 0:
+		/* fall thru */
+	}
+
+	arch_migrate_page(page, newpage);
+
+	if (PageError(page))
+		SetPageError(newpage);
+	if (PageReferenced(page))
+		SetPageReferenced(newpage);
+	if (PageActive(page)) {
+		SetPageActive(newpage);
+		ClearPageActive(page);
+	}
+	if (PageMappedToDisk(page))
+		SetPageMappedToDisk(newpage);
+	if (PageChecked(page))
+		SetPageChecked(newpage);
+	if (PageUptodate(page))
+		SetPageUptodate(newpage);
+	if (PageDirty(page)) {
+		clear_page_dirty_for_io(page);
+		/* this will make a whole page dirty (if it has buffers) */
+		set_page_dirty(newpage);
+	}
+	if (PagePrivate(newpage)) {
+		BUG_ON(newpage->mapping == NULL);
+		unlock_page_buffer(newpage);
+	}
+
+	if (PageWriteback(newpage))
+		BUG();
+
+	unlock_page(newpage);
+
+	/* map the newpage where the old page have been mapped. */
+	if (PageMigration(newpage))
+		detach_from_migration_cache(newpage);
+	else if (PageSwapCache(newpage)) {
+		lock_page(newpage);
+		__remove_exclusive_swap_page(newpage, 1);
+		unlock_page(newpage);
+	}
+
+	page->mapping = NULL;
+	unlock_page(page);
+	page_cache_release(page);
+
+	return 0;
+
+out_busy:
+	/* Roll back all operations. */
+	unwind_page(page, newpage);
+/*	touch_unmapped_address(&vlist);
+	if (PageMigration(page))
+		detach_from_migration_cache(page);
+	else if (PageSwapCache(page)) {
+		lock_page(page);
+		__remove_exclusive_swap_page(page, 1);
+		unlock_page(page);
+	} */
+
+	return ret;
+
+out_removing:
+	if (PagePrivate(newpage))
+		BUG();
+	unlock_page(page);
+	unlock_page(newpage);
+	if (PageMigration(page))
+		detach_from_migration_cache(page);
+	return ret;
+}
+
 /*
  * Try to migrate one page.  Returns non-zero on failure.
  *   - Lock for the page must be held when invoked.
@@ -275,7 +868,7 @@
  */
 int
 generic_migrate_page(struct page *page, struct page *newpage,
-	int (*migrate_fn)(struct page *, struct page *, struct list_head *))
+	int (*migrate_fn)(struct page *, struct page *, struct list_head *, int))
 {
 	LIST_HEAD(vlist);
 	int ret;
@@ -317,7 +910,7 @@
 	}
 
 	/* Wait for all operations against the page to finish. */
-	ret = migrate_fn(page, newpage, &vlist);
+	ret = migrate_fn(page, newpage, &vlist, 1);
 	switch (ret) {
 	default:
 		/* The page is busy. Try it later. */
@@ -367,7 +960,9 @@
 
 	/* map the newpage where the old page have been mapped. */
 	touch_unmapped_address(&vlist);
-	if (PageSwapCache(newpage)) {
+	if (PageMigration(newpage))
+		detach_from_migration_cache(newpage);
+	else if (PageSwapCache(newpage)) {
 		lock_page(newpage);
 		__remove_exclusive_swap_page(newpage, 1);
 		unlock_page(newpage);
@@ -383,7 +978,9 @@
 	/* Roll back all operations. */
 	unwind_page(page, newpage);
 	touch_unmapped_address(&vlist);
-	if (PageSwapCache(page)) {
+	if (PageMigration(page))
+		detach_from_migration_cache(page);
+	else if (PageSwapCache(page)) {
 		lock_page(page);
 		__remove_exclusive_swap_page(page, 1);
 		unlock_page(page);
@@ -396,6 +993,8 @@
 		BUG();
 	unlock_page(page);
 	unlock_page(newpage);
+	if (PageMigration(page))
+		detach_from_migration_cache(page);
 	return ret;
 }
 
@@ -417,10 +1016,14 @@
 	 */
 #ifdef CONFIG_SWAP
 	if (PageAnon(page) && !PageSwapCache(page))
-		if (!add_to_swap(page, GFP_KERNEL)) {
+		if (add_to_migration_cache(page, GFP_KERNEL)) {
 			unlock_page(page);
 			return ERR_PTR(-ENOSPC);
 		}
+/*		if (!add_to_swap(page, GFP_KERNEL)) {
+			unlock_page(page);
+			return ERR_PTR(-ENOSPC);
+		} */
 #endif /* CONFIG_SWAP */
 	if ((mapping = page_mapping(page)) == NULL) {
 		/* truncation is in progress */
@@ -440,8 +1043,9 @@
 		return ERR_PTR(-ENOMEM);
 	}
 
-	if (mapping->a_ops->migrate_page)
+	if (mapping->a_ops && mapping->a_ops->migrate_page) {
 		ret = mapping->a_ops->migrate_page(page, newpage);
+	}
 	else
 		ret = generic_migrate_page(page, newpage, migrate_page_common);
 	if (ret) {
@@ -454,6 +1058,59 @@
 	return newpage;
 }
 
+/*
+ * migrate_onepage_nonblock() is equivalent to migrate_onepage() but fails 
+ * if the page is busy.
+ */
+struct page *
+migrate_onepage_nonblock(struct page *page)
+{
+	struct page *newpage;
+	struct address_space *mapping;
+	int ret;
+
+	if (TestSetPageLocked(page))
+		return ERR_PTR(-EBUSY);
+
+	/*
+	 * Put the page in a radix tree if it isn't in the tree yet.
+	 */
+	if (PageAnon(page) && !PageSwapCache(page)) {
+		if (add_to_migration_cache(page, GFP_KERNEL)) {
+			unlock_page(page);
+			return ERR_PTR(-ENOSPC);
+		}
+	}
+
+	if ((mapping = page_mapping(page)) == NULL)
+		return ERR_PTR(-ENOENT); 
+
+	/*
+	 * Allocate a new page with the same gfp_mask
+	 * as the target page has.
+	 */
+	newpage = page_cache_alloc(mapping);
+	if (newpage == NULL) {
+		unlock_page(page);
+		return ERR_PTR(-ENOMEM);
+	}
+
+
+	if (mapping->a_ops && mapping->a_ops->migrate_page) {
+		ret = mapping->a_ops->migrate_page(page, newpage);
+	}
+	else
+		ret = generic_migrate_page_nonblock(page, newpage, migrate_page_common);
+	if (ret) {
+		BUG_ON(page_count(newpage) != 1);
+		page_cache_release(newpage);
+		return ERR_PTR(ret);
+	}
+	BUG_ON(page_count(page) != 1);
+//	page_cache_release(page);
+	return newpage;
+}
+
 static inline int
 need_writeback(struct page *page)
 {


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Avoiding external fragmentation with a placement policy Version 12
  2005-06-04  2:15                         ` Nick Piggin
@ 2005-06-05 19:52                           ` David S. Miller, Nick Piggin
  0 siblings, 0 replies; 42+ messages in thread
From: David S. Miller, Nick Piggin @ 2005-06-05 19:52 UTC (permalink / raw)
  To: nickpiggin; +Cc: herbert, mbligh, jschopp, mel, linux-mm, linux-kernel, akpm

> Herbert Xu wrote:
> > With Dave's latest super-TSO patch, TCP over loopback will only be
> > doing order-0 allocations in the common case.  UDP and others may
> > still do large allocations but that logic is all localised in
> > ip_append_data.
> > 
> > So if we wanted we could easily remove most large allocations over
> > the loopback device.
> 
> I would be very interested to look into that. I would be
> willing to do benchmarks on a range of machines too if
> that would be of any use to you.

Even without the super-TSO patch, we never do larger than
PAGE_SIZE allocations for sendmsg() when the device is
scatter-gather capable (as indicated in netdev->flags).

Loopback does set this bit.

This PAGE_SIZE limit comes from net/ipv4/tcp.c:select_size().
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Avoiding external fragmentation with a placement policy Version 12
  2005-06-03 14:00                   ` Martin J. Bligh
@ 2005-06-08 17:03                     ` Mel Gorman
  2005-06-08 17:18                       ` Martin J. Bligh
  0 siblings, 1 reply; 42+ messages in thread
From: Mel Gorman @ 2005-06-08 17:03 UTC (permalink / raw)
  To: Martin J. Bligh; +Cc: Nick Piggin, jschopp, linux-mm, lkml, Andrew Morton

On Fri, 3 Jun 2005, Martin J. Bligh wrote:

> > Does it need more documentation? If so, I'll write up a detailed blurb on
> > how it works and drop it into Documentation/
> >
> >> Although I can't argue that a buddy allocator is no good without
> >> being able to satisfy higher order allocations.
> >
> > Unfortunately, it is a fundemental flaw of the buddy allocator that it
> > fragments badly. The thing is, other allocators that do not fragment are
> > also slower.
>
> Do we care? 99.9% of allocations are fronted by the hot/cold page cache
> now anyway ...

Very true, but only for order-0 allocations. As it is, higher order
allocations are a lot less important because Linux has always avoided them
unless absolutely necessary. I would like to reach the point where we can
reliably allocate large blocks of memory so we do not have to split large
amounts of data into page-sized chunks all the time.

> and yes, I realise that things popping in/out of that
> obviously aren't going into the "defrag" pool, but still, it should help.
> I suppose all we're slowing down is higher order allocs anyway, which
> is the uncommon case, but ... worth thinking about.
>

I did measure it and there is a slow-down on high order allocations which
is not very surprising. The following is the result of a micro-benchmark
comparing the standard and modified allocator for 1500 order-5
allocations.

Standard
     Average          Max          Min       Allocs
     -------          ---          ---       ------
        0.73         1.09         0.53         1476
        1.33         1.87         1.10           23
        2.10         2.10         2.10            1

Modified
     Average          Max          Min       Allocs
     -------          ---          ---       ------
        0.82         1.23         0.60         1440
        1.36         1.96         1.23           57
        2.42         2.92         2.09            3

The average, max and min are in 1000's of clock cycles for an allocation
so there is not a massive difference between the two allocators. Aim9
still shows that overall, the modified allocator is as fast as the normal
allocator.

High order allocations do slow down a lot when under memory pressure and
neither allocator performs very well although the modified allocator
probably performs worse as it has more lists to search. In the case of the
placement policy though, I can work on the linear scanning patch to avoid
using a blunderbuss on memory. With the standard allocator, linear scanning
will not help significantly because non-reclaimable memory is scattered
all over the place.

I have also found that the modified allocator can fairly reliably allocate
memory on a desktop system which has been running a full day where the
standard allocator cannot. However, that experience is subjective and
benchmarks based on loads like kernel compiles will not be anything like a
desktop system. At the very least, kernel compiles, while they load the
system, will not pin memory used for PTEs like a desktop running
long-lived applications would.

I'll work on reproducing scenarios that show where the standard allocator
fails to allocate large blocks of memory without paging everything out
that the placement policy works with.

-- 
Mel Gorman
Part-time Phd Student                          Java Applications Developer
University of Limerick                         IBM Dublin Software Lab
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Avoiding external fragmentation with a placement policy Version 12
  2005-06-08 17:03                     ` Mel Gorman
@ 2005-06-08 17:18                       ` Martin J. Bligh
  2005-06-10 16:20                         ` Christoph Lameter
  0 siblings, 1 reply; 42+ messages in thread
From: Martin J. Bligh @ 2005-06-08 17:18 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Nick Piggin, jschopp, linux-mm, lkml, Andrew Morton

>> > Unfortunately, it is a fundemental flaw of the buddy allocator that it
>> > fragments badly. The thing is, other allocators that do not fragment are
>> > also slower.
>> 
>> Do we care? 99.9% of allocations are fronted by the hot/cold page cache
>> now anyway ...
> 
> Very true, but only for order-0 allocations. As it is, higher order
> allocations are a lot less important because Linux has always avoided them
> unless absolutely necessary. I would like to reach the point where we can
> reliably allocate large blocks of memory so we do not have to split large
> amounts of data into page-sized chunks all the time.

Right. I agree that large allocs should be reliable. Whether we care so
much about if they're performant or not, I don't know ... is an interesting
question. I think the answer is maybe not, within reason. The cost of
fishing in the allocator might well be irrelevant compared to the cost
of freeing the necessary memory area?

> I did measure it and there is a slow-down on high order allocations which
> is not very surprising. The following is the result of a micro-benchmark
> comparing the standard and modified allocator for 1500 order-5
> allocations.
> 
> Standard
>      Average          Max          Min       Allocs
>      -------          ---          ---       ------
>         0.73         1.09         0.53         1476
>         1.33         1.87         1.10           23
>         2.10         2.10         2.10            1
> 
> Modified
>      Average          Max          Min       Allocs
>      -------          ---          ---       ------
>         0.82         1.23         0.60         1440
>         1.36         1.96         1.23           57
>         2.42         2.92         2.09            3
> 
> The average, max and min are in 1000's of clock cycles for an allocation
> so there is not a massive difference between the two allocators. Aim9
> still shows that overall, the modified allocator is as fast as the normal
> allocator.

Mmmm. that doesn't look too bad at all to me.
 
> High order allocations do slow down a lot when under memory pressure and
> neither allocator performs very well although the modified allocator
> probably performs worse as it has more lists to search. In the case of the
> placement policy though, I can work on the linear scanning patch to avoid
> using a blunderbuss on memory. With the standard allocator, linear scanning
> will not help significantly because non-reclaimable memory is scattered
> all over the place.
> 
> I have also found that the modified allocator can fairly reliably allocate
> memory on a desktop system which has been running a full day where the
> standard allocator cannot. However, that experience is subjective and
> benchmarks based on loads like kernel compiles will not be anything like a
> desktop system. At the very least, kernel compiles, while they load the
> system, will not pin memory used for PTEs like a desktop running
> long-lived applications would.
> 
> I'll work on reproducing scenarios that show where the standard allocator
> fails to allocate large blocks of memory without paging everything out
> that the placement policy works with.

Sounds great ... would be really valuable to get those testcases.

M.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Avoiding external fragmentation with a placement policy Version 12
  2005-06-08 17:18                       ` Martin J. Bligh
@ 2005-06-10 16:20                         ` Christoph Lameter
  2005-06-10 17:53                           ` Steve Lord
  0 siblings, 1 reply; 42+ messages in thread
From: Christoph Lameter @ 2005-06-10 16:20 UTC (permalink / raw)
  To: Martin J. Bligh
  Cc: Mel Gorman, Nick Piggin, jschopp, linux-mm, lkml, Andrew Morton

On Wed, 8 Jun 2005, Martin J. Bligh wrote:

> Right. I agree that large allocs should be reliable. Whether we care so
> much about if they're performant or not, I don't know ... is an interesting
> question. I think the answer is maybe not, within reason. The cost of
> fishing in the allocator might well be irrelevant compared to the cost
> of freeing the necessary memory area?

Large consecutive page allocation is important for I/O. Lots of drivers 
are able to issue transfer requests spanning multiple pages which is only 
possible if the pages are in sequence. If memory is fragmented then this 
is no longer possible.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Avoiding external fragmentation with a placement policy Version 12
  2005-06-10 16:20                         ` Christoph Lameter
@ 2005-06-10 17:53                           ` Steve Lord
  0 siblings, 0 replies; 42+ messages in thread
From: Steve Lord @ 2005-06-10 17:53 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Martin J. Bligh, Mel Gorman, Nick Piggin, jschopp, linux-mm,
	lkml, Andrew Morton

Christoph Lameter wrote:
> On Wed, 8 Jun 2005, Martin J. Bligh wrote:
> 
> 
>>Right. I agree that large allocs should be reliable. Whether we care so
>>much about if they're performant or not, I don't know ... is an interesting
>>question. I think the answer is maybe not, within reason. The cost of
>>fishing in the allocator might well be irrelevant compared to the cost
>>of freeing the necessary memory area?
> 
> 
> Large consecutive page allocation is important for I/O. Lots of drivers 
> are able to issue transfer requests spanning multiple pages which is only 
> possible if the pages are in sequence. If memory is fragmented then this 
> is no longer possible.

Which I think is one of the reasons Mel set off down this path
in the first place. Scatter gather only gets you so far, and
it makes the DMA engine work harder. We have seen cases where
Windows can get more bandwidth out of fiber channel raids than
can Linux, Windows was using fewer and larger size scsi commands
too. Keep a Linux box busy for a few days and its memory map gets
very fragmented, requests to the scsi layer which could have been
larger tend to get limited by the maximum number of scatter gather
elements a device can handle. Some less powerful raids (Apple Xraids
for example) can become cpu bound when you do this rather than I/O bound.

In this case what tends to help is if processes get given their
address space in large physically contiguous chunks of pages.

Steve

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 42+ messages in thread

end of thread, other threads:[~2005-06-10 17:53 UTC | newest]

Thread overview: 42+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2005-05-31 11:20 Avoiding external fragmentation with a placement policy Version 12 Mel Gorman
2005-06-01 20:55 ` Joel Schopp
2005-06-01 23:09   ` Nick Piggin
2005-06-01 23:23     ` David S. Miller, Nick Piggin
2005-06-01 23:28     ` Martin J. Bligh
2005-06-01 23:43       ` Nick Piggin
2005-06-02  0:02         ` Martin J. Bligh
2005-06-02  0:20           ` Nick Piggin
2005-06-02 13:55             ` Mel Gorman
2005-06-02 15:52             ` Joel Schopp
2005-06-02 19:50               ` Ray Bryant
2005-06-02 20:10                 ` Joel Schopp
2005-06-04 16:09                   ` Marcelo Tosatti
2005-06-03  3:48               ` Nick Piggin
2005-06-03  4:49                 ` David S. Miller, Nick Piggin
2005-06-03  5:34                   ` Martin J. Bligh
2005-06-03  5:37                     ` David S. Miller, Martin J. Bligh
2005-06-03  5:42                       ` Martin J. Bligh
2005-06-03  5:51                         ` David S. Miller, Martin J. Bligh
2005-06-03 13:13                         ` Mel Gorman
2005-06-03  6:43                     ` Nick Piggin
2005-06-03 13:57                       ` Martin J. Bligh
2005-06-03 16:43                         ` Dave Hansen
2005-06-03 18:43                           ` David S. Miller, Dave Hansen
2005-06-04  1:44                       ` Herbert Xu
2005-06-04  2:15                         ` Nick Piggin
2005-06-05 19:52                           ` David S. Miller, Nick Piggin
2005-06-03 13:05                 ` Mel Gorman
2005-06-03 14:00                   ` Martin J. Bligh
2005-06-08 17:03                     ` Mel Gorman
2005-06-08 17:18                       ` Martin J. Bligh
2005-06-10 16:20                         ` Christoph Lameter
2005-06-10 17:53                           ` Steve Lord
2005-06-02 18:28           ` Andi Kleen
2005-06-02 18:42             ` Martin J. Bligh
2005-06-02 13:15       ` Mel Gorman
2005-06-02 14:01         ` Martin J. Bligh
     [not found]       ` <20050603174706.GA25663@localhost.localdomain>
2005-06-03 17:56         ` Martin J. Bligh
2005-06-01 23:47     ` Mike Kravetz
2005-06-01 23:56       ` Nick Piggin
2005-06-02  0:07         ` Mike Kravetz
2005-06-02  9:49   ` Mel Gorman

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox