linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/5] Light Fragmentation Avoidance V20
@ 2005-11-15 16:49 Mel Gorman
  2005-11-15 16:49 ` [PATCH 1/5] Light Fragmentation Avoidance V20: 001_antidefrag_flags Mel Gorman
                   ` (5 more replies)
  0 siblings, 6 replies; 31+ messages in thread
From: Mel Gorman @ 2005-11-15 16:49 UTC (permalink / raw)
  To: linux-mm; +Cc: Mel Gorman, mingo, lhms-devel, linux-kernel, nickpiggin

This is a much simplified anti-defragmentation approach that simply tries to
keep kernel allocations in groups of 2^(MAX_ORDER-1) and easily reclaimed
allocations in groups of 2^(MAX_ORDER-1). It uses no balancing, tunables
special reserves and it introduces no new branches in the main path. For
small memory systems, it can be disabled via a config option.  In total,
it adds 275 new lines of code with minimum changes made to the main path.

It is fast and reduces fragmentation a lot giving a best-effort for low
fragmentation rather than hard guarantees. High order allocation stress tests
show that this mechanism is pretty good in practice. A zone-based system
approach based on this would give hard guarantees on the ability to reclaim
a region for hotplug or hugetlb pages with best-effort everywhere else. This
set of patches is aimed at giving best effort low fragmentation everywhere,
including zones the kernel is using.

Full benchmarks are below the Changelog, but here is the summary

			clean		anti-defrag
Kernel extract		15		16		-1 second
Kernel build		736		735		+1 second
aim9-page_test		131847.72	132832.86	+0.75%
aim9-brk_test		621709.43	623125.62	+0.23%
HugePages under load	73		113		+54%
HugePages after tests	78		118		+51%

In general, Aim9 shows the allocator is comparable.  The difference between
the kernel compiles is negligible indicating that this approach does not
impact the common code paths. The lack of higher success rates on HugePage
allocations was due to the presence of per-cpu pages.

When anti-defrag is disabled via the config option, it behaves and performs
just like the standard allocator with no performance impact. Comments on it's
complexity (still too complex?) and alternative benchmarks are appreciated.

Diffstat for full set of patches
 fs/buffer.c             |    3 
 fs/compat.c             |    2 
 fs/exec.c               |    2 
 fs/inode.c              |    2 
 include/asm-i386/page.h |    3 
 include/linux/gfp.h     |   13 +-
 include/linux/highmem.h |    3 
 include/linux/mmzone.h  |   67 ++++++++++
 init/Kconfig            |   12 +
 mm/memory.c             |    6 
 mm/page_alloc.c         |  305 ++++++++++++++++++++++++++++++++++++++----------
 mm/shmem.c              |    4 
 mm/swap_state.c         |    3 
 13 files changed, 350 insertions(+), 75 deletions(-)

Changelog since complex anti-defragmentation v19
o Updated to 2.6.14-mm2
o Removed the fallback area and balancing code
o Only differentiate between kernel and easy reclaimable allocations
  - Removes almost all the code that deals with usemaps
  - Made a number of simplifications based on two allocation lists
  - Fallback code is drastically simpler
o Do not change behaviour for high-order allocations
o Drop stats patch - unnecessary complications

Changelog since v18
o Resync against 2.6.14-rc5-mm1
o 004_markfree dropped
o Documentation note added on the behavior of free_area.nr_free

Changelog since v17
o Update to 2.6.14-rc4-mm1
o Remove explicit casts where implicit casts were in place
o Change __GFP_USER to __GFP_EASYRCLM, RCLM_USER to RCLM_EASY and PCPU_USER to
  PCPU_EASY
o Print a warning and return NULL if both RCLM flags are set in the GFP flags
o Reduce size of fallback_allocs
o Change magic number 64 to FREE_AREA_USEMAP_SIZE
o CodingStyle regressions cleanup
o Move sparsemen setup_usemap() out of header
o Changed fallback_balance to a mechanism that depended on zone->present_pages
  to avoid hotplug problems later
o Many superfluous parenthesis removed

Changlog since v16
o Variables using bit operations now are unsigned long. Note that when used
  as indices, they are integers and cast to unsigned long when necessary.
  This is because aim9 shows regressions when used as unsigned longs 
  throughout (~10% slowdown)
o 004_showfree added to provide more debugging information
o 008_stats dropped. Even with CONFIG_ALLOCSTATS disabled, it is causing 
  severe performance regressions. No explanation as to why
o for_each_rclmtype_order moved to header
o More coding style cleanups

Changelog since V14 (V15 not released)
o Update against 2.6.14-rc3
o Resync with Joel's work. All suggestions made on fix-ups to his last
  set of patches should also be in here. e.g. __GFP_USER is still __GFP_USER
  but is better commented.
o Large amount of CodingStyle, readability cleanups and corrections pointed
  out by Dave Hansen.
o Fix CONFIG_NUMA error that corrupted per-cpu lists
o Patches broken out to have one-feature-per-patch rather than
  more-code-per-patch
o Fix fallback bug where pages for RCLM_NORCLM end up on random other
  free lists.

Changelog since V13
o Patches are now broken out
o Added per-cpu draining of userrclm pages
o Brought the patch more in line with memory hotplug work
o Fine-grained use of the __GFP_USER and __GFP_KERNRCLM flags
o Many coding-style corrections
o Many whitespace-damage corrections

Changelog since V12
o Minor whitespace damage fixed as pointed by Joel Schopp

Changelog since V11
o Mainly a redefiff against 2.6.12-rc5
o Use #defines for indexing into pcpu lists
o Fix rounding error in the size of usemap

Changelog since V10
o All allocation types now use per-cpu caches like the standard allocator
o Removed all the additional buddy allocator statistic code
o Elimated three zone fields that can be lived without
o Simplified some loops
o Removed many unnecessary calculations

Changelog since V9
o Tightened what pools are used for fallbacks, less likely to fragment
o Many micro-optimisations to have the same performance as the standard 
  allocator. Modified allocator now faster than standard allocator using
  gcc 3.3.5
o Add counter for splits/coalescing

Changelog since V8
o rmqueue_bulk() allocates pages in large blocks and breaks it up into the
  requested size. Reduces the number of calls to __rmqueue()
o Beancounters are now a configurable option under "Kernel Hacking"
o Broke out some code into inline functions to be more Hotplug-friendly
o Increased the size of reserve for fallbacks from 10% to 12.5%. 

Changelog since V7
o Updated to 2.6.11-rc4
o Lots of cleanups, mainly related to beancounters
o Fixed up a miscalculation in the bitmap size as pointed out by Mike Kravetz
  (thanks Mike)
o Introduced a 10% reserve for fallbacks. Drastically reduces the number of
  kernnorclm allocations that go to the wrong places
o Don't trigger OOM when large allocations are involved

Changelog since V6
o Updated to 2.6.11-rc2
o Minor change to allow prezeroing to be a cleaner looking patch

Changelog since V5
o Fixed up gcc-2.95 errors
o Fixed up whitespace damage

Changelog since V4
o No changes. Applies cleanly against 2.6.11-rc1 and 2.6.11-rc1-bk6. Applies
  with offsets to 2.6.11-rc1-mm1

Changelog since V3
o inlined get_pageblock_type() and set_pageblock_type()
o set_pageblock_type() now takes a zone parameter to avoid a call to page_zone()
o When taking from the global pool, do not scan all the low-order lists

Changelog since V2
o Do not to interfere with the "min" decay
o Update the __GFP_BITS_SHIFT properly. Old value broke fsync and probably
  anything to do with asynchronous IO
  
Changelog since V1
o Update patch to 2.6.11-rc1
o Cleaned up bug where memory was wasted on a large bitmap
o Remove code that needed the binary buddy bitmaps
o Update flags to avoid colliding with __GFP_ZERO changes
o Extended fallback_count bean counters to show the fallback count for each
  allocation type
o In-code documentation

Version 1
o Initial release against 2.6.9

This patch is designed to reduce fragmentation in the standard buddy allocator
without impairing the performance of the allocator. High fragmentation
in the standard binary buddy allocator means that high-order allocations
can rarely be serviced. This patch works by dividing allocations into two
different types of allocations;

EasyReclaimable - These are userspace pages that are easily reclaimable. This
	flag is set when it is known that the pages will be trivially reclaimed
	by writing the page out to swap or syncing with backing storage

KernelNonReclaimable - These are pages that are allocated by the kernel that
	are not trivially reclaimed. For example, the memory allocated for a
	loaded module would be in this category. By default, allocations are
	considered to be of this type

Instead of having one global MAX_ORDER-sized array of free lists, there are
two, one for each type of allocation. Once a 2^MAX_ORDER block of pages is
split for a type of allocation, it is added to the free-lists for that type,
in effect reserving it. Hence, over time, pages of the different types can
be clustered together.

When the preferred freelists are expired, the largest possible block is taken
from the alternative list. Buddies that are split from that large block are
placed on the preferred allocation-type freelists to mitigate fragmentation.

Four benchmark results are included all based on a 2.6.14-mm2 kernel compiled
with gcc 3.4. These benchmarks were run in the order you see them *without*
rebooting. This means that when the final highorder stress test, the system is
already running with any fragmentation introduced by other benchmarks.

The first test called bench-kbuild.sh times a kernel build. Time is in seconds

			clean		anti-defrag
Kernel extract		15		16				
Kernel build		736		735

The second is the output of portions of AIM9 for the vanilla
allocator and the modified one;

(Tests run with bench-aim9.sh from VMRegress 0.18)
2.6.14-mm2-clean
------------------------------------------------------------------------------------------------------------
 Test        Test        Elapsed  Iteration    Iteration          Operation
Number       Name      Time (sec)   Count   Rate (loops/sec)    Rate (ops/sec)
------------------------------------------------------------------------------------------------------------
     1 creat-clo           60.03        959   15.97535        15975.35 File Creations and Closes/second
     2 page_test           60.02       4655   77.55748       131847.72 System Allocations & Pages/second
     3 brk_test            60.02       2195   36.57114       621709.43 System Memory Allocations/second
     4 jmp_test            60.00     264177 4402.95000      4402950.00 Non-local gotos/second
     5 signal_test         60.00       4982   83.03333        83033.33 Signal Traps/second
     6 exec_test           60.08        763   12.69973           63.50 Program Loads/second
     7 fork_test           60.06        977   16.26707         1626.71 Task Creations/second
     8 link_test           60.01       5315   88.56857         5579.82 Link/Unlink Pairs/second

2.6.14-mm2-mbuddy-v20
------------------------------------------------------------------------------------------------------------
 Test        Test        Elapsed  Iteration    Iteration          Operation
Number       Name      Time (sec)   Count   Rate (loops/sec)    Rate (ops/sec)
------------------------------------------------------------------------------------------------------------
     1 creat-clo           60.04        961   16.00600        16006.00 File Creations and Closes/second
     2 page_test           60.01       4689   78.13698       132832.86 System Allocations & Pages/second
     3 brk_test            60.02       2200   36.65445       623125.62 System Memory Allocations/second
     4 jmp_test            60.00     264321 4405.35000      4405350.00 Non-local gotos/second
     5 signal_test         60.01       4943   82.36961        82369.61 Signal Traps/second
     6 exec_test           60.05        757   12.60616           63.03 Program Loads/second
     7 fork_test           60.04        975   16.23917         1623.92 Task Creations/second
     8 link_test           60.02       5317   88.58714         5580.99 Link/Unlink Pairs/second
------------------------------------------------------------------------------------------------------------

Difference in performance operations report generated by diff-aim9.sh
                   Clean   mbuddy-v20
                ---------- ----------
 1 creat-clo      15975.35   16006.00      30.65  0.19% File Creations and Closes/second
 2 page_test     131847.72  132832.86     985.14  0.75% System Allocations & Pages/second
 3 brk_test      621709.43  623125.62    1416.19  0.23% System Memory Allocations/second
 4 jmp_test     4402950.00 4405350.00    2400.00  0.05% Non-local gotos/second
 5 signal_test    83033.33   82369.61    -663.72 -0.80% Signal Traps/second
 6 exec_test         63.50      63.03      -0.47 -0.74% Program Loads/second
 7 fork_test       1626.71    1623.92      -2.79 -0.17% Task Creations/second
 8 link_test       5579.82    5580.99       1.17  0.02% Link/Unlink Pairs/second

The second benchmark tested the CPU cache usage to make sure it was not
getting clobbered. The test was to repeatedly render a large postscript file
10 times and get the average. The result is;

2.6.14-mm2-clean:      Average: 7.414 real, 7.08 user, 0.08 sys
2.6.14-mm2-mbuddy-v20: Average: 7.412 real, 7.236 user, 0.084 sys

So there are no adverse cache effects. The last test is to show that the
allocator can satisfy more high-order allocations, especially under load,
than the standard allocator. The test performs the following;

1. Start updatedb running in the background
2. Load kernel modules that tries to allocate high-order blocks on demand
3. Clean a kernel tree
4. Make 4 copies of the tree. As each copy finishes, a compile starts at -j2
5. Start compiling the primary tree
6. Sleep 1 minute while the 7 trees are being compiled
7. Use the kernel module to attempt 160 times to allocate a 2^10 block of pages
    - note, it only attempts 160 times, no matter how often it succeeds
    - An allocation is attempted every 1/10th of a second
    - Performance will get badly shot as it forces considerable amounts of
      pageout

This is a relatively light load that consumes almost all of physical memory
without putting either allocator under serious pressure and triggering
OOM. The result of the allocations under load (load averaging 12) were;

2.6.14-mm2 Clean
Order:                 10
Allocation type:       HighMem
Attempted allocations: 160
Success allocs:        73
Failed allocs:         87
DMA zone allocs:       0
Normal zone allocs:    22
HighMem zone allocs:   51
% Success:            45

2.6.14-mm2 MBuddy V20
Order:                 10
Allocation type:       HighMem
Attempted allocations: 160
Success allocs:        96
Failed allocs:         64
DMA zone allocs:       0
Normal zone allocs:    30
HighMem zone allocs:   66
% Success:            60

It shows the placement policy is significantly better than the standard
allocator at satisfying hugetlb sized allocations.  After the tests completed,
the standard allocator was able to allocate 78 order-10 pages and the modified
allocator allocated 102.  It is known that the success of large allocations
is also dependant on the location of per-cpu pages but fixing that problem
is a separate issue.

The results show that the modified allocator has comparable speed, has no
adverse cache effects but is far less fragmented and in a better position
to satisfy high-order allocations.
-- 
Mel Gorman
Part-time Phd Student                          Java Applications Developer
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH 1/5] Light Fragmentation Avoidance V20: 001_antidefrag_flags
  2005-11-15 16:49 [PATCH 0/5] Light Fragmentation Avoidance V20 Mel Gorman
@ 2005-11-15 16:49 ` Mel Gorman
  2005-11-15 23:00   ` Paul Jackson
  2005-11-15 16:49 ` [PATCH 2/5] Light Fragmentation Avoidance V20: 002_usemap Mel Gorman
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 31+ messages in thread
From: Mel Gorman @ 2005-11-15 16:49 UTC (permalink / raw)
  To: linux-mm; +Cc: Mel Gorman, mingo, linux-kernel, nickpiggin, lhms-devel

This patch adds a flag __GFP_EASYRCLM.  Allocations using the __GFP_EASYRCLM
flag are expected to be easily reclaimed by syncing with backing storage (be
it a file or swap) or cleaning the buffers and discarding.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.14-mm2-clean/fs/buffer.c linux-2.6.14-mm2-001_antidefrag_flags/fs/buffer.c
--- linux-2.6.14-mm2-clean/fs/buffer.c	2005-11-13 21:22:24.000000000 +0000
+++ linux-2.6.14-mm2-001_antidefrag_flags/fs/buffer.c	2005-11-15 12:40:42.000000000 +0000
@@ -1113,7 +1113,8 @@ grow_dev_page(struct block_device *bdev,
 	struct page *page;
 	struct buffer_head *bh;
 
-	page = find_or_create_page(inode->i_mapping, index, GFP_NOFS);
+	page = find_or_create_page(inode->i_mapping, index,
+				   GFP_NOFS|__GFP_EASYRCLM);
 	if (!page)
 		return NULL;
 
diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.14-mm2-clean/fs/compat.c linux-2.6.14-mm2-001_antidefrag_flags/fs/compat.c
--- linux-2.6.14-mm2-clean/fs/compat.c	2005-11-13 21:22:24.000000000 +0000
+++ linux-2.6.14-mm2-001_antidefrag_flags/fs/compat.c	2005-11-15 12:40:42.000000000 +0000
@@ -1345,7 +1345,7 @@ static int compat_copy_strings(int argc,
 			page = bprm->page[i];
 			new = 0;
 			if (!page) {
-				page = alloc_page(GFP_HIGHUSER);
+				page = alloc_page(GFP_HIGHUSER|__GFP_EASYRCLM);
 				bprm->page[i] = page;
 				if (!page) {
 					ret = -ENOMEM;
diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.14-mm2-clean/fs/exec.c linux-2.6.14-mm2-001_antidefrag_flags/fs/exec.c
--- linux-2.6.14-mm2-clean/fs/exec.c	2005-11-13 21:22:24.000000000 +0000
+++ linux-2.6.14-mm2-001_antidefrag_flags/fs/exec.c	2005-11-15 12:40:42.000000000 +0000
@@ -238,7 +238,7 @@ static int copy_strings(int argc, char _
 			page = bprm->page[i];
 			new = 0;
 			if (!page) {
-				page = alloc_page(GFP_HIGHUSER);
+				page = alloc_page(GFP_HIGHUSER|__GFP_EASYRCLM);
 				bprm->page[i] = page;
 				if (!page) {
 					ret = -ENOMEM;
diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.14-mm2-clean/fs/inode.c linux-2.6.14-mm2-001_antidefrag_flags/fs/inode.c
--- linux-2.6.14-mm2-clean/fs/inode.c	2005-11-13 21:22:24.000000000 +0000
+++ linux-2.6.14-mm2-001_antidefrag_flags/fs/inode.c	2005-11-15 12:40:42.000000000 +0000
@@ -146,7 +146,7 @@ static struct inode *alloc_inode(struct 
 		mapping->a_ops = &empty_aops;
  		mapping->host = inode;
 		mapping->flags = 0;
-		mapping_set_gfp_mask(mapping, GFP_HIGHUSER);
+		mapping_set_gfp_mask(mapping, GFP_HIGHUSER|__GFP_EASYRCLM);
 		mapping->assoc_mapping = NULL;
 		mapping->backing_dev_info = &default_backing_dev_info;
 
diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.14-mm2-clean/include/asm-i386/page.h linux-2.6.14-mm2-001_antidefrag_flags/include/asm-i386/page.h
--- linux-2.6.14-mm2-clean/include/asm-i386/page.h	2005-10-28 01:02:08.000000000 +0100
+++ linux-2.6.14-mm2-001_antidefrag_flags/include/asm-i386/page.h	2005-11-15 12:40:42.000000000 +0000
@@ -36,7 +36,8 @@
 #define clear_user_page(page, vaddr, pg)	clear_page(page)
 #define copy_user_page(to, from, vaddr, pg)	copy_page(to, from)
 
-#define alloc_zeroed_user_highpage(vma, vaddr) alloc_page_vma(GFP_HIGHUSER | __GFP_ZERO, vma, vaddr)
+#define alloc_zeroed_user_highpage(vma, vaddr) \
+	alloc_page_vma(GFP_HIGHUSER | __GFP_ZERO | __GFP_EASYRCLM, vma, vaddr)
 #define __HAVE_ARCH_ALLOC_ZEROED_USER_HIGHPAGE
 
 /*
diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.14-mm2-clean/include/linux/gfp.h linux-2.6.14-mm2-001_antidefrag_flags/include/linux/gfp.h
--- linux-2.6.14-mm2-clean/include/linux/gfp.h	2005-11-13 21:22:26.000000000 +0000
+++ linux-2.6.14-mm2-001_antidefrag_flags/include/linux/gfp.h	2005-11-15 12:40:42.000000000 +0000
@@ -50,6 +50,12 @@ struct vm_area_struct;
 #define __GFP_HARDWALL   ((__force gfp_t)0x40000u) /* Enforce hardwall cpuset memory allocs */
 #define __GFP_VALID	((__force gfp_t)0x80000000u) /* valid GFP flags */
 
+/*
+ * Allocation type modifier
+ * __GFP_EASYRCLM: Easily reclaimed pages like userspace or buffer pages
+ */
+#define __GFP_EASYRCLM   0x80000u  /* User and other easily reclaimed pages */
+
 #define __GFP_BITS_SHIFT 20	/* Room for 20 __GFP_FOO bits */
 #define __GFP_BITS_MASK ((__force gfp_t)((1 << __GFP_BITS_SHIFT) - 1))
 
@@ -57,7 +63,8 @@ struct vm_area_struct;
 #define GFP_LEVEL_MASK (__GFP_WAIT|__GFP_HIGH|__GFP_IO|__GFP_FS| \
 			__GFP_COLD|__GFP_NOWARN|__GFP_REPEAT| \
 			__GFP_NOFAIL|__GFP_NORETRY|__GFP_NO_GROW|__GFP_COMP| \
-			__GFP_NOMEMALLOC|__GFP_NORECLAIM|__GFP_HARDWALL)
+			__GFP_NOMEMALLOC|__GFP_NORECLAIM|__GFP_HARDWALL| \
+			__GFP_EASYRCLM)
 
 #define GFP_ATOMIC	(__GFP_VALID | __GFP_HIGH)
 #define GFP_NOIO	(__GFP_VALID | __GFP_WAIT)
diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.14-mm2-clean/include/linux/highmem.h linux-2.6.14-mm2-001_antidefrag_flags/include/linux/highmem.h
--- linux-2.6.14-mm2-clean/include/linux/highmem.h	2005-10-28 01:02:08.000000000 +0100
+++ linux-2.6.14-mm2-001_antidefrag_flags/include/linux/highmem.h	2005-11-15 12:40:42.000000000 +0000
@@ -47,7 +47,8 @@ static inline void clear_user_highpage(s
 static inline struct page *
 alloc_zeroed_user_highpage(struct vm_area_struct *vma, unsigned long vaddr)
 {
-	struct page *page = alloc_page_vma(GFP_HIGHUSER, vma, vaddr);
+	struct page *page = alloc_page_vma(GFP_HIGHUSER|__GFP_EASYRCLM,
+							vma, vaddr);
 
 	if (page)
 		clear_user_highpage(page, vaddr);
diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.14-mm2-clean/mm/memory.c linux-2.6.14-mm2-001_antidefrag_flags/mm/memory.c
--- linux-2.6.14-mm2-clean/mm/memory.c	2005-11-13 21:22:26.000000000 +0000
+++ linux-2.6.14-mm2-001_antidefrag_flags/mm/memory.c	2005-11-15 12:40:42.000000000 +0000
@@ -1346,7 +1346,8 @@ static int do_wp_page(struct mm_struct *
 		if (!new_page)
 			goto oom;
 	} else {
-		new_page = alloc_page_vma(GFP_HIGHUSER, vma, address);
+		new_page = alloc_page_vma(GFP_HIGHUSER|__GFP_EASYRCLM,
+							vma, address);
 		if (!new_page)
 			goto oom;
 		copy_user_highpage(new_page, old_page, address);
@@ -1914,7 +1915,8 @@ retry:
 
 			if (unlikely(anon_vma_prepare(vma)))
 				goto oom;
-			page = alloc_page_vma(GFP_HIGHUSER, vma, address);
+			page = alloc_page_vma(GFP_HIGHUSER|__GFP_EASYRCLM,
+								vma, address);
 			if (!page)
 				goto oom;
 			copy_user_highpage(page, new_page, address);
diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.14-mm2-clean/mm/shmem.c linux-2.6.14-mm2-001_antidefrag_flags/mm/shmem.c
--- linux-2.6.14-mm2-clean/mm/shmem.c	2005-11-13 21:22:27.000000000 +0000
+++ linux-2.6.14-mm2-001_antidefrag_flags/mm/shmem.c	2005-11-15 12:40:42.000000000 +0000
@@ -906,7 +906,7 @@ shmem_alloc_page(gfp_t gfp, struct shmem
 	pvma.vm_policy = mpol_shared_policy_lookup(&info->policy, idx);
 	pvma.vm_pgoff = idx;
 	pvma.vm_end = PAGE_SIZE;
-	page = alloc_page_vma(gfp | __GFP_ZERO, &pvma, 0);
+	page = alloc_page_vma(gfp | __GFP_ZERO | __GFP_EASYRCLM, &pvma, 0);
 	mpol_free(pvma.vm_policy);
 	return page;
 }
@@ -921,7 +921,7 @@ shmem_swapin(struct shmem_inode_info *in
 static inline struct page *
 shmem_alloc_page(gfp_t gfp,struct shmem_inode_info *info, unsigned long idx)
 {
-	return alloc_page(gfp | __GFP_ZERO);
+	return alloc_page(gfp | __GFP_ZERO | __GFP_EASYRCLM);
 }
 #endif
 
diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.14-mm2-clean/mm/swap_state.c linux-2.6.14-mm2-001_antidefrag_flags/mm/swap_state.c
--- linux-2.6.14-mm2-clean/mm/swap_state.c	2005-11-13 21:22:27.000000000 +0000
+++ linux-2.6.14-mm2-001_antidefrag_flags/mm/swap_state.c	2005-11-15 12:40:42.000000000 +0000
@@ -341,7 +341,8 @@ struct page *read_swap_cache_async(swp_e
 		 * Get a new page to read into from swap.
 		 */
 		if (!new_page) {
-			new_page = alloc_page_vma(GFP_HIGHUSER, vma, addr);
+			new_page = alloc_page_vma(GFP_HIGHUSER|__GFP_EASYRCLM,
+							vma, addr);
 			if (!new_page)
 				break;		/* Out of memory */
 		}

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH 2/5] Light Fragmentation Avoidance V20: 002_usemap
  2005-11-15 16:49 [PATCH 0/5] Light Fragmentation Avoidance V20 Mel Gorman
  2005-11-15 16:49 ` [PATCH 1/5] Light Fragmentation Avoidance V20: 001_antidefrag_flags Mel Gorman
@ 2005-11-15 16:49 ` Mel Gorman
  2005-11-15 23:36   ` Andi Kleen
  2005-11-15 16:50 ` [PATCH 3/5] Light Fragmentation Avoidance V20: 003_fragcore Mel Gorman
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 31+ messages in thread
From: Mel Gorman @ 2005-11-15 16:49 UTC (permalink / raw)
  To: linux-mm; +Cc: Mel Gorman, mingo, lhms-devel, linux-kernel, nickpiggin

This patch adds a "usemap" to the allocator. Each bit in the usemap indicates
whether a block of 2^(MAX_ORDER-1) pages are being used for kernel or
easily-reclaimed allocations. This enumerates two types of allocations;

RCLM_NORLM:	These are kernel allocations that cannot be reclaimed
		on demand.
RCLM_EASY:	These are pages allocated with __GFP_EASYRCLM flag set. They are
		considered to be user and other easily reclaimed pages such
		as buffers

gfpflags_to_rclmtype() converts gfp_flags to their corresponding RCLM_TYPE.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.14-mm2-001_antidefrag_flags/include/linux/mmzone.h linux-2.6.14-mm2-002_usemap/include/linux/mmzone.h
--- linux-2.6.14-mm2-001_antidefrag_flags/include/linux/mmzone.h	2005-11-13 21:22:26.000000000 +0000
+++ linux-2.6.14-mm2-002_usemap/include/linux/mmzone.h	2005-11-15 12:42:15.000000000 +0000
@@ -21,6 +21,11 @@
 #else
 #define MAX_ORDER CONFIG_FORCE_MAX_ZONEORDER
 #endif
+#define PAGES_PER_MAXORDER (1 << (MAX_ORDER-1))
+
+#define RCLM_NORCLM   0
+#define RCLM_EASY     1
+#define RCLM_TYPES    2
 
 struct free_area {
 	struct list_head	free_list;
@@ -147,6 +152,12 @@ struct zone {
 #endif
 	struct free_area	free_area[MAX_ORDER];
 
+#ifndef CONFIG_SPARSEMEM
+	/*
+	 * The map tracks what each 2^MAX_ORDER-1 sized block is being used for.
+	 */
+	unsigned long		*free_area_usemap;
+#endif
 
 	ZONE_PADDING(_pad1_)
 
@@ -502,9 +513,14 @@ extern struct pglist_data contig_page_da
 #define PAGES_PER_SECTION       (1UL << PFN_SECTION_SHIFT)
 #define PAGE_SECTION_MASK	(~(PAGES_PER_SECTION-1))
 
+#define FREE_AREA_BITS		64
+
 #if (MAX_ORDER - 1 + PAGE_SHIFT) > SECTION_SIZE_BITS
 #error Allocator MAX_ORDER exceeds SECTION_SIZE
 #endif
+#if ((SECTION_SIZE_BITS - MAX_ORDER)) > FREE_AREA_BITS
+#error free_area_usemap is not big enough
+#endif
 
 struct page;
 struct mem_section {
@@ -517,6 +533,7 @@ struct mem_section {
 	 * before using it wrong.
 	 */
 	unsigned long section_mem_map;
+	DECLARE_BITMAP(free_area_usemap, FREE_AREA_BITS);
 };
 
 #ifdef CONFIG_SPARSEMEM_EXTREME
@@ -585,6 +602,18 @@ static inline struct mem_section *__pfn_
 	return __nr_to_section(pfn_to_section_nr(pfn));
 }
 
+static inline unsigned long *pfn_to_usemap(struct zone *zone,
+						unsigned long pfn)
+{
+	return &__pfn_to_section(pfn)->free_area_usemap[0];
+}
+
+static inline int pfn_to_bitidx(struct zone *zone, unsigned long pfn)
+{
+	pfn &= (PAGES_PER_SECTION-1);
+	return (pfn >> (MAX_ORDER-1));
+}
+
 #define pfn_to_page(pfn) 						\
 ({ 									\
 	unsigned long __pfn = (pfn);					\
@@ -622,6 +651,17 @@ void sparse_init(void);
 #else
 #define sparse_init()	do {} while (0)
 #define sparse_index_init(_sec, _nid)  do {} while (0)
+static inline unsigned long *pfn_to_usemap(struct zone *zone,
+						unsigned long pfn)
+{
+	return zone->free_area_usemap;
+}
+
+static inline int pfn_to_bitidx(struct zone *zone, unsigned long pfn)
+{
+	pfn = pfn - zone->zone_start_pfn;
+	return (pfn >> (MAX_ORDER-1));
+}
 #endif /* CONFIG_SPARSEMEM */
 
 #ifdef CONFIG_NODES_SPAN_OTHER_NODES
diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.14-mm2-001_antidefrag_flags/mm/page_alloc.c linux-2.6.14-mm2-002_usemap/mm/page_alloc.c
--- linux-2.6.14-mm2-001_antidefrag_flags/mm/page_alloc.c	2005-11-13 21:22:26.000000000 +0000
+++ linux-2.6.14-mm2-002_usemap/mm/page_alloc.c	2005-11-15 12:42:15.000000000 +0000
@@ -68,6 +68,19 @@ int sysctl_lowmem_reserve_ratio[MAX_NR_Z
 
 EXPORT_SYMBOL(totalram_pages);
 
+static inline int get_pageblock_type(struct zone *zone, struct page *page)
+{
+	unsigned long pfn = page_to_pfn(page);
+	return !!test_bit(pfn_to_bitidx(zone, pfn),
+			pfn_to_usemap(zone, pfn));
+}
+
+static inline void change_pageblock_type(struct zone *zone, struct page *page)
+{
+	unsigned long pfn = page_to_pfn(page);
+	__change_bit(pfn_to_bitidx(zone, pfn), pfn_to_usemap(zone, pfn));
+}
+
 /*
  * Used by page_zone() to look up the address of the struct zone whose
  * id is encoded in the upper bits of page->flags
@@ -1847,6 +1860,38 @@ inline void setup_pageset(struct per_cpu
 	INIT_LIST_HEAD(&pcp->list);
 }
 
+#ifndef CONFIG_SPARSEMEM
+#define roundup(x, y) ((((x)+((y)-1))/(y))*(y))
+/*
+ * Calculate the size of the zone->usemap in bytes rounded to an unsigned long
+ * Start by making sure zonesize is a multiple of MAX_ORDER-1 by rounding up
+ * Then figure 1 RCLM_TYPE worth of bits per MAX_ORDER-1, finally round up
+ * what is now in bits to nearest long in bits, then return it in bytes.
+ */
+static unsigned long __init usemap_size(unsigned long zonesize)
+{
+	unsigned long usemapsize;
+
+	usemapsize = roundup(zonesize, PAGES_PER_MAXORDER);
+	usemapsize = usemapsize >> (MAX_ORDER-1);
+	usemapsize = roundup(usemapsize, 8 * sizeof(unsigned long));
+
+	return usemapsize / 8;
+}
+
+static void __init setup_usemap(struct pglist_data *pgdat,
+				struct zone *zone, unsigned long zonesize)
+{
+	unsigned long usemapsize = usemap_size(zonesize);
+	zone->free_area_usemap = alloc_bootmem_node(pgdat, usemapsize);
+	memset(zone->free_area_usemap, ~RCLM_NORCLM, usemapsize);
+	memset(zone->free_area_usemap, RCLM_NORCLM, usemapsize/2);
+}
+#else
+static void inline setup_usemap(struct pglist_data *pgdat,
+				struct zone *zone, unsigned long zonesize) {}
+#endif /* CONFIG_SPARSEMEM */
+
 #ifdef CONFIG_NUMA
 /*
  * Boot pageset table. One per cpu which is going to be used for all
@@ -2060,6 +2105,7 @@ static void __init free_area_init_core(s
 		zonetable_add(zone, nid, j, zone_start_pfn, size);
 		init_currently_empty_zone(zone, zone_start_pfn, size);
 		zone_start_pfn += size;
+		setup_usemap(pgdat, zone, size);
 	}
 }
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH 3/5] Light Fragmentation Avoidance V20: 003_fragcore
  2005-11-15 16:49 [PATCH 0/5] Light Fragmentation Avoidance V20 Mel Gorman
  2005-11-15 16:49 ` [PATCH 1/5] Light Fragmentation Avoidance V20: 001_antidefrag_flags Mel Gorman
  2005-11-15 16:49 ` [PATCH 2/5] Light Fragmentation Avoidance V20: 002_usemap Mel Gorman
@ 2005-11-15 16:50 ` Mel Gorman
  2005-11-16  2:35   ` KAMEZAWA Hiroyuki
  2005-11-15 16:50 ` [PATCH 4/5] Light Fragmentation Avoidance V20: 004_percpu Mel Gorman
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 31+ messages in thread
From: Mel Gorman @ 2005-11-15 16:50 UTC (permalink / raw)
  To: linux-mm; +Cc: Mel Gorman, mingo, linux-kernel, nickpiggin, lhms-devel

This patch adds the core of the anti-fragmentation strategy. It works by
grouping related allocation types together. The idea is that large groups of
pages that may be reclaimed are placed near each other. The zone->free_area
list is broken into RCLM_TYPES number of lists.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.14-mm2-002_usemap/include/linux/mmzone.h linux-2.6.14-mm2-003_fragcore/include/linux/mmzone.h
--- linux-2.6.14-mm2-002_usemap/include/linux/mmzone.h	2005-11-15 12:42:15.000000000 +0000
+++ linux-2.6.14-mm2-003_fragcore/include/linux/mmzone.h	2005-11-15 12:43:41.000000000 +0000
@@ -27,6 +27,12 @@
 #define RCLM_EASY     1
 #define RCLM_TYPES    2
 
+#define for_each_rclmtype(type) \
+	for (type = 0; type < RCLM_TYPES; type++)
+#define for_each_rclmtype_order(type, order) \
+	for (order = 0; order < MAX_ORDER; order++) \
+		for (type = 0; type < RCLM_TYPES; type++)
+
 struct free_area {
 	struct list_head	free_list;
 	unsigned long		nr_free;
@@ -150,7 +156,7 @@ struct zone {
 	/* see spanned/present_pages for more description */
 	seqlock_t		span_seqlock;
 #endif
-	struct free_area	free_area[MAX_ORDER];
+	struct free_area	free_area_lists[RCLM_TYPES][MAX_ORDER];
 
 #ifndef CONFIG_SPARSEMEM
 	/*
diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.14-mm2-002_usemap/mm/page_alloc.c linux-2.6.14-mm2-003_fragcore/mm/page_alloc.c
--- linux-2.6.14-mm2-002_usemap/mm/page_alloc.c	2005-11-15 12:42:15.000000000 +0000
+++ linux-2.6.14-mm2-003_fragcore/mm/page_alloc.c	2005-11-15 12:44:27.000000000 +0000
@@ -68,6 +68,11 @@ int sysctl_lowmem_reserve_ratio[MAX_NR_Z
 
 EXPORT_SYMBOL(totalram_pages);
 
+static inline int gfpflags_to_alloctype(unsigned long gfp_flags)
+{
+	return ((gfp_flags & __GFP_EASYRCLM) != 0);
+}
+
 static inline int get_pageblock_type(struct zone *zone, struct page *page)
 {
 	unsigned long pfn = page_to_pfn(page);
@@ -272,6 +277,16 @@ __find_combined_index(unsigned long page
 }
 
 /*
+ * Return the free list for a given page within a zone
+ */
+static inline struct free_area *__page_find_freelist(struct zone *zone,
+							struct page *page,
+							int order)
+{
+	return &zone->free_area_lists[get_pageblock_type(zone, page)][order];
+}
+
+/*
  * This function checks whether a page is free && is the buddy
  * we can do coalesce a page and its buddy if
  * (a) the buddy is free &&
@@ -318,6 +333,7 @@ static inline void __free_pages_bulk (st
 {
 	unsigned long page_idx;
 	int order_size = 1 << order;
+	struct free_area *freelist;
 
 	if (unlikely(order))
 		destroy_compound_page(page, order);
@@ -327,10 +343,11 @@ static inline void __free_pages_bulk (st
 	BUG_ON(page_idx & (order_size - 1));
 	BUG_ON(bad_range(zone, page));
 
+	freelist = __page_find_freelist(zone, page, order);
+
 	zone->free_pages += order_size;
 	while (order < MAX_ORDER-1) {
 		unsigned long combined_idx;
-		struct free_area *area;
 		struct page *buddy;
 
 		combined_idx = __find_combined_index(page_idx, order);
@@ -341,16 +358,16 @@ static inline void __free_pages_bulk (st
 		if (!page_is_buddy(buddy, order))
 			break;		/* Move the buddy up one level. */
 		list_del(&buddy->lru);
-		area = zone->free_area + order;
-		area->nr_free--;
+		freelist->nr_free--;
 		rmv_page_order(buddy);
 		page = page + (combined_idx - page_idx);
 		page_idx = combined_idx;
 		order++;
+		freelist++;
 	}
 	set_page_order(page, order);
-	list_add(&page->lru, &zone->free_area[order].free_list);
-	zone->free_area[order].nr_free++;
+	list_add(&page->lru, &freelist->free_list);
+	freelist->nr_free++;
 }
 
 static inline void free_pages_check(const char *function, struct page *page)
@@ -507,18 +524,59 @@ static void prep_new_page(struct page *p
 	kernel_map_pages(page, 1 << order, 1);
 }
 
+/* Remove an element from the buddy allocator from the fallback list */
+static struct page *__rmqueue_fallback(struct zone *zone, int order,
+							int alloctype)
+{
+	struct free_area * area;
+	int current_order;
+	struct page *page;
+
+	/* Find the largest possible block of pages in the other list */
+	alloctype = !alloctype;
+	for (current_order = MAX_ORDER-1; current_order >= order;
+						--current_order) {
+		area = &(zone->free_area_lists[alloctype][current_order]);
+ 		if (list_empty(&area->free_list))
+ 			continue;
+
+		page = list_entry(area->free_list.next, struct page, lru);
+		area->nr_free--;
+
+		/*
+		 * If breaking a large block of pages, place the buddies
+		 * on the preferred allocation list
+		 */
+		if (unlikely(current_order >= MAX_ORDER / 2)) {
+			alloctype = !alloctype;
+			change_pageblock_type(zone, page);
+			area = &zone->free_area_lists[alloctype][current_order];
+		}
+
+		list_del(&page->lru);
+		rmv_page_order(page);
+		zone->free_pages -= 1UL << order;
+		return expand(zone, page, order, current_order, area);
+
+	}
+
+	return NULL;
+}
+
 /* 
  * Do the hard work of removing an element from the buddy allocator.
  * Call me with the zone->lock already held.
  */
-static struct page *__rmqueue(struct zone *zone, unsigned int order)
+static struct page *__rmqueue(struct zone *zone, unsigned int order,
+					int alloctype)
 {
 	struct free_area * area;
 	unsigned int current_order;
 	struct page *page;
 
+	/* Find a page of the appropriate size in the preferred list */
 	for (current_order = order; current_order < MAX_ORDER; ++current_order) {
-		area = zone->free_area + current_order;
+		area = &(zone->free_area_lists[alloctype][current_order]);
 		if (list_empty(&area->free_list))
 			continue;
 
@@ -530,16 +588,18 @@ static struct page *__rmqueue(struct zon
 		return expand(zone, page, order, current_order, area);
 	}
 
-	return NULL;
+	return __rmqueue_fallback(zone, order, alloctype);
 }
 
+
 /* 
  * Obtain a specified number of elements from the buddy allocator, all under
  * a single hold of the lock, for efficiency.  Add them to the supplied list.
  * Returns the number of new pages which were placed at *list.
  */
 static int rmqueue_bulk(struct zone *zone, unsigned int order, 
-			unsigned long count, struct list_head *list)
+			unsigned long count, struct list_head *list,
+			int alloctype)
 {
 	unsigned long flags;
 	int i;
@@ -548,7 +608,7 @@ static int rmqueue_bulk(struct zone *zon
 	
 	spin_lock_irqsave(&zone->lock, flags);
 	for (i = 0; i < count; ++i) {
-		page = __rmqueue(zone, order);
+		page = __rmqueue(zone, order, alloctype);
 		if (page == NULL)
 			break;
 		allocated++;
@@ -614,7 +674,7 @@ static void __drain_pages(unsigned int c
 void mark_free_pages(struct zone *zone)
 {
 	unsigned long zone_pfn, flags;
-	int order;
+	int order, t;
 	struct list_head *curr;
 
 	if (!zone->spanned_pages)
@@ -624,14 +684,13 @@ void mark_free_pages(struct zone *zone)
 	for (zone_pfn = 0; zone_pfn < zone->spanned_pages; ++zone_pfn)
 		ClearPageNosaveFree(pfn_to_page(zone_pfn + zone->zone_start_pfn));
 
-	for (order = MAX_ORDER - 1; order >= 0; --order)
-		list_for_each(curr, &zone->free_area[order].free_list) {
+	for_each_rclmtype_order(t, order) {
+		list_for_each(curr,&zone->free_area_lists[t][order].free_list) {
 			unsigned long start_pfn, i;
-
 			start_pfn = page_to_pfn(list_entry(curr, struct page, lru));
-
 			for (i=0; i < (1<<order); i++)
 				SetPageNosaveFree(pfn_to_page(start_pfn+i));
+		}
 	}
 	spin_unlock_irqrestore(&zone->lock, flags);
 }
@@ -732,6 +791,7 @@ buffered_rmqueue(struct zone *zone, int 
 	unsigned long flags;
 	struct page *page = NULL;
 	int cold = !!(gfp_flags & __GFP_COLD);
+	int alloctype = gfpflags_to_alloctype(gfp_flags);
 
 	if (order == 0) {
 		struct per_cpu_pages *pcp;
@@ -740,7 +800,8 @@ buffered_rmqueue(struct zone *zone, int 
 		local_irq_save(flags);
 		if (pcp->count <= pcp->low)
 			pcp->count += rmqueue_bulk(zone, 0,
-						pcp->batch, &pcp->list);
+						pcp->batch, &pcp->list,
+						alloctype);
 		if (pcp->count) {
 			page = list_entry(pcp->list.next, struct page, lru);
 			list_del(&page->lru);
@@ -750,7 +811,7 @@ buffered_rmqueue(struct zone *zone, int 
 		put_cpu();
 	} else {
 		spin_lock_irqsave(&zone->lock, flags);
-		page = __rmqueue(zone, order);
+		page = __rmqueue(zone, order, alloctype);
 		spin_unlock_irqrestore(&zone->lock, flags);
 	}
 
@@ -781,25 +842,32 @@ int zone_watermark_ok(struct zone *z, in
 		      int classzone_idx, int alloc_flags)
 {
 	/* free_pages my go negative - that's OK */
-	long min = mark, free_pages = z->free_pages - (1 << order) + 1;
-	int o;
+	long free_pages = z->free_pages - (1 << order) + 1;
+	long min;
+	int o,t;
 
 	if (alloc_flags & ALLOC_HIGH)
-		min -= min / 2;
+		mark -= mark / 2;
 	if (alloc_flags & ALLOC_HARDER)
-		min -= min / 4;
+		mark -= mark / 4;
 
-	if (free_pages <= min + z->lowmem_reserve[classzone_idx])
+	if (free_pages <= mark + z->lowmem_reserve[classzone_idx])
 		goto out_failed;
-	for (o = 0; o < order; o++) {
-		/* At the next order, this order's pages become unavailable */
-		free_pages -= z->free_area[o].nr_free << o;
+	for_each_rclmtype(t) {
+		min = mark;
+		for (o = 0; o < order; o++) {
+			/*
+			 * At the next order, this order's pages become
+			 * unavailable
+			 */
+			free_pages -= z->free_area_lists[t][o].nr_free << o;
 
-		/* Require fewer higher order pages to be free */
-		min >>= 1;
+			/* Require fewer higher order pages to be free */
+			min >>= 1;
 
-		if (free_pages <= min)
-			goto out_failed;
+			if (free_pages <= min)
+				goto out_failed;
+			}
 	}
 
 	return 1;
@@ -1380,6 +1448,7 @@ void show_free_areas(void)
 	unsigned long inactive;
 	unsigned long free;
 	struct zone *zone;
+	int type;
 
 	for_each_zone(zone) {
 		show_node(zone);
@@ -1459,7 +1528,9 @@ void show_free_areas(void)
 	}
 
 	for_each_zone(zone) {
- 		unsigned long nr, flags, order, total = 0;
+ 		unsigned long nr = 0;
+		unsigned long total = 0;
+		unsigned long flags,order;
 
 		show_node(zone);
 		printk("%s: ", zone->name);
@@ -1469,10 +1540,18 @@ void show_free_areas(void)
 		}
 
 		spin_lock_irqsave(&zone->lock, flags);
-		for (order = 0; order < MAX_ORDER; order++) {
-			nr = zone->free_area[order].nr_free;
+		for_each_rclmtype_order(type, order) {
+			nr += zone->free_area_lists[type][order].nr_free;
 			total += nr << order;
-			printk("%lu*%lukB ", nr, K(1UL) << order);
+
+			/*
+			 * If type had reached RCLM_TYPE, the free pages
+			 * for this order have been summed up
+			 */
+			if (type == RCLM_TYPES-1) {
+				printk("%lu*%lukB ", nr, K(1UL) << order);
+				nr = 0;
+			}
 		}
 		spin_unlock_irqrestore(&zone->lock, flags);
 		printk("= %lukB\n", K(total));
@@ -1782,9 +1861,14 @@ void zone_init_free_lists(struct pglist_
 				unsigned long size)
 {
 	int order;
-	for (order = 0; order < MAX_ORDER ; order++) {
-		INIT_LIST_HEAD(&zone->free_area[order].free_list);
-		zone->free_area[order].nr_free = 0;
+	int type;
+	struct free_area *area;
+
+	/* Initialse the three size ordered lists of free_areas */
+	for_each_rclmtype_order(type, order) {
+		area = &(zone->free_area_lists[type][order]);
+		INIT_LIST_HEAD(&area->free_list);
+		area->nr_free = 0;
 	}
 }
 
@@ -2199,16 +2283,26 @@ static int frag_show(struct seq_file *m,
 	struct zone *zone;
 	struct zone *node_zones = pgdat->node_zones;
 	unsigned long flags;
-	int order;
+	int order, t;
+	struct free_area *area;
+	unsigned long nr_bufs = 0;
 
 	for (zone = node_zones; zone - node_zones < MAX_NR_ZONES; ++zone) {
 		if (!zone->present_pages)
 			continue;
 
 		spin_lock_irqsave(&zone->lock, flags);
-		seq_printf(m, "Node %d, zone %8s ", pgdat->node_id, zone->name);
-		for (order = 0; order < MAX_ORDER; ++order)
-			seq_printf(m, "%6lu ", zone->free_area[order].nr_free);
+		seq_printf(m, "Node %d, zone %8s", pgdat->node_id, zone->name);
+		for_each_rclmtype_order(t, order) {
+			area = &(zone->free_area_lists[t][order]);
+			nr_bufs += area->nr_free;
+
+			if (t == RCLM_TYPES-1) {
+				seq_printf(m, "%6lu ", nr_bufs);
+				nr_bufs = 0;
+			}
+		}
+
 		spin_unlock_irqrestore(&zone->lock, flags);
 		seq_putc(m, '\n');
 	}

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH 4/5] Light Fragmentation Avoidance V20: 004_percpu
  2005-11-15 16:49 [PATCH 0/5] Light Fragmentation Avoidance V20 Mel Gorman
                   ` (2 preceding siblings ...)
  2005-11-15 16:50 ` [PATCH 3/5] Light Fragmentation Avoidance V20: 003_fragcore Mel Gorman
@ 2005-11-15 16:50 ` Mel Gorman
  2005-11-15 23:24   ` Paul Jackson
  2005-11-15 16:50 ` [PATCH 5/5] Light Fragmentation Avoidance V20: 005_configurable Mel Gorman
  2005-11-15 22:54 ` [PATCH 0/5] Light Fragmentation Avoidance V20 Paul Jackson
  5 siblings, 1 reply; 31+ messages in thread
From: Mel Gorman @ 2005-11-15 16:50 UTC (permalink / raw)
  To: linux-mm; +Cc: Mel Gorman, mingo, lhms-devel, linux-kernel, nickpiggin

The freelists for each allocation type can slowly become corrupted due to
the per-cpu list. Consider what happens when the following happens

1. A 2^(MAX_ORDER-1) list is reserved for __GFP_EASYRCLM pages
2. An order-0 page is allocated from the newly reserved block
3. The page is freed and placed on the per-cpu list
4. alloc_page() is called with GFP_KERNEL as the gfp_mask
5. The per-cpu list is used to satisfy the allocation

This results in a kernel page is in the middle of a RCLM_EASY region. This
means that over long periods of the time, the anti-fragmentation scheme
slowly degrades to the standard allocator.

This patch divides the per-cpu lists into RCLM_TYPES number of lists.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.14-mm2-003_fragcore/include/linux/mmzone.h linux-2.6.14-mm2-004_percpu/include/linux/mmzone.h
--- linux-2.6.14-mm2-003_fragcore/include/linux/mmzone.h	2005-11-15 12:43:41.000000000 +0000
+++ linux-2.6.14-mm2-004_percpu/include/linux/mmzone.h	2005-11-15 12:44:23.000000000 +0000
@@ -56,11 +56,11 @@ struct zone_padding {
 #endif
 
 struct per_cpu_pages {
-	int count;		/* number of pages in the list */
+	int count[RCLM_TYPES];	/* Number of pages on the lists */
 	int low;		/* low watermark, refill needed */
 	int high;		/* high watermark, emptying needed */
 	int batch;		/* chunk size for buddy add/remove */
-	struct list_head list;	/* the list of pages */
+	struct list_head list[RCLM_TYPES]; /* the lists of pages */
 };
 
 struct per_cpu_pageset {
@@ -75,6 +75,9 @@ struct per_cpu_pageset {
 #endif
 } ____cacheline_aligned_in_smp;
 
+/* Helpers for per_cpu_pages */
+#define pcp_count(pcp) (pcp.count[RCLM_NORCLM] + pcp.count[RCLM_EASY])
+
 #ifdef CONFIG_NUMA
 #define zone_pcp(__z, __cpu) ((__z)->pageset[(__cpu)])
 #else
diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.14-mm2-003_fragcore/mm/page_alloc.c linux-2.6.14-mm2-004_percpu/mm/page_alloc.c
--- linux-2.6.14-mm2-003_fragcore/mm/page_alloc.c	2005-11-15 12:44:27.000000000 +0000
+++ linux-2.6.14-mm2-004_percpu/mm/page_alloc.c	2005-11-15 12:44:23.000000000 +0000
@@ -623,7 +623,7 @@ static int rmqueue_bulk(struct zone *zon
 void drain_remote_pages(void)
 {
 	struct zone *zone;
-	int i;
+	int i, pindex;
 	unsigned long flags;
 
 	local_irq_save(flags);
@@ -639,9 +639,16 @@ void drain_remote_pages(void)
 			struct per_cpu_pages *pcp;
 
 			pcp = &pset->pcp[i];
-			if (pcp->count)
-				pcp->count -= free_pages_bulk(zone, pcp->count,
-						&pcp->list, 0);
+			for_each_rclmtype(pindex) {
+				if (!pcp->count[pindex])
+					continue;
+
+				/* Try remove all pages from the pcpu list */
+				pcp->count[pindex] -=
+					free_pages_bulk(zone,
+						pcp->count[pindex],
+						&pcp->list[pindex], 0);
+			}
 		}
 	}
 	local_irq_restore(flags);
@@ -652,7 +659,7 @@ void drain_remote_pages(void)
 static void __drain_pages(unsigned int cpu)
 {
 	struct zone *zone;
-	int i;
+	int i, pindex;
 
 	for_each_zone(zone) {
 		struct per_cpu_pageset *pset;
@@ -662,8 +669,16 @@ static void __drain_pages(unsigned int c
 			struct per_cpu_pages *pcp;
 
 			pcp = &pset->pcp[i];
-			pcp->count -= free_pages_bulk(zone, pcp->count,
-						&pcp->list, 0);
+			for_each_rclmtype(pindex) {
+				if (!pcp->count[pindex])
+					continue;
+
+				/* Try remove all pages from the pcpu list */
+				pcp->count[pindex] -=
+					free_pages_bulk(zone,
+						pcp->count[pindex],
+						&pcp->list[pindex], 0);
+			}
 		}
 	}
 }
@@ -743,6 +758,7 @@ static void fastcall free_hot_cold_page(
 	struct zone *zone = page_zone(page);
 	struct per_cpu_pages *pcp;
 	unsigned long flags;
+	int pindex;
 
 	arch_free_page(page, 0);
 
@@ -752,11 +768,14 @@ static void fastcall free_hot_cold_page(
 		page->mapping = NULL;
 	free_pages_check(__FUNCTION__, page);
 	pcp = &zone_pcp(zone, get_cpu())->pcp[cold];
+
 	local_irq_save(flags);
-	list_add(&page->lru, &pcp->list);
-	pcp->count++;
-	if (pcp->count >= pcp->high)
-		pcp->count -= free_pages_bulk(zone, pcp->batch, &pcp->list, 0);
+	pindex = get_pageblock_type(zone, page);
+	list_add(&page->lru, &pcp->list[pindex]);
+	pcp->count[pindex]++;
+	if (pcp->count[pindex] >= pcp->high)
+		pcp->count[pindex] -= free_pages_bulk(zone, pcp->batch,
+				&pcp->list[pindex], 0);
 	local_irq_restore(flags);
 	put_cpu();
 }
@@ -798,14 +817,16 @@ buffered_rmqueue(struct zone *zone, int 
 
 		pcp = &zone_pcp(zone, get_cpu())->pcp[cold];
 		local_irq_save(flags);
-		if (pcp->count <= pcp->low)
-			pcp->count += rmqueue_bulk(zone, 0,
-						pcp->batch, &pcp->list,
+		if (pcp->count[alloctype] <= pcp->low)
+			pcp->count[alloctype] += rmqueue_bulk(zone, 0,
+						pcp->batch,
+						&pcp->list[alloctype],
 						alloctype);
-		if (pcp->count) {
-			page = list_entry(pcp->list.next, struct page, lru);
+		if (pcp->count[alloctype]) {
+			page = list_entry(pcp->list[alloctype].next,
+					struct page, lru);
 			list_del(&page->lru);
-			pcp->count--;
+			pcp->count[alloctype]--;
 		}
 		local_irq_restore(flags);
 		put_cpu();
@@ -847,9 +868,9 @@ int zone_watermark_ok(struct zone *z, in
 	int o,t;
 
 	if (alloc_flags & ALLOC_HIGH)
-		mark -= mark / 2;
+		mark /= 2;
 	if (alloc_flags & ALLOC_HARDER)
-		mark -= mark / 4;
+		mark /= 4;
 
 	if (free_pages <= mark + z->lowmem_reserve[classzone_idx])
 		goto out_failed;
@@ -861,13 +882,13 @@ int zone_watermark_ok(struct zone *z, in
 			 * unavailable
 			 */
 			free_pages -= z->free_area_lists[t][o].nr_free << o;
+		}
 
-			/* Require fewer higher order pages to be free */
-			min >>= 1;
+		/* Require fewer higher order pages to be free */
+		min >>= 1;
 
-			if (free_pages <= min)
-				goto out_failed;
-			}
+		if (free_pages <= min)
+			goto out_failed;
 	}
 
 	return 1;
@@ -1472,7 +1493,7 @@ void show_free_areas(void)
 					pageset->pcp[temperature].low,
 					pageset->pcp[temperature].high,
 					pageset->pcp[temperature].batch,
-					pageset->pcp[temperature].count);
+					pcp_count(pageset->pcp[temperature]));
 		}
 	}
 
@@ -1930,18 +1951,23 @@ inline void setup_pageset(struct per_cpu
 	memset(p, 0, sizeof(*p));
 
 	pcp = &p->pcp[0];		/* hot */
-	pcp->count = 0;
+	pcp->count[RCLM_NORCLM] = 0;
+	pcp->count[RCLM_EASY] = 0;
 	pcp->low = 0;
-	pcp->high = 6 * batch;
+	pcp->high = 3 * batch;
 	pcp->batch = max(1UL, 1 * batch);
-	INIT_LIST_HEAD(&pcp->list);
+	INIT_LIST_HEAD(&pcp->list[RCLM_NORCLM]);
+	INIT_LIST_HEAD(&pcp->list[RCLM_EASY]);
 
 	pcp = &p->pcp[1];		/* cold*/
-	pcp->count = 0;
+
+	pcp->count[RCLM_NORCLM] = 0;
+	pcp->count[RCLM_EASY] = 0;
 	pcp->low = 0;
-	pcp->high = 2 * batch;
+	pcp->high = batch;
 	pcp->batch = max(1UL, batch/2);
-	INIT_LIST_HEAD(&pcp->list);
+	INIT_LIST_HEAD(&pcp->list[RCLM_NORCLM]);
+	INIT_LIST_HEAD(&pcp->list[RCLM_EASY]);
 }
 
 #ifndef CONFIG_SPARSEMEM
@@ -2381,7 +2407,7 @@ static int zoneinfo_show(struct seq_file
 					   "\n              high:  %i"
 					   "\n              batch: %i",
 					   i, j,
-					   pageset->pcp[j].count,
+					   pcp_count(pageset->pcp[j]),
 					   pageset->pcp[j].low,
 					   pageset->pcp[j].high,
 					   pageset->pcp[j].batch);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH 5/5] Light Fragmentation Avoidance V20: 005_configurable
  2005-11-15 16:49 [PATCH 0/5] Light Fragmentation Avoidance V20 Mel Gorman
                   ` (3 preceding siblings ...)
  2005-11-15 16:50 ` [PATCH 4/5] Light Fragmentation Avoidance V20: 004_percpu Mel Gorman
@ 2005-11-15 16:50 ` Mel Gorman
  2005-11-15 23:39   ` Andi Kleen
  2005-11-15 22:54 ` [PATCH 0/5] Light Fragmentation Avoidance V20 Paul Jackson
  5 siblings, 1 reply; 31+ messages in thread
From: Mel Gorman @ 2005-11-15 16:50 UTC (permalink / raw)
  To: linux-mm; +Cc: Mel Gorman, mingo, linux-kernel, nickpiggin, lhms-devel

The anti-defragmentation strategy has memory overhead. This patch allows
the strategy to be disabled for small memory systems or if it is known the
workload is suffering because of the strategy. It also acts to show where
the anti-defrag strategy interacts with the standard buddy allocator.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.14-mm2-004_percpu/include/linux/gfp.h linux-2.6.14-mm2-005_configurable/include/linux/gfp.h
--- linux-2.6.14-mm2-004_percpu/include/linux/gfp.h	2005-11-15 12:40:42.000000000 +0000
+++ linux-2.6.14-mm2-005_configurable/include/linux/gfp.h	2005-11-15 12:45:06.000000000 +0000
@@ -54,7 +54,11 @@ struct vm_area_struct;
  * Allocation type modifier
  * __GFP_EASYRCLM: Easily reclaimed pages like userspace or buffer pages
  */
+#ifdef CONFIG_PAGEALLOC_ANTIDEFRAG
 #define __GFP_EASYRCLM   0x80000u  /* User and other easily reclaimed pages */
+#else
+#define __GFP_EASYRCLM   0x0u
+#endif /* CONFIG_PAGEALLOC_ANTIDEFRAG */
 
 #define __GFP_BITS_SHIFT 20	/* Room for 20 __GFP_FOO bits */
 #define __GFP_BITS_MASK ((__force gfp_t)((1 << __GFP_BITS_SHIFT) - 1))
diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.14-mm2-004_percpu/include/linux/mmzone.h linux-2.6.14-mm2-005_configurable/include/linux/mmzone.h
--- linux-2.6.14-mm2-004_percpu/include/linux/mmzone.h	2005-11-15 12:44:23.000000000 +0000
+++ linux-2.6.14-mm2-005_configurable/include/linux/mmzone.h	2005-11-15 12:45:06.000000000 +0000
@@ -23,9 +23,17 @@
 #endif
 #define PAGES_PER_MAXORDER (1 << (MAX_ORDER-1))
 
+#ifdef CONFIG_PAGEALLOC_ANTIDEFRAG
 #define RCLM_NORCLM   0
 #define RCLM_EASY     1
 #define RCLM_TYPES    2
+#define BITS_PER_RCLM_TYPE 1
+#else
+#define RCLM_NORCLM   0
+#define RCLM_EASY     0
+#define RCLM_TYPES    1
+#define BITS_PER_RCLM_TYPE 0
+#endif
 
 #define for_each_rclmtype(type) \
 	for (type = 0; type < RCLM_TYPES; type++)
@@ -76,7 +84,11 @@ struct per_cpu_pageset {
 } ____cacheline_aligned_in_smp;
 
 /* Helpers for per_cpu_pages */
+#ifdef CONFIG_PAGEALLOC_ANTIDEFRAG
 #define pcp_count(pcp) (pcp.count[RCLM_NORCLM] + pcp.count[RCLM_EASY])
+#else
+#define pcp_count(pcp) (pcp.count[RCLM_NORCLM])
+#endif /* CONFIG_PAGEALLOC_ANTIDEFRAG */
 
 #ifdef CONFIG_NUMA
 #define zone_pcp(__z, __cpu) ((__z)->pageset[(__cpu)])
diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.14-mm2-004_percpu/init/Kconfig linux-2.6.14-mm2-005_configurable/init/Kconfig
--- linux-2.6.14-mm2-004_percpu/init/Kconfig	2005-11-13 21:22:26.000000000 +0000
+++ linux-2.6.14-mm2-005_configurable/init/Kconfig	2005-11-15 12:45:06.000000000 +0000
@@ -392,6 +392,18 @@ config CC_ALIGN_FUNCTIONS
 	  32-byte boundary only if this can be done by skipping 23 bytes or less.
 	  Zero means use compiler's default.
 
+config PAGEALLOC_ANTIDEFRAG
+	bool "Avoid fragmentation in the page allocator"
+	def_bool n
+	help
+	  The standard allocator will fragment memory over time which means that
+	  high order allocations will fail even if kswapd is running. If this
+	  option is set, the allocator will try and group page types into
+	  two groups, kernel and easy reclaimable. The gain is a best effort
+	  attempt at lowering fragmentation which a few workloads care about.
+	  The loss is a more complex allocactor that performs slower.
+	  If unsure, say N
+
 config CC_ALIGN_LABELS
 	int "Label alignment" if EMBEDDED
 	default 0
diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.14-mm2-004_percpu/mm/page_alloc.c linux-2.6.14-mm2-005_configurable/mm/page_alloc.c
--- linux-2.6.14-mm2-004_percpu/mm/page_alloc.c	2005-11-15 12:44:23.000000000 +0000
+++ linux-2.6.14-mm2-005_configurable/mm/page_alloc.c	2005-11-15 12:45:06.000000000 +0000
@@ -73,6 +73,7 @@ static inline int gfpflags_to_alloctype(
 	return ((gfp_flags & __GFP_EASYRCLM) != 0);
 }
 
+#ifdef CONFIG_PAGEALLOC_ANTIDEFRAG
 static inline int get_pageblock_type(struct zone *zone, struct page *page)
 {
 	unsigned long pfn = page_to_pfn(page);
@@ -85,6 +86,9 @@ static inline void change_pageblock_type
 	unsigned long pfn = page_to_pfn(page);
 	__change_bit(pfn_to_bitidx(zone, pfn), pfn_to_usemap(zone, pfn));
 }
+#else
+#define get_pageblock_type(zone, type) RCLM_NORCLM
+#endif /* CONFIG_PAGEALLOC_ANTIDEFRAG */
 
 /*
  * Used by page_zone() to look up the address of the struct zone whose
@@ -279,12 +283,16 @@ __find_combined_index(unsigned long page
 /*
  * Return the free list for a given page within a zone
  */
+#ifdef CONFIG_PAGEALLOC_ANTIDEFRAG
 static inline struct free_area *__page_find_freelist(struct zone *zone,
 							struct page *page,
 							int order)
 {
 	return &zone->free_area_lists[get_pageblock_type(zone, page)][order];
 }
+#else
+#define __page_find_freelist(z, p, o) &zone->free_area_lists[RCLM_NORCLM][o]
+#endif /* CONFIG_PAGEALLOC_ANTIDEFRAG */
 
 /*
  * This function checks whether a page is free && is the buddy
@@ -524,6 +532,7 @@ static void prep_new_page(struct page *p
 	kernel_map_pages(page, 1 << order, 1);
 }
 
+#ifdef CONFIG_PAGEALLOC_ANTIDEFRAG
 /* Remove an element from the buddy allocator from the fallback list */
 static struct page *__rmqueue_fallback(struct zone *zone, int order,
 							int alloctype)
@@ -562,6 +571,13 @@ static struct page *__rmqueue_fallback(s
 
 	return NULL;
 }
+#else
+static struct page *__rmqueue_fallback(struct zone *zone, unsigned int order,
+							int alloctype)
+{
+	return NULL;
+}
+#endif /* CONFIG_PAGEALLOC_ANTIDEFRAG */
 
 /* 
  * Do the hard work of removing an element from the buddy allocator.
@@ -1985,6 +2001,7 @@ static unsigned long __init usemap_size(
 	usemapsize = roundup(zonesize, PAGES_PER_MAXORDER);
 	usemapsize = usemapsize >> (MAX_ORDER-1);
 	usemapsize = roundup(usemapsize, 8 * sizeof(unsigned long));
+	usemapsize *= BITS_PER_RCLM_TYPE;
 
 	return usemapsize / 8;
 }
@@ -1993,9 +2010,11 @@ static void __init setup_usemap(struct p
 				struct zone *zone, unsigned long zonesize)
 {
 	unsigned long usemapsize = usemap_size(zonesize);
-	zone->free_area_usemap = alloc_bootmem_node(pgdat, usemapsize);
-	memset(zone->free_area_usemap, ~RCLM_NORCLM, usemapsize);
-	memset(zone->free_area_usemap, RCLM_NORCLM, usemapsize/2);
+	if (usemapsize != 0) {
+		zone->free_area_usemap = alloc_bootmem_node(pgdat, usemapsize);
+		memset(zone->free_area_usemap, ~RCLM_NORCLM, usemapsize);
+		memset(zone->free_area_usemap, RCLM_NORCLM, usemapsize/2);
+	}
 }
 #else
 static void inline setup_usemap(struct pglist_data *pgdat,

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 0/5] Light Fragmentation Avoidance V20
  2005-11-15 16:49 [PATCH 0/5] Light Fragmentation Avoidance V20 Mel Gorman
                   ` (4 preceding siblings ...)
  2005-11-15 16:50 ` [PATCH 5/5] Light Fragmentation Avoidance V20: 005_configurable Mel Gorman
@ 2005-11-15 22:54 ` Paul Jackson
  2005-11-16  1:34   ` Mel Gorman
  5 siblings, 1 reply; 31+ messages in thread
From: Paul Jackson @ 2005-11-15 22:54 UTC (permalink / raw)
  To: Mel Gorman; +Cc: linux-mm, mingo, lhms-devel, linux-kernel, nickpiggin

I'm sure you've stated this before, but could you repeat it?

What's the driving motivation for this, and what's the essential
capability required?

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 1/5] Light Fragmentation Avoidance V20: 001_antidefrag_flags
  2005-11-15 16:49 ` [PATCH 1/5] Light Fragmentation Avoidance V20: 001_antidefrag_flags Mel Gorman
@ 2005-11-15 23:00   ` Paul Jackson
  2005-11-15 23:04     ` Randy.Dunlap
  2005-11-16  1:36     ` Mel Gorman
  0 siblings, 2 replies; 31+ messages in thread
From: Paul Jackson @ 2005-11-15 23:00 UTC (permalink / raw)
  To: Mel Gorman; +Cc: linux-mm, mingo, linux-kernel, nickpiggin, lhms-devel

Mel wrote:
>  #define __GFP_VALID	((__force gfp_t)0x80000000u) /* valid GFP flags */
>  
> +/*
> + * Allocation type modifier
> + * __GFP_EASYRCLM: Easily reclaimed pages like userspace or buffer pages
> + */
> +#define __GFP_EASYRCLM   0x80000u  /* User and other easily reclaimed pages */
> +

How about fitting the style (casts, just one line) of the other flags,
so that these added six lines become instead just the one line:

   #define __GFP_EASYRCLM   ((__force gfp_t)0x80000u)  /* easily reclaimed pages */

(Yeah - it was probably me that asked for -more- comments sometime in
the past - consistency is not my strong suit ;).

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 1/5] Light Fragmentation Avoidance V20: 001_antidefrag_flags
  2005-11-15 23:00   ` Paul Jackson
@ 2005-11-15 23:04     ` Randy.Dunlap
  2005-11-16  1:36     ` Mel Gorman
  1 sibling, 0 replies; 31+ messages in thread
From: Randy.Dunlap @ 2005-11-15 23:04 UTC (permalink / raw)
  To: Paul Jackson
  Cc: Mel Gorman, linux-mm, mingo, linux-kernel, nickpiggin, lhms-devel

On Tue, 15 Nov 2005, Paul Jackson wrote:

> Mel wrote:
> >  #define __GFP_VALID	((__force gfp_t)0x80000000u) /* valid GFP flags */
> >
> > +/*
> > + * Allocation type modifier
> > + * __GFP_EASYRCLM: Easily reclaimed pages like userspace or buffer pages
> > + */
> > +#define __GFP_EASYRCLM   0x80000u  /* User and other easily reclaimed pages */
> > +
>
> How about fitting the style (casts, just one line) of the other flags,
> so that these added six lines become instead just the one line:
>
>    #define __GFP_EASYRCLM   ((__force gfp_t)0x80000u)  /* easily reclaimed pages */
>
> (Yeah - it was probably me that asked for -more- comments sometime in
> the past - consistency is not my strong suit ;).

Conversely, if you are going to go to the effort of lots of docs,
please do it in kernel-doc format.
  Documentation/kernel-doc-nano-HOWTO.txt

-- 
~Randy

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 4/5] Light Fragmentation Avoidance V20: 004_percpu
  2005-11-15 16:50 ` [PATCH 4/5] Light Fragmentation Avoidance V20: 004_percpu Mel Gorman
@ 2005-11-15 23:24   ` Paul Jackson
  2005-11-16  1:37     ` Mel Gorman
  0 siblings, 1 reply; 31+ messages in thread
From: Paul Jackson @ 2005-11-15 23:24 UTC (permalink / raw)
  To: Mel Gorman; +Cc: linux-mm, mingo, lhms-devel, linux-kernel, nickpiggin

Mel wrote:
> -		mark -= mark / 2;			[A]
> +		mark /= 2;				[B]
>  	if (alloc_flags & ALLOC_HARDER)
> -		mark -= mark / 4;			[C]
> +		mark /= 4;				[D]

Why these changes?  For each of [A] - [D] above, if I start with a
value of mark == 33 and recycle that same mark through the above
transformation 16 times, I get the following sequence of values:

 A:  33  17   9   5   3   2   1   1   1   1   1   1   1   1   1   1
 B:  33  16   8   4   2   1   0   0   0   0   0   0   0   0   0   0
 C:  33  25  19  15  12   9   7   6   5   4   3   3   3   3   3   3
 D:  33   8   2   0   0   0   0   0   0   0   0   0   0   0   0   0

Comparing [A] to [B], observe that [A] converges to 1, but [B] to 0,
due to handling the underflow differently.

Comparing [C] to [D], observe that [D] converges to 0, due to the
different underflow, and converges much faster, since it is taking off
3/4's instead of 1/4 each iteration.

I doubt you want this change.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 2/5] Light Fragmentation Avoidance V20: 002_usemap
  2005-11-15 16:49 ` [PATCH 2/5] Light Fragmentation Avoidance V20: 002_usemap Mel Gorman
@ 2005-11-15 23:36   ` Andi Kleen
  2005-11-16  1:43     ` Mel Gorman
  0 siblings, 1 reply; 31+ messages in thread
From: Andi Kleen @ 2005-11-15 23:36 UTC (permalink / raw)
  To: Mel Gorman; +Cc: linux-mm, mingo, lhms-devel, linux-kernel, nickpiggin

On Tuesday 15 November 2005 17:49, Mel Gorman wrote:
> This patch adds a "usemap" to the allocator. Each bit in the usemap indicates
> whether a block of 2^(MAX_ORDER-1) pages are being used for kernel or
> easily-reclaimed allocations. This enumerates two types of allocations;

This will increase cache line footprint, which is costly.
Why can't this be done in the page flags?

-Andi

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 5/5] Light Fragmentation Avoidance V20: 005_configurable
  2005-11-15 16:50 ` [PATCH 5/5] Light Fragmentation Avoidance V20: 005_configurable Mel Gorman
@ 2005-11-15 23:39   ` Andi Kleen
  2005-11-16  1:47     ` Mel Gorman
  0 siblings, 1 reply; 31+ messages in thread
From: Andi Kleen @ 2005-11-15 23:39 UTC (permalink / raw)
  To: Mel Gorman; +Cc: linux-mm, mingo, linux-kernel, nickpiggin, lhms-devel

On Tuesday 15 November 2005 17:50, Mel Gorman wrote:
> The anti-defragmentation strategy has memory overhead. This patch allows
> the strategy to be disabled for small memory systems or if it is known the
> workload is suffering because of the strategy. It also acts to show where
> the anti-defrag strategy interacts with the standard buddy allocator.

If anything this should be a boot time option or perhaps sysctl, not a config.
In general CONFIGs that change runtime behaviour are evil - just makes
changing the option more painful, causes problems for distribution
users, doesn't make much sense, etc.etc.

Also #ifdef as a documentation device is a really really scary concept.
Yuck.

-Andi

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 0/5] Light Fragmentation Avoidance V20
  2005-11-15 22:54 ` [PATCH 0/5] Light Fragmentation Avoidance V20 Paul Jackson
@ 2005-11-16  1:34   ` Mel Gorman
  0 siblings, 0 replies; 31+ messages in thread
From: Mel Gorman @ 2005-11-16  1:34 UTC (permalink / raw)
  To: Paul Jackson; +Cc: linux-mm, mingo, lhms-devel, linux-kernel, nickpiggin

On Tue, 15 Nov 2005, Paul Jackson wrote:

> I'm sure you've stated this before, but could you repeat it?
>
> What's the driving motivation for this, and what's the essential
> capability required?
>

For me, the ultimate aim is the transparent support of huge pages which
would be a general benefit to any application that uses large amounts of
address space or uses their address space sparsely.  Low fragmentation is
a prerequisite before you even start trying.  Patches have been submitted
for the demand paging of huge pages but obviously more is needed. This
patchset should help the demand allocation of huge pages.

Other benefits are;

1. Benefits hotplug on some architectures, particularly ppc64 (fringe benefit)
2. HPC jobs that need to reset a system to a state with large pages
   available without rebooting the system benefit from this are likely to
   get their huge pages if they stop all running processes, dd a large
   file from /dev/zero and delete it again (fringe benefit)
3. Lower fragmentation means the per-cpu allocation is likely to be able
   to allocate pages in large batches avoiding multiple calls to the
   allocator. Jobs that are cache sensitive may benefit if they tend to
   fault their address space in chunks as they get pages that are
   contiguous in physical and virtual memoet (general benefit, patch
   available)
4. Prezeroing pages in large batches becomes a lot more feasible (general
   benefit, needs patch that does not regress performance)
5. Potentially reduces the blocks used for scatter/gather IO. In an
   earlier thread, it was noted that Windows is much better at providing
   large pages for DMA than Linux is (potential benefit, haven't measured it)

Think that covers the main points. Someone will chime in if I missed
something important.

Lastly, benchmarks on my testbed show the patches to be as fast or faster
than the standard allocator. Benchmark results that show the contrary are
missing.

-- 
Mel Gorman
Part-time Phd Student                          Java Applications Developer
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 1/5] Light Fragmentation Avoidance V20: 001_antidefrag_flags
  2005-11-15 23:00   ` Paul Jackson
  2005-11-15 23:04     ` Randy.Dunlap
@ 2005-11-16  1:36     ` Mel Gorman
  2005-11-20 14:45       ` [Lhms-devel] " Paul Jackson
  1 sibling, 1 reply; 31+ messages in thread
From: Mel Gorman @ 2005-11-16  1:36 UTC (permalink / raw)
  To: Paul Jackson; +Cc: linux-mm, mingo, linux-kernel, nickpiggin, lhms-devel

On Tue, 15 Nov 2005, Paul Jackson wrote:

> Mel wrote:
> >  #define __GFP_VALID	((__force gfp_t)0x80000000u) /* valid GFP flags */
> >
> > +/*
> > + * Allocation type modifier
> > + * __GFP_EASYRCLM: Easily reclaimed pages like userspace or buffer pages
> > + */
> > +#define __GFP_EASYRCLM   0x80000u  /* User and other easily reclaimed pages */
> > +
>
> How about fitting the style (casts, just one line) of the other flags,
> so that these added six lines become instead just the one line:
>
>    #define __GFP_EASYRCLM   ((__force gfp_t)0x80000u)  /* easily reclaimed pages */
>
> (Yeah - it was probably me that asked for -more- comments sometime in
> the past - consistency is not my strong suit ;).
>

No, you're right, my declaration is wrong. Changed to

+#define __GFP_EASYRCLM   ((__force gfp_t)0x80000u)

Comment to right removed because the comment above the declaration covers
everything.

-- 
Mel Gorman
Part-time Phd Student                          Java Applications Developer
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 4/5] Light Fragmentation Avoidance V20: 004_percpu
  2005-11-15 23:24   ` Paul Jackson
@ 2005-11-16  1:37     ` Mel Gorman
  0 siblings, 0 replies; 31+ messages in thread
From: Mel Gorman @ 2005-11-16  1:37 UTC (permalink / raw)
  To: Paul Jackson; +Cc: linux-mm, mingo, lhms-devel, linux-kernel, nickpiggin

On Tue, 15 Nov 2005, Paul Jackson wrote:

> Mel wrote:
> > -		mark -= mark / 2;			[A]
> > +		mark /= 2;				[B]
> >  	if (alloc_flags & ALLOC_HARDER)
> > -		mark -= mark / 4;			[C]
> > +		mark /= 4;				[D]
>
> Why these changes?  For each of [A] - [D] above, if I start with a
> value of mark == 33 and recycle that same mark through the above
> transformation 16 times, I get the following sequence of values:


This change by me is totally totally wrong. I shouldn't have modified how
the calculation is made at all. Fix made.

>  A:  33  17   9   5   3   2   1   1   1   1   1   1   1   1   1   1
>  B:  33  16   8   4   2   1   0   0   0   0   0   0   0   0   0   0
>  C:  33  25  19  15  12   9   7   6   5   4   3   3   3   3   3   3
>  D:  33   8   2   0   0   0   0   0   0   0   0   0   0   0   0   0
>
> Comparing [A] to [B], observe that [A] converges to 1, but [B] to 0,
> due to handling the underflow differently.
>
> Comparing [C] to [D], observe that [D] converges to 0, due to the
> different underflow, and converges much faster, since it is taking off
> 3/4's instead of 1/4 each iteration.
>
> I doubt you want this change.
>

And you'd be right.

-- 
Mel Gorman
Part-time Phd Student                          Java Applications Developer
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 2/5] Light Fragmentation Avoidance V20: 002_usemap
  2005-11-15 23:36   ` Andi Kleen
@ 2005-11-16  1:43     ` Mel Gorman
  2005-11-16  1:52       ` Andi Kleen
  0 siblings, 1 reply; 31+ messages in thread
From: Mel Gorman @ 2005-11-16  1:43 UTC (permalink / raw)
  To: Andi Kleen; +Cc: linux-mm, mingo, lhms-devel, linux-kernel, nickpiggin

On Wed, 16 Nov 2005, Andi Kleen wrote:

> On Tuesday 15 November 2005 17:49, Mel Gorman wrote:
> > This patch adds a "usemap" to the allocator. Each bit in the usemap indicates
> > whether a block of 2^(MAX_ORDER-1) pages are being used for kernel or
> > easily-reclaimed allocations. This enumerates two types of allocations;
>
> This will increase cache line footprint, which is costly.
> Why can't this be done in the page flags?
>

I actually did a version of these patches using page flags which are
sitting in a temporary directory. For allocation, it derived the type it
was reserved for by the list it was on and on free, it used the flags to
determine what free list it should go back to. There were a few reasons
why I didn't submit it

1. I was using a page flag, valuable commodity, thought I would get kicked
   for it. Usemap uses 1 bit per 2^(MAX_ORDER-1) pages. Page flags uses
   2^(MAX_ORDER-1) bits at worse case.
2. Fragmentation avoidance tended to break down, very fast.
3. When changing a block of pages from one type to another, there was no
   fast way to make sure all pages currently allocation would end up on
   the correct free list
4. Using page flags performed slower than using a usemap, at least with
   aim9. As using the usemap did not regress loads like kernel compiles,
   aim9 or anything else I thought to test, I figured it was not a
   problem.

-- 
Mel Gorman
Part-time Phd Student                          Java Applications Developer
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 5/5] Light Fragmentation Avoidance V20: 005_configurable
  2005-11-15 23:39   ` Andi Kleen
@ 2005-11-16  1:47     ` Mel Gorman
  0 siblings, 0 replies; 31+ messages in thread
From: Mel Gorman @ 2005-11-16  1:47 UTC (permalink / raw)
  To: Andi Kleen; +Cc: linux-mm, mingo, linux-kernel, nickpiggin, lhms-devel

On Wed, 16 Nov 2005, Andi Kleen wrote:

> On Tuesday 15 November 2005 17:50, Mel Gorman wrote:
> > The anti-defragmentation strategy has memory overhead. This patch allows
> > the strategy to be disabled for small memory systems or if it is known the
> > workload is suffering because of the strategy. It also acts to show where
> > the anti-defrag strategy interacts with the standard buddy allocator.
>
> If anything this should be a boot time option or perhaps sysctl, not a config.

I'll take a look at what's involved in doing this. Using a compile time
option, I was depending on the compiler to see that

for (i = 0; i < RCLM_TYPES; i++) {}

would only every iterate once and get rid of the loop. If I think there is
any chance of these patches getting merged, I'll work on making this a
sysctl or boot-time option rather than a compile option.

> In general CONFIGs that change runtime behaviour are evil - just makes
> changing the option more painful, causes problems for distribution
> users, doesn't make much sense, etc.etc.
>

Agreed, but I felt that some mechanism for disabling this for small
systems was desirable. As it is right now, I see this as a
very-small-memory-available option.

> Also #ifdef as a documentation device is a really really scary concept.
> Yuck.
>

Can't argue with you there. However, for the purposes of discussion here,
it shows exactly where anti-defrag affects the current allocator.

-- 
Mel Gorman
Part-time Phd Student                          Java Applications Developer
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 2/5] Light Fragmentation Avoidance V20: 002_usemap
  2005-11-16  1:43     ` Mel Gorman
@ 2005-11-16  1:52       ` Andi Kleen
  2005-11-16  2:07         ` Mel Gorman
  0 siblings, 1 reply; 31+ messages in thread
From: Andi Kleen @ 2005-11-16  1:52 UTC (permalink / raw)
  To: Mel Gorman; +Cc: linux-mm, mingo, lhms-devel, linux-kernel, nickpiggin

On Wednesday 16 November 2005 02:43, Mel Gorman wrote:

> 1. I was using a page flag, valuable commodity, thought I would get kicked
>    for it. Usemap uses 1 bit per 2^(MAX_ORDER-1) pages. Page flags uses
>    2^(MAX_ORDER-1) bits at worse case.

Why does it need multiple bits? A page can only be in one order at a
time, can't it?

> 2. Fragmentation avoidance tended to break down, very fast.

Why? The algorithm should the same, no?

> 3. When changing a block of pages from one type to another, there was no
>    fast way to make sure all pages currently allocation would end up on
>    the correct free list

If you can change the bitmap you can change as well mem_map

-Andi

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 2/5] Light Fragmentation Avoidance V20: 002_usemap
  2005-11-16  1:52       ` Andi Kleen
@ 2005-11-16  2:07         ` Mel Gorman
  2005-11-22 10:13           ` Andy Whitcroft
  0 siblings, 1 reply; 31+ messages in thread
From: Mel Gorman @ 2005-11-16  2:07 UTC (permalink / raw)
  To: Andi Kleen; +Cc: linux-mm, mingo, lhms-devel, linux-kernel, nickpiggin

On Wed, 16 Nov 2005, Andi Kleen wrote:

> On Wednesday 16 November 2005 02:43, Mel Gorman wrote:
>
> > 1. I was using a page flag, valuable commodity, thought I would get kicked
> >    for it. Usemap uses 1 bit per 2^(MAX_ORDER-1) pages. Page flags uses
> >    2^(MAX_ORDER-1) bits at worse case.
>
> Why does it need multiple bits? A page can only be in one order at a
> time, can't it?
>

Yes, but 1024 pages in one block is one bit per page. Usemap uses 1 page
for all 1024.

> > 2. Fragmentation avoidance tended to break down, very fast.
>
> Why? The algorithm should the same, no?
>

That's what I thought when I wrote it first but it broke down fast
according to bench-stresshighalloc. I'll need to re-examine the patches
and see where I went wrong.

> > 3. When changing a block of pages from one type to another, there was no
> >    fast way to make sure all pages currently allocation would end up on
> >    the correct free list
>
> If you can change the bitmap you can change as well mem_map
>

That's iterating through, potentially, 1024 pages which I considered too
expensive. In terms of code complexity, the page-flags patch adds 237
which is not much of a saving in comparison to 275 that the usemap
approach uses.

Again, I can revisit the page-flag approach if I thought that something
like this would get merged and people would not choke on another page flag
being consumed.

-- 
Mel Gorman
Part-time Phd Student                          Java Applications Developer
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 3/5] Light Fragmentation Avoidance V20: 003_fragcore
  2005-11-15 16:50 ` [PATCH 3/5] Light Fragmentation Avoidance V20: 003_fragcore Mel Gorman
@ 2005-11-16  2:35   ` KAMEZAWA Hiroyuki
  2005-11-16 10:42     ` Mel Gorman
  0 siblings, 1 reply; 31+ messages in thread
From: KAMEZAWA Hiroyuki @ 2005-11-16  2:35 UTC (permalink / raw)
  To: Mel Gorman; +Cc: linux-mm, mingo, linux-kernel, nickpiggin, lhms-devel

Hi,

> +/* Remove an element from the buddy allocator from the fallback list */
> +static struct page *__rmqueue_fallback(struct zone *zone, int order,
> +							int alloctype)

Should we avoid this fallback as much as possible ?
I think this is a weak point of this approach.


> +		/*
> +		 * If breaking a large block of pages, place the buddies
> +		 * on the preferred allocation list
> +		 */
> +		if (unlikely(current_order >= MAX_ORDER / 2)) {
> +			alloctype = !alloctype;
> +			change_pageblock_type(zone, page);
> +			area = &zone->free_area_lists[alloctype][current_order];
> +		}
Changing RCLM_NORCLM to RLCM_EASY is okay ??
If so, I think adding similar code to free_pages_bulk() is better.

-- Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 3/5] Light Fragmentation Avoidance V20: 003_fragcore
  2005-11-16  2:35   ` KAMEZAWA Hiroyuki
@ 2005-11-16 10:42     ` Mel Gorman
  0 siblings, 0 replies; 31+ messages in thread
From: Mel Gorman @ 2005-11-16 10:42 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, mingo, linux-kernel, nickpiggin, lhms-devel

On Wed, 16 Nov 2005, KAMEZAWA Hiroyuki wrote:

> > +/* Remove an element from the buddy allocator from the fallback list */
> > +static struct page *__rmqueue_fallback(struct zone *zone, int order,
> > +    int alloctype)
>
> Should we avoid this fallback as much as possible ?

Avoiding fallback as much as possible is something I would push into a
zone approach that can be developer separetly to this. I want to give hard
guarantees in special zones about fallbacks and best effort everywhere
else with this. Taking complex steps to avoid tough fallbacks here hurts
the general path on a typical machine.

> I think this is a weak point of this approach.
>
> > +    /*
> > +     * If breaking a large block of pages, place the buddies
> > +     * on the preferred allocation list
> > +     */
> > +    if (unlikely(current_order >= MAX_ORDER / 2)) {
> > +    alloctype = !alloctype;
> > +    change_pageblock_type(zone, page);
> > +    area = &zone->free_area_lists[alloctype][current_order];
> > +    }
> Changing RCLM_NORCLM to RLCM_EASY is okay ??

Yes. If anything, it's the other way around one would be concerned about.
The anti-defrag approach just groups related allocations together as much
as possible. If the grouping is not possible without taking expensive
steps like balancing or reclaiming, it tries to steal the largest
possible block from the other list to reduce the chances that fallbacks
will occur in the near future.

> If so, I think adding similar code to free_pages_bulk() is better.
>

It's at allocation time if you know whether fallbacks are needed or not.
To do something similar at free, you are entering the realm of watermarks,
balances and tunables. As it is, the usemap tells __free_pages_bulk() what
free list pages should be going back to.

-- 
Mel Gorman
Part-time Phd Student                          Java Applications Developer
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [Lhms-devel] Re: [PATCH 1/5] Light Fragmentation Avoidance V20: 001_antidefrag_flags
  2005-11-16  1:36     ` Mel Gorman
@ 2005-11-20 14:45       ` Paul Jackson
  0 siblings, 0 replies; 31+ messages in thread
From: Paul Jackson @ 2005-11-20 14:45 UTC (permalink / raw)
  To: Mel Gorman; +Cc: linux-mm, mingo, linux-kernel, nickpiggin, lhms-devel

Mel wrote:
> +#define __GFP_EASYRCLM   ((__force gfp_t)0x80000u)
> 
> Comment to right removed because the comment above the declaration covers
> everything.

(repeating myself) 
> How about fitting the style (casts, just one line) of the other flags,
> so that these added six lines become instead just the one line:

There is a consistent layout, one-per-line, to the other __GFP_*
flags.  The information content of the extra five lines you use for
the __GFP_EASYRCLM flag does not warrant upsetting that layout, in
my view.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 2/5] Light Fragmentation Avoidance V20: 002_usemap
  2005-11-16  2:07         ` Mel Gorman
@ 2005-11-22 10:13           ` Andy Whitcroft
  2005-11-22 10:19             ` Mel Gorman
                               ` (2 more replies)
  0 siblings, 3 replies; 31+ messages in thread
From: Andy Whitcroft @ 2005-11-22 10:13 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andi Kleen, linux-mm, mingo, lhms-devel, linux-kernel, nickpiggin

Mel Gorman wrote:

> That's iterating through, potentially, 1024 pages which I considered too
> expensive. In terms of code complexity, the page-flags patch adds 237
> which is not much of a saving in comparison to 275 that the usemap
> approach uses.

Surley you would just use a single bit in the first page of a MAX_ORDER
block.   We guarentee that the mem_map is contigious out to MAX_ORDER
pages so you can simply calculate the offset.  The page free path does
the same thing to find the buddy pages when coallescing.

> Again, I can revisit the page-flag approach if I thought that something
> like this would get merged and people would not choke on another page flag
> being consumed.

All of that said, I am not even sure we have a bit left in the page
flags on smaller architectures :/.

-apw

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 2/5] Light Fragmentation Avoidance V20: 002_usemap
  2005-11-22 10:13           ` Andy Whitcroft
@ 2005-11-22 10:19             ` Mel Gorman
  2005-11-22 10:22             ` Andi Kleen
  2005-11-22 11:35             ` Mel Gorman
  2 siblings, 0 replies; 31+ messages in thread
From: Mel Gorman @ 2005-11-22 10:19 UTC (permalink / raw)
  To: Andy Whitcroft
  Cc: Andi Kleen, linux-mm, mingo, lhms-devel, linux-kernel, nickpiggin

On Tue, 22 Nov 2005, Andy Whitcroft wrote:

> Mel Gorman wrote:
>
> > That's iterating through, potentially, 1024 pages which I considered too
> > expensive. In terms of code complexity, the page-flags patch adds 237
> > which is not much of a saving in comparison to 275 that the usemap
> > approach uses.
>
> Surley you would just use a single bit in the first page of a MAX_ORDER
> block.

No, because you need the flag at free time to determine what list is
should be going to. There is no guarantee that the first page remains
allocated or that it has not been used for fallback.

> We guarentee that the mem_map is contigious out to MAX_ORDER
> pages so you can simply calculate the offset.  The page free path does
> the same thing to find the buddy pages when coallescing.
>

That's finding buddies, not finding the first page in the MAX_ORDER block.

I revisited the page-flag approach anyway and got it working properly. It
currently stands at 160 code insertions. It's been run through benchmarks
at the moment.

> > Again, I can revisit the page-flag approach if I thought that something
> > like this would get merged and people would not choke on another page flag
> > being consumed.
>
> All of that said, I am not even sure we have a bit left in the page
> flags on smaller architectures :/.
>



> -apw
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>

-- 
Mel Gorman
Part-time Phd Student                          Java Applications Developer
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 2/5] Light Fragmentation Avoidance V20: 002_usemap
  2005-11-22 10:13           ` Andy Whitcroft
  2005-11-22 10:19             ` Mel Gorman
@ 2005-11-22 10:22             ` Andi Kleen
  2005-11-22 10:35               ` Mel Gorman
  2005-11-22 11:35             ` Mel Gorman
  2 siblings, 1 reply; 31+ messages in thread
From: Andi Kleen @ 2005-11-22 10:22 UTC (permalink / raw)
  To: Andy Whitcroft
  Cc: Mel Gorman, Andi Kleen, linux-mm, mingo, lhms-devel,
	linux-kernel, nickpiggin

> All of that said, I am not even sure we have a bit left in the page
> flags on smaller architectures :/.

How about

#define PG_checked               8      /* kill me in 2.5.<early>. */

?

At least PG_uncached isn't used on many architectures too, so could
be reused. I don't know why those that use it don't check VMAs instead.

-Andi

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 2/5] Light Fragmentation Avoidance V20: 002_usemap
  2005-11-22 10:22             ` Andi Kleen
@ 2005-11-22 10:35               ` Mel Gorman
  2005-11-22 10:48                 ` KAMEZAWA Hiroyuki
  2005-11-22 10:54                 ` KAMEZAWA Hiroyuki
  0 siblings, 2 replies; 31+ messages in thread
From: Mel Gorman @ 2005-11-22 10:35 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Andy Whitcroft, linux-mm, mingo, lhms-devel, linux-kernel, nickpiggin

On Tue, 22 Nov 2005, Andi Kleen wrote:

> > All of that said, I am not even sure we have a bit left in the page
> > flags on smaller architectures :/.
>
> How about
>
> #define PG_checked               8      /* kill me in 2.5.<early>. */
>
> ?
>
> At least PG_uncached isn't used on many architectures too, so could
> be reused. I don't know why those that use it don't check VMAs instead.
>

PG_unchecked appears to be totally unused. It's only users are the macros
that manipulate the bit and mm/page_alloc.c . It appears it has been a
long time since it was used to it is a canditate for reuse.

-- 
Mel Gorman
Part-time Phd Student                          Java Applications Developer
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 2/5] Light Fragmentation Avoidance V20: 002_usemap
  2005-11-22 10:35               ` Mel Gorman
@ 2005-11-22 10:48                 ` KAMEZAWA Hiroyuki
  2005-11-22 19:40                   ` Mel Gorman
  2005-11-22 10:54                 ` KAMEZAWA Hiroyuki
  1 sibling, 1 reply; 31+ messages in thread
From: KAMEZAWA Hiroyuki @ 2005-11-22 10:48 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andi Kleen, Andy Whitcroft, linux-mm, mingo, lhms-devel,
	linux-kernel, nickpiggin

Mel Gorman wrote:
> On Tue, 22 Nov 2005, Andi Kleen wrote:
> 
> 
>>>All of that said, I am not even sure we have a bit left in the page
>>>flags on smaller architectures :/.
>>
>>How about
>>
>>#define PG_checked               8      /* kill me in 2.5.<early>. */
>>
>>?
>>
>>At least PG_uncached isn't used on many architectures too, so could
>>be reused. I don't know why those that use it don't check VMAs instead.
>>
> 
> 
> PG_unchecked appears to be totally unused. It's only users are the macros
> that manipulate the bit and mm/page_alloc.c . It appears it has been a
> long time since it was used to it is a canditate for reuse.
> 
Considering memory hotplug, I don't want to resize bitmaps at hot-add/remove.
no bitmap is welcome :)

-- Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 2/5] Light Fragmentation Avoidance V20: 002_usemap
  2005-11-22 10:35               ` Mel Gorman
  2005-11-22 10:48                 ` KAMEZAWA Hiroyuki
@ 2005-11-22 10:54                 ` KAMEZAWA Hiroyuki
  2005-11-22 11:10                   ` Mel Gorman
  1 sibling, 1 reply; 31+ messages in thread
From: KAMEZAWA Hiroyuki @ 2005-11-22 10:54 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andi Kleen, Andy Whitcroft, linux-mm, mingo, lhms-devel,
	linux-kernel, nickpiggin

Mel Gorman wrote:
>>#define PG_checked               8      /* kill me in 2.5.<early>. */
>>
>>?
>>
>>At least PG_uncached isn't used on many architectures too, so could
>>be reused. I don't know why those that use it don't check VMAs instead.
>>
> 
> 
> PG_unchecked appears to be totally unused. It's only users are the macros
> that manipulate the bit and mm/page_alloc.c . It appears it has been a
> long time since it was used to it is a canditate for reuse.
> 

Just a notification..
from 2.6.14

PageUncached      375 include/asm-ia64/uaccess.h 	if (PageUncached(page))
PageUncached      393 include/asm-ia64/uaccess.h 	if (PageUncached(page))

This is used by /dev/mem


PageChecked       196 fs/afs/dir.c   		if (!PageChecked(page))
PageChecked       169 fs/ext2/dir.c  		if (!PageChecked(page))
PageChecked      1372 fs/ext3/inode.c 	if (!page_has_buffers(page) || PageChecked(page)) {
PageChecked      1441 fs/ext3/inode.c 	WARN_ON(PageChecked(page));
PageChecked      2350 fs/reiserfs/inode.c 	int checked = PageChecked(page);
PageChecked      2853 fs/reiserfs/inode.c 	WARN_ON(PageChecked(page));

This is used by fs, now.

-- kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 2/5] Light Fragmentation Avoidance V20: 002_usemap
  2005-11-22 10:54                 ` KAMEZAWA Hiroyuki
@ 2005-11-22 11:10                   ` Mel Gorman
  0 siblings, 0 replies; 31+ messages in thread
From: Mel Gorman @ 2005-11-22 11:10 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Andi Kleen, Andy Whitcroft, linux-mm, mingo, lhms-devel,
	linux-kernel, nickpiggin

On Tue, 22 Nov 2005, KAMEZAWA Hiroyuki wrote:

> Mel Gorman wrote:
> > > #define PG_checked               8      /* kill me in 2.5.<early>. */
> > >
> > > ?
> > >
> > > At least PG_uncached isn't used on many architectures too, so could
> > > be reused. I don't know why those that use it don't check VMAs instead.
> > >
> >
> >
> > PG_unchecked appears to be totally unused. It's only users are the macros
> > that manipulate the bit and mm/page_alloc.c . It appears it has been a
> > long time since it was used to it is a canditate for reuse.
> >
>
> Just a notification..
> from 2.6.14
>
> PageUncached      375 include/asm-ia64/uaccess.h       if (PageUncached(page))
> PageUncached      393 include/asm-ia64/uaccess.h       if (PageUncached(page))
>
> This is used by /dev/mem
>
>
> PageChecked       196 fs/afs/dir.c     if (!PageChecked(page))
> PageChecked       169 fs/ext2/dir.c    if (!PageChecked(page))
> PageChecked      1372 fs/ext3/inode.c  if (!page_has_buffers(page) ||
> PageChecked(page)) {
> PageChecked      1441 fs/ext3/inode.c  WARN_ON(PageChecked(page));
> PageChecked      2350 fs/reiserfs/inode.c      int checked =
> PageChecked(page);
> PageChecked      2853 fs/reiserfs/inode.c      WARN_ON(PageChecked(page));
>
> This is used by fs, now.
>

d'oh. I was looking for the flags, not the macros :(

-- 
Mel Gorman
Part-time Phd Student                          Java Applications Developer
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 2/5] Light Fragmentation Avoidance V20: 002_usemap
  2005-11-22 10:13           ` Andy Whitcroft
  2005-11-22 10:19             ` Mel Gorman
  2005-11-22 10:22             ` Andi Kleen
@ 2005-11-22 11:35             ` Mel Gorman
  2 siblings, 0 replies; 31+ messages in thread
From: Mel Gorman @ 2005-11-22 11:35 UTC (permalink / raw)
  To: Andy Whitcroft
  Cc: Andi Kleen, linux-mm, mingo, lhms-devel, linux-kernel, nickpiggin

On Tue, 22 Nov 2005, Andy Whitcroft wrote:

> Mel Gorman wrote:
>
> > That's iterating through, potentially, 1024 pages which I considered too
> > expensive. In terms of code complexity, the page-flags patch adds 237
> > which is not much of a saving in comparison to 275 that the usemap
> > approach uses.
>
> Surley you would just use a single bit in the first page of a MAX_ORDER
> block.   We guarentee that the mem_map is contigious out to MAX_ORDER
> pages so you can simply calculate the offset.  The page free path does
> the same thing to find the buddy pages when coallescing.
>
> > Again, I can revisit the page-flag approach if I thought that something
> > like this would get merged and people would not choke on another page flag
> > being consumed.
>
> All of that said, I am not even sure we have a bit left in the page
> flags on smaller architectures :/.
>

Based on the 2.6.15-rc1-mm2, there is a macro FLAGS_RESERVED defined in
include/linux/mmzone.h which says how many bits are served for teh
node+zone bits in the page flags with the remainder for normal page flags.
It's currently 9, leaving 21 bits for page flags of which 19 are used. I
don't think using another bit will cause breakage but I imagine it will
make eyebrows furrow.

-- 
Mel Gorman
Part-time Phd Student                          Java Applications Developer
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 2/5] Light Fragmentation Avoidance V20: 002_usemap
  2005-11-22 10:48                 ` KAMEZAWA Hiroyuki
@ 2005-11-22 19:40                   ` Mel Gorman
  0 siblings, 0 replies; 31+ messages in thread
From: Mel Gorman @ 2005-11-22 19:40 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Andi Kleen, Andy Whitcroft, linux-mm, mingo, lhms-devel,
	linux-kernel, nickpiggin

On Tue, 22 Nov 2005, KAMEZAWA Hiroyuki wrote:

> Mel Gorman wrote:
> > On Tue, 22 Nov 2005, Andi Kleen wrote:
> >
> >
> > > > All of that said, I am not even sure we have a bit left in the page
> > > > flags on smaller architectures :/.
> > >
> > > How about
> > >
> > > #define PG_checked               8      /* kill me in 2.5.<early>. */
> > >
> > > ?
> > >
> > > At least PG_uncached isn't used on many architectures too, so could
> > > be reused. I don't know why those that use it don't check VMAs instead.
> > >
> >
> >
> > PG_unchecked appears to be totally unused. It's only users are the macros
> > that manipulate the bit and mm/page_alloc.c . It appears it has been a
> > long time since it was used to it is a canditate for reuse.
> >
> Considering memory hotplug, I don't want to resize bitmaps at hot-add/remove.
> no bitmap is welcome :)
>

Version has now been posted that has no usemap with the subject "Light
fragmentation avoidance without usemap". Details on implementation and
benchmarks are included.

-- 
Mel Gorman
Part-time Phd Student                          Java Applications Developer
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

end of thread, other threads:[~2005-11-22 19:40 UTC | newest]

Thread overview: 31+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2005-11-15 16:49 [PATCH 0/5] Light Fragmentation Avoidance V20 Mel Gorman
2005-11-15 16:49 ` [PATCH 1/5] Light Fragmentation Avoidance V20: 001_antidefrag_flags Mel Gorman
2005-11-15 23:00   ` Paul Jackson
2005-11-15 23:04     ` Randy.Dunlap
2005-11-16  1:36     ` Mel Gorman
2005-11-20 14:45       ` [Lhms-devel] " Paul Jackson
2005-11-15 16:49 ` [PATCH 2/5] Light Fragmentation Avoidance V20: 002_usemap Mel Gorman
2005-11-15 23:36   ` Andi Kleen
2005-11-16  1:43     ` Mel Gorman
2005-11-16  1:52       ` Andi Kleen
2005-11-16  2:07         ` Mel Gorman
2005-11-22 10:13           ` Andy Whitcroft
2005-11-22 10:19             ` Mel Gorman
2005-11-22 10:22             ` Andi Kleen
2005-11-22 10:35               ` Mel Gorman
2005-11-22 10:48                 ` KAMEZAWA Hiroyuki
2005-11-22 19:40                   ` Mel Gorman
2005-11-22 10:54                 ` KAMEZAWA Hiroyuki
2005-11-22 11:10                   ` Mel Gorman
2005-11-22 11:35             ` Mel Gorman
2005-11-15 16:50 ` [PATCH 3/5] Light Fragmentation Avoidance V20: 003_fragcore Mel Gorman
2005-11-16  2:35   ` KAMEZAWA Hiroyuki
2005-11-16 10:42     ` Mel Gorman
2005-11-15 16:50 ` [PATCH 4/5] Light Fragmentation Avoidance V20: 004_percpu Mel Gorman
2005-11-15 23:24   ` Paul Jackson
2005-11-16  1:37     ` Mel Gorman
2005-11-15 16:50 ` [PATCH 5/5] Light Fragmentation Avoidance V20: 005_configurable Mel Gorman
2005-11-15 23:39   ` Andi Kleen
2005-11-16  1:47     ` Mel Gorman
2005-11-15 22:54 ` [PATCH 0/5] Light Fragmentation Avoidance V20 Paul Jackson
2005-11-16  1:34   ` Mel Gorman

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox