linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/9] fragmentation avoidance
@ 2005-09-26 20:01 Joel Schopp
  2005-09-26 20:03 ` [PATCH 1/9] add defrag flags Joel Schopp
                   ` (10 more replies)
  0 siblings, 11 replies; 28+ messages in thread
From: Joel Schopp @ 2005-09-26 20:01 UTC (permalink / raw)
  To: Andrew Morton
  Cc: lhms, Linux Memory Management List, linux-kernel, Mel Gorman,
	Mike Kravetz, jschopp

The buddy system provides an efficient algorithm for managing a set of pages
within each zone. Despite the proven effectiveness of the algorithm in its
current form as used in the kernel, it is not possible to aggregate a subset
of pages within a zone according to specific allocation types. As a result,
two physically contiguous page frames (or sets of page frames) may satisfy
allocation requests that are drastically different. For example, one page
frame may contain data that is only temporarily used by an application while
the other is in use for a kernel device driver.  This can result in heavy
system fragmentation.

This series of patches is designed to reduce fragmentation in the standard
buddy allocator without impairing the performance of the allocator. High
fragmentation in the standard binary buddy allocator means that high-order
allocations can rarely be serviced. These patches work by dividing allocations
into three different types of allocations;

UserReclaimable - These are userspace pages that are easily reclaimable. Right
	now, all allocations of GFP_USER, GFP_HIGHUSER and disk buffers are
	in this category. These pages are trivially reclaimed by writing
	the page out to swap or syncing with backing storage

KernelReclaimable - These are pages allocated by the kernel that are easily
	reclaimed. This is stuff like inode caches, dcache, buffer_heads etc.
	These type of pages potentially could be reclaimed by dumping the
	caches and reaping the slabs

KernelNonReclaimable - These are pages that are allocated by the kernel that
	are not trivially reclaimed. For example, the memory allocated for a
	loaded module would be in this category. By default, allocations are
	considered to be of this type

Instead of having one global MAX_ORDER-sized array of free lists, there
are four, one for each type of allocation and another 12.5% reserve for
fallbacks. Finally, there is a list of pages of size 2^MAX_ORDER which is
a global pool of the largest pages the kernel deals with.

Once a 2^MAX_ORDER block of pages it split for a type of allocation, it is
added to the free-lists for that type, in effect reserving it. Hence, over
time, pages of the different types can be clustered together. This means that
if we wanted 2^MAX_ORDER number of pages, we could linearly scan a block of
pages allocated for UserReclaimable and page each of them out.

Fallback is used when there are no 2^MAX_ORDER pages available and there
are no free pages of the desired type. The fallback lists were chosen in a
way that keeps the most easily reclaimable pages together.

These patches originally were discussed as "Avoiding external fragmentation
with a placement policy" as authored by Mel Gorman and went through about 13
revisions on lkml and linux-mm.  Then with Mel's permission I have been
reworking these patches for easier mergability, readability, maintainability,
etc.  Several revisions have been posted on lhms-devel, as the Linux memory
hotplug community will be a major beneficiary of these patches.  All of the
various revisions have been tested on various platforms and shown to perform
well.  I believe the patches are now ready for inclusion in -mm, and after
wider testing inclusion in the mainline kernel.

The patch set consists of 9 patches that can be merged in 4 separate blocks,
with the only dependency being that the lower numbered patches are merged
first.  All are against 2.6.13.
Patch 1 defines the allocation flags and adds them to the allocator calls.
Patch 2 defines some new structures and the macros used to access them.
Patch 3-8 implement the fully functional fragmentation avoidance.
Patch 9 is trivial but useful for memory hotplug remove.
---
Patch 10 -- not ready for merging -- extends fragmentation avoidance to the
percpu allocator.  This patch works on 2.6.13-rc1 but only with NUMA off on
2.6.13; I am having a great deal of trouble tracking down why, help would be
appreciated.  I include the patch for review and test purposes as I plan to
submit it for merging after resolving the NUMA issues.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH 1/9] add defrag flags
  2005-09-26 20:01 [PATCH 0/9] fragmentation avoidance Joel Schopp
@ 2005-09-26 20:03 ` Joel Schopp
  2005-09-27  0:16   ` Kyle Moffett
  2005-09-26 20:05 ` [PATCH 2/9] declare defrag structs Joel Schopp
                   ` (9 subsequent siblings)
  10 siblings, 1 reply; 28+ messages in thread
From: Joel Schopp @ 2005-09-26 20:03 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Joel Schopp, lhms, Linux Memory Management List, linux-kernel,
	Mel Gorman, Mike Kravetz

[-- Attachment #1: Type: text/plain, Size: 623 bytes --]

This patch adds 2 new GFP flags to correspond to the 3 allocation types.  The
third state is indicated by neither flag corresponding to the first two states
being set. It then modifies appropriate allocator calls to use these new flags.
The flags are:
__GFP_USER, which corresponds to easily reclaimable pages
__GFP_KERNRCLM, which corresponds to userspace pages

Also note that the __GFP_USER flag should be reusable by the HARDWALL folks as
well.

This patch was originally authored by Mel Gorman, and heavily modified by me.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Joel Schopp <jschopp@austin.ibm.com>

[-- Attachment #2: 1_add_defrag_flags --]
[-- Type: text/plain, Size: 5255 bytes --]

Index: 2.6.13-joel2/fs/buffer.c
===================================================================
--- 2.6.13-joel2.orig/fs/buffer.c	2005-09-13 14:54:13.%N -0500
+++ 2.6.13-joel2/fs/buffer.c	2005-09-13 15:02:01.%N -0500
@@ -1119,7 +1119,8 @@ grow_dev_page(struct block_device *bdev,
 	struct page *page;
 	struct buffer_head *bh;
 
-	page = find_or_create_page(inode->i_mapping, index, GFP_NOFS);
+	page = find_or_create_page(inode->i_mapping, index,
+				   GFP_NOFS | __GFP_USER);
 	if (!page)
 		return NULL;
 
@@ -3044,7 +3045,8 @@ static void recalc_bh_state(void)
 	
 struct buffer_head *alloc_buffer_head(unsigned int __nocast gfp_flags)
 {
-	struct buffer_head *ret = kmem_cache_alloc(bh_cachep, gfp_flags);
+	struct buffer_head *ret = kmem_cache_alloc(bh_cachep,
+						   gfp_flags|__GFP_KERNRCLM);
 	if (ret) {
 		preempt_disable();
 		__get_cpu_var(bh_accounting).nr++;
Index: 2.6.13-joel2/fs/dcache.c
===================================================================
--- 2.6.13-joel2.orig/fs/dcache.c	2005-09-13 14:54:14.%N -0500
+++ 2.6.13-joel2/fs/dcache.c	2005-09-13 15:02:01.%N -0500
@@ -721,7 +721,7 @@ struct dentry *d_alloc(struct dentry * p
 	struct dentry *dentry;
 	char *dname;
 
-	dentry = kmem_cache_alloc(dentry_cache, GFP_KERNEL); 
+	dentry = kmem_cache_alloc(dentry_cache, GFP_KERNEL|__GFP_KERNRCLM);
 	if (!dentry)
 		return NULL;
 
Index: 2.6.13-joel2/fs/ext2/super.c
===================================================================
--- 2.6.13-joel2.orig/fs/ext2/super.c	2005-09-13 14:54:14.%N -0500
+++ 2.6.13-joel2/fs/ext2/super.c	2005-09-13 15:02:01.%N -0500
@@ -138,7 +138,8 @@ static kmem_cache_t * ext2_inode_cachep;
 static struct inode *ext2_alloc_inode(struct super_block *sb)
 {
 	struct ext2_inode_info *ei;
-	ei = (struct ext2_inode_info *)kmem_cache_alloc(ext2_inode_cachep, SLAB_KERNEL);
+	ei = (struct ext2_inode_info *)kmem_cache_alloc(ext2_inode_cachep,
+						SLAB_KERNEL|__GFP_KERNRCLM);
 	if (!ei)
 		return NULL;
 #ifdef CONFIG_EXT2_FS_POSIX_ACL
Index: 2.6.13-joel2/fs/ext3/super.c
===================================================================
--- 2.6.13-joel2.orig/fs/ext3/super.c	2005-09-13 14:54:14.%N -0500
+++ 2.6.13-joel2/fs/ext3/super.c	2005-09-13 15:02:01.%N -0500
@@ -440,7 +440,7 @@ static struct inode *ext3_alloc_inode(st
 {
 	struct ext3_inode_info *ei;
 
-	ei = kmem_cache_alloc(ext3_inode_cachep, SLAB_NOFS);
+	ei = kmem_cache_alloc(ext3_inode_cachep, SLAB_NOFS|__GFP_KERNRCLM);
 	if (!ei)
 		return NULL;
 #ifdef CONFIG_EXT3_FS_POSIX_ACL
Index: 2.6.13-joel2/fs/ntfs/inode.c
===================================================================
--- 2.6.13-joel2.orig/fs/ntfs/inode.c	2005-09-13 14:54:14.%N -0500
+++ 2.6.13-joel2/fs/ntfs/inode.c	2005-09-13 15:05:53.%N -0500
@@ -317,7 +317,7 @@ struct inode *ntfs_alloc_big_inode(struc
 	ntfs_inode *ni;
 
 	ntfs_debug("Entering.");
-	ni = kmem_cache_alloc(ntfs_big_inode_cache, SLAB_NOFS);
+	ni = kmem_cache_alloc(ntfs_big_inode_cache, SLAB_NOFS|__GFP_KERNRCLM);
 	if (likely(ni != NULL)) {
 		ni->state = 0;
 		return VFS_I(ni);
@@ -342,7 +342,7 @@ static inline ntfs_inode *ntfs_alloc_ext
 	ntfs_inode *ni;
 
 	ntfs_debug("Entering.");
-	ni = kmem_cache_alloc(ntfs_inode_cache, SLAB_NOFS);
+	ni = kmem_cache_alloc(ntfs_inode_cache, SLAB_NOFS|__GFP_KERNRCLM);
 	if (likely(ni != NULL)) {
 		ni->state = 0;
 		return ni;
Index: 2.6.13-joel2/include/linux/gfp.h
===================================================================
--- 2.6.13-joel2.orig/include/linux/gfp.h	2005-09-13 14:54:17.%N -0500
+++ 2.6.13-joel2/include/linux/gfp.h	2005-09-13 15:02:01.%N -0500
@@ -41,21 +41,30 @@ struct vm_area_struct;
 #define __GFP_NOMEMALLOC 0x10000u /* Don't use emergency reserves */
 #define __GFP_NORECLAIM  0x20000u /* No realy zone reclaim during allocation */
 
-#define __GFP_BITS_SHIFT 20	/* Room for 20 __GFP_FOO bits */
+/* Allocation type modifiers, group together if possible
+ * __GPF_USER: Allocation for user page or a buffer page
+ * __GFP_KERNRCLM: Short-lived or reclaimable kernel allocation
+ */
+#define __GFP_USER	0x40000u /* Kernel page that is easily reclaimable */
+#define __GFP_KERNRCLM	0x80000u /* User is a userspace user */
+#define __GFP_RCLM_BITS (__GFP_USER|__GFP_KERNRCLM)
+
+#define __GFP_BITS_SHIFT 21	/* Room for 20 __GFP_FOO bits */
 #define __GFP_BITS_MASK ((1 << __GFP_BITS_SHIFT) - 1)
 
 /* if you forget to add the bitmask here kernel will crash, period */
 #define GFP_LEVEL_MASK (__GFP_WAIT|__GFP_HIGH|__GFP_IO|__GFP_FS| \
 			__GFP_COLD|__GFP_NOWARN|__GFP_REPEAT| \
 			__GFP_NOFAIL|__GFP_NORETRY|__GFP_NO_GROW|__GFP_COMP| \
-			__GFP_NOMEMALLOC|__GFP_NORECLAIM)
+			__GFP_NOMEMALLOC|__GFP_KERNRCLM|__GFP_USER)
 
 #define GFP_ATOMIC	(__GFP_HIGH)
 #define GFP_NOIO	(__GFP_WAIT)
 #define GFP_NOFS	(__GFP_WAIT | __GFP_IO)
 #define GFP_KERNEL	(__GFP_WAIT | __GFP_IO | __GFP_FS)
-#define GFP_USER	(__GFP_WAIT | __GFP_IO | __GFP_FS)
-#define GFP_HIGHUSER	(__GFP_WAIT | __GFP_IO | __GFP_FS | __GFP_HIGHMEM)
+#define GFP_USER	(__GFP_WAIT | __GFP_IO | __GFP_FS | __GFP_USER)
+#define GFP_HIGHUSER	(__GFP_WAIT | __GFP_IO | __GFP_FS | __GFP_HIGHMEM | \
+			 __GFP_USER)
 
 /* Flag - indicates that the buffer will be suitable for DMA.  Ignored on some
    platforms, used as appropriate on others */

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH 2/9] declare defrag structs
  2005-09-26 20:01 [PATCH 0/9] fragmentation avoidance Joel Schopp
  2005-09-26 20:03 ` [PATCH 1/9] add defrag flags Joel Schopp
@ 2005-09-26 20:05 ` Joel Schopp
  2005-09-26 20:06 ` [PATCH 3/9] initialize defrag Joel Schopp
                   ` (8 subsequent siblings)
  10 siblings, 0 replies; 28+ messages in thread
From: Joel Schopp @ 2005-09-26 20:05 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Joel Schopp, lhms, Linux Memory Management List, linux-kernel,
	Mel Gorman, Mike Kravetz

[-- Attachment #1: Type: text/plain, Size: 788 bytes --]

This patch declares most of the structures needed by fragmentation avoidance
and associated macros.

There are two things of note in this patch.
1. free_area_usemap is in the zone information by default and in the mem_section
for CONFIG_SPARSEMEM.  This is done to be more efficient for both memory
hotplug add and memory hotplug remove, which add and remove sections. With the
macros this placement should be transparent.

2. free_area_usemap requires > 32 bits and < 64 bits for all
known architectures and possible configurations, a compile time check has been
added to make sure future architectures still have this property.

Originally authored by Mel Gorman and heavily modified by me.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Joel Schopp <jschopp@austin.ibm.com>

[-- Attachment #2: 2_declare_defrag_structs --]
[-- Type: text/plain, Size: 4276 bytes --]

Index: 2.6.13-joel2/include/linux/mmzone.h
===================================================================
--- 2.6.13-joel2.orig/include/linux/mmzone.h	2005-09-13 14:54:17.%N -0500
+++ 2.6.13-joel2/include/linux/mmzone.h	2005-09-19 16:26:18.%N -0500
@@ -21,6 +21,21 @@
 #define MAX_ORDER CONFIG_FORCE_MAX_ZONEORDER
 #endif
 
+/*
+ * The two bit field __GFP_RECLAIMBITS enumerates the following 4 kinds of 
+ * page reclaimability.
+ */
+#define RCLM_TYPES 4
+#define RCLM_NORCLM 0
+#define RCLM_USER 1
+#define RCLM_KERN 2
+#define RCLM_FALLBACK 3
+
+#define RCLM_SHIFT 18
+#define BITS_PER_RCLM_TYPE 2
+
+#define BITS_PER_ALLOC_TYPE 2
+
 struct free_area {
 	struct list_head	free_list;
 	unsigned long		nr_free;
@@ -137,7 +152,45 @@ struct zone {
 	 * free areas of different sizes
 	 */
 	spinlock_t		lock;
+	/*
+	 *  free_area to be removed in later patch  as it is replaced by
+	 *  free_area_list
+	 */
 	struct free_area	free_area[MAX_ORDER];
+#ifndef CONFIG_SPARSEMEM
+	/*
+	 * The map tracks what each 2^MAX_ORDER-1 sized block is being used for.
+	 * Each 2^MAX_ORDER block have pages has BITS_PER_ALLOC_TYPE bits in
+	 * this map to remember what the block is for. When a page is freed,
+	 * it's index within this bitmap is calculated in get_pageblock_type()
+	 * This means that pages will always be freed into the correct list in
+	 * free_area_lists
+	 *
+	 * The bits are set when a 2^MAX_ORDER block of pages is split
+ 	 */
+ 	unsigned long		*free_area_usemap;
+#endif
+
+	/*
+	 * free_area_lists contains buddies of split MAX_ORDER blocks indexed
+	 * by their intended allocation type, while free_area_global contains
+	 * whole MAX_ORDER blocks that can be used for any allocation type.
+	 */
+	struct free_area	free_area_lists[RCLM_TYPES][MAX_ORDER];
+
+	/*
+	 * A percentage of a zone is reserved for falling back to. Without
+	 * a fallback, memory will slowly fragment over time meaning the
+	 * placement policy only delays the fragmentation problem, not
+	 * fixes it
+	 */
+	unsigned long fallback_reserve;
+
+	/*
+	 * When negative, 2^MAX_ORDER-1 sized blocks of pages will be reserved
+	 * for fallbacks
+	 */
+	long fallback_balance;
 
 
 	ZONE_PADDING(_pad1_)
@@ -230,6 +283,17 @@ struct zone {
 } ____cacheline_maxaligned_in_smp;
 
 
+static inline void inc_reserve_count(struct zone* zone, int type)
+{
+	if(type == RCLM_FALLBACK)
+		zone->fallback_reserve++;
+}
+static inline void dec_reserve_count(struct zone* zone, int type)
+{
+	if(type == RCLM_FALLBACK && zone->fallback_reserve)
+		zone->fallback_reserve--;
+}
+
 /*
  * The "priority" of VM scanning is how much of the queues we will scan in one
  * go. A value of 12 for DEF_PRIORITY implies that we will scan 1/4096th of the
@@ -473,6 +537,9 @@ extern struct pglist_data contig_page_da
 #if (MAX_ORDER - 1 + PAGE_SHIFT) > SECTION_SIZE_BITS
 #error Allocator MAX_ORDER exceeds SECTION_SIZE
 #endif
+#if ((SECTION_SIZE_BITS - MAX_ORDER) * BITS_PER_ALLOC) > 64
+#error free_area_usemap is not big enough
+#endif
 
 struct page;
 struct mem_section {
@@ -485,6 +552,7 @@ struct mem_section {
 	 * before using it wrong.
 	 */
 	unsigned long section_mem_map;
+	DECLARE_BITMAP(free_area_usemap,64);
 };
 
 extern struct mem_section mem_section[NR_MEM_SECTIONS];
@@ -536,6 +604,17 @@ static inline struct mem_section *__pfn_
 	return __nr_to_section(pfn_to_section_nr(pfn));
 }
 
+static inline unsigned long *pfn_to_usemap(struct zone *zone, unsigned long pfn)
+{
+	return &__pfn_to_section(pfn)->free_area_usemap[0];
+}
+
+static inline int pfn_to_bitidx(struct zone *zone, unsigned long pfn)
+{
+	pfn &= (PAGES_PER_SECTION-1);
+	return (int)((pfn >> (MAX_ORDER-1)) * BITS_PER_RCLM_TYPE);
+}
+
 #define pfn_to_page(pfn) 						\
 ({ 									\
 	unsigned long __pfn = (pfn);					\
@@ -572,6 +651,15 @@ static inline int pfn_valid(unsigned lon
 void sparse_init(void);
 #else
 #define sparse_init()	do {} while (0)
+static inline unsigned long *pfn_to_usemap(struct zone *zone, unsigned long pfn)
+{
+	return (zone->free_area_usemap);
+}
+static inline int pfn_to_bitidx(struct zone *zone, unsigned long pfn)
+{
+	pfn = pfn - zone->zone_start_pfn;
+	return (int)((pfn >> (MAX_ORDER-1)) * BITS_PER_RCLM_TYPE);
+}
 #endif /* CONFIG_SPARSEMEM */
 
 #ifdef CONFIG_NODES_SPAN_OTHER_NODES

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH 3/9] initialize defrag
  2005-09-26 20:01 [PATCH 0/9] fragmentation avoidance Joel Schopp
  2005-09-26 20:03 ` [PATCH 1/9] add defrag flags Joel Schopp
  2005-09-26 20:05 ` [PATCH 2/9] declare defrag structs Joel Schopp
@ 2005-09-26 20:06 ` Joel Schopp
  2005-09-26 20:09 ` [PATCH 4/9] defrag helper functions Joel Schopp
                   ` (7 subsequent siblings)
  10 siblings, 0 replies; 28+ messages in thread
From: Joel Schopp @ 2005-09-26 20:06 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Joel Schopp, lhms, Linux Memory Management List, linux-kernel,
	Mel Gorman, Mike Kravetz

[-- Attachment #1: Type: text/plain, Size: 179 bytes --]

This patch allocates and initializes the newly added structures. Nothing
exciting.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Joel Schopp <jschopp@austin.ibm.com>


[-- Attachment #2: 3_initialize_defrag --]
[-- Type: text/plain, Size: 3595 bytes --]

Index: 2.6.13-joel2/mm/sparse.c
===================================================================
--- 2.6.13-joel2.orig/mm/sparse.c	2005-09-19 16:28:09.%N -0500
+++ 2.6.13-joel2/mm/sparse.c	2005-09-19 16:47:01.%N -0500
@@ -100,6 +100,22 @@ static struct page *sparse_early_mem_map
 	return NULL;
 }
 
+static int sparse_early_alloc_init_section(unsigned long pnum)
+{
+	struct page *map;
+	struct mem_section *ms = __nr_to_section(pnum);
+
+	map = sparse_early_mem_map_alloc(pnum);
+	if (!map)
+		return 1;
+
+	sparse_init_one_section(ms, pnum, map);
+
+	set_bit(RCLM_NORCLM, ms->free_area_usemap);
+
+	return 0;
+}
+
 /*
  * Allocate the accumulated non-linear sections, allocate a mem_map
  * for each and record the physical to section mapping.
@@ -107,16 +123,19 @@ static struct page *sparse_early_mem_map
 void sparse_init(void)
 {
 	unsigned long pnum;
-	struct page *map;
+	int rc;
 
 	for (pnum = 0; pnum < NR_MEM_SECTIONS; pnum++) {
 		if (!valid_section_nr(pnum))
-			continue;
+			continue;
+		rc = sparse_early_alloc_init_section(pnum);
+		if(rc) goto out_error;
 
-		map = sparse_early_mem_map_alloc(pnum);
-		if (map)
-			sparse_init_one_section(&mem_section[pnum], pnum, map);
 	}
+	return;
+
+ out_error:
+	printk("initialization error in sparse_early_alloc_init_section()\n");
 }
 
 /*
Index: 2.6.13-joel2/mm/page_alloc.c
===================================================================
--- 2.6.13-joel2.orig/mm/page_alloc.c	2005-09-19 16:28:40.%N -0500
+++ 2.6.13-joel2/mm/page_alloc.c	2005-09-19 17:13:53.%N -0500
@@ -1669,9 +1669,16 @@ void zone_init_free_lists(struct pglist_
 				unsigned long size)
 {
 	int order;
-	for (order = 0; order < MAX_ORDER ; order++) {
-		INIT_LIST_HEAD(&zone->free_area[order].free_list);
-		zone->free_area[order].nr_free = 0;
+	int type;
+	struct free_area *area;
+
+	/* Initialse the three size ordered lists of free_areas */
+	for (type=0; type < RCLM_TYPES; type++) {
+		for (order = 0; order < MAX_ORDER; order++) {
+			area = zone->free_area_lists[type];
+			INIT_LIST_HEAD(&area[order].free_list);
+			area[order].nr_free = 0;
+		}
 	}
 }
 
@@ -1849,6 +1856,40 @@ void __init setup_per_cpu_pageset()
 }
 
 #endif
+#ifndef CONFIG_SPARSEMEM
+#define roundup(x, y) ((((x)+((y)-1))/(y))*(y))
+/*
+ * Calculate the size of the zone->usemap in bytes rounded to an unsigned long
+ * Start by making sure zonesize is a multiple of MAX_ORDER-1 by rounding up
+ * Then figure 1 RCLM_TYPE worth of bits per MAX_ORDER-1, finally round up
+ * what is now in bits to nearest long in bits, then return it in bytes.
+ */
+static unsigned long __init usemap_size(unsigned long zonesize)
+{
+	unsigned long usemapsize;
+
+	usemapsize = roundup(zonesize, MAX_ORDER-1);
+	usemapsize = usemapsize >> (MAX_ORDER-1);
+	usemapsize *= BITS_PER_RCLM_TYPE;
+	usemapsize = roundup(usemapsize, 8 * sizeof(unsigned long));
+
+	return usemapsize >> 8;
+}
+
+static void free_area_usemap_init(unsigned long size, struct pglist_data *pgdat,
+				  struct zone *zone)
+{
+	unsigned long usemapsize;
+
+	usemapsize = usemap_size(size);
+	zone->free_area_usemap = alloc_bootmem_node(pgdat, usemapsize);
+	memset(zone->free_area_usemap, RCLM_NORCLM, usemapsize);
+}
+
+#else
+static void free_area_usemap_init(unsigned long size, struct pglist_data *pgdat,
+				  struct zone *zone){}
+#endif /* CONFIG_SPARSEMEM */
 
 /*
  * Set up the zone data structures:
@@ -1938,6 +1979,8 @@ static void __init free_area_init_core(s
 
 		zone_start_pfn += size;
 
+		free_area_usemap_init(size, pgdat, zone);
+
 		zone_init_free_lists(pgdat, zone, zone->spanned_pages);
 	}
 }

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH 4/9] defrag helper functions
  2005-09-26 20:01 [PATCH 0/9] fragmentation avoidance Joel Schopp
                   ` (2 preceding siblings ...)
  2005-09-26 20:06 ` [PATCH 3/9] initialize defrag Joel Schopp
@ 2005-09-26 20:09 ` Joel Schopp
  2005-09-26 22:29   ` Alex Bligh - linux-kernel
  2005-09-26 20:11 ` [PATCH 5/9] propagate defrag alloc types Joel Schopp
                   ` (6 subsequent siblings)
  10 siblings, 1 reply; 28+ messages in thread
From: Joel Schopp @ 2005-09-26 20:09 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Joel Schopp, lhms, Linux Memory Management List, linux-kernel,
	Mel Gorman, Mike Kravetz

[-- Attachment #1: Type: text/plain, Size: 275 bytes --]

This patch contains a handful of trivial functions, and one fairly short function
that finds an unallocated 2^MAX_ORDER-1 sized block from one type and moves it
to another type.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Joel Schopp <jschopp@austin.ibm.com>



[-- Attachment #2: 4_defrag_helper_funcs --]
[-- Type: text/plain, Size: 2882 bytes --]

Index: 2.6.13-joel2/mm/page_alloc.c
===================================================================
--- 2.6.13-joel2.orig/mm/page_alloc.c	2005-09-20 13:45:47.%N -0500
+++ 2.6.13-joel2/mm/page_alloc.c	2005-09-20 14:16:35.%N -0500
@@ -63,7 +63,62 @@ int sysctl_lowmem_reserve_ratio[MAX_NR_Z
 
 EXPORT_SYMBOL(totalram_pages);
 EXPORT_SYMBOL(nr_swap_pages);
+static inline int need_min_fallback_reserve(struct zone *zone)
+{
+	return (zone->free_pages >> MAX_ORDER) < zone->fallback_reserve;
+}
+static inline int is_min_fallback_reserved(struct zone *zone)
+{
+	return zone->fallback_balance < 0;
+}
+static inline unsigned int get_pageblock_type(struct zone *zone,
+					      struct page *page)
+{
+	unsigned long pfn = page_to_pfn(page);
+	int i, bitidx;
+	unsigned int type = 0;
+	unsigned long *usemap;
+
+	bitidx = pfn_to_bitidx(zone, pfn);
+	usemap = pfn_to_usemap(zone, pfn);
+
+	for (i=0; i < BITS_PER_RCLM_TYPE; i++) {
+		type = (type << 1);
+		type |= (!!test_bit(bitidx+i, usemap));
+	}
+
+	return type;
+}
 
+void assign_bit(int bit_nr, unsigned long* map, int value)
+{
+	switch (value) {
+	case 0:
+		clear_bit(bit_nr, map);
+		break;
+	default:
+		set_bit(bit_nr, map);
+	}
+}
+/*
+ * Reserve a block of pages for an allocation type & enforce function
+ * being changed if more bits are added to keep track of additional types
+ */
+BUILD_BUG_ON(BITS_PER_RCLM_TYPE > 2)
+static inline void set_pageblock_type(struct zone *zone, struct page *page,
+				      int type)
+{
+	unsigned long pfn = page_to_pfn(page);
+	int bitidx;
+	unsigned long *usemap;
+
+	usemap = pfn_to_usemap(zone, pfn);
+	bitidx = pfn_to_bitidx(zone, pfn);
+
+	assign_bit(bitidx, usemap, (type & 0x1));
+	assign_bit(bitidx + 1, usemap, (type & 0x2));
+
+}
 /*
  * Used by page_zone() to look up the address of the struct zone whose
  * id is encoded in the upper bits of page->flags
@@ -465,6 +520,41 @@ static void prep_new_page(struct page *p
 	kernel_map_pages(page, 1 << order, 1);
 }
 
+/*
+ * Find a list that has a 2^MAX_ORDER-1 block of pages available and
+ * return it
+ */
+static inline struct page* steal_largepage(struct zone *zone, int alloctype)
+{
+	struct page *page;
+	struct free_area *area;
+	int i=0;
+
+	for(i = 0; i < RCLM_TYPES; i++) {
+		if(i == alloctype)
+			continue;
+
+		area = &zone->free_area_lists[i][MAX_ORDER-1];
+		if(!list_empty(&area->free_list))
+			break;
+	}
+	if (i == RCLM_TYPES) return NULL;
+
+	page = list_entry(area->free_list.next, struct page, lru);
+	area->nr_free--;
+
+	if (!is_min_fallback_reserved(zone) &&
+	    need_min_fallback_reserve(zone)) {
+		alloctype = RCLM_FALLBACK;
+	}
+
+	set_pageblock_type(zone, page, alloctype);
+	dec_reserve_count(zone, i);
+	inc_reserve_count(zone, alloctype);
+
+	return page;
+}
+
 /* 
  * Do the hard work of removing an element from the buddy allocator.
  * Call me with the zone->lock already held.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH 5/9] propagate defrag alloc types
  2005-09-26 20:01 [PATCH 0/9] fragmentation avoidance Joel Schopp
                   ` (3 preceding siblings ...)
  2005-09-26 20:09 ` [PATCH 4/9] defrag helper functions Joel Schopp
@ 2005-09-26 20:11 ` Joel Schopp
  2005-09-26 20:13 ` [PATCH 6/9] fragmentation avoidance core Joel Schopp
                   ` (5 subsequent siblings)
  10 siblings, 0 replies; 28+ messages in thread
From: Joel Schopp @ 2005-09-26 20:11 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Joel Schopp, lhms, Linux Memory Management List, linux-kernel,
	Mel Gorman, Mike Kravetz

[-- Attachment #1: Type: text/plain, Size: 210 bytes --]

Now that we have this new information of alloctype, this patch propagates it to
functions where it will be useful.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Joel Schopp <jschopp@austin.ibm.com>

[-- Attachment #2: 5_propogate_defrag_types --]
[-- Type: text/plain, Size: 3677 bytes --]

Index: 2.6.13-joel2/mm/page_alloc.c
===================================================================
--- 2.6.13-joel2.orig/mm/page_alloc.c	2005-09-20 14:16:35.%N -0500
+++ 2.6.13-joel2/mm/page_alloc.c	2005-09-20 15:08:05.%N -0500
@@ -559,7 +559,8 @@ static inline struct page* steal_largepa
  * Do the hard work of removing an element from the buddy allocator.
  * Call me with the zone->lock already held.
  */
-static struct page *__rmqueue(struct zone *zone, unsigned int order)
+static struct page *__rmqueue(struct zone *zone, unsigned int order,
+			      int alloctype)
 {
 	struct free_area * area;
 	unsigned int current_order;
@@ -587,7 +588,8 @@ static struct page *__rmqueue(struct zon
  * Returns the number of new pages which were placed at *list.
  */
 static int rmqueue_bulk(struct zone *zone, unsigned int order, 
-			unsigned long count, struct list_head *list)
+			unsigned long count, struct list_head *list,
+			int alloctype)
 {
 	unsigned long flags;
 	int i;
@@ -596,7 +598,7 @@ static int rmqueue_bulk(struct zone *zon
 	
 	spin_lock_irqsave(&zone->lock, flags);
 	for (i = 0; i < count; ++i) {
-		page = __rmqueue(zone, order);
+		page = __rmqueue(zone, order, alloctype);
 		if (page == NULL)
 			break;
 		allocated++;
@@ -775,7 +777,8 @@ static inline void prep_zero_page(struct
  * or two.
  */
 static struct page *
-buffered_rmqueue(struct zone *zone, int order, unsigned int __nocast gfp_flags)
+buffered_rmqueue(struct zone *zone, int order, unsigned int __nocast gfp_flags,
+		 int alloctype)
 {
 	unsigned long flags;
 	struct page *page = NULL;
@@ -787,8 +790,8 @@ buffered_rmqueue(struct zone *zone, int 
 		pcp = &zone_pcp(zone, get_cpu())->pcp[cold];
 		local_irq_save(flags);
 		if (pcp->count <= pcp->low)
-			pcp->count += rmqueue_bulk(zone, 0,
-						pcp->batch, &pcp->list);
+			pcp->count += rmqueue_bulk(zone, 0, pcp->batch,
+						   &pcp->list, alloctype);
 		if (pcp->count) {
 			page = list_entry(pcp->list.next, struct page, lru);
 			list_del(&page->lru);
@@ -800,7 +803,7 @@ buffered_rmqueue(struct zone *zone, int 
 
 	if (page == NULL) {
 		spin_lock_irqsave(&zone->lock, flags);
-		page = __rmqueue(zone, order);
+		page = __rmqueue(zone, order, alloctype);
 		spin_unlock_irqrestore(&zone->lock, flags);
 	}
 
@@ -876,7 +879,9 @@ __alloc_pages(unsigned int __nocast gfp_
 	int do_retry;
 	int can_try_harder;
 	int did_some_progress;
-
+	int alloctype;
+
+	alloctype = (gfp_mask & __GFP_RCLM_BITS);
 	might_sleep_if(wait);
 
 	/*
@@ -921,7 +926,7 @@ zone_reclaim_retry:
 			}
 		}
 
-		page = buffered_rmqueue(z, order, gfp_mask);
+		page = buffered_rmqueue(z, order, gfp_mask, alloctype);
 		if (page)
 			goto got_pg;
 	}
@@ -945,7 +950,7 @@ zone_reclaim_retry:
 		if (wait && !cpuset_zone_allowed(z))
 			continue;
 
-		page = buffered_rmqueue(z, order, gfp_mask);
+		page = buffered_rmqueue(z, order, gfp_mask, alloctype);
 		if (page)
 			goto got_pg;
 	}
@@ -959,7 +964,8 @@ zone_reclaim_retry:
 			for (i = 0; (z = zones[i]) != NULL; i++) {
 				if (!cpuset_zone_allowed(z))
 					continue;
-				page = buffered_rmqueue(z, order, gfp_mask);
+				page = buffered_rmqueue(z, order, gfp_mask,
+							alloctype);
 				if (page)
 					goto got_pg;
 			}
@@ -996,7 +1002,7 @@ rebalance:
 			if (!cpuset_zone_allowed(z))
 				continue;
 
-			page = buffered_rmqueue(z, order, gfp_mask);
+			page = buffered_rmqueue(z, order, gfp_mask, alloctype);
 			if (page)
 				goto got_pg;
 		}
@@ -1015,7 +1021,7 @@ rebalance:
 			if (!cpuset_zone_allowed(z))
 				continue;
 
-			page = buffered_rmqueue(z, order, gfp_mask);
+			page = buffered_rmqueue(z, order, gfp_mask, alloctype);
 			if (page)
 				goto got_pg;
 		}

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH 6/9] fragmentation avoidance core
  2005-09-26 20:01 [PATCH 0/9] fragmentation avoidance Joel Schopp
                   ` (4 preceding siblings ...)
  2005-09-26 20:11 ` [PATCH 5/9] propagate defrag alloc types Joel Schopp
@ 2005-09-26 20:13 ` Joel Schopp
  2005-09-26 20:14 ` [PATCH 7/9] try harder on large allocations Joel Schopp
                   ` (4 subsequent siblings)
  10 siblings, 0 replies; 28+ messages in thread
From: Joel Schopp @ 2005-09-26 20:13 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Joel Schopp, lhms, Linux Memory Management List, linux-kernel,
	Mel Gorman, Mike Kravetz

[-- Attachment #1: Type: text/plain, Size: 705 bytes --]

This patch is the core changes for memory fragmentation avoidance.

__rmqueue() within an alloctype behaves much as it did before on the global
list. If that alloctype has insufficient free blocks it trys to
steal a unallocated 2^MAX_ORDER-1 block from another type.  If there are no
unallocated 2^MAX_ORDER-1 blocks it goes to a more aggressive fallback
allocation detailed in a later patch.

The other functions do basically the same thing they did before, they just get
tidied up to deal with 3 alloctypes.

As this patch replaces all references to free_area[] with free_area_lists[] it
removes free_area[]

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Joel Schopp <jschopp@austin.ibm.com>

[-- Attachment #2: 6_defrag_core --]
[-- Type: text/plain, Size: 10456 bytes --]

Index: 2.6.13-joel2/mm/page_alloc.c
===================================================================
--- 2.6.13-joel2.orig/mm/page_alloc.c	2005-09-20 15:08:05.%N -0500
+++ 2.6.13-joel2/mm/page_alloc.c	2005-09-21 11:13:14.%N -0500
@@ -104,7 +104,9 @@ void assign_bit(int bit_nr, unsigned lon
  * Reserve a block of pages for an allocation type & enforce function
  * being changed if more bits are added to keep track of additional types
  */
-BUILD_BUG_ON(BITS_PER_RCLM_TYPE > 2)
+#if BITS_PER_RCLM_TYPE > 2
+#error
+#endif
 static inline void set_pageblock_type(struct zone *zone, struct page *page,
 				      int type)
 {
@@ -333,6 +335,8 @@ static inline void __free_pages_bulk (st
 {
 	unsigned long page_idx;
 	int order_size = 1 << order;
+	struct free_area *area;
+	struct free_area *freelist;
 
 	if (unlikely(order))
 		destroy_compound_page(page, order);
@@ -342,10 +346,11 @@ static inline void __free_pages_bulk (st
 	BUG_ON(page_idx & (order_size - 1));
 	BUG_ON(bad_range(zone, page));
 
+	freelist = zone->free_area_lists[get_pageblock_type(zone,page)];
+
 	zone->free_pages += order_size;
 	while (order < MAX_ORDER-1) {
 		unsigned long combined_idx;
-		struct free_area *area;
 		struct page *buddy;
 
 		combined_idx = __find_combined_index(page_idx, order);
@@ -356,16 +361,20 @@ static inline void __free_pages_bulk (st
 		if (!page_is_buddy(buddy, order))
 			break;		/* Move the buddy up one level. */
 		list_del(&buddy->lru);
-		area = zone->free_area + order;
+		area = freelist + order;
 		area->nr_free--;
 		rmv_page_order(buddy);
 		page = page + (combined_idx - page_idx);
 		page_idx = combined_idx;
 		order++;
 	}
+	if (unlikely(order == MAX_ORDER-1)) zone->fallback_balance++;
+
 	set_page_order(page, order);
-	list_add(&page->lru, &zone->free_area[order].free_list);
-	zone->free_area[order].nr_free++;
+	area = freelist + order;
+	list_add_tail(&page->lru, &area->free_list);
+	area->nr_free++;
+
 }
 
 static inline void free_pages_check(const char *function, struct page *page)
@@ -555,6 +564,25 @@ static inline struct page* steal_largepa
 	return page;
 }
 
+static inline struct page
+*remove_page(struct zone *zone, struct page *page, unsigned int order,
+	     unsigned int current_order, struct free_area *area)
+{
+	if (unlikely(current_order == MAX_ORDER-1)) zone->fallback_balance--;
+	list_del(&page->lru);
+	rmv_page_order(page);
+	zone->free_pages -= 1UL << order;
+	return expand(zone, page, order, current_order, area);
+}
+
+
+static struct page *
+fallback_alloc(int alloctype, struct zone *zone, unsigned int order)
+{
+	/* Stub out for seperate review, NULL equates to no fallback*/
+	return NULL;
+
+}
 /* 
  * Do the hard work of removing an element from the buddy allocator.
  * Call me with the zone->lock already held.
@@ -566,20 +594,24 @@ static struct page *__rmqueue(struct zon
 	unsigned int current_order;
 	struct page *page;
 
-	for (current_order = order; current_order < MAX_ORDER; ++current_order) {
-		area = zone->free_area + current_order;
-		if (list_empty(&area->free_list))
-			continue;
-
+	alloctype >>= RCLM_SHIFT;
+	/* Search the list for the alloctype */
+	area = zone->free_area_lists[alloctype] + order;
+	for (current_order = order; current_order < MAX_ORDER;
+	     current_order++, area++) {
+		if (list_empty(&area->free_list)) continue;
 		page = list_entry(area->free_list.next, struct page, lru);
-		list_del(&page->lru);
-		rmv_page_order(page);
-		area->nr_free--;
-		zone->free_pages -= 1UL << order;
-		return expand(zone, page, order, current_order, area);
+ 		area->nr_free--;
+		return remove_page(zone, page, order, current_order, area);
 	}
 
-	return NULL;
+	page = steal_largepage(zone, alloctype);
+	if (page == NULL)
+		return fallback_alloc(alloctype, zone, order);
+
+	area--;
+	current_order--;
+	return remove_page(zone, page, order, current_order, area);
 }
 
 /* 
@@ -587,25 +619,49 @@ static struct page *__rmqueue(struct zon
  * a single hold of the lock, for efficiency.  Add them to the supplied list.
  * Returns the number of new pages which were placed at *list.
  */
-static int rmqueue_bulk(struct zone *zone, unsigned int order, 
-			unsigned long count, struct list_head *list,
-			int alloctype)
+static int rmqueue_bulk(struct zone *zone, unsigned long count,
+			struct list_head *list, int alloctype)
 {
 	unsigned long flags;
 	int i;
-	int allocated = 0;
+	unsigned long allocated = count;
 	struct page *page;
-	
+	unsigned long current_order = 0;
+	/* Find what order we should start allocating blocks at */
+	current_order = ffs(count) - 1;
+
+
+	/*
+	 * Satisfy the request in as the largest possible physically
+	 * contiguous block
+	 */
 	spin_lock_irqsave(&zone->lock, flags);
-	for (i = 0; i < count; ++i) {
-		page = __rmqueue(zone, order, alloctype);
-		if (page == NULL)
-			break;
-		allocated++;
-		list_add_tail(&page->lru, list);
+	while (allocated) {
+		if ((1 << current_order) > allocated)
+			current_order--;
+
+		/* Allocate a block at the current_order */
+		page = __rmqueue(zone, current_order, alloctype);
+		if (page == NULL) {
+			if (current_order == 0) break;
+			current_order--;
+			continue;
+		}
+		allocated -= 1 << current_order;
+		/* Move to the next block if order is already 0 */
+		if (current_order == 0) {
+			list_add_tail(&page->lru, list);
+			continue;
+		}
+
+		/* Split the large block into order-sized blocks  */
+		for (i = 1 << current_order; i != 0; i--) {
+			list_add_tail(&page->lru, list);
+			page++;
+		}
 	}
 	spin_unlock_irqrestore(&zone->lock, flags);
-	return allocated;
+	return count - allocated;
 }
 
 #ifdef CONFIG_NUMA
@@ -664,9 +720,9 @@ static void __drain_pages(unsigned int c
 void mark_free_pages(struct zone *zone)
 {
 	unsigned long zone_pfn, flags;
-	int order;
+	int order, type;
+	unsigned long start_pfn, i;
 	struct list_head *curr;
-
 	if (!zone->spanned_pages)
 		return;
 
@@ -674,14 +730,15 @@ void mark_free_pages(struct zone *zone)
 	for (zone_pfn = 0; zone_pfn < zone->spanned_pages; ++zone_pfn)
 		ClearPageNosaveFree(pfn_to_page(zone_pfn + zone->zone_start_pfn));
 
-	for (order = MAX_ORDER - 1; order >= 0; --order)
-		list_for_each(curr, &zone->free_area[order].free_list) {
-			unsigned long start_pfn, i;
-
-			start_pfn = page_to_pfn(list_entry(curr, struct page, lru));
-
+	for (type=0; type < RCLM_TYPES; type++) {
+		for (order = MAX_ORDER - 1; order >= 0; --order)
+			list_for_each(curr, &zone->free_area_lists[type]
+				      [order].free_list) {
+			start_pfn = page_to_pfn(list_entry(curr, struct page,
+							   lru));
 			for (i=0; i < (1<<order); i++)
 				SetPageNosaveFree(pfn_to_page(start_pfn+i));
+		}
 	}
 	spin_unlock_irqrestore(&zone->lock, flags);
 }
@@ -790,7 +847,7 @@ buffered_rmqueue(struct zone *zone, int 
 		pcp = &zone_pcp(zone, get_cpu())->pcp[cold];
 		local_irq_save(flags);
 		if (pcp->count <= pcp->low)
-			pcp->count += rmqueue_bulk(zone, 0, pcp->batch,
+			pcp->count += rmqueue_bulk(zone, pcp->batch,
 						   &pcp->list, alloctype);
 		if (pcp->count) {
 			page = list_entry(pcp->list.next, struct page, lru);
@@ -831,6 +888,7 @@ int zone_watermark_ok(struct zone *z, in
 	/* free_pages my go negative - that's OK */
 	long min = mark, free_pages = z->free_pages - (1 << order) + 1;
 	int o;
+	struct free_area *kernnorclm, *kernrclm, *userrclm;
 
 	if (gfp_high)
 		min -= min / 2;
@@ -839,15 +897,21 @@ int zone_watermark_ok(struct zone *z, in
 
 	if (free_pages <= min + z->lowmem_reserve[classzone_idx])
 		return 0;
+	kernnorclm = z->free_area_lists[RCLM_NORCLM];
+	kernrclm = z->free_area_lists[RCLM_KERN];
+	userrclm = z->free_area_lists[RCLM_USER];
 	for (o = 0; o < order; o++) {
 		/* At the next order, this order's pages become unavailable */
-		free_pages -= z->free_area[o].nr_free << o;
-
+		free_pages -= (kernnorclm->nr_free + kernrclm->nr_free +
+			       userrclm->nr_free) << o;
 		/* Require fewer higher order pages to be free */
 		min >>= 1;
 
 		if (free_pages <= min)
 			return 0;
+		kernnorclm++;
+		kernrclm++;
+		userrclm++;
 	}
 	return 1;
 }
@@ -1370,6 +1434,7 @@ void show_free_areas(void)
 	unsigned long inactive;
 	unsigned long free;
 	struct zone *zone;
+	int type;
 
 	for_each_zone(zone) {
 		show_node(zone);
@@ -1463,8 +1528,10 @@ void show_free_areas(void)
 
 		spin_lock_irqsave(&zone->lock, flags);
 		for (order = 0; order < MAX_ORDER; order++) {
-			nr = zone->free_area[order].nr_free;
-			total += nr << order;
+			for (type=0; type < RCLM_TYPES; type++) {
+				nr = zone->free_area_lists[type][order].nr_free;
+				total += nr << order;
+			}
 			printk("%lu*%lukB ", nr, K(1UL) << order);
 		}
 		spin_unlock_irqrestore(&zone->lock, flags);
@@ -2171,16 +2238,36 @@ static int frag_show(struct seq_file *m,
 	struct zone *zone;
 	struct zone *node_zones = pgdat->node_zones;
 	unsigned long flags;
-	int order;
+	int order, type;
+	struct list_head *elem;
+	unsigned long nr_bufs = 0;
 
 	for (zone = node_zones; zone - node_zones < MAX_NR_ZONES; ++zone) {
 		if (!zone->present_pages)
 			continue;
 
 		spin_lock_irqsave(&zone->lock, flags);
-		seq_printf(m, "Node %d, zone %8s ", pgdat->node_id, zone->name);
-		for (order = 0; order < MAX_ORDER; ++order)
-			seq_printf(m, "%6lu ", zone->free_area[order].nr_free);
+		seq_printf(m, "Node %d, zone %8s", pgdat->node_id, zone->name);
+		for (order = 0; order < MAX_ORDER-1; ++order) {
+			nr_bufs = 0;
+
+			for (type=0; type < RCLM_TYPES; type++) {
+				list_for_each(elem,
+					      &(zone->free_area_lists[type]
+						[order].free_list))
+					++nr_bufs;
+			}
+			seq_printf(m, "%6lu ", nr_bufs);
+		}
+
+		/* Scan global list */
+		nr_bufs = 0;
+		for (type=0; type < RCLM_TYPES; type++) {
+			nr_bufs += zone->free_area_lists[type]
+				[MAX_ORDER-1].nr_free;
+		}
+		seq_printf(m, "%6lu ", nr_bufs);
+
 		spin_unlock_irqrestore(&zone->lock, flags);
 		seq_putc(m, '\n');
 	}
Index: 2.6.13-joel2/include/linux/mmzone.h
===================================================================
--- 2.6.13-joel2.orig/include/linux/mmzone.h	2005-09-20 15:04:47.%N -0500
+++ 2.6.13-joel2/include/linux/mmzone.h	2005-09-21 10:44:00.%N -0500
@@ -152,11 +152,7 @@ struct zone {
 	 * free areas of different sizes
 	 */
 	spinlock_t		lock;
-	/*
-	 *  free_area to be removed in later patch  as it is replaced by
-	 *  free_area_list
-	 */
-	struct free_area	free_area[MAX_ORDER];
+
 #ifndef CONFIG_SPARSEMEM
 	/*
 	 * The map tracks what each 2^MAX_ORDER-1 sized block is being used for.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH 7/9] try harder on large allocations
  2005-09-26 20:01 [PATCH 0/9] fragmentation avoidance Joel Schopp
                   ` (5 preceding siblings ...)
  2005-09-26 20:13 ` [PATCH 6/9] fragmentation avoidance core Joel Schopp
@ 2005-09-26 20:14 ` Joel Schopp
  2005-09-27  7:21   ` Coywolf Qi Hunt
  2005-09-26 20:16 ` [PATCH 8/9] defrag fallback Joel Schopp
                   ` (3 subsequent siblings)
  10 siblings, 1 reply; 28+ messages in thread
From: Joel Schopp @ 2005-09-26 20:14 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Joel Schopp, lhms, Linux Memory Management List, linux-kernel,
	Mel Gorman, Mike Kravetz

[-- Attachment #1: Type: text/plain, Size: 331 bytes --]

Fragmentation avoidance patches increase our chances of satisfying high order
allocations.  So this patch takes more than one iteration at trying to fulfill
those allocations because unlike before the extra iterations are often useful.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Joel Schopp <jschopp@austin.ibm.com>

[-- Attachment #2: 7_large_alloc_try_harder --]
[-- Type: text/plain, Size: 1069 bytes --]

Index: 2.6.13-joel2/mm/page_alloc.c
===================================================================
--- 2.6.13-joel2.orig/mm/page_alloc.c	2005-09-21 11:13:14.%N -0500
+++ 2.6.13-joel2/mm/page_alloc.c	2005-09-21 11:14:49.%N -0500
@@ -944,7 +944,8 @@ __alloc_pages(unsigned int __nocast gfp_
 	int can_try_harder;
 	int did_some_progress;
 	int alloctype;
- 
+	int highorder_retry = 3;
+
 	alloctype = (gfp_mask & __GFP_RCLM_BITS);
 	might_sleep_if(wait);
 
@@ -1090,7 +1091,14 @@ rebalance:
 				goto got_pg;
 		}
 
-		out_of_memory(gfp_mask, order);
+		if (order < MAX_ORDER/2) out_of_memory(gfp_mask, order);
+		/*
+		 * Due to low fragmentation efforts, we should try a little
+		 * harder to satisfy high order allocations
+		 */
+		if (order >= MAX_ORDER/2 && --highorder_retry > 0)
+			goto rebalance;
+
 		goto restart;
 	}
 
@@ -1107,6 +1115,8 @@ rebalance:
 			do_retry = 1;
 		if (gfp_mask & __GFP_NOFAIL)
 			do_retry = 1;
+		if (order >= MAX_ORDER/2 && --highorder_retry > 0)
+			do_retry = 1;
 	}
 	if (do_retry) {
 		blk_congestion_wait(WRITE, HZ/50);

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH 8/9] defrag fallback
  2005-09-26 20:01 [PATCH 0/9] fragmentation avoidance Joel Schopp
                   ` (6 preceding siblings ...)
  2005-09-26 20:14 ` [PATCH 7/9] try harder on large allocations Joel Schopp
@ 2005-09-26 20:16 ` Joel Schopp
  2005-09-26 20:17 ` [PATCH 9/9] free memory is user reclaimable Joel Schopp
                   ` (2 subsequent siblings)
  10 siblings, 0 replies; 28+ messages in thread
From: Joel Schopp @ 2005-09-26 20:16 UTC (permalink / raw)
  To: Joel Schopp
  Cc: Andrew Morton, lhms, Linux Memory Management List, linux-kernel,
	Mel Gorman, Mike Kravetz

[-- Attachment #1: Type: text/plain, Size: 1263 bytes --]

When we can't allocate from the preferred allocation type we need fallback to
other allocation types.  This patch determines which allocation types we try
to fallback to in which order.  It also adds a special fallback type that is
designed to minimize the fragmentation caused by fallback between the other
types.

There is an implicit tradeoff being made here between avoiding fragmentation
and satisfying allocations.  This patch aims to keep existing behavior of
satisfying allocations if there is any free memory of any type to satisfy them.
It does a reasonable job of trying to minimize the fragmentation, and certainly
does better than a stock kernel in all situations.

However, it would not be hard to imagine scenarios where a different fallback
algorithm that fails more allocations was able to keep fragmentation down much
better, and on some systems this decreased fragmentation might even be worth
the cost of failing allocations.  Systems doing memory hotplug remove for
example.  This patch is designed so that the static function
fallback_alloc() can be easily replaced with an alternate implementation (under
a config option perhaps) in the future.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Joel Schopp <jschopp@austin.ibm.com>

[-- Attachment #2: 8_defrag_fallback --]
[-- Type: text/plain, Size: 3586 bytes --]

Index: 2.6.13-joel2/mm/page_alloc.c
===================================================================
--- 2.6.13-joel2.orig/mm/page_alloc.c	2005-09-21 11:14:49.%N -0500
+++ 2.6.13-joel2/mm/page_alloc.c	2005-09-21 11:17:23.%N -0500
@@ -39,6 +39,17 @@
 #include "internal.h"
 
 /*
+ * fallback_allocs contains the fallback types for low memory conditions
+ * where the preferred alloction type if not available.
+ */
+int fallback_allocs[RCLM_TYPES][RCLM_TYPES+1] = {
+	{RCLM_NORCLM,RCLM_FALLBACK,  RCLM_KERN,  RCLM_USER,-1},
+	{RCLM_KERN,  RCLM_FALLBACK,  RCLM_NORCLM,RCLM_USER,-1},
+	{RCLM_USER,  RCLM_FALLBACK,  RCLM_NORCLM,RCLM_KERN,-1},
+	{RCLM_FALLBACK,  RCLM_NORCLM,RCLM_KERN,  RCLM_USER,-1}
+};
+
+/*
  * MCD - HACK: Find somewhere to initialize this EARLY, or make this
  * initializer cleaner
  */
@@ -576,13 +587,86 @@ static inline struct page
 }
 
 
+/*
+ * If we are falling back, and the allocation is KERNNORCLM,
+ * then reserve any buddies for the KERNNORCLM pool. These
+ * allocations fragment the worst so this helps keep them
+ * in the one place
+ */
+static inline void
+fallback_buddy_reserve(int start_alloctype, struct zone *zone,
+		       unsigned int current_order, struct page *page)
+{
+	int reserve_type = RCLM_NORCLM;
+	struct free_area *area;
+
+	if (start_alloctype == RCLM_NORCLM) {
+		area = zone->free_area_lists[RCLM_NORCLM] + current_order;
+
+		/* Reserve the whole block if this is a large split */
+		if (current_order >= MAX_ORDER / 2) {
+			dec_reserve_count(zone, get_pageblock_type(zone,page));
+
+			/*
+			 * Use this block for fallbacks if the
+			 * minimum reserve is not being met
+			 */
+			if (!is_min_fallback_reserved(zone))
+				reserve_type = RCLM_FALLBACK;
+
+			set_pageblock_type(zone, page, reserve_type);
+			inc_reserve_count(zone, reserve_type);
+		}
+
+	}
+
+}
+
 static struct page *
 fallback_alloc(int alloctype, struct zone *zone, unsigned int order)
 {
-	/* Stub out for seperate review, NULL equates to no fallback*/
+	int *fallback_list;
+	int start_alloctype;
+	unsigned int current_order;
+	struct free_area *area;
+	struct page* page;
+
+	/* Ok, pick the fallback order based on the type */
+	fallback_list = fallback_allocs[alloctype];
+	start_alloctype = alloctype;
+
+
+	/*
+	 * Here, the alloc type lists has been depleted as well as the global
+	 * pool, so fallback. When falling back, the largest possible block
+	 * will be taken to keep the fallbacks clustered if possible
+	 */
+	while ((alloctype = *(++fallback_list)) != -1) {
+
+		/* Find a block to allocate */
+		area = zone->free_area_lists[alloctype] + MAX_ORDER;
+		current_order=MAX_ORDER;
+		do {
+			current_order--;
+			area--;
+			if (!list_empty(&area->free_list)) {
+				page = list_entry(area->free_list.next,
+						  struct page, lru);
+				area->nr_free--;
+				fallback_buddy_reserve(start_alloctype, zone,
+						       current_order, page);
+				return remove_page(zone, page, order,
+						   current_order, area);
+			}
+
+		} while (current_order != order);
+
+	}
+
 	return NULL;
 
 }
+
 /* 
  * Do the hard work of removing an element from the buddy allocator.
  * Call me with the zone->lock already held.
@@ -2101,6 +2185,11 @@ static void __init free_area_init_core(s
 		spin_lock_init(&zone->lru_lock);
 		zone->zone_pgdat = pgdat;
 		zone->free_pages = 0;
+		zone->fallback_reserve = 0;
+
+		/* Set the balance so about 12.5% will be used for fallbacks */
+		zone->fallback_balance = (realsize >> (MAX_ORDER-1)) -
+					 (realsize >> (MAX_ORDER+2));
 
 		zone->temp_priority = zone->prev_priority = DEF_PRIORITY;
 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH 9/9] free memory is user reclaimable
  2005-09-26 20:01 [PATCH 0/9] fragmentation avoidance Joel Schopp
                   ` (7 preceding siblings ...)
  2005-09-26 20:16 ` [PATCH 8/9] defrag fallback Joel Schopp
@ 2005-09-26 20:17 ` Joel Schopp
  2005-09-26 20:19 ` [PATCH 10/9] percpu splitout Joel Schopp
  2005-09-26 21:49 ` [Lhms-devel] [PATCH 0/9] fragmentation avoidance Joel Schopp
  10 siblings, 0 replies; 28+ messages in thread
From: Joel Schopp @ 2005-09-26 20:17 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Joel Schopp, lhms, Linux Memory Management List, linux-kernel,
	Mel Gorman, Mike Kravetz

[-- Attachment #1: Type: text/plain, Size: 239 bytes --]

Make the free memory revert to user reclaimable type, which is probably more
accurate, and certainly helpful for memory hotplug remove.

Signed-off-by: Mike Kravetz <kravetz@us.ibm.com>
Signed-off-by: Joel Schopp <jschopp@austin.ibm.com>


[-- Attachment #2: 9_free_memory_is_user --]
[-- Type: text/plain, Size: 941 bytes --]

Index: 2.6.13-joel2/mm/page_alloc.c
===================================================================
--- 2.6.13-joel2.orig/mm/page_alloc.c	2005-09-21 11:31:51.%N -0500
+++ 2.6.13-joel2/mm/page_alloc.c	2005-09-21 11:37:48.%N -0500
@@ -339,6 +339,9 @@ static inline int page_is_buddy(struct p
  * triggers coalescing into a block of larger size.            
  *
  * -- wli
+ *
+ * For hotplug memory purposes make the free memory revert to the user
+ * reclaimable type, which is probably more accurate for that state anyway.
  */
 
 static inline void __free_pages_bulk (struct page *page,
@@ -379,7 +382,10 @@ static inline void __free_pages_bulk (st
 		page_idx = combined_idx;
 		order++;
 	}
-	if (unlikely(order == MAX_ORDER-1)) zone->fallback_balance++;
+	if (unlikely(order == MAX_ORDER-1)) {
+		set_pageblock_type(zone, page, RCLM_USER);
+		zone->fallback_balance++;
+	}
 
 	set_page_order(page, order);
 	area = freelist + order;

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH 10/9] percpu splitout
  2005-09-26 20:01 [PATCH 0/9] fragmentation avoidance Joel Schopp
                   ` (8 preceding siblings ...)
  2005-09-26 20:17 ` [PATCH 9/9] free memory is user reclaimable Joel Schopp
@ 2005-09-26 20:19 ` Joel Schopp
  2005-09-26 21:49 ` [Lhms-devel] [PATCH 0/9] fragmentation avoidance Joel Schopp
  10 siblings, 0 replies; 28+ messages in thread
From: Joel Schopp @ 2005-09-26 20:19 UTC (permalink / raw)
  To: Joel Schopp
  Cc: Andrew Morton, lhms, Linux Memory Management List, linux-kernel,
	Mel Gorman, Mike Kravetz

[-- Attachment #1: Type: text/plain, Size: 554 bytes --]

NOT READY FOR MERGING!
Only works with NUMA off on 2.6.13.  On 2.6.13 with NUMA on free_hot_cold_page
calls __free_pages_bulk, which then trips BUG_ON(bad_range(zone,page));  This
does not happen on 2.6.13-rc1 kernels. Released under the release early
release often doctrine.

This patch splits the percpu allocations into two types.  Kernel reclaimable
and kernel non-reclaimable types are considered one PCPU_KERNEL type and user
types are PCPU_USER type.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Joel Schopp <jschopp@austin.ibm.com>


[-- Attachment #2: 10_percpu_splitout --]
[-- Type: text/plain, Size: 5504 bytes --]

Index: 2.6.13-joel2/include/linux/mmzone.h
===================================================================
--- 2.6.13-joel2.orig/include/linux/mmzone.h	2005-09-26 13:58:59.%N -0500
+++ 2.6.13-joel2/include/linux/mmzone.h	2005-09-26 13:59:38.%N -0500
@@ -57,13 +57,28 @@ struct zone_padding {
 #else
 #define ZONE_PADDING(name)
 #endif
+/*
+ * The pcpu_list is to keep kernel and userrclm allocations
+ * apart while still allowing all allocation types to have
+ * per-cpu lists
+ */
+struct pcpu_list {
+	int count;
+	struct list_head list;
+} ____cacheline_aligned_in_smp;
+
+
+/* Indices into pcpu_list */
+#define PCPU_KERN 0
+#define PCPU_USER 1
+#define PCPU_LIST_SIZE 2
 
 struct per_cpu_pages {
-	int count;		/* number of pages in the list */
-	int low;		/* low watermark, refill needed */
-	int high;		/* high watermark, emptying needed */
-	int batch;		/* chunk size for buddy add/remove */
-	struct list_head list;	/* the list of pages */
+	int count;			/* number of pages in the list */
+	struct pcpu_list pcpu_list[PCPU_LIST_SIZE];
+	int low;			/* low watermark, refill needed */
+	int high;			/* high watermark, emptying needed */
+	int batch;			/* chunk size for buddy add/remove */
 };
 
 struct per_cpu_pageset {
Index: 2.6.13-joel2/mm/page_alloc.c
===================================================================
--- 2.6.13-joel2.orig/mm/page_alloc.c	2005-09-26 13:59:27.%N -0500
+++ 2.6.13-joel2/mm/page_alloc.c	2005-09-26 13:59:38.%N -0500
@@ -775,9 +775,18 @@ void drain_remote_pages(void)
 			struct per_cpu_pages *pcp;
 
 			pcp = &pset->pcp[i];
-			if (pcp->count)
-				pcp->count -= free_pages_bulk(zone, pcp->count,
-						&pcp->list, 0);
+			if (pcp->pcpu_list[PCPU_KERN].count)
+				pcp->pcpu_list[PCPU_KERN].count -=
+					free_pages_bulk(zone,
+							pcp->pcpu_list[PCPU_KERN].count,
+							&pcp->pcpu_list[PCPU_KERN].list,
+							0);
+			if (pcp->pcpu_list[PCPU_USER].count)
+				pcp->pcpu_list[PCPU_USER].count -=
+					free_pages_bulk(zone,
+							pcp->pcpu_list[PCPU_USER].count,
+							&pcp->pcpu_list[PCPU_USER].list,
+							0);
 		}
 	}
 	local_irq_restore(flags);
@@ -798,8 +807,18 @@ static void __drain_pages(unsigned int c
 			struct per_cpu_pages *pcp;
 
 			pcp = &pset->pcp[i];
-			pcp->count -= free_pages_bulk(zone, pcp->count,
-						&pcp->list, 0);
+			pcp->pcpu_list[PCPU_KERN].count -=
+				free_pages_bulk(zone,
+						pcp->pcpu_list[PCPU_KERN].count,
+						&pcp->pcpu_list[PCPU_KERN].list,
+						0);
+
+			pcp->pcpu_list[PCPU_USER].count -=
+				free_pages_bulk(zone,
+						pcp->pcpu_list[PCPU_USER].count,
+						&pcp->pcpu_list[PCPU_USER].list,
+						0);
+
 		}
 	}
 }
@@ -881,6 +900,7 @@ static void fastcall free_hot_cold_page(
 	struct zone *zone = page_zone(page);
 	struct per_cpu_pages *pcp;
 	unsigned long flags;
+	struct pcpu_list *plist;
 
 	arch_free_page(page, 0);
 
@@ -890,11 +910,24 @@ static void fastcall free_hot_cold_page(
 		page->mapping = NULL;
 	free_pages_check(__FUNCTION__, page);
 	pcp = &zone_pcp(zone, get_cpu())->pcp[cold];
+
+	/*
+	 * Strictly speaking, we should not be accessing the zone information
+	 * here wihtout the zone lock. In this case, it does not matter if 
+	 * the read is incorrect.
+	 */
+	if (get_pageblock_type(zone, page) == RCLM_USER)
+		plist = &pcp->pcpu_list[PCPU_USER];
+	else
+		plist = &pcp->pcpu_list[PCPU_KERN];
+
+	if (plist->count >= pcp->high)
+		plist->count -= free_pages_bulk(zone, pcp->batch,
+						&plist->list, 0);
+
 	local_irq_save(flags);
-	list_add(&page->lru, &pcp->list);
-	pcp->count++;
-	if (pcp->count >= pcp->high)
-		pcp->count -= free_pages_bulk(zone, pcp->batch, &pcp->list, 0);
+	list_add(&page->lru, &plist->list);
+	plist->count++;
 	local_irq_restore(flags);
 	put_cpu();
 }
@@ -930,19 +963,28 @@ buffered_rmqueue(struct zone *zone, int 
 	unsigned long flags;
 	struct page *page = NULL;
 	int cold = !!(gfp_flags & __GFP_COLD);
+	struct pcpu_list *plist;
 
 	if (order == 0) {
 		struct per_cpu_pages *pcp;
 
 		pcp = &zone_pcp(zone, get_cpu())->pcp[cold];
 		local_irq_save(flags);
-		if (pcp->count <= pcp->low)
-			pcp->count += rmqueue_bulk(zone, pcp->batch,
-						   &pcp->list, alloctype);
-		if (pcp->count) {
-			page = list_entry(pcp->list.next, struct page, lru);
+
+		if (alloctype == __GFP_USER)
+			plist = &pcp->pcpu_list[PCPU_USER];
+		else
+			plist = &pcp->pcpu_list[PCPU_KERN];
+
+		if (plist->count <= pcp->low)
+			plist->count += rmqueue_bulk(zone,
+						     pcp->batch,
+						     &plist->list,
+						     alloctype);
+		if (plist->count) {
+			page = list_entry(plist->list.next, struct page, lru);
 			list_del(&page->lru);
-			pcp->count--;
+			plist->count--;
 		}
 		local_irq_restore(flags);
 		put_cpu();
@@ -2001,18 +2043,23 @@ inline void setup_pageset(struct per_cpu
 	struct per_cpu_pages *pcp;
 
 	pcp = &p->pcp[0];		/* hot */
-	pcp->count = 0;
+	pcp->pcpu_list[PCPU_KERN].count = 0;
+	pcp->pcpu_list[PCPU_USER].count = 0;
 	pcp->low = 2 * batch;
 	pcp->high = 6 * batch;
 	pcp->batch = max(1UL, 1 * batch);
-	INIT_LIST_HEAD(&pcp->list);
+	INIT_LIST_HEAD(&pcp->pcpu_list[PCPU_KERN].list);
+	INIT_LIST_HEAD(&pcp->pcpu_list[PCPU_USER].list);
 
 	pcp = &p->pcp[1];		/* cold*/
-	pcp->count = 0;
+	pcp->pcpu_list[PCPU_KERN].count = 0;
+	pcp->pcpu_list[PCPU_USER].count = 0;
 	pcp->low = 0;
 	pcp->high = 2 * batch;
 	pcp->batch = max(1UL, 1 * batch);
-	INIT_LIST_HEAD(&pcp->list);
+	INIT_LIST_HEAD(&pcp->pcpu_list[PCPU_KERN].list);
+	INIT_LIST_HEAD(&pcp->pcpu_list[PCPU_USER].list);
+
 }
 
 #ifdef CONFIG_NUMA

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [Lhms-devel] [PATCH 0/9] fragmentation avoidance
  2005-09-26 20:01 [PATCH 0/9] fragmentation avoidance Joel Schopp
                   ` (9 preceding siblings ...)
  2005-09-26 20:19 ` [PATCH 10/9] percpu splitout Joel Schopp
@ 2005-09-26 21:49 ` Joel Schopp
  10 siblings, 0 replies; 28+ messages in thread
From: Joel Schopp @ 2005-09-26 21:49 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Joel Schopp, lhms, Linux Memory Management List, linux-kernel,
	Mel Gorman, Mike Kravetz

> well.  I believe the patches are now ready for inclusion in -mm, and after
> wider testing inclusion in the mainline kernel.
> 
> The patch set consists of 9 patches that can be merged in 4 separate 
> blocks,
> with the only dependency being that the lower numbered patches are merged
> first.  All are against 2.6.13.
> Patch 1 defines the allocation flags and adds them to the allocator calls.
> Patch 2 defines some new structures and the macros used to access them.
> Patch 3-8 implement the fully functional fragmentation avoidance.
> Patch 9 is trivial but useful for memory hotplug remove.
> ---
> Patch 10 -- not ready for merging -- extends fragmentation avoidance to the
> percpu allocator.  This patch works on 2.6.13-rc1 but only with NUMA off on
> 2.6.13; I am having a great deal of trouble tracking down why, help 
> would be
> appreciated.  I include the patch for review and test purposes as I plan to
> submit it for merging after resolving the NUMA issues.

It was pointed out that I did not make it clear that I would like the 9 patches
in this series merged into -mm.  They are ready to go.

Patch 10 is just a bonus patch you can ignore.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 4/9] defrag helper functions
  2005-09-26 20:09 ` [PATCH 4/9] defrag helper functions Joel Schopp
@ 2005-09-26 22:29   ` Alex Bligh - linux-kernel
  2005-09-27 16:08     ` Joel Schopp
  0 siblings, 1 reply; 28+ messages in thread
From: Alex Bligh - linux-kernel @ 2005-09-26 22:29 UTC (permalink / raw)
  To: Joel Schopp, Andrew Morton
  Cc: lhms, Linux Memory Management List, linux-kernel, Mel Gorman,
	Mike Kravetz, Alex Bligh - linux-kernel


--On 26 September 2005 15:09 -0500 Joel Schopp <jschopp@austin.ibm.com> 
wrote:

> +void assign_bit(int bit_nr, unsigned long* map, int value)

Maybe:
static inline void assign_bit(int bit_nr, unsigned long* map, int value)

it's short enough

>  +static struct page *
> +fallback_alloc(int alloctype, struct zone *zone, unsigned int order)
> +{
> +       /* Stub out for seperate review, NULL equates to no fallback*/
> +       return NULL;
> +
> +}

Maybe "static inline" too.

--
Alex Bligh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 1/9] add defrag flags
  2005-09-26 20:03 ` [PATCH 1/9] add defrag flags Joel Schopp
@ 2005-09-27  0:16   ` Kyle Moffett
  2005-09-27  0:24     ` Dave Hansen
  0 siblings, 1 reply; 28+ messages in thread
From: Kyle Moffett @ 2005-09-27  0:16 UTC (permalink / raw)
  To: Joel Schopp
  Cc: Andrew Morton, lhms, Linux Memory Management List, linux-kernel,
	Mel Gorman, Mike Kravetz

On Sep 26, 2005, at 16:03:30, Joel Schopp wrote:
> The flags are:
> __GFP_USER, which corresponds to easily reclaimable pages
> __GFP_KERNRCLM, which corresponds to userspace pages

Uhh, call me crazy, but don't those flags look a little backwards to  
you?  Maybe it's just me, but wouldn't it make sense to expect  
__GFP_USER to be a userspace allocation and __GFP_KERNRCLM to be an  
easily reclaimable kernel page?

Cheers,
Kyle Moffett

-----BEGIN GEEK CODE BLOCK-----
Version: 3.12
GCM/CS/IT/U d- s++: a18 C++++>$ UB/L/X/*++++(+)>$ P+++(++++)>$ L++++(+ 
++) E W++(+) N+++(++) o? K? w--- O? M++ V? PS+() PE+(-) Y+ PGP+++ t+(+ 
++) 5 X R? tv-(--) b++++(++) DI+ D+ G e->++++$ h!*()>++$ r  !y?(-)
------END GEEK CODE BLOCK------


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 1/9] add defrag flags
  2005-09-27  0:16   ` Kyle Moffett
@ 2005-09-27  0:24     ` Dave Hansen
  2005-09-27  0:43       ` Kyle Moffett
  2005-09-27  5:44       ` Paul Jackson
  0 siblings, 2 replies; 28+ messages in thread
From: Dave Hansen @ 2005-09-27  0:24 UTC (permalink / raw)
  To: Kyle Moffett
  Cc: Joel Schopp, Andrew Morton, lhms, Linux Memory Management List,
	Linux Kernel Mailing List, Mel Gorman, Mike Kravetz

On Mon, 2005-09-26 at 20:16 -0400, Kyle Moffett wrote:
> On Sep 26, 2005, at 16:03:30, Joel Schopp wrote:
> > The flags are:
> > __GFP_USER, which corresponds to easily reclaimable pages
> > __GFP_KERNRCLM, which corresponds to userspace pages
> 
> Uhh, call me crazy, but don't those flags look a little backwards to  
> you?  Maybe it's just me, but wouldn't it make sense to expect  
> __GFP_USER to be a userspace allocation and __GFP_KERNRCLM to be an  
> easily reclaimable kernel page?

I think Joel simply made an error in his description.

__GFP_KERNRCLM corresponds to pages which are kernel-allocated, but have
some chance of being reclaimed at some point.  Basically, they're things
that will get freed back under memory pressure.  This can be direct, as
with the dcache and its slab shrinker, or more indirect as for control
structures like buffer_heads that get reclaimed after _other_ things are
freed.

-- Dave

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 1/9] add defrag flags
  2005-09-27  0:24     ` Dave Hansen
@ 2005-09-27  0:43       ` Kyle Moffett
  2005-09-27  5:44       ` Paul Jackson
  1 sibling, 0 replies; 28+ messages in thread
From: Kyle Moffett @ 2005-09-27  0:43 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Joel Schopp, Andrew Morton, lhms, Linux Memory Management List,
	Linux Kernel Mailing List, Mel Gorman, Mike Kravetz

On Sep 26, 2005, at 20:24:08, Dave Hansen wrote:
> On Mon, 2005-09-26 at 20:16 -0400, Kyle Moffett wrote:
>> Uhh, call me crazy, but don't those flags look a little backwards  
>> to you?  Maybe it's just me, but wouldn't it make sense to expect  
>> __GFP_USER to be a userspace allocation and __GFP_KERNRCLM to be  
>> an easily reclaimable kernel page?
>
> I think Joel simply made an error in his description.
>
> __GFP_KERNRCLM corresponds to pages which are kernel-allocated, but  
> have some chance of being reclaimed at some point.  Basically,  
> they're things that will get freed back under memory pressure.   
> This can be direct, as with the dcache and its slab shrinker, or  
> more indirect as for control structures like buffer_heads that get  
> reclaimed after _other_ things are freed.

Ok, well he should fix both that description and the comment in his  
patches, and make sure that the code actually matches what it says:

> +#define __GFP_USER    0x40000u /* Kernel page that is easily  
> reclaimable */
> +#define __GFP_KERNRCLM    0x80000u /* User is a userspace user */

Cheers,
Kyle Moffett

--
Debugging is twice as hard as writing the code in the first place.   
Therefore, if you write the code as cleverly as possible, you are, by  
definition, not smart enough to debug it.
   -- Brian Kernighan


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 1/9] add defrag flags
  2005-09-27  0:24     ` Dave Hansen
  2005-09-27  0:43       ` Kyle Moffett
@ 2005-09-27  5:44       ` Paul Jackson
  2005-09-27 13:34         ` Mel Gorman
  2005-09-27 18:38         ` Joel Schopp
  1 sibling, 2 replies; 28+ messages in thread
From: Paul Jackson @ 2005-09-27  5:44 UTC (permalink / raw)
  To: Dave Hansen
  Cc: mrmacman_g4, jschopp, akpm, lhms-devel, linux-mm, linux-kernel,
	mel, kravetz

Dave wrote:
> I think Joel simply made an error in his description.

Looks like he made the same mistake in the actual code comments:

+/* Allocation type modifiers, group together if possible
+ * __GPF_USER: Allocation for user page or a buffer page
+ * __GFP_KERNRCLM: Short-lived or reclaimable kernel allocation
+ */
+#define __GFP_USER	0x40000u /* Kernel page that is easily reclaimable */
+#define __GFP_KERNRCLM	0x80000u /* User is a userspace user */

I'd guess you meant to write more like the following:

#define __GFP_USER   0x40000u /* Page for user address space */
#define __GFP_KERNRCLM 0x80000u /* Kernel page that is easily reclaimable */

And the block comment seems to needlessly repeat the inline comments,
add a dubious claim, and omit the interesting stuff ...  In other words:

    Does it actually matter if these two bits are grouped, or not?  I
    suspect that some of your other code, such as shifting the gfpmask by
    RCLM_SHIFT bits, _requires_ that these two bits be adjacent.  So the
    "if possible" in the comment above is misleading.

    And I suspect that gfp.h should contain the RCLM_SHIFT define, or
    at least mention in comment that RCLM_SHIFT depends on the position
    of the above two __GFP_* bits.

    And I don't see any mention in the comments in gfp.h that these
    two bits, in tandem, have an additional meaning - both bits off
    means, I guess, not reclaimable, well at least not easily.

My HARDWALL patch appears to already be in Linus's kernel, so you
probably also need to do a global substitute of all instances in
the kernel of __GFP_HARDWALL, replacing it with __GFP_USER.  Here
is the list of files I see affected, with a count of the number of
__GFP_HARDWALL strings in each:

    include/linux/gfp.h:4
    kernel/cpuset.c:6
    mm/page_alloc.c:2
    mm/vmscan.c:4

The comment in the next line looks like it needs to be changed to match
the code change:

+#define __GFP_BITS_SHIFT 21	/* Room for 20 __GFP_FOO bits */

On the other hand, why did you change __GFP_BITS_SHIFT?  Isn't 20
enough - just enough?

Why was the flag change in fs/buffer.c:grow_dev_page() to add the
__GFP_USER bit, not to add the __GFP_KERNRCLM bit?  I don't know that
code - perhaps the answer is simply that the resulting page ends up in
user space.

Aha - I just read one of the comments above that I cut+pasted.
It says that __GFP_USER means user *OR* buffer page.  That certainly
explains the fs/buffer.c code using __GFP_USER.  But it causes me to
wonder if we can equate __GFP_USER with __GFP_HARDWALL.  I'm reluctant,
but more on principal than concrete experience, to modify the meaning
of hardwall cpusets to constrain both user address space pages *AND*
buffer pages.  How open would you be to making buffers __GFP_KERNRCLM
instead of __GFP_USER?

If you have good reason to keep __GFP_USER meanin either user or buffer,
then perhaps the name __GFP_USER is misleading.

What sort of performance claims can you make for this change?  How does
it impact kernel text size?  Could we see a diffstat for the entire
patchset?  Under what sort of loads or conditions would you expect
this patchset to do more harm than good?

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 7/9] try harder on large allocations
  2005-09-26 20:14 ` [PATCH 7/9] try harder on large allocations Joel Schopp
@ 2005-09-27  7:21   ` Coywolf Qi Hunt
  2005-09-27 16:17     ` Joel Schopp
  0 siblings, 1 reply; 28+ messages in thread
From: Coywolf Qi Hunt @ 2005-09-27  7:21 UTC (permalink / raw)
  To: Joel Schopp
  Cc: Andrew Morton, lhms, Linux Memory Management List, linux-kernel,
	Mel Gorman, Mike Kravetz

On 9/27/05, Joel Schopp <jschopp@austin.ibm.com> wrote:
> Fragmentation avoidance patches increase our chances of satisfying high order
> allocations.  So this patch takes more than one iteration at trying to fulfill
> those allocations because unlike before the extra iterations are often useful.
>
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> Signed-off-by: Joel Schopp <jschopp@austin.ibm.com>
>
>
> Index: 2.6.13-joel2/mm/page_alloc.c
> ===================================================================
> --- 2.6.13-joel2.orig/mm/page_alloc.c   2005-09-21 11:13:14.%N -0500
> +++ 2.6.13-joel2/mm/page_alloc.c        2005-09-21 11:14:49.%N -0500
> @@ -944,7 +944,8 @@ __alloc_pages(unsigned int __nocast gfp_
>         int can_try_harder;
>         int did_some_progress;
>         int alloctype;
> -
> +       int highorder_retry = 3;
> +
>         alloctype = (gfp_mask & __GFP_RCLM_BITS);
>         might_sleep_if(wait);
>
> @@ -1090,7 +1091,14 @@ rebalance:
>                                 goto got_pg;
>                 }
>
> -               out_of_memory(gfp_mask, order);
> +               if (order < MAX_ORDER/2) out_of_memory(gfp_mask, order);

Shouldn't that be written in two lines?

> +               /*
> +                * Due to low fragmentation efforts, we should try a little
> +                * harder to satisfy high order allocations
> +                */
> +               if (order >= MAX_ORDER/2 && --highorder_retry > 0)
> +                       goto rebalance;
> +
>                 goto restart;
>         }

--
Coywolf Qi Hunt
http://sosdg.org/~coywolf/

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 1/9] add defrag flags
  2005-09-27  5:44       ` Paul Jackson
@ 2005-09-27 13:34         ` Mel Gorman
  2005-09-27 16:26           ` [Lhms-devel] " Paul Jackson
  2005-09-27 18:38         ` Joel Schopp
  1 sibling, 1 reply; 28+ messages in thread
From: Mel Gorman @ 2005-09-27 13:34 UTC (permalink / raw)
  To: Paul Jackson
  Cc: Dave Hansen, mrmacman_g4, jschopp, akpm, lhms-devel, linux-mm,
	linux-kernel, kravetz

On Mon, 26 Sep 2005, Paul Jackson wrote:

> Dave wrote:
> > I think Joel simply made an error in his description.
>
> Looks like he made the same mistake in the actual code comments:
>
> +/* Allocation type modifiers, group together if possible
> + * __GPF_USER: Allocation for user page or a buffer page
> + * __GFP_KERNRCLM: Short-lived or reclaimable kernel allocation
> + */
> +#define __GFP_USER	0x40000u /* Kernel page that is easily reclaimable */
> +#define __GFP_KERNRCLM	0x80000u /* User is a userspace user */
>
> I'd guess you meant to write more like the following:
>
> #define __GFP_USER   0x40000u /* Page for user address space */
> #define __GFP_KERNRCLM 0x80000u /* Kernel page that is easily reclaimable */
>

yep

> And the block comment seems to needlessly repeat the inline comments,
> add a dubious claim, and omit the interesting stuff ...  In other words:
>
>     Does it actually matter if these two bits are grouped, or not?  I
>     suspect that some of your other code, such as shifting the gfpmask by
>     RCLM_SHIFT bits, _requires_ that these two bits be adjacent.  So the
>     "if possible" in the comment above is misleading.
>

The "if possible" must be misleading. The bits have to beside each other
as assumptions are made later in the code about this. The "group together"
comment refers to the patches that are allocated with gfp flags that
include __GFP_USER or __GFP_KERNNORCLM. Those pages should be "grouped
together if possible". The bits must be grouped that way.

>     And I suspect that gfp.h should contain the RCLM_SHIFT define, or
>     at least mention in comment that RCLM_SHIFT depends on the position
>     of the above two __GFP_* bits.
>
>     And I don't see any mention in the comments in gfp.h that these
>     two bits, in tandem, have an additional meaning - both bits off
>     means, I guess, not reclaimable, well at least not easily.
>
> My HARDWALL patch appears to already be in Linus's kernel, so you
> probably also need to do a global substitute of all instances in
> the kernel of __GFP_HARDWALL, replacing it with __GFP_USER.

I am not sure if that is a good idea as I will explain later.

> Here
> is the list of files I see affected, with a count of the number of
> __GFP_HARDWALL strings in each:
>
>     include/linux/gfp.h:4
>     kernel/cpuset.c:6
>     mm/page_alloc.c:2
>     mm/vmscan.c:4
>
> The comment in the next line looks like it needs to be changed to match
> the code change:
>
> +#define __GFP_BITS_SHIFT 21	/* Room for 20 __GFP_FOO bits */
>
> On the other hand, why did you change __GFP_BITS_SHIFT?  Isn't 20
> enough - just enough?
>

Yep, you're right, it is just enough.

> Why was the flag change in fs/buffer.c:grow_dev_page() to add the
> __GFP_USER bit, not to add the __GFP_KERNRCLM bit?

Because these are buffer pages that get reclaimed very quickly. The
KERNRCLM pages are generally slab pages. These can be reclaimed by reaping
certain slab patches but it's a very hit and miss behavior. Trust me, the
whole scheme works better if buffer pages are treated as __GFP_USER pages,
not __GFP_KERNRCLM.

> Aha - I just read one of the comments above that I cut+pasted.
> It says that __GFP_USER means user *OR* buffer page.  That certainly
> explains the fs/buffer.c code using __GFP_USER.  But it causes me to
> wonder if we can equate __GFP_USER with __GFP_HARDWALL.

I don't think it should be.

> I'm reluctant,
> but more on principal than concrete experience, to modify the meaning
> of hardwall cpusets to constrain both user address space pages *AND*
> buffer pages.  How open would you be to making buffers __GFP_KERNRCLM
> instead of __GFP_USER?
>

Not very open at all. I would prefer to have an additional flag than do
that. The anti-fragmentation does not work anywhere near as well when
buffer pages are KERNRCLM pages. It's because there are large number of
pages that are easily reclaimable by cleaning the buffers and discarding
them. If they were mixed with slab pages, it would not be very effective
when we try to make a large allocation.

> If you have good reason to keep __GFP_USER meanin either user or buffer,
> then perhaps the name __GFP_USER is misleading.
>

Possibly but we are stuck for terminology here. It's hard to think of a
good term that reflects the intention.

> What sort of performance claims can you make for this change?

I don't have figures for this patchset. The figures I do have are for
another version that I'm currently trying to merge with Joels. In my own
set, there are no performance regressions or gains.

> How does
> it impact kernel text size?

Again, based on my own patchset but the figures should be essentially the
same as Joel's;

linux-2.6.13-clean/vmlinux
   text    data     bss     dec     hex filename
2992829  686212  212708 3891749  3b6225 linux-2.6.13-clean/vmlinux

linux-2.6.13-mbuddy-v14/vmlinux
   text    data     bss     dec     hex filename
2995335  687852  212708 3895895  3b7257 linux-2.6.13-mbuddy-v14/vmlinux

Is that what you are looking for?

> Could we see a diffstat for the entire
> patchset?

Don't have this at the moment

> Under what sort of loads or conditions would you expect
> this patchset to do more harm than good?
>

I cannot think of a case where it does more harm. At worst, it does not
help fragmentation. For that to happen, the system needs to be very
heavily loaded under heavy memory pressure for a long time with
RCLM_NORCLM pages been retained for very long periods of time even after
loads ease. In this case, fallbacks will eventually fragment memory.

A second case where it could hurt is in allocator scalability over a large
number of CPUs as there are now additional per-cpu lists. I am having
trouble thinking of a test case that would trigger this case though.
Someone used to dealing with large numbers of processors might be able to
make a suggestion.

-- 
Mel Gorman
Part-time Phd Student                          Java Applications Developer
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 4/9] defrag helper functions
  2005-09-26 22:29   ` Alex Bligh - linux-kernel
@ 2005-09-27 16:08     ` Joel Schopp
  0 siblings, 0 replies; 28+ messages in thread
From: Joel Schopp @ 2005-09-27 16:08 UTC (permalink / raw)
  To: Alex Bligh - linux-kernel
  Cc: lhms, Linux Memory Management List, linux-kernel, Mel Gorman

>> +void assign_bit(int bit_nr, unsigned long* map, int value)
> 
> 
> Maybe:
> static inline void assign_bit(int bit_nr, unsigned long* map, int value)
> 
> it's short enough

OK.  It looks like I'll be sending these again based on the feedback I got,
I'll inline that in the next version.  I'd think with it being static that
the compiler would be smart enough to inline it anyway though.

> 
>>  +static struct page *
>> +fallback_alloc(int alloctype, struct zone *zone, unsigned int order)
>> +{
>> +       /* Stub out for seperate review, NULL equates to no fallback*/
>> +       return NULL;
>> +
>> +}
> 
> 
> Maybe "static inline" too.

Except this is only a placeholder for the next patch, where the function
is no longer short.  I'm going to keep it not inline.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 7/9] try harder on large allocations
  2005-09-27  7:21   ` Coywolf Qi Hunt
@ 2005-09-27 16:17     ` Joel Schopp
  0 siblings, 0 replies; 28+ messages in thread
From: Joel Schopp @ 2005-09-27 16:17 UTC (permalink / raw)
  To: Coywolf Qi Hunt
  Cc: lhms, Linux Memory Management List, linux-kernel, Mel Gorman

>>+               if (order < MAX_ORDER/2) out_of_memory(gfp_mask, order);
> 
> 
> Shouldn't that be written in two lines?

Yes, fixed.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [Lhms-devel] Re: [PATCH 1/9] add defrag flags
  2005-09-27 13:34         ` Mel Gorman
@ 2005-09-27 16:26           ` Paul Jackson
  0 siblings, 0 replies; 28+ messages in thread
From: Paul Jackson @ 2005-09-27 16:26 UTC (permalink / raw)
  To: Mel Gorman
  Cc: haveblue, mrmacman_g4, jschopp, akpm, lhms-devel, linux-mm,
	linux-kernel, kravetz

Mel wrote:
> > If you have good reason to keep __GFP_USER meanin either user or buffer,
> > then perhaps the name __GFP_USER is misleading.
> >
> 
> Possibly but we are stuck for terminology here. It's hard to think of a
> good term that reflects the intention.

You make several good points.  How about:
  * Rename __GFP_USER to __GFP_EASYRCLM
  * Shift the two __GFP_*RCLM flags up to 0x80000u and 0x100000u
  * Leave __GFP_BITS_SHIFT at the 21 in your patch (and fix its comment)
    (or should we go up the next nibble, to 24?).

This results in the two key GFP defines being:

#define __GFP_EASYRCLM  0x80000u /* Easily reclaimed user or buffer page */
#define __GFP_KERNRCLM 0x100000u /* Reclaimable kernel page */

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 1/9] add defrag flags
  2005-09-27  5:44       ` Paul Jackson
  2005-09-27 13:34         ` Mel Gorman
@ 2005-09-27 18:38         ` Joel Schopp
  2005-09-27 19:30           ` Paul Jackson
  1 sibling, 1 reply; 28+ messages in thread
From: Joel Schopp @ 2005-09-27 18:38 UTC (permalink / raw)
  To: Paul Jackson
  Cc: Dave Hansen, mrmacman_g4, akpm, lhms-devel, linux-mm,
	linux-kernel, mel, kravetz

[-- Attachment #1: Type: text/plain, Size: 2396 bytes --]

> Looks like he made the same mistake in the actual code comments:
> 
> +/* Allocation type modifiers, group together if possible
> + * __GPF_USER: Allocation for user page or a buffer page
> + * __GFP_KERNRCLM: Short-lived or reclaimable kernel allocation
> + */
> +#define __GFP_USER	0x40000u /* Kernel page that is easily reclaimable */
> +#define __GFP_KERNRCLM	0x80000u /* User is a userspace user */
> 
> I'd guess you meant to write more like the following:
> 
> #define __GFP_USER   0x40000u /* Page for user address space */
> #define __GFP_KERNRCLM 0x80000u /* Kernel page that is easily reclaimable */

Yep.

> 
> And the block comment seems to needlessly repeat the inline comments,
> add a dubious claim, and omit the interesting stuff ...  In other words:
> 
>     Does it actually matter if these two bits are grouped, or not?  I
>     suspect that some of your other code, such as shifting the gfpmask by
>     RCLM_SHIFT bits, _requires_ that these two bits be adjacent.  So the
>     "if possible" in the comment above is misleading.
> 
>     And I suspect that gfp.h should contain the RCLM_SHIFT define, or
>     at least mention in comment that RCLM_SHIFT depends on the position
>     of the above two __GFP_* bits.

I'll add a comment here.

> 
>     And I don't see any mention in the comments in gfp.h that these
>     two bits, in tandem, have an additional meaning - both bits off
>     means, I guess, not reclaimable, well at least not easily.

Yep, adding comment now.

> 
> My HARDWALL patch appears to already be in Linus's kernel, so you
> probably also need to do a global substitute of all instances in
> the kernel of __GFP_HARDWALL, replacing it with __GFP_USER.  Here
> is the list of files I see affected, with a count of the number of
> __GFP_HARDWALL strings in each:
> 
>     include/linux/gfp.h:4
>     kernel/cpuset.c:6
>     mm/page_alloc.c:2
>     mm/vmscan.c:4

We may not be able to use the same flag after all due to our need to mark buffer 
pages as user.

> 
> The comment in the next line looks like it needs to be changed to match
> the code change:
> 
> +#define __GFP_BITS_SHIFT 21	/* Room for 20 __GFP_FOO bits */

Yep.

> 
> On the other hand, why did you change __GFP_BITS_SHIFT?  Isn't 20
> enough - just enough?

Yep.

Fixed patch attached.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Joel Schopp <jschopp@austin.ibm.com>

[-- Attachment #2: 1_add_defrag_flags --]
[-- Type: text/plain, Size: 5310 bytes --]

Index: 2.6.13-joel2/fs/buffer.c
===================================================================
--- 2.6.13-joel2.orig/fs/buffer.c	2005-09-13 14:54:13.%N -0500
+++ 2.6.13-joel2/fs/buffer.c	2005-09-13 15:02:01.%N -0500
@@ -1119,7 +1119,8 @@ grow_dev_page(struct block_device *bdev,
 	struct page *page;
 	struct buffer_head *bh;
 
-	page = find_or_create_page(inode->i_mapping, index, GFP_NOFS);
+	page = find_or_create_page(inode->i_mapping, index,
+				   GFP_NOFS | __GFP_USER);
 	if (!page)
 		return NULL;
 
@@ -3044,7 +3045,8 @@ static void recalc_bh_state(void)
 	
 struct buffer_head *alloc_buffer_head(unsigned int __nocast gfp_flags)
 {
-	struct buffer_head *ret = kmem_cache_alloc(bh_cachep, gfp_flags);
+	struct buffer_head *ret = kmem_cache_alloc(bh_cachep,
+						   gfp_flags|__GFP_KERNRCLM);
 	if (ret) {
 		preempt_disable();
 		__get_cpu_var(bh_accounting).nr++;
Index: 2.6.13-joel2/fs/dcache.c
===================================================================
--- 2.6.13-joel2.orig/fs/dcache.c	2005-09-13 14:54:14.%N -0500
+++ 2.6.13-joel2/fs/dcache.c	2005-09-13 15:02:01.%N -0500
@@ -721,7 +721,7 @@ struct dentry *d_alloc(struct dentry * p
 	struct dentry *dentry;
 	char *dname;
 
-	dentry = kmem_cache_alloc(dentry_cache, GFP_KERNEL); 
+	dentry = kmem_cache_alloc(dentry_cache, GFP_KERNEL|__GFP_KERNRCLM);
 	if (!dentry)
 		return NULL;
 
Index: 2.6.13-joel2/fs/ext2/super.c
===================================================================
--- 2.6.13-joel2.orig/fs/ext2/super.c	2005-09-13 14:54:14.%N -0500
+++ 2.6.13-joel2/fs/ext2/super.c	2005-09-13 15:02:01.%N -0500
@@ -138,7 +138,8 @@ static kmem_cache_t * ext2_inode_cachep;
 static struct inode *ext2_alloc_inode(struct super_block *sb)
 {
 	struct ext2_inode_info *ei;
-	ei = (struct ext2_inode_info *)kmem_cache_alloc(ext2_inode_cachep, SLAB_KERNEL);
+	ei = (struct ext2_inode_info *)kmem_cache_alloc(ext2_inode_cachep,
+						SLAB_KERNEL|__GFP_KERNRCLM);
 	if (!ei)
 		return NULL;
 #ifdef CONFIG_EXT2_FS_POSIX_ACL
Index: 2.6.13-joel2/fs/ext3/super.c
===================================================================
--- 2.6.13-joel2.orig/fs/ext3/super.c	2005-09-13 14:54:14.%N -0500
+++ 2.6.13-joel2/fs/ext3/super.c	2005-09-13 15:02:01.%N -0500
@@ -440,7 +440,7 @@ static struct inode *ext3_alloc_inode(st
 {
 	struct ext3_inode_info *ei;
 
-	ei = kmem_cache_alloc(ext3_inode_cachep, SLAB_NOFS);
+	ei = kmem_cache_alloc(ext3_inode_cachep, SLAB_NOFS|__GFP_KERNRCLM);
 	if (!ei)
 		return NULL;
 #ifdef CONFIG_EXT3_FS_POSIX_ACL
Index: 2.6.13-joel2/fs/ntfs/inode.c
===================================================================
--- 2.6.13-joel2.orig/fs/ntfs/inode.c	2005-09-13 14:54:14.%N -0500
+++ 2.6.13-joel2/fs/ntfs/inode.c	2005-09-13 15:05:53.%N -0500
@@ -317,7 +317,7 @@ struct inode *ntfs_alloc_big_inode(struc
 	ntfs_inode *ni;
 
 	ntfs_debug("Entering.");
-	ni = kmem_cache_alloc(ntfs_big_inode_cache, SLAB_NOFS);
+	ni = kmem_cache_alloc(ntfs_big_inode_cache, SLAB_NOFS|__GFP_KERNRCLM);
 	if (likely(ni != NULL)) {
 		ni->state = 0;
 		return VFS_I(ni);
@@ -342,7 +342,7 @@ static inline ntfs_inode *ntfs_alloc_ext
 	ntfs_inode *ni;
 
 	ntfs_debug("Entering.");
-	ni = kmem_cache_alloc(ntfs_inode_cache, SLAB_NOFS);
+	ni = kmem_cache_alloc(ntfs_inode_cache, SLAB_NOFS|__GFP_KERNRCLM);
 	if (likely(ni != NULL)) {
 		ni->state = 0;
 		return ni;
Index: 2.6.13-joel2/include/linux/gfp.h
===================================================================
--- 2.6.13-joel2.orig/include/linux/gfp.h	2005-09-13 14:54:17.%N -0500
+++ 2.6.13-joel2/include/linux/gfp.h	2005-09-27 12:53:13.%N -0500
@@ -41,6 +41,16 @@ struct vm_area_struct;
 #define __GFP_NOMEMALLOC 0x10000u /* Don't use emergency reserves */
 #define __GFP_NORECLAIM  0x20000u /* No realy zone reclaim during allocation */
 
+/* Allocation type modifiers, these are required to be adjacent
+ * __GPF_USER: Allocation for user page or a buffer page
+ * __GFP_KERNRCLM: Short-lived or reclaimable kernel allocation
+ * Both bits off: Kernel non-reclaimable or very hard to reclaim
+ * RCLM_SHIFT (defined elsewhere) depends on the location of these bits
+ */
+#define __GFP_USER	0x40000u /* User is a userspace user */
+#define __GFP_KERNRCLM	0x80000u /* Kernel page that is easily reclaimable */
+#define __GFP_RCLM_BITS (__GFP_USER|__GFP_KERNRCLM)
+
 #define __GFP_BITS_SHIFT 20	/* Room for 20 __GFP_FOO bits */
 #define __GFP_BITS_MASK ((1 << __GFP_BITS_SHIFT) - 1)
 
@@ -48,14 +58,15 @@ struct vm_area_struct;
 #define GFP_LEVEL_MASK (__GFP_WAIT|__GFP_HIGH|__GFP_IO|__GFP_FS| \
 			__GFP_COLD|__GFP_NOWARN|__GFP_REPEAT| \
 			__GFP_NOFAIL|__GFP_NORETRY|__GFP_NO_GROW|__GFP_COMP| \
-			__GFP_NOMEMALLOC|__GFP_NORECLAIM)
+			__GFP_NOMEMALLOC|__GFP_KERNRCLM|__GFP_USER)
 
 #define GFP_ATOMIC	(__GFP_HIGH)
 #define GFP_NOIO	(__GFP_WAIT)
 #define GFP_NOFS	(__GFP_WAIT | __GFP_IO)
 #define GFP_KERNEL	(__GFP_WAIT | __GFP_IO | __GFP_FS)
-#define GFP_USER	(__GFP_WAIT | __GFP_IO | __GFP_FS)
-#define GFP_HIGHUSER	(__GFP_WAIT | __GFP_IO | __GFP_FS | __GFP_HIGHMEM)
+#define GFP_USER	(__GFP_WAIT | __GFP_IO | __GFP_FS | __GFP_USER)
+#define GFP_HIGHUSER	(__GFP_WAIT | __GFP_IO | __GFP_FS | __GFP_HIGHMEM | \
+			 __GFP_USER)
 
 /* Flag - indicates that the buffer will be suitable for DMA.  Ignored on some
    platforms, used as appropriate on others */

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 1/9] add defrag flags
  2005-09-27 18:38         ` Joel Schopp
@ 2005-09-27 19:30           ` Paul Jackson
  2005-09-27 21:00             ` [Lhms-devel] " Joel Schopp
  0 siblings, 1 reply; 28+ messages in thread
From: Paul Jackson @ 2005-09-27 19:30 UTC (permalink / raw)
  To: Joel Schopp
  Cc: haveblue, mrmacman_g4, akpm, lhms-devel, linux-mm, linux-kernel,
	mel, kravetz

Joel wrote:
> We may not be able to use the same flag after all due to our need to mark buffer 
> pages as user.

Agreed - we have separate flags.  I want exactly user address space
pages.  You want really easy to reclaim pages.  You have good
performance justifications for your choice.  I have just "design
purity", so if for some reason there was a dire shortage of GFP bits,
I suspect it is I who should give, not you.

> > +#define __GFP_BITS_SHIFT 21	/* Room for 20 __GFP_FOO bits */
> 
> Yep.

Once this is merged with current Linux, which already has GFP_HARDWALL,
I presume you will be back up to 21 bits, code and comment.

As I noted in another message the "USER" and the comment in:

#define __GFP_USER	0x40000u /* User is a userspace user */

are a bit misleading now.  Perhaps GFP_EASYRCLM?

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [Lhms-devel] Re: [PATCH 1/9] add defrag flags
  2005-09-27 19:30           ` Paul Jackson
@ 2005-09-27 21:00             ` Joel Schopp
  2005-09-27 21:23               ` Paul Jackson
  0 siblings, 1 reply; 28+ messages in thread
From: Joel Schopp @ 2005-09-27 21:00 UTC (permalink / raw)
  To: Paul Jackson
  Cc: haveblue, mrmacman_g4, akpm, lhms-devel, linux-mm, linux-kernel,
	mel, kravetz

> Once this is merged with current Linux, which already has GFP_HARDWALL,
> I presume you will be back up to 21 bits, code and comment.

Looks like it.

> 
> As I noted in another message the "USER" and the comment in:
> 
> #define __GFP_USER	0x40000u /* User is a userspace user */
> 
> are a bit misleading now.  Perhaps GFP_EASYRCLM?
> 

A rose by any other name would smell as sweet -Romeo

A flag by any other name would work as well -Joel

There are problems with any name we would use.  I personally like __GFP_USER 
because it is mostly user memory, and nobody will accidently use it to label 
something that is not user memory.  Those who do use it for non-user memory will 
do so with more caution and ridicule.  This will keep it from expanding in use 
beyond its intent.

If we name it __GFP_EASYRCLM we then start getting into questions about what we 
mean by easy and somebody is going to  decide that their kernel memory is pretty 
easy to reclaim and mess things up.  Maybe we could call it 
__GPF_REALLYREALLYEASYRCLM to avoid confusion.

If there is a consensus from multiple people for me to go rename the flag 
__GFP_xxxxx then I'm not that attached to it and will.  But for now I'm going to 
leave it __GFP_USER.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [Lhms-devel] Re: [PATCH 1/9] add defrag flags
  2005-09-27 21:00             ` [Lhms-devel] " Joel Schopp
@ 2005-09-27 21:23               ` Paul Jackson
  2005-09-27 22:03                 ` Joel Schopp
  0 siblings, 1 reply; 28+ messages in thread
From: Paul Jackson @ 2005-09-27 21:23 UTC (permalink / raw)
  To: Joel Schopp
  Cc: haveblue, mrmacman_g4, akpm, lhms-devel, linux-mm, linux-kernel,
	mel, kravetz

> But for now I'm going to  leave it __GFP_USER.

Well, then, at least fix the comment, from the rather oddly phrased:

#define __GFP_USER	0x40000u /* User is a userspace user */

to something more accurate such as:

#define __GFP_USER	0x40000u /* User and other really easily reclaimed pages */

And consider adding a comment to its use in fs/buffer.c, where marking
a page obviously destined for kernel space __GFP_USER seems strange.
I doubt I will be the last person to look at the line of code and
scratch my head.

Nice clear simple names such as __GFP_USER (only a kernel hacker would
say that ;) should not be used if they are a flat out lie.  Better to
use some tongue twister acronym, such as

#define__GFP_RRE_RCLM 0x40000u /* Really Really Easy ReCLaiM (user, buffer) */

so that people don't think they know what something means when they don't.

And the one thing you could say that's useful in this name, that it has
something to do with the reclaim mechanism, is missing - no 'RCLM' in it.

Roses may smell sweet by other names, but kernel names for things do
matter.  Unlike classic flowers, we have an awful lot of colorless,
ordorless stuff in there that no one learns about in childhood (Linus's
child notwithstanding ;).  We desparately need names to tell the
essentials, and not lie.  __GFP_USER does neither.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [Lhms-devel] Re: [PATCH 1/9] add defrag flags
  2005-09-27 21:23               ` Paul Jackson
@ 2005-09-27 22:03                 ` Joel Schopp
  2005-09-27 22:45                   ` Paul Jackson
  0 siblings, 1 reply; 28+ messages in thread
From: Joel Schopp @ 2005-09-27 22:03 UTC (permalink / raw)
  To: Paul Jackson
  Cc: haveblue, mrmacman_g4, akpm, lhms-devel, linux-mm, linux-kernel,
	mel, kravetz

[-- Attachment #1: Type: text/plain, Size: 822 bytes --]

> Well, then, at least fix the comment, from the rather oddly phrased:
> 
> #define __GFP_USER	0x40000u /* User is a userspace user */
> 
> to something more accurate such as:
> 
> #define __GFP_USER	0x40000u /* User and other really easily reclaimed pages */

This was a cleverly designed trick to push me over the 80 column per line limit. 
   I've seen through your ruse and added:

#define __GFP_USER     0x40000u /* User & other really easily reclaimed pages */

> 
> And consider adding a comment to its use in fs/buffer.c, where marking
> a page obviously destined for kernel space __GFP_USER seems strange.
> I doubt I will be the last person to look at the line of code and
> scratch my head.

Done.  Patch with the two updated comments attached.

I think all hairs have been split and we can merge this one now.

[-- Attachment #2: 1_add_defrag_flags --]
[-- Type: text/plain, Size: 5476 bytes --]

Index: 2.6.13-joel2/fs/buffer.c
===================================================================
--- 2.6.13-joel2.orig/fs/buffer.c	2005-09-13 14:54:13.%N -0500
+++ 2.6.13-joel2/fs/buffer.c	2005-09-27 16:52:05.%N -0500
@@ -1119,7 +1119,12 @@ grow_dev_page(struct block_device *bdev,
 	struct page *page;
 	struct buffer_head *bh;
 
-	page = find_or_create_page(inode->i_mapping, index, GFP_NOFS);
+	/*
+	 * Mark as __GFP_USER because from a fragmentation avoidance and
+	 * reclimation point of view this memory behaves like user memory.
+	 */
+	page = find_or_create_page(inode->i_mapping, index,
+				   GFP_NOFS | __GFP_USER);
 	if (!page)
 		return NULL;
 
@@ -3044,7 +3049,8 @@ static void recalc_bh_state(void)
 	
 struct buffer_head *alloc_buffer_head(unsigned int __nocast gfp_flags)
 {
-	struct buffer_head *ret = kmem_cache_alloc(bh_cachep, gfp_flags);
+	struct buffer_head *ret = kmem_cache_alloc(bh_cachep,
+						   gfp_flags|__GFP_KERNRCLM);
 	if (ret) {
 		preempt_disable();
 		__get_cpu_var(bh_accounting).nr++;
Index: 2.6.13-joel2/fs/dcache.c
===================================================================
--- 2.6.13-joel2.orig/fs/dcache.c	2005-09-13 14:54:14.%N -0500
+++ 2.6.13-joel2/fs/dcache.c	2005-09-13 15:02:01.%N -0500
@@ -721,7 +721,7 @@ struct dentry *d_alloc(struct dentry * p
 	struct dentry *dentry;
 	char *dname;
 
-	dentry = kmem_cache_alloc(dentry_cache, GFP_KERNEL); 
+	dentry = kmem_cache_alloc(dentry_cache, GFP_KERNEL|__GFP_KERNRCLM);
 	if (!dentry)
 		return NULL;
 
Index: 2.6.13-joel2/fs/ext2/super.c
===================================================================
--- 2.6.13-joel2.orig/fs/ext2/super.c	2005-09-13 14:54:14.%N -0500
+++ 2.6.13-joel2/fs/ext2/super.c	2005-09-13 15:02:01.%N -0500
@@ -138,7 +138,8 @@ static kmem_cache_t * ext2_inode_cachep;
 static struct inode *ext2_alloc_inode(struct super_block *sb)
 {
 	struct ext2_inode_info *ei;
-	ei = (struct ext2_inode_info *)kmem_cache_alloc(ext2_inode_cachep, SLAB_KERNEL);
+	ei = (struct ext2_inode_info *)kmem_cache_alloc(ext2_inode_cachep,
+						SLAB_KERNEL|__GFP_KERNRCLM);
 	if (!ei)
 		return NULL;
 #ifdef CONFIG_EXT2_FS_POSIX_ACL
Index: 2.6.13-joel2/fs/ext3/super.c
===================================================================
--- 2.6.13-joel2.orig/fs/ext3/super.c	2005-09-13 14:54:14.%N -0500
+++ 2.6.13-joel2/fs/ext3/super.c	2005-09-13 15:02:01.%N -0500
@@ -440,7 +440,7 @@ static struct inode *ext3_alloc_inode(st
 {
 	struct ext3_inode_info *ei;
 
-	ei = kmem_cache_alloc(ext3_inode_cachep, SLAB_NOFS);
+	ei = kmem_cache_alloc(ext3_inode_cachep, SLAB_NOFS|__GFP_KERNRCLM);
 	if (!ei)
 		return NULL;
 #ifdef CONFIG_EXT3_FS_POSIX_ACL
Index: 2.6.13-joel2/fs/ntfs/inode.c
===================================================================
--- 2.6.13-joel2.orig/fs/ntfs/inode.c	2005-09-13 14:54:14.%N -0500
+++ 2.6.13-joel2/fs/ntfs/inode.c	2005-09-13 15:05:53.%N -0500
@@ -317,7 +317,7 @@ struct inode *ntfs_alloc_big_inode(struc
 	ntfs_inode *ni;
 
 	ntfs_debug("Entering.");
-	ni = kmem_cache_alloc(ntfs_big_inode_cache, SLAB_NOFS);
+	ni = kmem_cache_alloc(ntfs_big_inode_cache, SLAB_NOFS|__GFP_KERNRCLM);
 	if (likely(ni != NULL)) {
 		ni->state = 0;
 		return VFS_I(ni);
@@ -342,7 +342,7 @@ static inline ntfs_inode *ntfs_alloc_ext
 	ntfs_inode *ni;
 
 	ntfs_debug("Entering.");
-	ni = kmem_cache_alloc(ntfs_inode_cache, SLAB_NOFS);
+	ni = kmem_cache_alloc(ntfs_inode_cache, SLAB_NOFS|__GFP_KERNRCLM);
 	if (likely(ni != NULL)) {
 		ni->state = 0;
 		return ni;
Index: 2.6.13-joel2/include/linux/gfp.h
===================================================================
--- 2.6.13-joel2.orig/include/linux/gfp.h	2005-09-13 14:54:17.%N -0500
+++ 2.6.13-joel2/include/linux/gfp.h	2005-09-27 16:40:55.%N -0500
@@ -41,6 +41,16 @@ struct vm_area_struct;
 #define __GFP_NOMEMALLOC 0x10000u /* Don't use emergency reserves */
 #define __GFP_NORECLAIM  0x20000u /* No realy zone reclaim during allocation */
 
+/* Allocation type modifiers, these are required to be adjacent
+ * __GPF_USER: Allocation for user page or a buffer page
+ * __GFP_KERNRCLM: Short-lived or reclaimable kernel allocation
+ * Both bits off: Kernel non-reclaimable or very hard to reclaim
+ * RCLM_SHIFT (defined elsewhere) depends on the location of these bits
+ */
+#define __GFP_USER	0x40000u /* User & other really easily reclaimed pages */
+#define __GFP_KERNRCLM	0x80000u /* Kernel page that is easily reclaimable */
+#define __GFP_RCLM_BITS (__GFP_USER|__GFP_KERNRCLM)
+
 #define __GFP_BITS_SHIFT 20	/* Room for 20 __GFP_FOO bits */
 #define __GFP_BITS_MASK ((1 << __GFP_BITS_SHIFT) - 1)
 
@@ -48,14 +58,15 @@ struct vm_area_struct;
 #define GFP_LEVEL_MASK (__GFP_WAIT|__GFP_HIGH|__GFP_IO|__GFP_FS| \
 			__GFP_COLD|__GFP_NOWARN|__GFP_REPEAT| \
 			__GFP_NOFAIL|__GFP_NORETRY|__GFP_NO_GROW|__GFP_COMP| \
-			__GFP_NOMEMALLOC|__GFP_NORECLAIM)
+			__GFP_NOMEMALLOC|__GFP_KERNRCLM|__GFP_USER)
 
 #define GFP_ATOMIC	(__GFP_HIGH)
 #define GFP_NOIO	(__GFP_WAIT)
 #define GFP_NOFS	(__GFP_WAIT | __GFP_IO)
 #define GFP_KERNEL	(__GFP_WAIT | __GFP_IO | __GFP_FS)
-#define GFP_USER	(__GFP_WAIT | __GFP_IO | __GFP_FS)
-#define GFP_HIGHUSER	(__GFP_WAIT | __GFP_IO | __GFP_FS | __GFP_HIGHMEM)
+#define GFP_USER	(__GFP_WAIT | __GFP_IO | __GFP_FS | __GFP_USER)
+#define GFP_HIGHUSER	(__GFP_WAIT | __GFP_IO | __GFP_FS | __GFP_HIGHMEM | \
+			 __GFP_USER)
 
 /* Flag - indicates that the buffer will be suitable for DMA.  Ignored on some
    platforms, used as appropriate on others */

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [Lhms-devel] Re: [PATCH 1/9] add defrag flags
  2005-09-27 22:03                 ` Joel Schopp
@ 2005-09-27 22:45                   ` Paul Jackson
  0 siblings, 0 replies; 28+ messages in thread
From: Paul Jackson @ 2005-09-27 22:45 UTC (permalink / raw)
  To: Joel Schopp
  Cc: haveblue, mrmacman_g4, akpm, lhms-devel, linux-mm, linux-kernel,
	mel, kravetz

+	 * Mark as __GFP_USER because from a fragmentation avoidance and
+	 * reclimation point of view this memory behaves like user memory.

You misspelled reclamation.

(Nice comment - I had to bitch about something ;).

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2005-09-27 22:45 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2005-09-26 20:01 [PATCH 0/9] fragmentation avoidance Joel Schopp
2005-09-26 20:03 ` [PATCH 1/9] add defrag flags Joel Schopp
2005-09-27  0:16   ` Kyle Moffett
2005-09-27  0:24     ` Dave Hansen
2005-09-27  0:43       ` Kyle Moffett
2005-09-27  5:44       ` Paul Jackson
2005-09-27 13:34         ` Mel Gorman
2005-09-27 16:26           ` [Lhms-devel] " Paul Jackson
2005-09-27 18:38         ` Joel Schopp
2005-09-27 19:30           ` Paul Jackson
2005-09-27 21:00             ` [Lhms-devel] " Joel Schopp
2005-09-27 21:23               ` Paul Jackson
2005-09-27 22:03                 ` Joel Schopp
2005-09-27 22:45                   ` Paul Jackson
2005-09-26 20:05 ` [PATCH 2/9] declare defrag structs Joel Schopp
2005-09-26 20:06 ` [PATCH 3/9] initialize defrag Joel Schopp
2005-09-26 20:09 ` [PATCH 4/9] defrag helper functions Joel Schopp
2005-09-26 22:29   ` Alex Bligh - linux-kernel
2005-09-27 16:08     ` Joel Schopp
2005-09-26 20:11 ` [PATCH 5/9] propagate defrag alloc types Joel Schopp
2005-09-26 20:13 ` [PATCH 6/9] fragmentation avoidance core Joel Schopp
2005-09-26 20:14 ` [PATCH 7/9] try harder on large allocations Joel Schopp
2005-09-27  7:21   ` Coywolf Qi Hunt
2005-09-27 16:17     ` Joel Schopp
2005-09-26 20:16 ` [PATCH 8/9] defrag fallback Joel Schopp
2005-09-26 20:17 ` [PATCH 9/9] free memory is user reclaimable Joel Schopp
2005-09-26 20:19 ` [PATCH 10/9] percpu splitout Joel Schopp
2005-09-26 21:49 ` [Lhms-devel] [PATCH 0/9] fragmentation avoidance Joel Schopp

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox