[patch 00/23] Slab defragmentation V6

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [patch 00/23] Slab defragmentation V6
@ 2007-11-07  1:11 Christoph Lameter
  2007-11-07  1:11 ` [patch 01/23] SLUB: Move count_partial() Christoph Lameter
                   ` (23 more replies)
  0 siblings, 24 replies; 77+ messages in thread
From: Christoph Lameter @ 2007-11-07  1:11 UTC (permalink / raw)
  To: akpm; +Cc: linux-kernel, linux-mm, Mel Gorman

Slab defragmentation is mainly an issue if Linux is used as a fileserver
and large amounts of dentries, inodes and buffer heads accumulate. In some
load situations the slabs become very sparsely populated so that a lot of
memory is wasted by slabs that only contain one or a few objects. In
extreme cases the performance of a machine will become sluggish since
we are continually running reclaim. Slab defragmentation adds the
capability to recover wasted memory.

With lumpy reclaim slab defragmentation can be used to enhance the
ability to recover larger contiguous areas of memory. Lumpy reclaim currently
cannot do anything if a slab page is encountered. With slab defragmentation
that slab page can be removed and a large contiguous page freed. It may
be possible to have slab pages also part of ZONE_MOVABLE (Mel's defrag
scheme in 2.6.23) or the MOVABLE areas (antifrag patches in mm).

The patchset is also available via git

git pull git://git.kernel.org/pub/scm/linux/kernel/git/christoph/slab.git defrag

Currently memory reclaim from the following slab caches is possible:

1. dentry cache
2. inode cache (with a generic interface to allow easy setup of more
   filesystems than the currently supported ext2/3/4 reiserfs, XFS
   and proc)
3. buffer_heads

One typical mechanism that triggers slab defragmentation on my systems
is the daily run of

	updatedb

Updatedb scans all files on the system which causes a high inode and dentry
use. After updatedb is complete we need to go back to the regular use
patterns (typical on my machine: kernel compiles). Those need the memory now
for different purposes. The inodes and dentries used for updatedb will
gradually be aged by the dentry/inode reclaim algorithm which will free
up the dentries and inode entries randomly through the slabs that were
allocated. As a result the slabs will become sparsely populated. If they
become empty then they can be freed but a lot of them will remain sparsely
populated. That is where slab defrag comes in: It removes the slabs with
just a few entries reclaiming more memory for other uses.

V5->V6
- Rediff against 2.6.24-rc2 + mm slub patches.
- Add reviewed by lines.
- Take out the experimental code to make slab pages movable. That
  has to wait until this has been considered by Mel.

V4->V5:
- Support lumpy reclaim for slabs
- Support reclaim via slab_shrink()
- Add constructors to insure a consistent object state at all times.

V3->V4:
- Optimize scan for slabs that need defragmentation
- Add /sys/slab/*/defrag_ratio to allow setting defrag limits
  per slab.
- Add support for buffer heads.
- Describe how the cleanup after the daily updatedb can be
  improved by slab defragmentation.

V2->V3
- Support directory reclaim
- Add infrastructure to trigger defragmentation after slab shrinking if we
  have slabs with a high degree of fragmentation.

V1->V2
- Clean up control flow using a state variable. Simplify API. Back to 2
  functions that now take arrays of objects.
- Inode defrag support for a set of filesystems
- Fix up dentry defrag support to work on negative dentries by adding
  a new dentry flag that indicates that a dentry is not in the process
  of being freed or allocated.

-- 

^ permalink raw reply	[flat|nested] 77+ messages in thread

* [patch 01/23] SLUB: Move count_partial()
  2007-11-07  1:11 [patch 00/23] Slab defragmentation V6 Christoph Lameter
@ 2007-11-07  1:11 ` Christoph Lameter
  2007-11-07  1:11 ` [patch 02/23] SLUB: Rename NUMA defrag_ratio to remote_node_defrag_ratio Christoph Lameter
                   ` (22 subsequent siblings)
  23 siblings, 0 replies; 77+ messages in thread
From: Christoph Lameter @ 2007-11-07  1:11 UTC (permalink / raw)
  To: akpm; +Cc: linux-kernel, linux-mm, Mel Gorman

[-- Attachment #1: 0002-slab_defrag_move_count_partial.patch --]
[-- Type: text/plain, Size: 1620 bytes --]

Move the counting function for objects in partial slabs so that it is placed
before kmem_cache_shrink. We will need to use it to establish the
fragmentation ratio of per node slab lists.

[This patch is already in mm]

Reviewed-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Christoph Lameter <clameter@sgi.com>
---
 mm/slub.c |   26 +++++++++++++-------------
 1 file changed, 13 insertions(+), 13 deletions(-)

Index: linux-2.6/mm/slub.c
===================================================================
--- linux-2.6.orig/mm/slub.c	2007-11-06 12:34:13.000000000 -0800
+++ linux-2.6/mm/slub.c	2007-11-06 12:35:37.000000000 -0800
@@ -2758,6 +2758,19 @@ void kfree(const void *x)
 }
 EXPORT_SYMBOL(kfree);
 
+static unsigned long count_partial(struct kmem_cache_node *n)
+{
+	unsigned long flags;
+	unsigned long x = 0;
+	struct page *page;
+
+	spin_lock_irqsave(&n->list_lock, flags);
+	list_for_each_entry(page, &n->partial, lru)
+		x += page->inuse;
+	spin_unlock_irqrestore(&n->list_lock, flags);
+	return x;
+}
+
 /*
  * kmem_cache_shrink removes empty slabs from the partial lists and sorts
  * the remaining slabs by the number of items in use. The slabs with the
@@ -3615,19 +3628,6 @@ static int list_locations(struct kmem_ca
 	return n;
 }
 
-static unsigned long count_partial(struct kmem_cache_node *n)
-{
-	unsigned long flags;
-	unsigned long x = 0;
-	struct page *page;
-
-	spin_lock_irqsave(&n->list_lock, flags);
-	list_for_each_entry(page, &n->partial, lru)
-		x += page->inuse;
-	spin_unlock_irqrestore(&n->list_lock, flags);
-	return x;
-}
-
 enum slab_stat_type {
 	SL_FULL,
 	SL_PARTIAL,

-- 

^ permalink raw reply	[flat|nested] 77+ messages in thread

* [patch 02/23] SLUB: Rename NUMA defrag_ratio to remote_node_defrag_ratio
  2007-11-07  1:11 [patch 00/23] Slab defragmentation V6 Christoph Lameter
  2007-11-07  1:11 ` [patch 01/23] SLUB: Move count_partial() Christoph Lameter
@ 2007-11-07  1:11 ` Christoph Lameter
  2007-11-08 14:50   ` Mel Gorman
  2007-11-07  1:11 ` [patch 03/23] bufferhead: Revert constructor removal Christoph Lameter
                   ` (21 subsequent siblings)
  23 siblings, 1 reply; 77+ messages in thread
From: Christoph Lameter @ 2007-11-07  1:11 UTC (permalink / raw)
  To: akpm; +Cc: linux-kernel, linux-mm, Mel Gorman

[-- Attachment #1: 0003-slab_defrag_remote_node_defrag_ratio.patch --]
[-- Type: text/plain, Size: 2881 bytes --]

We need the defrag ratio for the non NUMA situation now. The NUMA defrag works
by allocating objects from partial slabs on remote nodes. Rename it to

	remote_node_defrag_ratio

to be clear about this.

[This patch is already in mm]

Reviewed-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Christoph Lameter <clameter@sgi.com>
---
 include/linux/slub_def.h |    5 ++++-
 mm/slub.c                |   17 +++++++++--------
 2 files changed, 13 insertions(+), 9 deletions(-)

Index: linux-2.6/include/linux/slub_def.h
===================================================================
--- linux-2.6.orig/include/linux/slub_def.h	2007-11-06 12:34:13.000000000 -0800
+++ linux-2.6/include/linux/slub_def.h	2007-11-06 12:36:28.000000000 -0800
@@ -60,7 +60,10 @@ struct kmem_cache {
 #endif
 
 #ifdef CONFIG_NUMA
-	int defrag_ratio;
+	/*
+	 * Defragmentation by allocating from a remote node.
+	 */
+	int remote_node_defrag_ratio;
 	struct kmem_cache_node *node[MAX_NUMNODES];
 #endif
 #ifdef CONFIG_SMP
Index: linux-2.6/mm/slub.c
===================================================================
--- linux-2.6.orig/mm/slub.c	2007-11-06 12:36:16.000000000 -0800
+++ linux-2.6/mm/slub.c	2007-11-06 12:37:25.000000000 -0800
@@ -1345,7 +1345,8 @@ static unsigned long get_any_partial(str
 	 * expensive if we do it every time we are trying to find a slab
 	 * with available objects.
 	 */
-	if (!s->defrag_ratio || get_cycles() % 1024 > s->defrag_ratio)
+	if (!s->remote_node_defrag_ratio ||
+			get_cycles() % 1024 > s->remote_node_defrag_ratio)
 		return 0;
 
 	zonelist = &NODE_DATA(slab_node(current->mempolicy))
@@ -2363,7 +2364,7 @@ static int kmem_cache_open(struct kmem_c
 
 	s->refcount = 1;
 #ifdef CONFIG_NUMA
-	s->defrag_ratio = 100;
+	s->remote_node_defrag_ratio = 100;
 #endif
 	if (!init_kmem_cache_nodes(s, gfpflags & ~SLUB_DMA))
 		goto error;
@@ -4005,21 +4006,21 @@ static ssize_t free_calls_show(struct km
 SLAB_ATTR_RO(free_calls);
 
 #ifdef CONFIG_NUMA
-static ssize_t defrag_ratio_show(struct kmem_cache *s, char *buf)
+static ssize_t remote_node_defrag_ratio_show(struct kmem_cache *s, char *buf)
 {
-	return sprintf(buf, "%d\n", s->defrag_ratio / 10);
+	return sprintf(buf, "%d\n", s->remote_node_defrag_ratio / 10);
 }
 
-static ssize_t defrag_ratio_store(struct kmem_cache *s,
+static ssize_t remote_node_defrag_ratio_store(struct kmem_cache *s,
 				const char *buf, size_t length)
 {
 	int n = simple_strtoul(buf, NULL, 10);
 
 	if (n < 100)
-		s->defrag_ratio = n * 10;
+		s->remote_node_defrag_ratio = n * 10;
 	return length;
 }
-SLAB_ATTR(defrag_ratio);
+SLAB_ATTR(remote_node_defrag_ratio);
 #endif
 
 static struct attribute * slab_attrs[] = {
@@ -4050,7 +4051,7 @@ static struct attribute * slab_attrs[] =
 	&cache_dma_attr.attr,
 #endif
 #ifdef CONFIG_NUMA
-	&defrag_ratio_attr.attr,
+	&remote_node_defrag_ratio_attr.attr,
 #endif
 	NULL
 };

-- 

^ permalink raw reply	[flat|nested] 77+ messages in thread

* [patch 03/23] bufferhead: Revert constructor removal
  2007-11-07  1:11 [patch 00/23] Slab defragmentation V6 Christoph Lameter
  2007-11-07  1:11 ` [patch 01/23] SLUB: Move count_partial() Christoph Lameter
  2007-11-07  1:11 ` [patch 02/23] SLUB: Rename NUMA defrag_ratio to remote_node_defrag_ratio Christoph Lameter
@ 2007-11-07  1:11 ` Christoph Lameter
  2007-11-07  1:11 ` [patch 04/23] dentries: Extract common code to remove dentry from lru Christoph Lameter
                   ` (20 subsequent siblings)
  23 siblings, 0 replies; 77+ messages in thread
From: Christoph Lameter @ 2007-11-07  1:11 UTC (permalink / raw)
  To: akpm; +Cc: linux-kernel, linux-mm, Mel Gorman

[-- Attachment #1: 0015-slab_defrag_buffer_head_revert.patch --]
[-- Type: text/plain, Size: 1751 bytes --]

The constructor for buffer_head slabs was removed recently. We need
the constructor in order to insure that slab objects always have a definite
state even before we allocated them.

[This patch is already in mm]

Reviewed-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Christoph Lameter <clameter@sgi.com>

---
 fs/buffer.c |   19 +++++++++++++++----
 1 file changed, 15 insertions(+), 4 deletions(-)

Index: linux-2.6/fs/buffer.c
===================================================================
--- linux-2.6.orig/fs/buffer.c	2007-10-25 18:28:40.000000000 -0700
+++ linux-2.6/fs/buffer.c	2007-11-06 12:55:45.000000000 -0800
@@ -3169,10 +3169,9 @@ static void recalc_bh_state(void)
 	
 struct buffer_head *alloc_buffer_head(gfp_t gfp_flags)
 {
-	struct buffer_head *ret = kmem_cache_zalloc(bh_cachep,
+	struct buffer_head *ret = kmem_cache_alloc(bh_cachep,
 				set_migrateflags(gfp_flags, __GFP_RECLAIMABLE));
 	if (ret) {
-		INIT_LIST_HEAD(&ret->b_assoc_buffers);
 		get_cpu_var(bh_accounting).nr++;
 		recalc_bh_state();
 		put_cpu_var(bh_accounting);
@@ -3213,12 +3212,24 @@ static int buffer_cpu_notify(struct noti
 	return NOTIFY_OK;
 }
 
+static void
+init_buffer_head(void *data, struct kmem_cache *cachep, unsigned long flags)
+{
+	struct buffer_head * bh = (struct buffer_head *)data;
+
+	memset(bh, 0, sizeof(*bh));
+	INIT_LIST_HEAD(&bh->b_assoc_buffers);
+}
+
 void __init buffer_init(void)
 {
 	int nrpages;
 
-	bh_cachep = KMEM_CACHE(buffer_head,
-			SLAB_RECLAIM_ACCOUNT|SLAB_PANIC|SLAB_MEM_SPREAD);
+	bh_cachep = kmem_cache_create("buffer_head",
+			sizeof(struct buffer_head), 0,
+				(SLAB_RECLAIM_ACCOUNT|SLAB_PANIC|
+				SLAB_MEM_SPREAD),
+				init_buffer_head);
 
 	/*
 	 * Limit the bh occupancy to 10% of ZONE_NORMAL

-- 

^ permalink raw reply	[flat|nested] 77+ messages in thread

* [patch 04/23] dentries: Extract common code to remove dentry from lru
  2007-11-07  1:11 [patch 00/23] Slab defragmentation V6 Christoph Lameter
                   ` (2 preceding siblings ...)
  2007-11-07  1:11 ` [patch 03/23] bufferhead: Revert constructor removal Christoph Lameter
@ 2007-11-07  1:11 ` Christoph Lameter
  2007-11-07  8:50   ` Johannes Weiner
  2007-11-07  1:11 ` [patch 05/23] VM: Allow get_page_unless_zero on compound pages Christoph Lameter
                   ` (19 subsequent siblings)
  23 siblings, 1 reply; 77+ messages in thread
From: Christoph Lameter @ 2007-11-07  1:11 UTC (permalink / raw)
  To: akpm; +Cc: linux-kernel, linux-mm, Mel Gorman

[-- Attachment #1: 0023-slab_defrag_dentry_remove_lru.patch --]
[-- Type: text/plain, Size: 3281 bytes --]

Extract the common code to remove a dentry from the lru into a new function
dentry_lru_remove().

Two call sites used list_del() instead of list_del_init(). AFAIK the
performance of both is the same. dentry_lru_remove() does a list_del_init().

As a result dentry->d_lru is now always empty when a dentry is freed.
A consistent state is useful to establish dentry state from slab defrag.

[This patch is already in mm]

Reviewed-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Christoph Lameter <clameter@sgi.com>
---
 fs/dcache.c |   42 ++++++++++++++----------------------------
 1 file changed, 14 insertions(+), 28 deletions(-)

Index: linux-2.6/fs/dcache.c
===================================================================
--- linux-2.6.orig/fs/dcache.c	2007-10-25 18:28:40.000000000 -0700
+++ linux-2.6/fs/dcache.c	2007-11-06 12:56:31.000000000 -0800
@@ -95,6 +95,14 @@ static void d_free(struct dentry *dentry
 		call_rcu(&dentry->d_u.d_rcu, d_callback);
 }
 
+static void dentry_lru_remove(struct dentry *dentry)
+{
+	if (!list_empty(&dentry->d_lru)) {
+		list_del_init(&dentry->d_lru);
+		dentry_stat.nr_unused--;
+	}
+}
+
 /*
  * Release the dentry's inode, using the filesystem
  * d_iput() operation if defined.
@@ -211,13 +219,7 @@ repeat:
 unhash_it:
 	__d_drop(dentry);
 kill_it:
-	/* If dentry was on d_lru list
-	 * delete it from there
-	 */
-	if (!list_empty(&dentry->d_lru)) {
-		list_del(&dentry->d_lru);
-		dentry_stat.nr_unused--;
-	}
+	dentry_lru_remove(dentry);
 	dentry = d_kill(dentry);
 	if (dentry)
 		goto repeat;
@@ -285,10 +287,7 @@ int d_invalidate(struct dentry * dentry)
 static inline struct dentry * __dget_locked(struct dentry *dentry)
 {
 	atomic_inc(&dentry->d_count);
-	if (!list_empty(&dentry->d_lru)) {
-		dentry_stat.nr_unused--;
-		list_del_init(&dentry->d_lru);
-	}
+	dentry_lru_remove(dentry);
 	return dentry;
 }
 
@@ -404,10 +403,7 @@ static void prune_one_dentry(struct dent
 
 		if (dentry->d_op && dentry->d_op->d_delete)
 			dentry->d_op->d_delete(dentry);
-		if (!list_empty(&dentry->d_lru)) {
-			list_del(&dentry->d_lru);
-			dentry_stat.nr_unused--;
-		}
+		dentry_lru_remove(dentry);
 		__d_drop(dentry);
 		dentry = d_kill(dentry);
 		spin_lock(&dcache_lock);
@@ -596,10 +592,7 @@ static void shrink_dcache_for_umount_sub
 
 	/* detach this root from the system */
 	spin_lock(&dcache_lock);
-	if (!list_empty(&dentry->d_lru)) {
-		dentry_stat.nr_unused--;
-		list_del_init(&dentry->d_lru);
-	}
+	dentry_lru_remove(dentry);
 	__d_drop(dentry);
 	spin_unlock(&dcache_lock);
 
@@ -613,11 +606,7 @@ static void shrink_dcache_for_umount_sub
 			spin_lock(&dcache_lock);
 			list_for_each_entry(loop, &dentry->d_subdirs,
 					    d_u.d_child) {
-				if (!list_empty(&loop->d_lru)) {
-					dentry_stat.nr_unused--;
-					list_del_init(&loop->d_lru);
-				}
-
+				dentry_lru_remove(dentry);
 				__d_drop(loop);
 				cond_resched_lock(&dcache_lock);
 			}
@@ -799,10 +788,7 @@ resume:
 		struct dentry *dentry = list_entry(tmp, struct dentry, d_u.d_child);
 		next = tmp->next;
 
-		if (!list_empty(&dentry->d_lru)) {
-			dentry_stat.nr_unused--;
-			list_del_init(&dentry->d_lru);
-		}
+		dentry_lru_remove(dentry);
 		/* 
 		 * move only zero ref count dentries to the end 
 		 * of the unused list for prune_dcache

-- 

^ permalink raw reply	[flat|nested] 77+ messages in thread

* [patch 05/23] VM: Allow get_page_unless_zero on compound pages
  2007-11-07  1:11 [patch 00/23] Slab defragmentation V6 Christoph Lameter
                   ` (3 preceding siblings ...)
  2007-11-07  1:11 ` [patch 04/23] dentries: Extract common code to remove dentry from lru Christoph Lameter
@ 2007-11-07  1:11 ` Christoph Lameter
  2007-11-07  1:11 ` [patch 06/23] SLUB: Extend slabinfo to support -D and -C options Christoph Lameter
                   ` (18 subsequent siblings)
  23 siblings, 0 replies; 77+ messages in thread
From: Christoph Lameter @ 2007-11-07  1:11 UTC (permalink / raw)
  To: akpm; +Cc: linux-kernel, linux-mm, Mel Gorman

[-- Attachment #1: 0011-slab_defrag_get_page_unless.patch --]
[-- Type: text/plain, Size: 955 bytes --]

SLUB uses compound pages for larger slabs. We need to increment
the page count of these pages in order to make sure that they are not
freed under us because of other operations on the slab that may
independently remove the objects.

[This patch is already in mm]

Reviewed-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Christoph Lameter <clameter@sgi.com>
---
 include/linux/mm.h |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Index: linux-2.6.23-mm1/include/linux/mm.h
===================================================================
--- linux-2.6.23-mm1.orig/include/linux/mm.h	2007-10-12 18:15:57.000000000 -0700
+++ linux-2.6.23-mm1/include/linux/mm.h	2007-10-12 18:17:30.000000000 -0700
@@ -227,7 +227,7 @@ static inline int put_page_testzero(stru
  */
 static inline int get_page_unless_zero(struct page *page)
 {
-	VM_BUG_ON(PageCompound(page));
+	VM_BUG_ON(PageTail(page));
 	return atomic_inc_not_zero(&page->_count);
 }
 

-- 

^ permalink raw reply	[flat|nested] 77+ messages in thread

* [patch 06/23] SLUB: Extend slabinfo to support -D and -C options
  2007-11-07  1:11 [patch 00/23] Slab defragmentation V6 Christoph Lameter
                   ` (4 preceding siblings ...)
  2007-11-07  1:11 ` [patch 05/23] VM: Allow get_page_unless_zero on compound pages Christoph Lameter
@ 2007-11-07  1:11 ` Christoph Lameter
  2007-11-08 15:00   ` Mel Gorman
  2007-11-07  1:11 ` [patch 07/23] SLUB: Add defrag_ratio field and sysfs support Christoph Lameter
                   ` (17 subsequent siblings)
  23 siblings, 1 reply; 77+ messages in thread
From: Christoph Lameter @ 2007-11-07  1:11 UTC (permalink / raw)
  To: akpm; +Cc: linux-kernel, linux-mm, Mel Gorman

[-- Attachment #1: 0001-slab_defrag_slabinfo_update.patch --]
[-- Type: text/plain, Size: 6014 bytes --]

-D lists caches that support defragmentation

-C lists caches that use a ctor.

Change field names for defrag_ratio and remote_node_defrag_ratio.

Add determination of the allocation ratio for a slab. The allocation ratio
is the percentage of available slots for objects in use.

Reviewed-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Christoph Lameter <clameter@sgi.com>
---
 Documentation/vm/slabinfo.c |   52 +++++++++++++++++++++++++++++++++++++-------
 1 file changed, 44 insertions(+), 8 deletions(-)

Index: linux-2.6.23-mm1/Documentation/vm/slabinfo.c
===================================================================
--- linux-2.6.23-mm1.orig/Documentation/vm/slabinfo.c	2007-10-12 16:25:54.000000000 -0700
+++ linux-2.6.23-mm1/Documentation/vm/slabinfo.c	2007-10-12 17:58:14.000000000 -0700
@@ -31,6 +31,8 @@ struct slabinfo {
 	int hwcache_align, object_size, objs_per_slab;
 	int sanity_checks, slab_size, store_user, trace;
 	int order, poison, reclaim_account, red_zone;
+	int defrag, ctor;
+	int defrag_ratio, remote_node_defrag_ratio;
 	unsigned long partial, objects, slabs;
 	int numa[MAX_NODES];
 	int numa_partial[MAX_NODES];
@@ -57,6 +59,8 @@ int show_slab = 0;
 int skip_zero = 1;
 int show_numa = 0;
 int show_track = 0;
+int show_defrag = 0;
+int show_ctor = 0;
 int show_first_alias = 0;
 int validate = 0;
 int shrink = 0;
@@ -91,18 +95,20 @@ void fatal(const char *x, ...)
 void usage(void)
 {
 	printf("slabinfo 5/7/2007. (c) 2007 sgi. clameter@sgi.com\n\n"
-		"slabinfo [-ahnpvtsz] [-d debugopts] [slab-regexp]\n"
+		"slabinfo [-aCDefhilnosSrtTvz1] [-d debugopts] [slab-regexp]\n"
 		"-a|--aliases           Show aliases\n"
+		"-C|--ctor              Show slabs with ctors\n"
 		"-d<options>|--debug=<options> Set/Clear Debug options\n"
-		"-e|--empty		Show empty slabs\n"
+		"-D|--defrag            Show defragmentable caches\n"
+		"-e|--empty             Show empty slabs\n"
 		"-f|--first-alias       Show first alias\n"
 		"-h|--help              Show usage information\n"
 		"-i|--inverted          Inverted list\n"
 		"-l|--slabs             Show slabs\n"
 		"-n|--numa              Show NUMA information\n"
-		"-o|--ops		Show kmem_cache_ops\n"
+		"-o|--ops               Show kmem_cache_ops\n"
 		"-s|--shrink            Shrink slabs\n"
-		"-r|--report		Detailed report on single slabs\n"
+		"-r|--report            Detailed report on single slabs\n"
 		"-S|--Size              Sort by size\n"
 		"-t|--tracking          Show alloc/free information\n"
 		"-T|--Totals            Show summary information\n"
@@ -282,7 +288,7 @@ int line = 0;
 void first_line(void)
 {
 	printf("Name                   Objects Objsize    Space "
-		"Slabs/Part/Cpu  O/S O %%Fr %%Ef Flg\n");
+		"Slabs/Part/Cpu  O/S O %%Ra %%Ef Flg\n");
 }
 
 /*
@@ -325,7 +331,7 @@ void slab_numa(struct slabinfo *s, int m
 		return;
 
 	if (!line) {
-		printf("\n%-21s:", mode ? "NUMA nodes" : "Slab");
+		printf("\n%-21s: Rto ", mode ? "NUMA nodes" : "Slab");
 		for(node = 0; node <= highest_node; node++)
 			printf(" %4d", node);
 		printf("\n----------------------");
@@ -334,6 +340,7 @@ void slab_numa(struct slabinfo *s, int m
 		printf("\n");
 	}
 	printf("%-21s ", mode ? "All slabs" : s->name);
+	printf("%3d ", s->remote_node_defrag_ratio);
 	for(node = 0; node <= highest_node; node++) {
 		char b[20];
 
@@ -407,6 +414,8 @@ void report(struct slabinfo *s)
 		printf("** Slabs are destroyed via RCU\n");
 	if (s->reclaim_account)
 		printf("** Reclaim accounting active\n");
+	if (s->defrag)
+		printf("** Defragmentation at %d%%\n", s->defrag_ratio);
 
 	printf("\nSizes (bytes)     Slabs              Debug                Memory\n");
 	printf("------------------------------------------------------------------------\n");
@@ -453,6 +462,12 @@ void slabcache(struct slabinfo *s)
 	if (show_empty && s->slabs)
 		return;
 
+	if (show_defrag && !s->defrag)
+		return;
+
+	if (show_ctor && !s->ctor)
+		return;
+
 	store_size(size_str, slab_size(s));
 	snprintf(dist_str, 40, "%lu/%lu/%d", s->slabs, s->partial, s->cpu_slabs);
 
@@ -463,6 +478,10 @@ void slabcache(struct slabinfo *s)
 		*p++ = '*';
 	if (s->cache_dma)
 		*p++ = 'd';
+	if (s->defrag)
+		*p++ = 'D';
+	if (s->ctor)
+		*p++ = 'C';
 	if (s->hwcache_align)
 		*p++ = 'A';
 	if (s->poison)
@@ -482,7 +501,7 @@ void slabcache(struct slabinfo *s)
 	printf("%-21s %8ld %7d %8s %14s %4d %1d %3ld %3ld %s\n",
 		s->name, s->objects, s->object_size, size_str, dist_str,
 		s->objs_per_slab, s->order,
-		s->slabs ? (s->partial * 100) / s->slabs : 100,
+		s->slabs ? (s->objects * 100) / (s->slabs * s->objs_per_slab) : 100,
 		s->slabs ? (s->objects * s->object_size * 100) /
 			(s->slabs * (page_size << s->order)) : 100,
 		flags);
@@ -1074,7 +1093,16 @@ void read_slab_dir(void)
 			free(t);
 			slab->store_user = get_obj("store_user");
 			slab->trace = get_obj("trace");
+			slab->defrag_ratio = get_obj("defrag_ratio");
+			slab->remote_node_defrag_ratio =
+					get_obj("remote_node_defrag_ratio");
 			chdir("..");
+			if (read_slab_obj(slab, "ops")) {
+				if (strstr(buffer, "ctor :"))
+					slab->ctor = 1;
+				if (strstr(buffer, "kick :"))
+					slab->defrag = 1;
+			}
 			if (slab->name[0] == ':')
 				alias_targets++;
 			slab++;
@@ -1124,7 +1152,9 @@ void output_slabs(void)
 
 struct option opts[] = {
 	{ "aliases", 0, NULL, 'a' },
+	{ "ctor", 0, NULL, 'C' },
 	{ "debug", 2, NULL, 'd' },
+	{ "defrag", 0, NULL, 'D' },
 	{ "empty", 0, NULL, 'e' },
 	{ "first-alias", 0, NULL, 'f' },
 	{ "help", 0, NULL, 'h' },
@@ -1149,7 +1179,7 @@ int main(int argc, char *argv[])
 
 	page_size = getpagesize();
 
-	while ((c = getopt_long(argc, argv, "ad::efhil1noprstvzTS",
+	while ((c = getopt_long(argc, argv, "ad::efhil1noprstvzCDTS",
 						opts, NULL)) != -1)
 		switch (c) {
 		case '1':
@@ -1199,6 +1229,12 @@ int main(int argc, char *argv[])
 		case 'z':
 			skip_zero = 0;
 			break;
+		case 'C':
+			show_ctor = 1;
+			break;
+		case 'D':
+			show_defrag = 1;
+			break;
 		case 'T':
 			show_totals = 1;
 			break;

-- 

^ permalink raw reply	[flat|nested] 77+ messages in thread

* [patch 07/23] SLUB: Add defrag_ratio field and sysfs support.
  2007-11-07  1:11 [patch 00/23] Slab defragmentation V6 Christoph Lameter
                   ` (5 preceding siblings ...)
  2007-11-07  1:11 ` [patch 06/23] SLUB: Extend slabinfo to support -D and -C options Christoph Lameter
@ 2007-11-07  1:11 ` Christoph Lameter
  2007-11-07  8:55   ` Johannes Weiner
  2007-11-08 15:07   ` Mel Gorman
  2007-11-07  1:11 ` [patch 08/23] SLUB: Replace ctor field with ops field in /sys/slab/:0000008 /sys/slab/:0000016 /sys/slab/:0000024 /sys/slab/:0000032 /sys/slab/:0000040 /sys/slab/:0000048 /sys/slab/:0000056 /sys/slab/:0000064 /sys/slab/:0000072 /sys/slab/:0000080 /sys/slab/:0000088 /sys/slab/:0000096 /sys/slab/:0000104 /sys/slab/:0000128 /sys/slab/:0000144 /sys/slab/:0000184 /sys/slab/:0000192 /sys/slab/:0000216 /sys/slab/:0000256 /sys/slab/:0000344 /sys/slab/:0000384 /sys/slab/:0000448 /sys/slab/:0000512 /sys/slab/:0000768 /sys/slab/:0000968 /sys/slab/:0001024 /sys/slab/:0001152 /sys/slab/:0001312 /sys/slab/:0001536 /sys/slab/:0002048 /sys/slab/:0003072 /sys/slab/:0004096 /sys/slab/:a-0000016 /sys/slab/:a-0000024 /sys/slab/:a-0000056 /sys/slab/:a-0000080 /sys/slab/:a-0000128 /sys/slab/Acpi-Namesp ace /sys/slab/Acpi-Operand /sys/slab/Acpi-Parse /sys/slab/Acpi-ParseExt /sys/slab/Acpi-State /sys/ Christoph Lameter
                   ` (16 subsequent siblings)
  23 siblings, 2 replies; 77+ messages in thread
From: Christoph Lameter @ 2007-11-07  1:11 UTC (permalink / raw)
  To: akpm; +Cc: linux-kernel, linux-mm, Mel Gorman

[-- Attachment #1: 0004-slab_defrag_add_defrag_ratio.patch --]
[-- Type: text/plain, Size: 2585 bytes --]

The defrag_ratio is used to set the threshold at which defragmentation
should be run on a slabcache.

The allocation ratio is measured in a percentage of the available slots.
The percentage will be lower for slabs that are more fragmented.

Add a defrag ratio field and set it to 30% by default. A limit of 30% specified
that less than 3 out of 10 available slots for objects are in use before
reclaim occurs.

Reviewed-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Christoph Lameter <clameter@sgi.com>
---
 include/linux/slub_def.h |    7 +++++++
 mm/slub.c                |   18 ++++++++++++++++++
 2 files changed, 25 insertions(+)

Index: linux-2.6/include/linux/slub_def.h
===================================================================
--- linux-2.6.orig/include/linux/slub_def.h	2007-11-06 12:36:28.000000000 -0800
+++ linux-2.6/include/linux/slub_def.h	2007-11-06 12:37:44.000000000 -0800
@@ -53,6 +53,13 @@ struct kmem_cache {
 	void (*ctor)(struct kmem_cache *, void *);
 	int inuse;		/* Offset to metadata */
 	int align;		/* Alignment */
+	int defrag_ratio;	/*
+				 * objects/possible-objects limit. If we have
+				 * less that the specified percentage of
+				 * objects allocated then defrag passes
+				 * will start to occur during reclaim.
+				 */
+
 	const char *name;	/* Name (only for display!) */
 	struct list_head list;	/* List of slab caches */
 #ifdef CONFIG_SLUB_DEBUG
Index: linux-2.6/mm/slub.c
===================================================================
--- linux-2.6.orig/mm/slub.c	2007-11-06 12:37:25.000000000 -0800
+++ linux-2.6/mm/slub.c	2007-11-06 12:37:44.000000000 -0800
@@ -2363,6 +2363,7 @@ static int kmem_cache_open(struct kmem_c
 		goto error;
 
 	s->refcount = 1;
+	s->defrag_ratio = 30;
 #ifdef CONFIG_NUMA
 	s->remote_node_defrag_ratio = 100;
 #endif
@@ -4005,6 +4006,22 @@ static ssize_t free_calls_show(struct km
 }
 SLAB_ATTR_RO(free_calls);
 
+static ssize_t defrag_ratio_show(struct kmem_cache *s, char *buf)
+{
+	return sprintf(buf, "%d\n", s->defrag_ratio);
+}
+
+static ssize_t defrag_ratio_store(struct kmem_cache *s,
+				const char *buf, size_t length)
+{
+	int n = simple_strtoul(buf, NULL, 10);
+
+	if (n < 100)
+		s->defrag_ratio = n;
+	return length;
+}
+SLAB_ATTR(defrag_ratio);
+
 #ifdef CONFIG_NUMA
 static ssize_t remote_node_defrag_ratio_show(struct kmem_cache *s, char *buf)
 {
@@ -4047,6 +4064,7 @@ static struct attribute * slab_attrs[] =
 	&shrink_attr.attr,
 	&alloc_calls_attr.attr,
 	&free_calls_attr.attr,
+	&defrag_ratio_attr.attr,
 #ifdef CONFIG_ZONE_DMA
 	&cache_dma_attr.attr,
 #endif

-- 

^ permalink raw reply	[flat|nested] 77+ messages in thread

* [patch 08/23] SLUB: Replace ctor field with ops field in /sys/slab/:0000008 /sys/slab/:0000016 /sys/slab/:0000024 /sys/slab/:0000032 /sys/slab/:0000040 /sys/slab/:0000048 /sys/slab/:0000056 /sys/slab/:0000064 /sys/slab/:0000072 /sys/slab/:0000080 /sys/slab/:0000088 /sys/slab/:0000096 /sys/slab/:0000104 /sys/slab/:0000128 /sys/slab/:0000144 /sys/slab/:0000184 /sys/slab/:0000192 /sys/slab/:0000216 /sys/slab/:0000256 /sys/slab/:0000344 /sys/slab/:0000384 /sys/slab/:0000448 /sys/slab/:0000512 /sys/slab/:0000768 /sys/slab/:0000968 /sys/slab/:0001024 /sys/slab/:0001152 /sys/slab/:0001312 /sys/slab/:0001536 /sys/slab/:0002048 /sys/slab/:0003072 /sys/slab/:0004096 /sys/slab/:a-0000016 /sys/slab/:a-0000024 /sys/slab/:a-0000056 /sys/slab/:a-0000080 /sys/slab/:a-0000128 /sys/slab/Acpi-Namesp ace /sys/slab/Acpi-Operand /sys/slab/Acpi-Parse /sys/slab/Acpi-ParseExt /sys/slab/Acpi-State /sys/
  2007-11-07  1:11 [patch 00/23] Slab defragmentation V6 Christoph Lameter
                   ` (6 preceding siblings ...)
  2007-11-07  1:11 ` [patch 07/23] SLUB: Add defrag_ratio field and sysfs support Christoph Lameter
@ 2007-11-07  1:11 ` Christoph Lameter
  2007-11-07  1:11 ` [patch 09/23] SLUB: Add get() and kick() methods Christoph Lameter
                   ` (15 subsequent siblings)
  23 siblings, 0 replies; 77+ messages in thread
From: Christoph Lameter @ 2007-11-07  1:11 UTC (permalink / raw)
  To: akpm; +Cc: linux-kernel, linux-mm, Mel Gorman

[-- Attachment #1: 0005-slab_defrag_ops_field.patch --]
[-- Type: text/plain, Size: 1440 bytes --]

Create an ops field in /sys/slab/*/ops to contain all the operations defined
on a slab. This will be used to display the additional operations that will
be defined soon.

Reviewed-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Christoph Lameter <clameter@sgi.com>
---
 mm/slub.c |   16 +++++++++-------
 1 file changed, 9 insertions(+), 7 deletions(-)

Index: linux-2.6/mm/slub.c
===================================================================
--- linux-2.6.orig/mm/slub.c	2007-11-06 12:37:44.000000000 -0800
+++ linux-2.6/mm/slub.c	2007-11-06 12:37:47.000000000 -0800
@@ -3789,16 +3789,18 @@ static ssize_t order_show(struct kmem_ca
 }
 SLAB_ATTR_RO(order);
 
-static ssize_t ctor_show(struct kmem_cache *s, char *buf)
+static ssize_t ops_show(struct kmem_cache *s, char *buf)
 {
-	if (s->ctor) {
-		int n = sprint_symbol(buf, (unsigned long)s->ctor);
+	int x = 0;
 
-		return n + sprintf(buf + n, "\n");
+	if (s->ctor) {
+		x += sprintf(buf + x, "ctor : ");
+		x += sprint_symbol(buf + x, (unsigned long)s->ops->ctor);
+		x += sprintf(buf + x, "\n");
 	}
-	return 0;
+	return x;
 }
-SLAB_ATTR_RO(ctor);
+SLAB_ATTR_RO(ops);
 
 static ssize_t aliases_show(struct kmem_cache *s, char *buf)
 {
@@ -4049,7 +4051,7 @@ static struct attribute * slab_attrs[] =
 	&slabs_attr.attr,
 	&partial_attr.attr,
 	&cpu_slabs_attr.attr,
-	&ctor_attr.attr,
+	&ops_attr.attr,
 	&aliases_attr.attr,
 	&align_attr.attr,
 	&sanity_checks_attr.attr,

-- 

^ permalink raw reply	[flat|nested] 77+ messages in thread

* [patch 09/23] SLUB: Add get() and kick() methods
  2007-11-07  1:11 [patch 00/23] Slab defragmentation V6 Christoph Lameter
                   ` (7 preceding siblings ...)
  2007-11-07  1:11 ` [patch 08/23] SLUB: Replace ctor field with ops field in /sys/slab/:0000008 /sys/slab/:0000016 /sys/slab/:0000024 /sys/slab/:0000032 /sys/slab/:0000040 /sys/slab/:0000048 /sys/slab/:0000056 /sys/slab/:0000064 /sys/slab/:0000072 /sys/slab/:0000080 /sys/slab/:0000088 /sys/slab/:0000096 /sys/slab/:0000104 /sys/slab/:0000128 /sys/slab/:0000144 /sys/slab/:0000184 /sys/slab/:0000192 /sys/slab/:0000216 /sys/slab/:0000256 /sys/slab/:0000344 /sys/slab/:0000384 /sys/slab/:0000448 /sys/slab/:0000512 /sys/slab/:0000768 /sys/slab/:0000968 /sys/slab/:0001024 /sys/slab/:0001152 /sys/slab/:0001312 /sys/slab/:0001536 /sys/slab/:0002048 /sys/slab/:0003072 /sys/slab/:0004096 /sys/slab/:a-0000016 /sys/slab/:a-0000024 /sys/slab/:a-0000056 /sys/slab/:a-0000080 /sys/slab/:a-0000128 /sys/slab/Acpi-Namesp ace /sys/slab/Acpi-Operand /sys/slab/Acpi-Parse /sys/slab/Acpi-ParseExt /sys/slab/Acpi-State /sys/ Christoph Lameter
@ 2007-11-07  1:11 ` Christoph Lameter
  2007-11-07  2:37   ` Adrian Bunk
  2007-11-07  1:11 ` [patch 10/23] SLUB: Sort slab cache list and establish maximum objects for defrag slabs Christoph Lameter
                   ` (14 subsequent siblings)
  23 siblings, 1 reply; 77+ messages in thread
From: Christoph Lameter @ 2007-11-07  1:11 UTC (permalink / raw)
  To: akpm; +Cc: linux-kernel, linux-mm, Mel Gorman

[-- Attachment #1: 0006-slab_defrag_get_and_kick_method.patch --]
[-- Type: text/plain, Size: 4511 bytes --]

Add the two methods needed for defragmentation and add the display of the
methods via the proc interface.

Add documentation explaining the use of these methods.

Reviewed-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Christoph Lameter <clameter@sgi.com>
---
 include/linux/slab.h     |    3 +++
 include/linux/slub_def.h |   31 +++++++++++++++++++++++++++++++
 mm/slub.c                |   32 ++++++++++++++++++++++++++++++--
 3 files changed, 64 insertions(+), 2 deletions(-)

Index: linux-2.6/include/linux/slab.h
===================================================================
--- linux-2.6.orig/include/linux/slab.h	2007-10-17 13:35:53.000000000 -0700
+++ linux-2.6/include/linux/slab.h	2007-11-06 12:37:51.000000000 -0800
@@ -56,6 +56,9 @@ struct kmem_cache *kmem_cache_create(con
 			void (*)(struct kmem_cache *, void *));
 void kmem_cache_destroy(struct kmem_cache *);
 int kmem_cache_shrink(struct kmem_cache *);
+void kmem_cache_setup_defrag(struct kmem_cache *s,
+	void *(*get)(struct kmem_cache *, int nr, void **),
+	void (*kick)(struct kmem_cache *, int nr, void **, void *private));
 void kmem_cache_free(struct kmem_cache *, void *);
 unsigned int kmem_cache_size(struct kmem_cache *);
 const char *kmem_cache_name(struct kmem_cache *);
Index: linux-2.6/include/linux/slub_def.h
===================================================================
--- linux-2.6.orig/include/linux/slub_def.h	2007-11-06 12:37:44.000000000 -0800
+++ linux-2.6/include/linux/slub_def.h	2007-11-06 12:37:51.000000000 -0800
@@ -51,6 +51,37 @@ struct kmem_cache {
 	int objects;		/* Number of objects in slab */
 	int refcount;		/* Refcount for slab cache destroy */
 	void (*ctor)(struct kmem_cache *, void *);
+	/*
+	 * Called with slab lock held and interrupts disabled.
+	 * No slab operation may be performed in get().
+	 *
+	 * Parameters passed are the number of objects to process
+	 * and an array of pointers to objects for which we
+	 * need references.
+	 *
+	 * Returns a pointer that is passed to the kick function.
+	 * If all objects cannot be moved then the pointer may
+	 * indicate that this wont work and then kick can simply
+	 * remove the references that were already obtained.
+	 *
+	 * The array passed to get() is also passed to kick(). The
+	 * function may remove objects by setting array elements to NULL.
+	 */
+	void *(*get)(struct kmem_cache *, int nr, void **);
+
+	/*
+	 * Called with no locks held and interrupts enabled.
+	 * Any operation may be performed in kick().
+	 *
+	 * Parameters passed are the number of objects in the array,
+	 * the array of pointers to the objects and the pointer
+	 * returned by get().
+	 *
+	 * Success is checked by examining the number of remaining
+	 * objects in the slab.
+	 */
+	void (*kick)(struct kmem_cache *, int nr, void **, void *private);
+
 	int inuse;		/* Offset to metadata */
 	int align;		/* Alignment */
 	int defrag_ratio;	/*
Index: linux-2.6/mm/slub.c
===================================================================
--- linux-2.6.orig/mm/slub.c	2007-11-06 12:37:47.000000000 -0800
+++ linux-2.6/mm/slub.c	2007-11-06 12:37:51.000000000 -0800
@@ -2760,6 +2760,20 @@ void kfree(const void *x)
 }
 EXPORT_SYMBOL(kfree);
 
+void kmem_cache_setup_defrag(struct kmem_cache *s,
+	void *(*get)(struct kmem_cache *, int nr, void **),
+	void (*kick)(struct kmem_cache *, int nr, void **, void *private))
+{
+	/*
+	 * Defragmentable slabs must have a ctor otherwise objects may be
+	 * in an undetermined state after they are allocated.
+	 */
+	BUG_ON(!s->ctor);
+	s->get = get;
+	s->kick = kick;
+}
+EXPORT_SYMBOL(kmem_cache_setup_defrag);
+
 static unsigned long count_partial(struct kmem_cache_node *n)
 {
 	unsigned long flags;
@@ -3058,7 +3072,7 @@ static int slab_unmergeable(struct kmem_
 	if (slub_nomerge || (s->flags & SLUB_NEVER_MERGE))
 		return 1;
 
-	if (s->ctor)
+	if (s->ctor || s->kick || s->get)
 		return 1;
 
 	/*
@@ -3795,7 +3809,21 @@ static ssize_t ops_show(struct kmem_cach
 
 	if (s->ctor) {
 		x += sprintf(buf + x, "ctor : ");
-		x += sprint_symbol(buf + x, (unsigned long)s->ops->ctor);
+		x += sprint_symbol(buf + x, (unsigned long)s->ctor);
+		x += sprintf(buf + x, "\n");
+	}
+
+	if (s->get) {
+		x += sprintf(buf + x, "get : ");
+		x += sprint_symbol(buf + x,
+				(unsigned long)s->get);
+		x += sprintf(buf + x, "\n");
+	}
+
+	if (s->kick) {
+		x += sprintf(buf + x, "kick : ");
+		x += sprint_symbol(buf + x,
+				(unsigned long)s->kick);
 		x += sprintf(buf + x, "\n");
 	}
 	return x;

-- 

^ permalink raw reply	[flat|nested] 77+ messages in thread

* [patch 10/23] SLUB: Sort slab cache list and establish maximum objects for defrag slabs
  2007-11-07  1:11 [patch 00/23] Slab defragmentation V6 Christoph Lameter
                   ` (8 preceding siblings ...)
  2007-11-07  1:11 ` [patch 09/23] SLUB: Add get() and kick() methods Christoph Lameter
@ 2007-11-07  1:11 ` Christoph Lameter
  2007-11-07  1:11 ` [patch 11/23] SLUB: Slab defrag core Christoph Lameter
                   ` (13 subsequent siblings)
  23 siblings, 0 replies; 77+ messages in thread
From: Christoph Lameter @ 2007-11-07  1:11 UTC (permalink / raw)
  To: akpm; +Cc: linux-kernel, linux-mm, Mel Gorman

[-- Attachment #1: 0007-slab_defrag_determine_maximum_objects.patch --]
[-- Type: text/plain, Size: 2375 bytes --]

When defragmenting slabs then it is advantageous to have all
defragmentable slabs together at the beginning of the list so that we do not
have to scan the complete list. When adding a slab cache put defragmentable
caches first and others last.

Determine the maximum number of objects in defragmentable slabs. This allows
to size the allocation of arrays holding refs to these objects later.

Reviewed-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Christoph Lameter <clameter@sgi.com>
---
 mm/slub.c |   19 +++++++++++++++++--
 1 file changed, 17 insertions(+), 2 deletions(-)

Index: linux-2.6/mm/slub.c
===================================================================
--- linux-2.6.orig/mm/slub.c	2007-11-06 12:37:51.000000000 -0800
+++ linux-2.6/mm/slub.c	2007-11-06 12:37:54.000000000 -0800
@@ -198,6 +198,9 @@ static enum {
 static DECLARE_RWSEM(slub_lock);
 static LIST_HEAD(slab_caches);
 
+/* Maximum objects in defragmentable slabs */
+static unsigned int max_defrag_slab_objects = 0;
+
 /*
  * Tracking user of a slab.
  */
@@ -2546,7 +2549,7 @@ static struct kmem_cache *create_kmalloc
 			flags, NULL))
 		goto panic;
 
-	list_add(&s->list, &slab_caches);
+	list_add_tail(&s->list, &slab_caches);
 	up_write(&slub_lock);
 	if (sysfs_slab_add(s))
 		goto panic;
@@ -2760,6 +2763,13 @@ void kfree(const void *x)
 }
 EXPORT_SYMBOL(kfree);
 
+static inline void *alloc_scratch(void)
+{
+	return kmalloc(max_defrag_slab_objects * sizeof(void *) +
+	    BITS_TO_LONGS(max_defrag_slab_objects) * sizeof(unsigned long),
+								GFP_KERNEL);
+}
+
 void kmem_cache_setup_defrag(struct kmem_cache *s,
 	void *(*get)(struct kmem_cache *, int nr, void **),
 	void (*kick)(struct kmem_cache *, int nr, void **, void *private))
@@ -2771,6 +2781,11 @@ void kmem_cache_setup_defrag(struct kmem
 	BUG_ON(!s->ctor);
 	s->get = get;
 	s->kick = kick;
+	down_write(&slub_lock);
+	list_move(&s->list, &slab_caches);
+	if (s->objects > max_defrag_slab_objects)
+		max_defrag_slab_objects = s->objects;
+	up_write(&slub_lock);
 }
 EXPORT_SYMBOL(kmem_cache_setup_defrag);
 
@@ -3159,7 +3174,7 @@ struct kmem_cache *kmem_cache_create(con
 	if (s) {
 		if (kmem_cache_open(s, GFP_KERNEL, name,
 				size, align, flags, ctor)) {
-			list_add(&s->list, &slab_caches);
+			list_add_tail(&s->list, &slab_caches);
 			up_write(&slub_lock);
 			if (sysfs_slab_add(s))
 				goto err;

-- 

^ permalink raw reply	[flat|nested] 77+ messages in thread

* [patch 11/23] SLUB: Slab defrag core
  2007-11-07  1:11 [patch 00/23] Slab defragmentation V6 Christoph Lameter
                   ` (9 preceding siblings ...)
  2007-11-07  1:11 ` [patch 10/23] SLUB: Sort slab cache list and establish maximum objects for defrag slabs Christoph Lameter
@ 2007-11-07  1:11 ` Christoph Lameter
  2007-11-07 22:13   ` Christoph Lameter
  2007-11-07  1:11 ` [patch 12/23] SLUB: Trigger defragmentation from memory reclaim Christoph Lameter
                   ` (12 subsequent siblings)
  23 siblings, 1 reply; 77+ messages in thread
From: Christoph Lameter @ 2007-11-07  1:11 UTC (permalink / raw)
  To: akpm; +Cc: linux-kernel, linux-mm, Mel Gorman

[-- Attachment #1: 0009-slab_defrag_core.patch --]
[-- Type: text/plain, Size: 12750 bytes --]

Slab defragmentation (aside from Lumpy Reclaim) may occur:

1. Unconditionally when kmem_cache_shrink is called on a slab cache by the
   kernel calling kmem_cache_shrink.

2. Use of the slabinfo command line to trigger slab shrinking.

3. Per node defrag conditionally when kmem_cache_defrag(<node>) is called.

   Defragmentation is only performed if the fragmentation of the slab
   is lower than the specified percentage. Fragmentation ratios are measured
   by calculating the percentage of objects in use compared to the total
   number of objects that the slab cache could hold.

   kmem_cache_defrag takes a node parameter. This can either be -1 if
   defragmentation should be performed on all nodes, or a node number.
   If a node number was specified then defragmentation is only performed
   on a specific node.

   Slab defragmentation is a memory intensive operation that can be
   sped up in a NUMA system if mostly node local memory is accessed. That
   is the case if we run shrinking on a node after just having run
   shrink_slabs() reclaim on a node.

A couple of functions must be setup via a call to kmem_cache_setup_defrag()
in order for a slabcache to support defragmentation. These are

void *get(struct kmem_cache *s, int nr, void **objects)

	Must obtain a reference to the listed objects. SLUB guarantees that
	the objects are still allocated. However, other threads may be blocked
	in slab_free() attempting to free objects in the slab. These may succeed
	as soon as get() returns to the slab allocator. The function must
	be able to detect such situations and void the attempts to free such
	objects (by for example voiding the corresponding entry in the objects
	array).

	No slab operations may be performed in get(). Interrupts
	are disabled. What can be done is very limited. The slab lock
	for the page that contains the object is taken. Any attempt to perform
	a slab operation may lead to a deadlock.

	get() returns a private pointer that is passed to kick. Should we
	be unable to obtain all references then that pointer may indicate
	to the kick() function that it should not attempt any object removal
	or move but simply remove the reference counts.

void kick(struct kmem_cache *, int nr, void **objects, void *get_result)

	After SLUB has established references to the objects in a
	slab it will then drop all locks and use kick() to move objects out
	of the slab. The existence of the object is guaranteed by virtue of
	the earlier obtained references via get(). The callback may perform
	any slab operation since no locks are held at the time of call.

	The callback should remove the object from the slab in some way. This
	may be accomplished by reclaiming the object and then running
	kmem_cache_free() or reallocating it and then running
	kmem_cache_free(). Reallocation is advantageous because the partial
	slabs were just sorted to have the partial slabs with the most objects
	first. Reallocation is likely to result in filling up a slab in
	addition to freeing up one slab. A filled up slab can also be removed
	from the partial list. So there could be a double effect.

	Kick() does not return a result. SLUB will check the number of
	remaining objects in the slab. If all objects were removed then
	we know that the operation was successful.

Reviewed-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Christoph Lameter <clameter@sgi.com>
---
 mm/slab.c |    5 +
 mm/slub.c |  275 ++++++++++++++++++++++++++++++++++++++++++++++++--------------
 2 files changed, 222 insertions(+), 58 deletions(-)

Index: linux-2.6/mm/slab.c
===================================================================
--- linux-2.6.orig/mm/slab.c	2007-11-06 16:03:35.000000000 -0800
+++ linux-2.6/mm/slab.c	2007-11-06 16:32:32.000000000 -0800
@@ -2554,6 +2554,11 @@ int kmem_cache_shrink(struct kmem_cache 
 }
 EXPORT_SYMBOL(kmem_cache_shrink);
 
+int kmem_cache_defrag(int node)
+{
+	return 0;
+}
+
 /**
  * kmem_cache_destroy - delete a cache
  * @cachep: the cache to destroy
Index: linux-2.6/mm/slub.c
===================================================================
--- linux-2.6.orig/mm/slub.c	2007-11-06 16:03:44.000000000 -0800
+++ linux-2.6/mm/slub.c	2007-11-06 16:33:04.000000000 -0800
@@ -143,14 +143,14 @@
  * Mininum number of partial slabs. These will be left on the partial
  * lists even if they are empty. kmem_cache_shrink may reclaim them.
  */
-#define MIN_PARTIAL 2
+#define MIN_PARTIAL 5
 
 /*
  * Maximum number of desirable partial slabs.
- * The existence of more partial slabs makes kmem_cache_shrink
- * sort the partial list by the number of objects in the.
+ * More slabs cause kmem_cache_shrink to sort the slabs by objects
+ * and triggers slab defragmentation.
  */
-#define MAX_PARTIAL 10
+#define MAX_PARTIAL 20
 
 #define DEBUG_DEFAULT_FLAGS (SLAB_DEBUG_FREE | SLAB_RED_ZONE | \
 				SLAB_POISON | SLAB_STORE_USER)
@@ -2803,76 +2803,235 @@ static unsigned long count_partial(struc
 }
 
 /*
- * kmem_cache_shrink removes empty slabs from the partial lists and sorts
- * the remaining slabs by the number of items in use. The slabs with the
- * most items in use come first. New allocations will then fill those up
- * and thus they can be removed from the partial lists.
+ * Vacate all objects in the given slab.
  *
- * The slabs with the least items are placed last. This results in them
- * being allocated from last increasing the chance that the last objects
- * are freed in them.
- */
-int kmem_cache_shrink(struct kmem_cache *s)
+ * The scratch aread passed to list function is sufficient to hold
+ * struct listhead times objects per slab. We use it to hold void ** times
+ * objects per slab plus a bitmap for each object.
+*/
+static int kmem_cache_vacate(struct page *page, void *scratch)
 {
-	int node;
-	int i;
-	struct kmem_cache_node *n;
-	struct page *page;
-	struct page *t;
-	struct list_head *slabs_by_inuse =
-		kmalloc(sizeof(struct list_head) * s->objects, GFP_KERNEL);
+	void **vector = scratch;
+	void *p;
+	void *addr = page_address(page);
+	struct kmem_cache *s;
+	unsigned long *map;
+	int leftover;
+	int objects;
+	void *private;
 	unsigned long flags;
 	unsigned long state;
 
-	if (!slabs_by_inuse)
-		return -ENOMEM;
+	BUG_ON(!PageSlab(page));
+	local_irq_save(flags);
+	state = slab_lock(page);
+	BUG_ON(state & FROZEN);
 
-	flush_all(s);
-	for_each_node_state(node, N_NORMAL_MEMORY) {
-		n = get_node(s, node);
+	s = page->slab;
+	map = scratch + max_defrag_slab_objects * sizeof(void **);
+	if (!page->inuse || !s->kick)
+		goto out;
 
-		if (!n->nr_partial)
-			continue;
+	/* Determine used objects */
+	bitmap_fill(map, s->objects);
+	for_each_free_object(p, s, page->freelist)
+		__clear_bit(slab_index(p, s, addr), map);
 
-		for (i = 0; i < s->objects; i++)
-			INIT_LIST_HEAD(slabs_by_inuse + i);
+	objects = 0;
+	memset(vector, 0, s->objects * sizeof(void **));
+	for_each_object(p, s, addr)
+		if (test_bit(slab_index(p, s, addr), map))
+			vector[objects++] = p;
 
-		spin_lock_irqsave(&n->list_lock, flags);
+	private = s->get(s, objects, vector);
 
-		/*
-		 * Build lists indexed by the items in use in each slab.
-		 *
-		 * Note that concurrent frees may occur while we hold the
-		 * list_lock. page->inuse here is the upper limit.
-		 */
-		list_for_each_entry_safe(page, t, &n->partial, lru) {
-			if (!page->inuse && (state = slab_trylock(page))) {
-				/*
-				 * Must hold slab lock here because slab_free
-				 * may have freed the last object and be
-				 * waiting to release the slab.
-				 */
-				list_del(&page->lru);
+	/*
+	 * Got references. Now we can drop the slab lock. The slab
+	 * is frozen so it cannot vanish from under us nor will
+	 * allocations be performed on the slab. However, unlocking the
+	 * slab will allow concurrent slab_frees to proceed.
+	 */
+	slab_unlock(page, state);
+	local_irq_restore(flags);
+
+	/*
+	 * Perform the KICK callbacks to remove the objects.
+	 */
+	s->kick(s, objects, vector, private);
+
+	local_irq_save(flags);
+	state = slab_lock(page);
+out:
+	/*
+	 * Check the result and unfreeze the slab
+	 */
+	leftover = page->inuse;
+	unfreeze_slab(s, page, leftover > 0, state);
+	local_irq_restore(flags);
+	return leftover;
+}
+
+/*
+ * Reclaim objects from a list of slab pages that have been gathered.
+ * Must be called with slabs that have been isolated before.
+ */
+int kmem_cache_reclaim(struct list_head *zaplist)
+{
+	int freed = 0;
+	void **scratch;
+	struct page *page;
+	struct page *page2;
+
+	if (list_empty(zaplist))
+		return 0;
+
+	scratch = alloc_scratch();
+	if (!scratch)
+		return 0;
+
+	list_for_each_entry_safe(page, page2, zaplist, lru) {
+		list_del(&page->lru);
+		if (kmem_cache_vacate(page, scratch) == 0)
+				freed++;
+	}
+	kfree(scratch);
+	return freed;
+}
+
+/*
+ * Shrink the slab cache on a particular node of the cache
+ * by releasing slabs with zero objects and trying to reclaim
+ * slabs with less than a quarter of objects allocated.
+ */
+static unsigned long __kmem_cache_shrink(struct kmem_cache *s,
+	struct kmem_cache_node *n)
+{
+	unsigned long flags;
+	struct page *page, *page2;
+	LIST_HEAD(zaplist);
+	int freed = 0;
+	unsigned long state;
+
+	spin_lock_irqsave(&n->list_lock, flags);
+	list_for_each_entry_safe(page, page2, &n->partial, lru) {
+		if (page->inuse > s->objects / 4)
+			continue;
+		state = slab_trylock(page);
+		if (!state)
+			continue;
+
+		if (page->inuse) {
+
+			list_move(&page->lru, &zaplist);
+			if (s->kick) {
 				n->nr_partial--;
-				slab_unlock(page, state);
-				discard_slab(s, page);
-			} else {
-				list_move(&page->lru,
-				slabs_by_inuse + page->inuse);
+				state |= FROZEN;
 			}
+			slab_unlock(page, state);
+
+		} else {
+			list_del(&page->lru);
+			slab_unlock(page, state);
+			discard_slab(s, page);
+			freed++;
 		}
+	}
 
-		/*
-		 * Rebuild the partial list with the slabs filled up most
-		 * first and the least used slabs at the end.
-		 */
-		for (i = s->objects - 1; i >= 0; i--)
-			list_splice(slabs_by_inuse + i, n->partial.prev);
+	if (!s->kick)
+		/* Simply put the zaplist at the end */
+		list_splice(&zaplist, n->partial.prev);
 
-		spin_unlock_irqrestore(&n->list_lock, flags);
-	}
+	spin_unlock_irqrestore(&n->list_lock, flags);
+
+	if (s->kick)
+		freed += kmem_cache_reclaim(&zaplist);
+	return freed;
+}
+
+static unsigned long __kmem_cache_defrag(struct kmem_cache *s, int node)
+{
+	unsigned long capacity;
+	unsigned long objects_in_full_slabs;
+	unsigned long ratio;
+	struct kmem_cache_node *n = get_node(s, node);
+
+	/*
+	 * An insignificant number of partial slabs means that the
+	 * slab cache does not need any defragmentation.
+	 */
+	if (n->nr_partial <= MAX_PARTIAL)
+		return 0;
+
+	capacity = atomic_long_read(&n->nr_slabs) * s->objects;
+	objects_in_full_slabs =
+			(atomic_long_read(&n->nr_slabs) - n->nr_partial)
+							* s->objects;
+	/*
+	 * Worst case calculation: If we would be over the ratio
+	 * even if all partial slabs would only have one object
+	 * then we can skip the next test that requires a scan
+	 * through all the partial page structs to sum up the actual
+	 * number of objects in the partial slabs.
+	 */
+	ratio = (objects_in_full_slabs + 1 * n->nr_partial) * 100 / capacity;
+	if (ratio > s->defrag_ratio)
+		return 0;
+
+	/*
+	 * Now for the real calculation. If usage ratio is more than required
+	 * then no defragmentation is necessary.
+	 */
+	ratio = (objects_in_full_slabs + count_partial(n)) * 100 / capacity;
+	if (ratio > s->defrag_ratio)
+		return 0;
+
+	return __kmem_cache_shrink(s, n) << s->order;
+}
+
+/*
+ * Defrag slabs conditional on the amount of fragmentation on each node.
+ */
+int kmem_cache_defrag(int node)
+{
+	struct kmem_cache *s;
+	unsigned long pages = 0;
+
+	/*
+	 * kmem_cache_defrag may be called from the reclaim path which may be
+	 * called for any page allocator alloc. So there is the danger that we
+	 * get called in a situation where slub already acquired the slub_lock
+	 * for other purposes.
+	 */
+	if (!down_read_trylock(&slub_lock))
+		return 0;
+
+	list_for_each_entry(s, &slab_caches, list) {
+		if (node == -1) {
+			int nid;
+
+			for_each_node_state(nid, N_NORMAL_MEMORY)
+				pages += __kmem_cache_defrag(s, nid);
+		} else
+			pages += __kmem_cache_defrag(s, node);
+	}
+	up_read(&slub_lock);
+	return pages;
+}
+EXPORT_SYMBOL(kmem_cache_defrag);
+
+/*
+ * kmem_cache_shrink removes empty slabs from the partial lists.
+ * If the slab cache support defragmentation then objects are
+ * reclaimed.
+ */
+int kmem_cache_shrink(struct kmem_cache *s)
+{
+	int node;
+
+	flush_all(s);
+	for_each_node_state(node, N_NORMAL_MEMORY)
+		__kmem_cache_shrink(s, get_node(s, node));
 
-	kfree(slabs_by_inuse);
 	return 0;
 }
 EXPORT_SYMBOL(kmem_cache_shrink);

-- 

^ permalink raw reply	[flat|nested] 77+ messages in thread

* [patch 12/23] SLUB: Trigger defragmentation from memory reclaim
  2007-11-07  1:11 [patch 00/23] Slab defragmentation V6 Christoph Lameter
                   ` (10 preceding siblings ...)
  2007-11-07  1:11 ` [patch 11/23] SLUB: Slab defrag core Christoph Lameter
@ 2007-11-07  1:11 ` Christoph Lameter
  2007-11-07  9:28   ` Johannes Weiner
  2007-11-08 15:12   ` Mel Gorman
  2007-11-07  1:11 ` [patch 13/23] Buffer heads: Support slab defrag Christoph Lameter
                   ` (11 subsequent siblings)
  23 siblings, 2 replies; 77+ messages in thread
From: Christoph Lameter @ 2007-11-07  1:11 UTC (permalink / raw)
  To: akpm; +Cc: linux-kernel, linux-mm, Mel Gorman

[-- Attachment #1: 0010-slab_defrag_trigger_defrag_from_reclaim.patch --]
[-- Type: text/plain, Size: 6019 bytes --]

This patch triggers slab defragmentation from memory reclaim.
The logical point for this is after slab shrinking was performed in
vmscan.c. At that point the fragmentation ratio of a slab was increased
because objects were freed via the LRUs. So we call kmem_cache_defrag from
there.

slab_shrink() from vmscan.c is called in some contexts to do
global shrinking of slabs and in others to do shrinking for
a particular zone. Pass the zone to slab_shrink, so that slab_shrink
can call kmem_cache_defrag() and restrict the defragmentation to
the node that is under memory pressure.

Reviewed-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Christoph Lameter <clameter@sgi.com>
---
 fs/drop_caches.c     |    2 +-
 include/linux/mm.h   |    2 +-
 include/linux/slab.h |    1 +
 mm/vmscan.c          |   26 +++++++++++++++++++-------
 4 files changed, 22 insertions(+), 9 deletions(-)

Index: linux-2.6/fs/drop_caches.c
===================================================================
--- linux-2.6.orig/fs/drop_caches.c	2007-08-29 19:30:53.000000000 -0700
+++ linux-2.6/fs/drop_caches.c	2007-11-06 12:53:40.000000000 -0800
@@ -50,7 +50,7 @@ void drop_slab(void)
 	int nr_objects;
 
 	do {
-		nr_objects = shrink_slab(1000, GFP_KERNEL, 1000);
+		nr_objects = shrink_slab(1000, GFP_KERNEL, 1000, NULL);
 	} while (nr_objects > 10);
 }
 
Index: linux-2.6/include/linux/mm.h
===================================================================
--- linux-2.6.orig/include/linux/mm.h	2007-11-06 12:33:55.000000000 -0800
+++ linux-2.6/include/linux/mm.h	2007-11-06 12:54:11.000000000 -0800
@@ -1118,7 +1118,7 @@ int in_gate_area_no_task(unsigned long a
 int drop_caches_sysctl_handler(struct ctl_table *, int, struct file *,
 					void __user *, size_t *, loff_t *);
 unsigned long shrink_slab(unsigned long scanned, gfp_t gfp_mask,
-			unsigned long lru_pages);
+			unsigned long lru_pages, struct zone *z);
 void drop_pagecache(void);
 void drop_slab(void);
 
Index: linux-2.6/include/linux/slab.h
===================================================================
--- linux-2.6.orig/include/linux/slab.h	2007-11-06 12:37:51.000000000 -0800
+++ linux-2.6/include/linux/slab.h	2007-11-06 12:53:40.000000000 -0800
@@ -63,6 +63,7 @@ void kmem_cache_free(struct kmem_cache *
 unsigned int kmem_cache_size(struct kmem_cache *);
 const char *kmem_cache_name(struct kmem_cache *);
 int kmem_ptr_validate(struct kmem_cache *cachep, const void *ptr);
+int kmem_cache_defrag(int node);
 
 /*
  * Please use this macro to create slab caches. Simply specify the
Index: linux-2.6/mm/vmscan.c
===================================================================
--- linux-2.6.orig/mm/vmscan.c	2007-10-25 18:28:41.000000000 -0700
+++ linux-2.6/mm/vmscan.c	2007-11-06 12:55:25.000000000 -0800
@@ -150,10 +150,18 @@ EXPORT_SYMBOL(unregister_shrinker);
  * are eligible for the caller's allocation attempt.  It is used for balancing
  * slab reclaim versus page reclaim.
  *
+ * zone is the zone for which we are shrinking the slabs. If the intent
+ * is to do a global shrink then zone may be NULL. Specification of a
+ * zone is currently only used to limit slab defragmentation to a NUMA node.
+ * The performace of shrink_slab would be better (in particular under NUMA)
+ * if it could be targeted as a whole to the zone that is under memory
+ * pressure but the VFS infrastructure does not allow that at the present
+ * time.
+ *
  * Returns the number of slab objects which we shrunk.
  */
 unsigned long shrink_slab(unsigned long scanned, gfp_t gfp_mask,
-			unsigned long lru_pages)
+			unsigned long lru_pages, struct zone *zone)
 {
 	struct shrinker *shrinker;
 	unsigned long ret = 0;
@@ -210,6 +218,8 @@ unsigned long shrink_slab(unsigned long 
 		shrinker->nr += total_scan;
 	}
 	up_read(&shrinker_rwsem);
+	if (gfp_mask & __GFP_FS)
+		kmem_cache_defrag(zone ? zone_to_nid(zone) : -1);
 	return ret;
 }
 
@@ -1241,7 +1251,7 @@ unsigned long try_to_free_pages(struct z
 		if (!priority)
 			disable_swap_token();
 		nr_reclaimed += shrink_zones(priority, zones, &sc);
-		shrink_slab(sc.nr_scanned, gfp_mask, lru_pages);
+		shrink_slab(sc.nr_scanned, gfp_mask, lru_pages, NULL);
 		if (reclaim_state) {
 			nr_reclaimed += reclaim_state->reclaimed_slab;
 			reclaim_state->reclaimed_slab = 0;
@@ -1419,7 +1429,7 @@ loop_again:
 				nr_reclaimed += shrink_zone(priority, zone, &sc);
 			reclaim_state->reclaimed_slab = 0;
 			nr_slab = shrink_slab(sc.nr_scanned, GFP_KERNEL,
-						lru_pages);
+						lru_pages, zone);
 			nr_reclaimed += reclaim_state->reclaimed_slab;
 			total_scanned += sc.nr_scanned;
 			if (zone_is_all_unreclaimable(zone))
@@ -1658,7 +1668,7 @@ unsigned long shrink_all_memory(unsigned
 	/* If slab caches are huge, it's better to hit them first */
 	while (nr_slab >= lru_pages) {
 		reclaim_state.reclaimed_slab = 0;
-		shrink_slab(nr_pages, sc.gfp_mask, lru_pages);
+		shrink_slab(nr_pages, sc.gfp_mask, lru_pages, NULL);
 		if (!reclaim_state.reclaimed_slab)
 			break;
 
@@ -1696,7 +1706,7 @@ unsigned long shrink_all_memory(unsigned
 
 			reclaim_state.reclaimed_slab = 0;
 			shrink_slab(sc.nr_scanned, sc.gfp_mask,
-					count_lru_pages());
+					count_lru_pages(), NULL);
 			ret += reclaim_state.reclaimed_slab;
 			if (ret >= nr_pages)
 				goto out;
@@ -1713,7 +1723,8 @@ unsigned long shrink_all_memory(unsigned
 	if (!ret) {
 		do {
 			reclaim_state.reclaimed_slab = 0;
-			shrink_slab(nr_pages, sc.gfp_mask, count_lru_pages());
+			shrink_slab(nr_pages, sc.gfp_mask,
+					count_lru_pages(), NULL);
 			ret += reclaim_state.reclaimed_slab;
 		} while (ret < nr_pages && reclaim_state.reclaimed_slab > 0);
 	}
@@ -1875,7 +1886,8 @@ static int __zone_reclaim(struct zone *z
 		 * Note that shrink_slab will free memory on all zones and may
 		 * take a long time.
 		 */
-		while (shrink_slab(sc.nr_scanned, gfp_mask, order) &&
+		while (shrink_slab(sc.nr_scanned, gfp_mask, order,
+						zone) &&
 			zone_page_state(zone, NR_SLAB_RECLAIMABLE) >
 				slab_reclaimable - nr_pages)
 			;

-- 

^ permalink raw reply	[flat|nested] 77+ messages in thread

* [patch 13/23] Buffer heads: Support slab defrag
  2007-11-07  1:11 [patch 00/23] Slab defragmentation V6 Christoph Lameter
                   ` (11 preceding siblings ...)
  2007-11-07  1:11 ` [patch 12/23] SLUB: Trigger defragmentation from memory reclaim Christoph Lameter
@ 2007-11-07  1:11 ` Christoph Lameter
  2007-11-07  1:11 ` [patch 14/23] inodes: Support generic defragmentation Christoph Lameter
                   ` (10 subsequent siblings)
  23 siblings, 0 replies; 77+ messages in thread
From: Christoph Lameter @ 2007-11-07  1:11 UTC (permalink / raw)
  To: akpm; +Cc: linux-kernel, linux-mm, Mel Gorman

[-- Attachment #1: 0016-slab_defrag_buffer_head.patch --]
[-- Type: text/plain, Size: 3541 bytes --]

Defragmentation support for buffer heads. We convert the references to
buffers to struct page references and try to remove the buffers from
those pages. If the pages are dirty then trigger writeout so that the
buffer heads can be removed later.

Reviewed-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Christoph Lameter <clameter@sgi.com>
---
 fs/buffer.c |  103 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 102 insertions(+), 1 deletion(-)

Index: linux-2.6/fs/buffer.c
===================================================================
--- linux-2.6.orig/fs/buffer.c	2007-11-06 12:56:07.000000000 -0800
+++ linux-2.6/fs/buffer.c	2007-11-06 12:56:11.000000000 -0800
@@ -3213,7 +3213,7 @@ static int buffer_cpu_notify(struct noti
 }
 
 static void
-init_buffer_head(void *data, struct kmem_cache *cachep, unsigned long flags)
+init_buffer_head(struct kmem_cache *cachep, void *data)
 {
 	struct buffer_head * bh = (struct buffer_head *)data;
 
@@ -3221,6 +3221,106 @@ init_buffer_head(void *data, struct kmem
 	INIT_LIST_HEAD(&bh->b_assoc_buffers);
 }
 
+/*
+ * Writeback a page to clean the dirty state
+ */
+static void trigger_write(struct page *page)
+{
+	struct address_space *mapping = page_mapping(page);
+	int rc;
+	struct writeback_control wbc = {
+		.sync_mode = WB_SYNC_NONE,
+		.nr_to_write = 1,
+		.range_start = 0,
+		.range_end = LLONG_MAX,
+		.nonblocking = 1,
+		.for_reclaim = 0
+	};
+
+	if (!mapping->a_ops->writepage)
+		/* No write method for the address space */
+		return;
+
+	if (!clear_page_dirty_for_io(page))
+		/* Someone else already triggered a write */
+		return;
+
+	rc = mapping->a_ops->writepage(page, &wbc);
+	if (rc < 0)
+		/* I/O Error writing */
+		return;
+
+	if (rc == AOP_WRITEPAGE_ACTIVATE)
+		unlock_page(page);
+}
+
+/*
+ * Get references on buffers.
+ *
+ * We obtain references on the page that uses the buffer. v[i] will point to
+ * the corresponding page after get_buffers() is through.
+ *
+ * We are safe from the underlying page being removed simply by doing
+ * a get_page_unless_zero. The buffer head removal may race at will.
+ * try_to_free_buffes will later take appropriate locks to remove the
+ * buffers if they are still there.
+ */
+static void *get_buffers(struct kmem_cache *s, int nr, void **v)
+{
+	struct page *page;
+	struct buffer_head *bh;
+	int i,j;
+	int n = 0;
+
+	for (i = 0; i < nr; i++) {
+		bh = v[i];
+		v[i] = NULL;
+
+		page = bh->b_page;
+
+		if (page && PagePrivate(page)) {
+			for (j = 0; j < n; j++)
+				if (page == v[j])
+					goto cont;
+		}
+
+		if (get_page_unless_zero(page))
+			v[n++] = page;
+cont:	;
+	}
+	return NULL;
+}
+
+/*
+ * Despite its name: kick_buffers operates on a list of pointers to
+ * page structs that was setup by get_buffer
+ */
+static void kick_buffers(struct kmem_cache *s, int nr, void **v,
+							void *private)
+{
+	struct page *page;
+	int i;
+
+	for (i = 0; i < nr; i++) {
+		page = v[i];
+
+		if (!page || PageWriteback(page))
+			continue;
+
+
+		if (!TestSetPageLocked(page)) {
+			if (PageDirty(page))
+				trigger_write(page);
+			else {
+				if (PagePrivate(page))
+					try_to_free_buffers(page);
+				unlock_page(page);
+			}
+		}
+		put_page(page);
+	}
+}
+
 void __init buffer_init(void)
 {
 	int nrpages;
@@ -3230,6 +3330,7 @@ void __init buffer_init(void)
 				(SLAB_RECLAIM_ACCOUNT|SLAB_PANIC|
 				SLAB_MEM_SPREAD),
 				init_buffer_head);
+	kmem_cache_setup_defrag(bh_cachep, get_buffers, kick_buffers);
 
 	/*
 	 * Limit the bh occupancy to 10% of ZONE_NORMAL

-- 

^ permalink raw reply	[flat|nested] 77+ messages in thread

* [patch 14/23] inodes: Support generic defragmentation
  2007-11-07  1:11 [patch 00/23] Slab defragmentation V6 Christoph Lameter
                   ` (12 preceding siblings ...)
  2007-11-07  1:11 ` [patch 13/23] Buffer heads: Support slab defrag Christoph Lameter
@ 2007-11-07  1:11 ` Christoph Lameter
  2007-11-07 10:17   ` Jörn Engel
  2007-11-07  1:11 ` [patch 15/23] FS: ExtX filesystem defrag Christoph Lameter
                   ` (9 subsequent siblings)
  23 siblings, 1 reply; 77+ messages in thread
From: Christoph Lameter @ 2007-11-07  1:11 UTC (permalink / raw)
  To: akpm; +Cc: linux-kernel, linux-mm, Mel Gorman

[-- Attachment #1: 0017-slab_defrag_generic_inode_defrag.patch --]
[-- Type: text/plain, Size: 4068 bytes --]

This implements the ability to remove inodes in a particular slab
from inode cache. In order to remove an inode we may have to write out
the pages of an inode, the inode itself and remove the dentries referring
to the node.

Provide generic functionality that can be used by filesystems that have
their own inode caches to also tie into the defragmentation functions
that are made available here.

Reviewed-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Christoph Lameter <clameter@sgi.com>
---
 fs/inode.c         |   96 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 include/linux/fs.h |    6 +++
 2 files changed, 102 insertions(+)

Index: linux-2.6/fs/inode.c
===================================================================
--- linux-2.6.orig/fs/inode.c	2007-10-17 13:35:52.000000000 -0700
+++ linux-2.6/fs/inode.c	2007-11-06 12:56:15.000000000 -0800
@@ -1369,6 +1369,101 @@ static int __init set_ihash_entries(char
 }
 __setup("ihash_entries=", set_ihash_entries);
 
+void *get_inodes(struct kmem_cache *s, int nr, void **v)
+{
+	int i;
+
+	spin_lock(&inode_lock);
+	for (i = 0; i < nr; i++) {
+		struct inode *inode = v[i];
+
+		if (inode->i_state & (I_FREEING|I_CLEAR|I_WILL_FREE))
+			v[i] = NULL;
+		else
+			__iget(inode);
+	}
+	spin_unlock(&inode_lock);
+	return NULL;
+}
+EXPORT_SYMBOL(get_inodes);
+
+/*
+ * Function for filesystems that embedd struct inode into their own
+ * structures. The offset is the offset of the struct inode in the fs inode.
+ */
+void *fs_get_inodes(struct kmem_cache *s, int nr, void **v,
+						unsigned long offset)
+{
+	int i;
+
+	for (i = 0; i < nr; i++)
+		v[i] += offset;
+
+	return get_inodes(s, nr, v);
+}
+EXPORT_SYMBOL(fs_get_inodes);
+
+void kick_inodes(struct kmem_cache *s, int nr, void **v, void *private)
+{
+	struct inode *inode;
+	int i;
+	int abort = 0;
+	LIST_HEAD(freeable);
+	struct super_block *sb;
+
+	for (i = 0; i < nr; i++) {
+		inode = v[i];
+		if (!inode)
+			continue;
+
+		if (inode_has_buffers(inode) || inode->i_data.nrpages) {
+			if (remove_inode_buffers(inode))
+				invalidate_mapping_pages(&inode->i_data,
+								0, -1);
+		}
+
+		/* Invalidate children and dentry */
+		if (S_ISDIR(inode->i_mode)) {
+			struct dentry *d = d_find_alias(inode);
+
+			if (d) {
+				d_invalidate(d);
+				dput(d);
+			}
+		}
+
+		if (inode->i_state & I_DIRTY)
+			write_inode_now(inode, 1);
+
+		d_prune_aliases(inode);
+	}
+
+	mutex_lock(&iprune_mutex);
+	for (i = 0; i < nr; i++) {
+		inode = v[i];
+		if (!inode)
+			continue;
+
+		sb = inode->i_sb;
+		iput(inode);
+		if (abort || !(sb->s_flags & MS_ACTIVE))
+			continue;
+
+		spin_lock(&inode_lock);
+		abort =  !can_unuse(inode);
+
+		if (!abort) {
+			list_move(&inode->i_list, &freeable);
+			inode->i_state |= I_FREEING;
+			inodes_stat.nr_unused--;
+		}
+		spin_unlock(&inode_lock);
+	}
+	dispose_list(&freeable);
+	mutex_unlock(&iprune_mutex);
+}
+EXPORT_SYMBOL(kick_inodes);
+
 /*
  * Initialize the waitqueues and inode hash table.
  */
@@ -1408,6 +1503,7 @@ void __init inode_init(void)
 					 SLAB_MEM_SPREAD),
 					 init_once);
 	register_shrinker(&icache_shrinker);
+	kmem_cache_setup_defrag(inode_cachep, get_inodes, kick_inodes);
 
 	/* Hash may have been set up in inode_init_early */
 	if (!hashdist)
Index: linux-2.6/include/linux/fs.h
===================================================================
--- linux-2.6.orig/include/linux/fs.h	2007-10-25 18:28:41.000000000 -0700
+++ linux-2.6/include/linux/fs.h	2007-11-06 12:56:15.000000000 -0800
@@ -1776,6 +1776,12 @@ static inline void insert_inode_hash(str
 	__insert_inode_hash(inode, inode->i_ino);
 }
 
+/* Helper functions for inode defragmentation support in filesystems */
+extern void kick_inodes(struct kmem_cache *, int, void **, void *);
+extern void *get_inodes(struct kmem_cache *, int nr, void **);
+extern void *fs_get_inodes(struct kmem_cache *, int nr, void **,
+						unsigned long offset);
+
 extern struct file * get_empty_filp(void);
 extern void file_move(struct file *f, struct list_head *list);
 extern void file_kill(struct file *f);

-- 

^ permalink raw reply	[flat|nested] 77+ messages in thread

* [patch 15/23] FS: ExtX filesystem defrag
  2007-11-07  1:11 [patch 00/23] Slab defragmentation V6 Christoph Lameter
                   ` (13 preceding siblings ...)
  2007-11-07  1:11 ` [patch 14/23] inodes: Support generic defragmentation Christoph Lameter
@ 2007-11-07  1:11 ` Christoph Lameter
  2007-11-07  1:11 ` [patch 16/23] FS: XFS slab defragmentation Christoph Lameter
                   ` (8 subsequent siblings)
  23 siblings, 0 replies; 77+ messages in thread
From: Christoph Lameter @ 2007-11-07  1:11 UTC (permalink / raw)
  To: akpm; +Cc: linux-kernel, linux-mm, Mel Gorman

[-- Attachment #1: 0018-slab_defrag_ext234.patch --]
[-- Type: text/plain, Size: 2761 bytes --]

Support defragmentation for extX filesystem inodes

Reviewed-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Christoph Lameter <clameter@sgi.com>
---
 fs/ext2/super.c |    9 +++++++++
 fs/ext3/super.c |    8 ++++++++
 fs/ext4/super.c |    8 ++++++++
 3 files changed, 25 insertions(+)

Index: linux-2.6/fs/ext2/super.c
===================================================================
--- linux-2.6.orig/fs/ext2/super.c	2007-10-25 18:28:40.000000000 -0700
+++ linux-2.6/fs/ext2/super.c	2007-11-06 12:56:18.000000000 -0800
@@ -171,6 +171,12 @@ static void init_once(struct kmem_cache 
 	inode_init_once(&ei->vfs_inode);
 }
 
+static void *ext2_get_inodes(struct kmem_cache *s, int nr, void **v)
+{
+	return fs_get_inodes(s, nr, v,
+		offsetof(struct ext2_inode_info, vfs_inode));
+}
+
 static int init_inodecache(void)
 {
 	ext2_inode_cachep = kmem_cache_create("ext2_inode_cache",
@@ -180,6 +186,9 @@ static int init_inodecache(void)
 					     init_once);
 	if (ext2_inode_cachep == NULL)
 		return -ENOMEM;
+
+	kmem_cache_setup_defrag(ext2_inode_cachep,
+			ext2_get_inodes, kick_inodes);
 	return 0;
 }
 
Index: linux-2.6/fs/ext3/super.c
===================================================================
--- linux-2.6.orig/fs/ext3/super.c	2007-10-25 18:28:40.000000000 -0700
+++ linux-2.6/fs/ext3/super.c	2007-11-06 12:56:18.000000000 -0800
@@ -484,6 +484,12 @@ static void init_once(struct kmem_cache 
 	inode_init_once(&ei->vfs_inode);
 }
 
+static void *ext3_get_inodes(struct kmem_cache *s, int nr, void **v)
+{
+	return fs_get_inodes(s, nr, v,
+		offsetof(struct ext3_inode_info, vfs_inode));
+}
+
 static int init_inodecache(void)
 {
 	ext3_inode_cachep = kmem_cache_create("ext3_inode_cache",
@@ -493,6 +499,8 @@ static int init_inodecache(void)
 					     init_once);
 	if (ext3_inode_cachep == NULL)
 		return -ENOMEM;
+	kmem_cache_setup_defrag(ext3_inode_cachep,
+			ext3_get_inodes, kick_inodes);
 	return 0;
 }
 
Index: linux-2.6/fs/ext4/super.c
===================================================================
--- linux-2.6.orig/fs/ext4/super.c	2007-10-25 18:28:40.000000000 -0700
+++ linux-2.6/fs/ext4/super.c	2007-11-06 12:56:18.000000000 -0800
@@ -537,6 +537,12 @@ static void init_once(struct kmem_cache 
 	inode_init_once(&ei->vfs_inode);
 }
 
+static void *ext4_get_inodes(struct kmem_cache *s, int nr, void **v)
+{
+	return fs_get_inodes(s, nr, v,
+		offsetof(struct ext4_inode_info, vfs_inode));
+}
+
 static int init_inodecache(void)
 {
 	ext4_inode_cachep = kmem_cache_create("ext4_inode_cache",
@@ -546,6 +552,8 @@ static int init_inodecache(void)
 					     init_once);
 	if (ext4_inode_cachep == NULL)
 		return -ENOMEM;
+	kmem_cache_setup_defrag(ext4_inode_cachep,
+			ext4_get_inodes, kick_inodes);
 	return 0;
 }
 

-- 

^ permalink raw reply	[flat|nested] 77+ messages in thread

* [patch 16/23] FS: XFS slab defragmentation
  2007-11-07  1:11 [patch 00/23] Slab defragmentation V6 Christoph Lameter
                   ` (14 preceding siblings ...)
  2007-11-07  1:11 ` [patch 15/23] FS: ExtX filesystem defrag Christoph Lameter
@ 2007-11-07  1:11 ` Christoph Lameter
  2007-11-07  1:11 ` [patch 17/23] FS: Proc filesystem support for slab defrag Christoph Lameter
                   ` (7 subsequent siblings)
  23 siblings, 0 replies; 77+ messages in thread
From: Christoph Lameter @ 2007-11-07  1:11 UTC (permalink / raw)
  To: akpm; +Cc: linux-kernel, linux-mm, Mel Gorman

[-- Attachment #1: 0019-slab_defrag_xfs.patch --]
[-- Type: text/plain, Size: 820 bytes --]

Support inode defragmentation for xfs

Reviewed-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Christoph Lameter <clameter@sgi.com>
---
 fs/xfs/linux-2.6/xfs_super.c |    1 +
 1 file changed, 1 insertion(+)

Index: linux-2.6/fs/xfs/linux-2.6/xfs_super.c
===================================================================
--- linux-2.6.orig/fs/xfs/linux-2.6/xfs_super.c	2007-11-06 12:57:26.000000000 -0800
+++ linux-2.6/fs/xfs/linux-2.6/xfs_super.c	2007-11-06 12:57:34.000000000 -0800
@@ -374,6 +374,7 @@ xfs_init_zones(void)
 	xfs_ioend_zone = kmem_zone_init(sizeof(xfs_ioend_t), "xfs_ioend");
 	if (!xfs_ioend_zone)
 		goto out_destroy_vnode_zone;
+	kmem_cache_setup_defrag(xfs_vnode_zone, get_inodes, kick_inodes);
 
 	xfs_ioend_pool = mempool_create_slab_pool(4 * MAX_BUF_PER_PAGE,
 						  xfs_ioend_zone);

-- 

^ permalink raw reply	[flat|nested] 77+ messages in thread

* [patch 17/23] FS: Proc filesystem support for slab defrag
  2007-11-07  1:11 [patch 00/23] Slab defragmentation V6 Christoph Lameter
                   ` (15 preceding siblings ...)
  2007-11-07  1:11 ` [patch 16/23] FS: XFS slab defragmentation Christoph Lameter
@ 2007-11-07  1:11 ` Christoph Lameter
  2007-11-07  1:11 ` [patch 18/23] FS: Slab defrag: Reiserfs support Christoph Lameter
                   ` (6 subsequent siblings)
  23 siblings, 0 replies; 77+ messages in thread
From: Christoph Lameter @ 2007-11-07  1:11 UTC (permalink / raw)
  To: akpm; +Cc: linux-kernel, linux-mm, Mel Gorman

[-- Attachment #1: 0020-slab_defrag_proc.patch --]
[-- Type: text/plain, Size: 1081 bytes --]

Support procfs inode defragmentation

Reviewed-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Christoph Lameter <clameter@sgi.com>
---
 fs/proc/inode.c |    8 ++++++++
 1 file changed, 8 insertions(+)

Index: linux-2.6.23-mm1/fs/proc/inode.c
===================================================================
--- linux-2.6.23-mm1.orig/fs/proc/inode.c	2007-10-12 16:26:08.000000000 -0700
+++ linux-2.6.23-mm1/fs/proc/inode.c	2007-10-12 18:48:32.000000000 -0700
@@ -114,6 +114,12 @@ static void init_once(struct kmem_cache 
 	inode_init_once(&ei->vfs_inode);
 }
 
+static void *proc_get_inodes(struct kmem_cache *s, int nr, void **v)
+{
+	return fs_get_inodes(s, nr, v,
+		offsetof(struct proc_inode, vfs_inode));
+};
+
 int __init proc_init_inodecache(void)
 {
 	proc_inode_cachep = kmem_cache_create("proc_inode_cache",
@@ -121,6 +127,8 @@ int __init proc_init_inodecache(void)
 					     0, (SLAB_RECLAIM_ACCOUNT|
 						SLAB_MEM_SPREAD|SLAB_PANIC),
 					     init_once);
+	kmem_cache_setup_defrag(proc_inode_cachep,
+				proc_get_inodes, kick_inodes);
 	return 0;
 }
 

-- 

^ permalink raw reply	[flat|nested] 77+ messages in thread

* [patch 18/23] FS: Slab defrag: Reiserfs support
  2007-11-07  1:11 [patch 00/23] Slab defragmentation V6 Christoph Lameter
                   ` (16 preceding siblings ...)
  2007-11-07  1:11 ` [patch 17/23] FS: Proc filesystem support for slab defrag Christoph Lameter
@ 2007-11-07  1:11 ` Christoph Lameter
  2007-11-07  1:11 ` [patch 19/23] FS: Socket inode defragmentation Christoph Lameter
                   ` (5 subsequent siblings)
  23 siblings, 0 replies; 77+ messages in thread
From: Christoph Lameter @ 2007-11-07  1:11 UTC (permalink / raw)
  To: akpm; +Cc: linux-kernel, linux-mm, Mel Gorman

[-- Attachment #1: 0021-slab_defrag_reiserfs.patch --]
[-- Type: text/plain, Size: 1090 bytes --]

Slab defragmentation: Support reiserfs inode defragmentation

Reviewed-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Christoph Lameter <clameter@sgi.com>
---
 fs/reiserfs/super.c |    8 ++++++++
 1 file changed, 8 insertions(+)

Index: linux-2.6.23-mm1/fs/reiserfs/super.c
===================================================================
--- linux-2.6.23-mm1.orig/fs/reiserfs/super.c	2007-10-12 16:26:09.000000000 -0700
+++ linux-2.6.23-mm1/fs/reiserfs/super.c	2007-10-12 18:48:36.000000000 -0700
@@ -532,6 +532,12 @@ static void init_once(struct kmem_cache 
 #endif
 }
 
+static void *reiserfs_get_inodes(struct kmem_cache *s, int nr, void **v)
+{
+	return fs_get_inodes(s, nr, v,
+		offsetof(struct reiserfs_inode_info, vfs_inode));
+}
+
 static int init_inodecache(void)
 {
 	reiserfs_inode_cachep = kmem_cache_create("reiser_inode_cache",
@@ -542,6 +548,8 @@ static int init_inodecache(void)
 						  init_once);
 	if (reiserfs_inode_cachep == NULL)
 		return -ENOMEM;
+	kmem_cache_setup_defrag(reiserfs_inode_cachep,
+			reiserfs_get_inodes, kick_inodes);
 	return 0;
 }
 

-- 

^ permalink raw reply	[flat|nested] 77+ messages in thread

* [patch 19/23] FS: Socket inode defragmentation
  2007-11-07  1:11 [patch 00/23] Slab defragmentation V6 Christoph Lameter
                   ` (17 preceding siblings ...)
  2007-11-07  1:11 ` [patch 18/23] FS: Slab defrag: Reiserfs support Christoph Lameter
@ 2007-11-07  1:11 ` Christoph Lameter
  2007-11-07  1:11 ` [patch 20/23] dentries: Add constructor Christoph Lameter
                   ` (4 subsequent siblings)
  23 siblings, 0 replies; 77+ messages in thread
From: Christoph Lameter @ 2007-11-07  1:11 UTC (permalink / raw)
  To: akpm; +Cc: linux-kernel, linux-mm, Mel Gorman

[-- Attachment #1: 0022-slab_defrag_socket.patch --]
[-- Type: text/plain, Size: 1023 bytes --]

Support inode defragmentation for sockets

Reviewed-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Christoph Lameter <clameter@sgi.com>
---
 net/socket.c |    8 ++++++++
 1 file changed, 8 insertions(+)

Index: linux-2.6/net/socket.c
===================================================================
--- linux-2.6.orig/net/socket.c	2007-10-30 16:34:39.000000000 -0700
+++ linux-2.6/net/socket.c	2007-11-06 12:56:27.000000000 -0800
@@ -265,6 +265,12 @@ static void init_once(struct kmem_cache 
 	inode_init_once(&ei->vfs_inode);
 }
 
+static void *sock_get_inodes(struct kmem_cache *s, int nr, void **v)
+{
+	return fs_get_inodes(s, nr, v,
+		offsetof(struct socket_alloc, vfs_inode));
+}
+
 static int init_inodecache(void)
 {
 	sock_inode_cachep = kmem_cache_create("sock_inode_cache",
@@ -276,6 +282,8 @@ static int init_inodecache(void)
 					      init_once);
 	if (sock_inode_cachep == NULL)
 		return -ENOMEM;
+	kmem_cache_setup_defrag(sock_inode_cachep,
+			sock_get_inodes, kick_inodes);
 	return 0;
 }
 

-- 

^ permalink raw reply	[flat|nested] 77+ messages in thread

* [patch 20/23] dentries: Add constructor
  2007-11-07  1:11 [patch 00/23] Slab defragmentation V6 Christoph Lameter
                   ` (18 preceding siblings ...)
  2007-11-07  1:11 ` [patch 19/23] FS: Socket inode defragmentation Christoph Lameter
@ 2007-11-07  1:11 ` Christoph Lameter
  2007-11-08 15:23   ` Mel Gorman
  2007-11-07  1:11 ` [patch 21/23] dentries: dentry defragmentation Christoph Lameter
                   ` (3 subsequent siblings)
  23 siblings, 1 reply; 77+ messages in thread
From: Christoph Lameter @ 2007-11-07  1:11 UTC (permalink / raw)
  To: akpm; +Cc: linux-kernel, linux-mm, Mel Gorman

[-- Attachment #1: 0024-slab_defrag_dentry_state.patch --]
[-- Type: text/plain, Size: 2186 bytes --]

In order to support defragmentation on the dentry cache we need to have
a determined object state at all times. Without a constructor the object
would have a random state after allocation.

Reviewed-by: Rik van Riel <riel@redhat.com>
So provide a constructor.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
---
 fs/dcache.c |   26 ++++++++++++++------------
 1 file changed, 14 insertions(+), 12 deletions(-)

Index: linux-2.6/fs/dcache.c
===================================================================
--- linux-2.6.orig/fs/dcache.c	2007-11-06 12:56:56.000000000 -0800
+++ linux-2.6/fs/dcache.c	2007-11-06 12:57:01.000000000 -0800
@@ -870,6 +870,16 @@ static struct shrinker dcache_shrinker =
 	.seeks = DEFAULT_SEEKS,
 };
 
+void dcache_ctor(struct kmem_cache *s, void *p)
+{
+	struct dentry *dentry = p;
+
+	spin_lock_init(&dentry->d_lock);
+	dentry->d_inode = NULL;
+	INIT_LIST_HEAD(&dentry->d_lru);
+	INIT_LIST_HEAD(&dentry->d_alias);
+}
+
 /**
  * d_alloc	-	allocate a dcache entry
  * @parent: parent of entry to allocate
@@ -907,8 +917,6 @@ struct dentry *d_alloc(struct dentry * p
 
 	atomic_set(&dentry->d_count, 1);
 	dentry->d_flags = DCACHE_UNHASHED;
-	spin_lock_init(&dentry->d_lock);
-	dentry->d_inode = NULL;
 	dentry->d_parent = NULL;
 	dentry->d_sb = NULL;
 	dentry->d_op = NULL;
@@ -918,9 +926,7 @@ struct dentry *d_alloc(struct dentry * p
 	dentry->d_cookie = NULL;
 #endif
 	INIT_HLIST_NODE(&dentry->d_hash);
-	INIT_LIST_HEAD(&dentry->d_lru);
 	INIT_LIST_HEAD(&dentry->d_subdirs);
-	INIT_LIST_HEAD(&dentry->d_alias);
 
 	if (parent) {
 		dentry->d_parent = dget(parent);
@@ -2096,14 +2102,10 @@ static void __init dcache_init(void)
 {
 	int loop;
 
-	/* 
-	 * A constructor could be added for stable state like the lists,
-	 * but it is probably not worth it because of the cache nature
-	 * of the dcache. 
-	 */
-	dentry_cache = KMEM_CACHE(dentry,
-		SLAB_RECLAIM_ACCOUNT|SLAB_PANIC|SLAB_MEM_SPREAD);
-	
+	dentry_cache = kmem_cache_create("dentry_cache", sizeof(struct dentry),
+		0, SLAB_RECLAIM_ACCOUNT|SLAB_PANIC|SLAB_MEM_SPREAD,
+		dcache_ctor);
+
 	register_shrinker(&dcache_shrinker);
 
 	/* Hash may have been set up in dcache_init_early */

-- 

^ permalink raw reply	[flat|nested] 77+ messages in thread

* [patch 21/23] dentries: dentry defragmentation
  2007-11-07  1:11 [patch 00/23] Slab defragmentation V6 Christoph Lameter
                   ` (19 preceding siblings ...)
  2007-11-07  1:11 ` [patch 20/23] dentries: Add constructor Christoph Lameter
@ 2007-11-07  1:11 ` Christoph Lameter
  2007-11-07  1:11 ` [patch 22/23] SLUB: Slab reclaim through Lumpy reclaim Christoph Lameter
                   ` (2 subsequent siblings)
  23 siblings, 0 replies; 77+ messages in thread
From: Christoph Lameter @ 2007-11-07  1:11 UTC (permalink / raw)
  To: akpm; +Cc: linux-kernel, linux-mm, Mel Gorman

[-- Attachment #1: 0025-slab_defrag_dentry_defrag.patch --]
[-- Type: text/plain, Size: 4065 bytes --]

The dentry pruning for unused entries works in a straightforward way. It
could be made more aggressive if one would actually move dentries instead
of just reclaiming them.

Reviewed-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Christoph Lameter <clameter@sgi.com>
---
 fs/dcache.c |  101 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 100 insertions(+), 1 deletion(-)

Index: linux-2.6/fs/dcache.c
===================================================================
--- linux-2.6.orig/fs/dcache.c	2007-11-06 12:57:01.000000000 -0800
+++ linux-2.6/fs/dcache.c	2007-11-06 12:57:06.000000000 -0800
@@ -31,6 +31,7 @@
 #include <linux/seqlock.h>
 #include <linux/swap.h>
 #include <linux/bootmem.h>
+#include <linux/backing-dev.h>
 #include "internal.h"
 
 
@@ -143,7 +144,10 @@ static struct dentry *d_kill(struct dent
 
 	list_del(&dentry->d_u.d_child);
 	dentry_stat.nr_dentry--;	/* For d_free, below */
-	/*drops the locks, at that point nobody can reach this dentry */
+	/*
+	 * drops the locks, at that point nobody (aside from defrag)
+	 * can reach this dentry
+	 */
 	dentry_iput(dentry);
 	parent = dentry->d_parent;
 	d_free(dentry);
@@ -2098,6 +2102,100 @@ static void __init dcache_init_early(voi
 		INIT_HLIST_HEAD(&dentry_hashtable[loop]);
 }
 
+/*
+ * The slab allocator is holding off frees. We can safely examine
+ * the object without the danger of it vanishing from under us.
+ */
+static void *get_dentries(struct kmem_cache *s, int nr, void **v)
+{
+	struct dentry *dentry;
+	int i;
+
+	spin_lock(&dcache_lock);
+	for (i = 0; i < nr; i++) {
+		dentry = v[i];
+
+		/*
+		 * Three sorts of dentries cannot be reclaimed:
+		 *
+		 * 1. dentries that are in the process of being allocated
+		 *    or being freed. In that case the dentry is neither
+		 *    on the LRU nor hashed.
+		 *
+		 * 2. Fake hashed entries as used for anonymous dentries
+		 *    and pipe I/O. The fake hashed entries have d_flags
+		 *    set to indicate a hashed entry. However, the
+		 *    d_hash field indicates that the entry is not hashed.
+		 *
+		 * 3. dentries that have a backing store that is not
+		 *    writable. This is true for tmpsfs and other in
+		 *    memory filesystems. Removing dentries from them
+		 *    would loose dentries for good.
+		 */
+		if ((d_unhashed(dentry) && list_empty(&dentry->d_lru)) ||
+		   (!d_unhashed(dentry) && hlist_unhashed(&dentry->d_hash)) ||
+		   (dentry->d_inode &&
+		   !mapping_cap_writeback_dirty(dentry->d_inode->i_mapping)))
+			/* Ignore this dentry */
+			v[i] = NULL;
+		else
+			/* dget_locked will remove the dentry from the LRU */
+			dget_locked(dentry);
+	}
+	spin_unlock(&dcache_lock);
+	return NULL;
+}
+
+/*
+ * Slab has dropped all the locks. Get rid of the refcount obtained
+ * earlier and also free the object.
+ */
+static void kick_dentries(struct kmem_cache *s,
+				int nr, void **v, void *private)
+{
+	struct dentry *dentry;
+	int i;
+
+	/*
+	 * First invalidate the dentries without holding the dcache lock
+	 */
+	for (i = 0; i < nr; i++) {
+		dentry = v[i];
+
+		if (dentry)
+			d_invalidate(dentry);
+	}
+
+	/*
+	 * If we are the last one holding a reference then the dentries can
+	 * be freed. We need the dcache_lock.
+	 */
+	spin_lock(&dcache_lock);
+	for (i = 0; i < nr; i++) {
+		dentry = v[i];
+		if (!dentry)
+			continue;
+
+		spin_lock(&dentry->d_lock);
+		if (atomic_read(&dentry->d_count) > 1) {
+			spin_unlock(&dentry->d_lock);
+			spin_unlock(&dcache_lock);
+			dput(dentry);
+			spin_lock(&dcache_lock);
+			continue;
+		}
+
+		prune_one_dentry(dentry);
+	}
+	spin_unlock(&dcache_lock);
+
+	/*
+	 * dentries are freed using RCU so we need to wait until RCU
+	 * operations are complete
+	 */
+	synchronize_rcu();
+}
+
 static void __init dcache_init(void)
 {
 	int loop;
@@ -2107,6 +2205,7 @@ static void __init dcache_init(void)
 		dcache_ctor);
 
 	register_shrinker(&dcache_shrinker);
+	kmem_cache_setup_defrag(dentry_cache, get_dentries, kick_dentries);
 
 	/* Hash may have been set up in dcache_init_early */
 	if (!hashdist)

-- 

^ permalink raw reply	[flat|nested] 77+ messages in thread

* [patch 22/23] SLUB: Slab reclaim through Lumpy reclaim
  2007-11-07  1:11 [patch 00/23] Slab defragmentation V6 Christoph Lameter
                   ` (20 preceding siblings ...)
  2007-11-07  1:11 ` [patch 21/23] dentries: dentry defragmentation Christoph Lameter
@ 2007-11-07  1:11 ` Christoph Lameter
  2007-11-07  1:11 ` [patch 23/23] SLUB: Add SlabReclaimable() to avoid repeated reclaim attempts Christoph Lameter
  2007-11-08 15:26 ` [patch 00/23] Slab defragmentation V6 Mel Gorman
  23 siblings, 0 replies; 77+ messages in thread
From: Christoph Lameter @ 2007-11-07  1:11 UTC (permalink / raw)
  To: akpm; +Cc: linux-kernel, linux-mm, Mel Gorman

[-- Attachment #1: 0012-slab_defrag_lumpy_reclaim.patch --]
[-- Type: text/plain, Size: 7961 bytes --]

Creates a special function kmem_cache_isolate_slab() and kmem_cache_reclaim()
to support lumpy reclaim.

In order to isolate pages we will have to handle slab page allocations in
such a way that we can determine if a slab is valid whenever we access it
regardless of its time in life.

A valid slab that can be freed has PageSlab(page) and page->inuse > 0 set.
So we need to make sure in allocate_slab() that page->inuse is zero before
PageSlab is set.

kmem_cache_isolate_page() is called from lumpy reclaim to isolate pages
neighboring a page cache page that is being reclaimed. Lumpy reclaim will
gather the slabs and call kmem_cache_reclaim() on the list.

This means that we can remove a slab in order to be able to coalesce
a higher order page.

Reviewed-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Christoph Lameter <clameter@sgi.com>
---
 include/linux/slab.h |    2 +
 mm/slab.c            |   13 ++++++
 mm/slub.c            |  102 ++++++++++++++++++++++++++++++++++++++++++++++++---
 mm/vmscan.c          |   13 +++++-
 4 files changed, 123 insertions(+), 7 deletions(-)

Index: linux-2.6/include/linux/slab.h
===================================================================
--- linux-2.6.orig/include/linux/slab.h	2007-11-06 13:50:47.000000000 -0800
+++ linux-2.6/include/linux/slab.h	2007-11-06 13:50:54.000000000 -0800
@@ -64,6 +64,8 @@ unsigned int kmem_cache_size(struct kmem
 const char *kmem_cache_name(struct kmem_cache *);
 int kmem_ptr_validate(struct kmem_cache *cachep, const void *ptr);
 int kmem_cache_defrag(int node);
+int kmem_cache_isolate_slab(struct page *);
+int kmem_cache_reclaim(struct list_head *);
 
 /*
  * Please use this macro to create slab caches. Simply specify the
Index: linux-2.6/mm/slab.c
===================================================================
--- linux-2.6.orig/mm/slab.c	2007-11-06 13:50:33.000000000 -0800
+++ linux-2.6/mm/slab.c	2007-11-06 13:50:54.000000000 -0800
@@ -2559,6 +2559,19 @@ int kmem_cache_defrag(int node)
 	return 0;
 }
 
+/*
+ * SLAB does not support slab defragmentation
+ */
+int kmem_cache_isolate_slab(struct page *page)
+{
+	return -ENOSYS;
+}
+
+int kmem_cache_reclaim(struct list_head *zaplist)
+{
+	return 0;
+}
+
 /**
  * kmem_cache_destroy - delete a cache
  * @cachep: the cache to destroy
Index: linux-2.6/mm/slub.c
===================================================================
--- linux-2.6.orig/mm/slub.c	2007-11-06 13:50:40.000000000 -0800
+++ linux-2.6/mm/slub.c	2007-11-06 13:50:54.000000000 -0800
@@ -1088,18 +1088,19 @@ static noinline struct page *new_slab(st
 	page = allocate_slab(s,
 		flags & (GFP_RECLAIM_MASK | GFP_CONSTRAINT_MASK), node);
 	if (!page)
-		goto out;
+		return NULL;
 
 	n = get_node(s, page_to_nid(page));
 	if (n)
 		atomic_long_inc(&n->nr_slabs);
+
+	page->inuse = 0;
 	page->slab = s;
-	state = 1 << PG_slab;
+	state = page->flags | (1 << PG_slab);
 	if (s->flags & (SLAB_DEBUG_FREE | SLAB_RED_ZONE | SLAB_POISON |
 			SLAB_STORE_USER | SLAB_TRACE))
 		state |= SLABDEBUG;
 
-	page->flags |= state;
 	start = page_address(page);
 	page->end = start + 1;
 
@@ -1116,8 +1117,13 @@ static noinline struct page *new_slab(st
 	set_freepointer(s, last, page->end);
 
 	page->freelist = start;
-	page->inuse = 0;
-out:
+
+	/*
+	 * page->inuse must be 0 when PageSlab(page) becomes
+	 * true so that defrag knows that this slab is not in use.
+	 */
+	smp_wmb();
+	page->flags = state;
 	return page;
 }
 
@@ -2622,6 +2628,92 @@ out:
 }
 #endif
 
+
+/*
+ * Check if the given state is that of a reclaimable slab page.
+ *
+ * This is only true if this is indeed a slab page and if
+ * the page has not been frozen.
+ */
+static inline int reclaimable_slab(unsigned long state)
+{
+	if (!(state & (1 << PG_slab)))
+		return 0;
+
+	if (state & FROZEN)
+		return 0;
+
+	return 1;
+}
+
+ /*
+ * Isolate page from the slab partial lists. Return 0 if succesful.
+ *
+ * After isolation the LRU field can be used to put the page onto
+ * a reclaim list.
+ */
+int kmem_cache_isolate_slab(struct page *page)
+{
+	unsigned long flags;
+	struct kmem_cache *s;
+	int rc = -ENOENT;
+	unsigned long state;
+
+	/*
+	 * Avoid attempting to isolate the slab pages if there are
+	 * indications that this will not be successful.
+	 */
+	if (!reclaimable_slab(page->flags) || page_count(page) == 1)
+		return rc;
+
+	/*
+	 * Get a reference to the page. Return if its freed or being freed.
+	 * This is necessary to make sure that the page does not vanish
+	 * from under us before we are able to check the result.
+	 */
+	if (!get_page_unless_zero(page))
+		return rc;
+
+	local_irq_save(flags);
+	state = slab_lock(page);
+
+	/*
+	 * Check the flags again now that we have locked it.
+	 */
+	if (!reclaimable_slab(flags) || !page->inuse) {
+		slab_unlock(page, state);
+		put_page(page);
+		goto out;
+	}
+
+	/*
+	 * Drop reference count. There are object remaining and therefore
+	 * the slab lock will have to be taken before the last object can
+	 * be removed. We hold the slab lock, so no one can free this slab
+	 * now.
+	 *
+	 * We set the slab frozen before releasing the lock. This means
+	 * that no slab free action will be performed. If all objects are
+	 * removed then the slab will be freed during kmem_cache_reclaim().
+	 */
+	BUG_ON(page_count(page) <= 1);
+	put_page(page);
+
+	/*
+	 * Remove the slab from the lists and mark it frozen
+	 */
+	s = page->slab;
+	if (page->inuse < s->objects)
+		remove_partial(s, page);
+	else if (s->flags & SLAB_STORE_USER)
+		remove_full(s, page);
+	slab_unlock(page, state | FROZEN);
+	rc = 0;
+out:
+	local_irq_restore(flags);
+	return rc;
+}
+
 /*
  * Conversion table for small slabs sizes / 8 to the index in the
  * kmalloc array. This is necessary for slabs < 192 since we have non power
Index: linux-2.6/mm/vmscan.c
===================================================================
--- linux-2.6.orig/mm/vmscan.c	2007-11-06 13:50:47.000000000 -0800
+++ linux-2.6/mm/vmscan.c	2007-11-06 13:50:54.000000000 -0800
@@ -687,6 +687,7 @@ static int __isolate_lru_page(struct pag
  */
 static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
 		struct list_head *src, struct list_head *dst,
+		struct list_head *slab_pages,
 		unsigned long *scanned, int order, int mode)
 {
 	unsigned long nr_taken = 0;
@@ -760,7 +761,13 @@ static unsigned long isolate_lru_pages(u
 			case -EBUSY:
 				/* else it is being freed elsewhere */
 				list_move(&cursor_page->lru, src);
+				break;
+
 			default:
+				if (slab_pages &&
+				    kmem_cache_isolate_slab(cursor_page) == 0)
+						list_add(&cursor_page->lru,
+							slab_pages);
 				break;
 			}
 		}
@@ -796,6 +803,7 @@ static unsigned long shrink_inactive_lis
 				struct zone *zone, struct scan_control *sc)
 {
 	LIST_HEAD(page_list);
+	LIST_HEAD(slab_list);
 	struct pagevec pvec;
 	unsigned long nr_scanned = 0;
 	unsigned long nr_reclaimed = 0;
@@ -813,7 +821,7 @@ static unsigned long shrink_inactive_lis
 
 		nr_taken = isolate_lru_pages(sc->swap_cluster_max,
 			     &zone->inactive_list,
-			     &page_list, &nr_scan, sc->order,
+			     &page_list, &slab_list, &nr_scan, sc->order,
 			     (sc->order > PAGE_ALLOC_COSTLY_ORDER)?
 					     ISOLATE_BOTH : ISOLATE_INACTIVE);
 		nr_active = clear_active_flags(&page_list);
@@ -824,6 +832,7 @@ static unsigned long shrink_inactive_lis
 						-(nr_taken - nr_active));
 		zone->pages_scanned += nr_scan;
 		spin_unlock_irq(&zone->lru_lock);
+		kmem_cache_reclaim(&slab_list);
 
 		nr_scanned += nr_scan;
 		nr_freed = shrink_page_list(&page_list, sc, PAGEOUT_IO_ASYNC);
@@ -1029,7 +1038,7 @@ force_reclaim_mapped:
 	lru_add_drain();
 	spin_lock_irq(&zone->lru_lock);
 	pgmoved = isolate_lru_pages(nr_pages, &zone->active_list,
-			    &l_hold, &pgscanned, sc->order, ISOLATE_ACTIVE);
+			&l_hold, NULL, &pgscanned, sc->order, ISOLATE_ACTIVE);
 	zone->pages_scanned += pgscanned;
 	__mod_zone_page_state(zone, NR_ACTIVE, -pgmoved);
 	spin_unlock_irq(&zone->lru_lock);

-- 

^ permalink raw reply	[flat|nested] 77+ messages in thread

* [patch 23/23] SLUB: Add SlabReclaimable() to avoid repeated reclaim attempts
  2007-11-07  1:11 [patch 00/23] Slab defragmentation V6 Christoph Lameter
                   ` (21 preceding siblings ...)
  2007-11-07  1:11 ` [patch 22/23] SLUB: Slab reclaim through Lumpy reclaim Christoph Lameter
@ 2007-11-07  1:11 ` Christoph Lameter
  2007-11-08 15:26 ` [patch 00/23] Slab defragmentation V6 Mel Gorman
  23 siblings, 0 replies; 77+ messages in thread
From: Christoph Lameter @ 2007-11-07  1:11 UTC (permalink / raw)
  To: akpm; +Cc: linux-kernel, linux-mm, Mel Gorman

[-- Attachment #1: 0013-slab_defrag_reclaim_flag.patch --]
[-- Type: text/plain, Size: 2635 bytes --]

Add a flag RECLAIMABLE to be set on slabs with a defragmentation method

Clear the flag if a reclaim action is not successful in reducing the
number of objects in a slab.

The reclaim flag is set again when all objeccts of the slab have been
allocated and it is removed from the partial lists.

Reviewed-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Christoph Lameter <clameter@sgi.com>
---
 mm/slub.c |   20 +++++++++++++++++---
 1 file changed, 17 insertions(+), 3 deletions(-)

Index: linux-2.6/mm/slub.c
===================================================================
--- linux-2.6.orig/mm/slub.c	2007-11-06 17:06:46.000000000 -0800
+++ linux-2.6/mm/slub.c	2007-11-06 17:07:54.000000000 -0800
@@ -102,6 +102,7 @@
 
 #define FROZEN (1 << PG_active)
 #define LOCKED (1 << PG_locked)
+#define RECLAIMABLE (1 << PG_dirty)
 
 #ifdef CONFIG_SLUB_DEBUG
 #define SLABDEBUG (1 << PG_error)
@@ -1100,6 +1101,8 @@ static noinline struct page *new_slab(st
 	if (s->flags & (SLAB_DEBUG_FREE | SLAB_RED_ZONE | SLAB_POISON |
 			SLAB_STORE_USER | SLAB_TRACE))
 		state |= SLABDEBUG;
+	if (s->kick)
+		state |= RECLAIMABLE;
 
 	start = page_address(page);
 	page->end = start + 1;
@@ -1176,6 +1179,7 @@ static void discard_slab(struct kmem_cac
 
 	atomic_long_dec(&n->nr_slabs);
 	reset_page_mapcount(page);
+	page->flags &= ~RECLAIMABLE;
 	__ClearPageSlab(page);
 	free_slab(s, page);
 }
@@ -1408,8 +1412,11 @@ static void unfreeze_slab(struct kmem_ca
 
 		if (page->freelist != page->end)
 			add_partial(s, page, tail);
-		else
+		else {
 			add_full(s, page, state);
+			if (s->kick && !(state & RECLAIMABLE))
+				state |= RECLAIMABLE;
+		}
 		slab_unlock(page, state);
 
 	} else {
@@ -2633,7 +2640,7 @@ out:
  * Check if the given state is that of a reclaimable slab page.
  *
  * This is only true if this is indeed a slab page and if
- * the page has not been frozen.
+ * the page has not been frozen or marked as unreclaimable.
  */
 static inline int reclaimable_slab(unsigned long state)
 {
@@ -2643,7 +2650,7 @@ static inline int reclaimable_slab(unsig
 	if (state & FROZEN)
 		return 0;
 
-	return 1;
+	return state & RECLAIMABLE;
 }
 
  /*
@@ -2958,6 +2965,8 @@ out:
 	 * Check the result and unfreeze the slab
 	 */
 	leftover = page->inuse;
+	if (leftover)
+		state &= ~RECLAIMABLE;
 	unfreeze_slab(s, page, leftover > 0, state);
 	local_irq_restore(flags);
 	return leftover;
@@ -3012,6 +3021,11 @@ static unsigned long __kmem_cache_shrink
 		if (!state)
 			continue;
 
+		if (!(state & RECLAIMABLE)) {
+			slab_unlock(page, state);
+			continue;
+		}
+
 		if (page->inuse) {
 
 			list_move(&page->lru, &zaplist);

-- 

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [patch 09/23] SLUB: Add get() and kick() methods
  2007-11-07  1:11 ` [patch 09/23] SLUB: Add get() and kick() methods Christoph Lameter
@ 2007-11-07  2:37   ` Adrian Bunk
  2007-11-07  3:07     ` Christoph Lameter
  0 siblings, 1 reply; 77+ messages in thread
From: Adrian Bunk @ 2007-11-07  2:37 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: akpm, linux-kernel, linux-mm, Mel Gorman, Rik van Riel

On Tue, Nov 06, 2007 at 05:11:39PM -0800, Christoph Lameter wrote:
> Add the two methods needed for defragmentation and add the display of the
> methods via the proc interface.
> 
> Add documentation explaining the use of these methods.
> 
> Reviewed-by: Rik van Riel <riel@redhat.com>
> Signed-off-by: Christoph Lameter <clameter@sgi.com>
> ---
>  include/linux/slab.h     |    3 +++
>  include/linux/slub_def.h |   31 +++++++++++++++++++++++++++++++
>  mm/slub.c                |   32 ++++++++++++++++++++++++++++++--
>  3 files changed, 64 insertions(+), 2 deletions(-)
> 
> Index: linux-2.6/include/linux/slab.h
> ===================================================================
> --- linux-2.6.orig/include/linux/slab.h	2007-10-17 13:35:53.000000000 -0700
> +++ linux-2.6/include/linux/slab.h	2007-11-06 12:37:51.000000000 -0800
> @@ -56,6 +56,9 @@ struct kmem_cache *kmem_cache_create(con
>  			void (*)(struct kmem_cache *, void *));
>  void kmem_cache_destroy(struct kmem_cache *);
>  int kmem_cache_shrink(struct kmem_cache *);
> +void kmem_cache_setup_defrag(struct kmem_cache *s,
> +	void *(*get)(struct kmem_cache *, int nr, void **),
> +	void (*kick)(struct kmem_cache *, int nr, void **, void *private));
>  void kmem_cache_free(struct kmem_cache *, void *);
>  unsigned int kmem_cache_size(struct kmem_cache *);
>  const char *kmem_cache_name(struct kmem_cache *);
>...

A static inline dummy function for CONFIG_SLUB=n seems to be missing?

cu
Adrian

-- 

       "Is there not promise of rain?" Ling Tan asked suddenly out
        of the darkness. There had been need of rain for many days.
       "Only a promise," Lao Er said.
                                       Pearl S. Buck - Dragon Seed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [patch 09/23] SLUB: Add get() and kick() methods
  2007-11-07  2:37   ` Adrian Bunk
@ 2007-11-07  3:07     ` Christoph Lameter
  2007-11-07  3:26       ` Adrian Bunk
  0 siblings, 1 reply; 77+ messages in thread
From: Christoph Lameter @ 2007-11-07  3:07 UTC (permalink / raw)
  To: Adrian Bunk; +Cc: akpm, linux-kernel, linux-mm, Mel Gorman, Rik van Riel

On Wed, 7 Nov 2007, Adrian Bunk wrote:

> A static inline dummy function for CONFIG_SLUB=n seems to be missing?

Correct. This patch is needed so that building with SLAB will work.


Slab defrag: Provide empty kmem_cache_setup_defrag function for SLAB.

Provide an empty function to satisfy dependencies for Slab defrag.

Signed-off-by: Christoph Lameter <clameter@sgi.com>?

---
 mm/slab.c |    7 +++++++
 1 file changed, 7 insertions(+)

Index: linux-2.6/mm/slab.c
===================================================================
--- linux-2.6.orig/mm/slab.c	2007-11-06 18:57:22.000000000 -0800
+++ linux-2.6/mm/slab.c	2007-11-06 18:58:40.000000000 -0800
@@ -2535,6 +2535,13 @@ static int __cache_shrink(struct kmem_ca
 	return (ret ? 1 : 0);
 }
 
+void kmem_cache_setup_defrag(struct kmem_cache *s,
+	void *(*get)(struct kmem_cache *, int nr, void **),
+	void (*kick)(struct kmem_cache *, int nr, void **, void *private))
+{
+}
+EXPORT_SYMBOL(kmem_cache_setup_defrag);
+
 /**
  * kmem_cache_shrink - Shrink a cache.
  * @cachep: The cache to shrink.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [patch 09/23] SLUB: Add get() and kick() methods
  2007-11-07  3:07     ` Christoph Lameter
@ 2007-11-07  3:26       ` Adrian Bunk
  0 siblings, 0 replies; 77+ messages in thread
From: Adrian Bunk @ 2007-11-07  3:26 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: akpm, linux-kernel, linux-mm, Mel Gorman, Rik van Riel

On Tue, Nov 06, 2007 at 07:07:15PM -0800, Christoph Lameter wrote:
> On Wed, 7 Nov 2007, Adrian Bunk wrote:
> 
> > A static inline dummy function for CONFIG_SLUB=n seems to be missing?
> 
> Correct. This patch is needed so that building with SLAB will work.
> 
> Slab defrag: Provide empty kmem_cache_setup_defrag function for SLAB.
> 
> Provide an empty function to satisfy dependencies for Slab defrag.
> 
> Signed-off-by: Christoph Lameter <clameter@sgi.com>?
> 
> ---
>  mm/slab.c |    7 +++++++
>  1 file changed, 7 insertions(+)
> 
> Index: linux-2.6/mm/slab.c
> ===================================================================
> --- linux-2.6.orig/mm/slab.c	2007-11-06 18:57:22.000000000 -0800
> +++ linux-2.6/mm/slab.c	2007-11-06 18:58:40.000000000 -0800
> @@ -2535,6 +2535,13 @@ static int __cache_shrink(struct kmem_ca
>  	return (ret ? 1 : 0);
>  }
>  
> +void kmem_cache_setup_defrag(struct kmem_cache *s,
> +	void *(*get)(struct kmem_cache *, int nr, void **),
> +	void (*kick)(struct kmem_cache *, int nr, void **, void *private))
> +{
> +}
> +EXPORT_SYMBOL(kmem_cache_setup_defrag);
> +

- this misses slob
- this wastes memory

An empty static inline function in slab.h would be better.

cu
Adrian

-- 

       "Is there not promise of rain?" Ling Tan asked suddenly out
        of the darkness. There had been need of rain for many days.
       "Only a promise," Lao Er said.
                                       Pearl S. Buck - Dragon Seed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [patch 04/23] dentries: Extract common code to remove dentry from lru
  2007-11-07  1:11 ` [patch 04/23] dentries: Extract common code to remove dentry from lru Christoph Lameter
@ 2007-11-07  8:50   ` Johannes Weiner
  2007-11-07  9:43     ` Jörn Engel
  2007-11-07 18:28     ` Christoph Lameter
  0 siblings, 2 replies; 77+ messages in thread
From: Johannes Weiner @ 2007-11-07  8:50 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: akpm, linux-kernel, linux-mm, Mel Gorman

Hi Christoph,

On Tue, Nov 06, 2007 at 05:11:34PM -0800, Christoph Lameter wrote:
> @@ -613,11 +606,7 @@ static void shrink_dcache_for_umount_sub
>  			spin_lock(&dcache_lock);
>  			list_for_each_entry(loop, &dentry->d_subdirs,
>  					    d_u.d_child) {
> -				if (!list_empty(&loop->d_lru)) {
> -					dentry_stat.nr_unused--;
> -					list_del_init(&loop->d_lru);
> -				}
> -
> +				dentry_lru_remove(dentry);

Shouldn't this be dentry_lru_remove(loop)?

	Hannes

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [patch 07/23] SLUB: Add defrag_ratio field and sysfs support.
  2007-11-07  1:11 ` [patch 07/23] SLUB: Add defrag_ratio field and sysfs support Christoph Lameter
@ 2007-11-07  8:55   ` Johannes Weiner
  2007-11-07 18:30     ` Christoph Lameter
  2007-11-08 15:07   ` Mel Gorman
  1 sibling, 1 reply; 77+ messages in thread
From: Johannes Weiner @ 2007-11-07  8:55 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: akpm, linux-kernel, linux-mm, Mel Gorman

Hi Christoph,

On Tue, Nov 06, 2007 at 05:11:37PM -0800, Christoph Lameter wrote:
> --- linux-2.6.orig/include/linux/slub_def.h	2007-11-06 12:36:28.000000000 -0800
> +++ linux-2.6/include/linux/slub_def.h	2007-11-06 12:37:44.000000000 -0800
> @@ -53,6 +53,13 @@ struct kmem_cache {
>  	void (*ctor)(struct kmem_cache *, void *);
>  	int inuse;		/* Offset to metadata */
>  	int align;		/* Alignment */
> +	int defrag_ratio;	/*
> +				 * objects/possible-objects limit. If we have
> +				 * less that the specified percentage of

That should be `less than', I guess.

	Hannes

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [patch 12/23] SLUB: Trigger defragmentation from memory reclaim
  2007-11-07  1:11 ` [patch 12/23] SLUB: Trigger defragmentation from memory reclaim Christoph Lameter
@ 2007-11-07  9:28   ` Johannes Weiner
  2007-11-07 18:34     ` Christoph Lameter
  2007-11-08 15:12   ` Mel Gorman
  1 sibling, 1 reply; 77+ messages in thread
From: Johannes Weiner @ 2007-11-07  9:28 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: akpm, linux-kernel, linux-mm, Mel Gorman

Hi Christoph,

On Tue, Nov 06, 2007 at 05:11:42PM -0800, Christoph Lameter wrote:
> Index: linux-2.6/include/linux/slab.h
> ===================================================================
> --- linux-2.6.orig/include/linux/slab.h	2007-11-06 12:37:51.000000000 -0800
> +++ linux-2.6/include/linux/slab.h	2007-11-06 12:53:40.000000000 -0800
> @@ -63,6 +63,7 @@ void kmem_cache_free(struct kmem_cache *
>  unsigned int kmem_cache_size(struct kmem_cache *);
>  const char *kmem_cache_name(struct kmem_cache *);
>  int kmem_ptr_validate(struct kmem_cache *cachep, const void *ptr);
> +int kmem_cache_defrag(int node);

The definition in slab.c always returns 0.  Wouldn't a static inline
function in the header be better?


>   * Returns the number of slab objects which we shrunk.
      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>   */
>  unsigned long shrink_slab(unsigned long scanned, gfp_t gfp_mask,
> -                       unsigned long lru_pages)
> +                       unsigned long lru_pages, struct zone *zone)
>  {
>         struct shrinker *shrinker;
>         unsigned long ret = 0;
> @@ -210,6 +218,8 @@ unsigned long shrink_slab(unsigned long
>                 shrinker->nr += total_scan;
>         }
>         up_read(&shrinker_rwsem);
> +       if (gfp_mask & __GFP_FS)
> +               kmem_cache_defrag(zone ? zone_to_nid(zone) : -1);
>         return ret;
>  }

What about the objects that kmem_cache_defrag() releases?  Shouldn't
they be counted too?

     ret += kmem_cache_defrag(...)

Or am I overseeing something here?

	Hannes

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [patch 04/23] dentries: Extract common code to remove dentry from lru
  2007-11-07  8:50   ` Johannes Weiner
@ 2007-11-07  9:43     ` Jörn Engel
  2007-11-07 18:55       ` Christoph Lameter
  2007-11-07 18:28     ` Christoph Lameter
  1 sibling, 1 reply; 77+ messages in thread
From: Jörn Engel @ 2007-11-07  9:43 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Christoph Lameter, akpm, linux-kernel, linux-mm, Mel Gorman

On Wed, 7 November 2007 09:50:27 +0100, Johannes Weiner wrote:
> On Tue, Nov 06, 2007 at 05:11:34PM -0800, Christoph Lameter wrote:
> > @@ -613,11 +606,7 @@ static void shrink_dcache_for_umount_sub
> >  			spin_lock(&dcache_lock);
> >  			list_for_each_entry(loop, &dentry->d_subdirs,
> >  					    d_u.d_child) {
> > -				if (!list_empty(&loop->d_lru)) {
> > -					dentry_stat.nr_unused--;
> > -					list_del_init(&loop->d_lru);
> > -				}
> > -
> > +				dentry_lru_remove(dentry);
> 
> Shouldn't this be dentry_lru_remove(loop)?

Looks like it.  Once this is fixed, feel free to add
Acked-by: Joern Engel <joern@logfs.org>

JA?rn

-- 
It does not require a majority to prevail, but rather an irate,
tireless minority keen to set brush fires in people's minds.
-- Samuel Adams

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [patch 14/23] inodes: Support generic defragmentation
  2007-11-07  1:11 ` [patch 14/23] inodes: Support generic defragmentation Christoph Lameter
@ 2007-11-07 10:17   ` Jörn Engel
  2007-11-07 10:31     ` Jörn Engel
                       ` (2 more replies)
  0 siblings, 3 replies; 77+ messages in thread
From: Jörn Engel @ 2007-11-07 10:17 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: akpm, linux-kernel, linux-mm, Mel Gorman

On Tue, 6 November 2007 17:11:44 -0800, Christoph Lameter wrote:
>  
> +void *get_inodes(struct kmem_cache *s, int nr, void **v)
> +{
> +	int i;
> +
> +	spin_lock(&inode_lock);
> +	for (i = 0; i < nr; i++) {
> +		struct inode *inode = v[i];
> +
> +		if (inode->i_state & (I_FREEING|I_CLEAR|I_WILL_FREE))
> +			v[i] = NULL;
> +		else
> +			__iget(inode);
> +	}
> +	spin_unlock(&inode_lock);
> +	return NULL;
> +}
> +EXPORT_SYMBOL(get_inodes);

What purpose does the return type have?

> +/*
> + * Function for filesystems that embedd struct inode into their own
> + * structures. The offset is the offset of the struct inode in the fs inode.
> + */
> +void *fs_get_inodes(struct kmem_cache *s, int nr, void **v,
> +						unsigned long offset)
> +{
> +	int i;
> +
> +	for (i = 0; i < nr; i++)
> +		v[i] += offset;
> +
> +	return get_inodes(s, nr, v);
> +}
> +EXPORT_SYMBOL(fs_get_inodes);

The fact that all pointers get changed makes me a bit uneasy:
	struct foo_inode v[20];
	...
	fs_get_inodes(..., v, ...);
	...
	v[0].foo_field = bar;
	
No warning, but spectacular fireworks.

> +void kick_inodes(struct kmem_cache *s, int nr, void **v, void *private)
> +{
> +	struct inode *inode;
> +	int i;
> +	int abort = 0;
> +	LIST_HEAD(freeable);
> +	struct super_block *sb;
> +
> +	for (i = 0; i < nr; i++) {
> +		inode = v[i];
> +		if (!inode)
> +			continue;

NULL is legal here?  Then fs_get_inodes should check for NULL as well
and not add the offset to NULL pointers, I guess.

> +		if (inode_has_buffers(inode) || inode->i_data.nrpages) {
> +			if (remove_inode_buffers(inode))
> +				invalidate_mapping_pages(&inode->i_data,
> +								0, -1);

This linebreak can be removed.

> +		}
> +
> +		/* Invalidate children and dentry */
> +		if (S_ISDIR(inode->i_mode)) {
> +			struct dentry *d = d_find_alias(inode);
> +
> +			if (d) {
> +				d_invalidate(d);
> +				dput(d);
> +			}
> +		}
> +
> +		if (inode->i_state & I_DIRTY)
> +			write_inode_now(inode, 1);

Once more the three-bit I_DIRTY is used like a boolean value.  I don't
hold it against you, specifically.  A general review/cleanup is
necessary for that.

JA?rn

-- 
"[One] doesn't need to know [...] how to cause a headache in order
to take an aspirin."
-- Scott Culp, Manager of the Microsoft Security Response Center, 2001

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [patch 14/23] inodes: Support generic defragmentation
  2007-11-07 10:17   ` Jörn Engel
@ 2007-11-07 10:31     ` Jörn Engel
  2007-11-07 10:35     ` Andreas Schwab
  2007-11-07 18:40     ` Christoph Lameter
  2 siblings, 0 replies; 77+ messages in thread
From: Jörn Engel @ 2007-11-07 10:31 UTC (permalink / raw)
  To: Jörn Engel
  Cc: Christoph Lameter, akpm, linux-kernel, linux-mm, Mel Gorman

On Wed, 7 November 2007 11:17:48 +0100, JA?rn Engel wrote:
> > +/*
> > + * Function for filesystems that embedd struct inode into their own
> > + * structures. The offset is the offset of the struct inode in the fs inode.
> > + */
> > +void *fs_get_inodes(struct kmem_cache *s, int nr, void **v,
> > +						unsigned long offset)
> > +{
> > +	int i;
> > +
> > +	for (i = 0; i < nr; i++)
> > +		v[i] += offset;
> > +
> > +	return get_inodes(s, nr, v);
> > +}
> > +EXPORT_SYMBOL(fs_get_inodes);
> 
> The fact that all pointers get changed makes me a bit uneasy:
> 	struct foo_inode v[20];
> 	...
> 	fs_get_inodes(..., v, ...);
> 	...
> 	v[0].foo_field = bar;
> 	
> No warning, but spectacular fireworks.
> 
> > +void kick_inodes(struct kmem_cache *s, int nr, void **v, void *private)
> > +{
> > +	struct inode *inode;
> > +	int i;
> > +	int abort = 0;
> > +	LIST_HEAD(freeable);
> > +	struct super_block *sb;
> > +
> > +	for (i = 0; i < nr; i++) {
> > +		inode = v[i];
> > +		if (!inode)
> > +			continue;
> 
> NULL is legal here?  Then fs_get_inodes should check for NULL as well
> and not add the offset to NULL pointers, I guess.

Ignore these two comments.  Reading further before making them would
have helped. ;)

JA?rn

-- 
Fancy algorithms are slow when n is small, and n is usually small.
Fancy algorithms have big constants. Until you know that n is
frequently going to be big, don't get fancy.
-- Rob Pike

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [patch 14/23] inodes: Support generic defragmentation
  2007-11-07 10:17   ` Jörn Engel
  2007-11-07 10:31     ` Jörn Engel
@ 2007-11-07 10:35     ` Andreas Schwab
  2007-11-07 10:35       ` Jörn Engel
  2007-11-07 18:40     ` Christoph Lameter
  2 siblings, 1 reply; 77+ messages in thread
From: Andreas Schwab @ 2007-11-07 10:35 UTC (permalink / raw)
  To: Jörn Engel
  Cc: Christoph Lameter, akpm, linux-kernel, linux-mm, Mel Gorman

Jorn Engel <joern@logfs.org> writes:

> On Tue, 6 November 2007 17:11:44 -0800, Christoph Lameter wrote:
>>  
>> +/*
>> + * Function for filesystems that embedd struct inode into their own
>> + * structures. The offset is the offset of the struct inode in the fs inode.
>> + */
>> +void *fs_get_inodes(struct kmem_cache *s, int nr, void **v,
>> +						unsigned long offset)
>> +{
>> +	int i;
>> +
>> +	for (i = 0; i < nr; i++)
>> +		v[i] += offset;
>> +
>> +	return get_inodes(s, nr, v);
>> +}
>> +EXPORT_SYMBOL(fs_get_inodes);
>
> The fact that all pointers get changed makes me a bit uneasy:
> 	struct foo_inode v[20];
> 	...
> 	fs_get_inodes(..., v, ...);
> 	...
> 	v[0].foo_field = bar;
> 	
> No warning, but spectacular fireworks.

You'l get a warning that struct foo_inode * is incompatible with void **.

Andreas.

-- 
Andreas Schwab, SuSE Labs, schwab@suse.de
SuSE Linux Products GmbH, Maxfeldstrasse 5, 90409 Nurnberg, Germany
PGP key fingerprint = 58CA 54C7 6D53 942B 1756  01D3 44D5 214B 8276 4ED5
"And now for something completely different."

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [patch 14/23] inodes: Support generic defragmentation
  2007-11-07 10:35     ` Andreas Schwab
@ 2007-11-07 10:35       ` Jörn Engel
  0 siblings, 0 replies; 77+ messages in thread
From: Jörn Engel @ 2007-11-07 10:35 UTC (permalink / raw)
  To: Andreas Schwab
  Cc: Jörn Engel, Christoph Lameter, akpm, linux-kernel, linux-mm,
	Mel Gorman

On Wed, 7 November 2007 11:35:13 +0100, Andreas Schwab wrote:
> >
> > The fact that all pointers get changed makes me a bit uneasy:
> > 	struct foo_inode v[20];
> > 	...
> > 	fs_get_inodes(..., v, ...);
> > 	...
> > 	v[0].foo_field = bar;
> > 	
> > No warning, but spectacular fireworks.
> 
> You'l get a warning that struct foo_inode * is incompatible with void **.
- 	struct foo_inode v[20];
+ 	struct foo_inode *v[20];

Looks like my example needs a patch as well.  Anyway, the function is
used in a way that makes this a non-issue.

JA?rn

-- 
You cannot suppose that Moliere ever troubled himself to be original in the
matter of ideas. You cannot suppose that the stories he tells in his plays
have never been told before. They were culled, as you very well know.
-- Andre-Louis Moreau in Scarabouche

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [patch 04/23] dentries: Extract common code to remove dentry from lru
  2007-11-07  8:50   ` Johannes Weiner
  2007-11-07  9:43     ` Jörn Engel
@ 2007-11-07 18:28     ` Christoph Lameter
  1 sibling, 0 replies; 77+ messages in thread
From: Christoph Lameter @ 2007-11-07 18:28 UTC (permalink / raw)
  To: akpm; +Cc: Johannes Weiner, linux-kernel, linux-mm, Mel Gorman

On Wed, 7 Nov 2007, Johannes Weiner wrote:

> Hi Christoph,
> 
> On Tue, Nov 06, 2007 at 05:11:34PM -0800, Christoph Lameter wrote:
> > @@ -613,11 +606,7 @@ static void shrink_dcache_for_umount_sub
> >  			spin_lock(&dcache_lock);
> >  			list_for_each_entry(loop, &dentry->d_subdirs,
> >  					    d_u.d_child) {
> > -				if (!list_empty(&loop->d_lru)) {
> > -					dentry_stat.nr_unused--;
> > -					list_del_init(&loop->d_lru);
> > -				}
> > -
> > +				dentry_lru_remove(dentry);
> 
> Shouldn't this be dentry_lru_remove(loop)?

Correct. Andrew: This needs to go into your tree to fix the patch that is 
already there:


[PATCH] dcache: use the correct variable.

We need to use "loop" instead of "dentry"

Signed-off-by: Christoph Lameter <clameter@sgi.com>

Index: linux-2.6/fs/dcache.c
===================================================================
--- linux-2.6.orig/fs/dcache.c	2007-11-07 10:26:20.000000000 -0800
+++ linux-2.6/fs/dcache.c	2007-11-07 10:26:27.000000000 -0800
@@ -610,7 +610,7 @@ static void shrink_dcache_for_umount_sub
 			spin_lock(&dcache_lock);
 			list_for_each_entry(loop, &dentry->d_subdirs,
 					    d_u.d_child) {
-				dentry_lru_remove(dentry);
+				dentry_lru_remove(loop);
 				__d_drop(loop);
 				cond_resched_lock(&dcache_lock);
 			}

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [patch 07/23] SLUB: Add defrag_ratio field and sysfs support.
  2007-11-07  8:55   ` Johannes Weiner
@ 2007-11-07 18:30     ` Christoph Lameter
  0 siblings, 0 replies; 77+ messages in thread
From: Christoph Lameter @ 2007-11-07 18:30 UTC (permalink / raw)
  To: Johannes Weiner; +Cc: akpm, linux-kernel, linux-mm, Mel Gorman

On Wed, 7 Nov 2007, Johannes Weiner wrote:

> Hi Christoph,
> 
> On Tue, Nov 06, 2007 at 05:11:37PM -0800, Christoph Lameter wrote:
> > --- linux-2.6.orig/include/linux/slub_def.h	2007-11-06 12:36:28.000000000 -0800
> > +++ linux-2.6/include/linux/slub_def.h	2007-11-06 12:37:44.000000000 -0800
> > @@ -53,6 +53,13 @@ struct kmem_cache {
> >  	void (*ctor)(struct kmem_cache *, void *);
> >  	int inuse;		/* Offset to metadata */
> >  	int align;		/* Alignment */
> > +	int defrag_ratio;	/*
> > +				 * objects/possible-objects limit. If we have
> > +				 * less that the specified percentage of
> 
> That should be `less than', I guess.

Correct.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [patch 12/23] SLUB: Trigger defragmentation from memory reclaim
  2007-11-07  9:28   ` Johannes Weiner
@ 2007-11-07 18:34     ` Christoph Lameter
  0 siblings, 0 replies; 77+ messages in thread
From: Christoph Lameter @ 2007-11-07 18:34 UTC (permalink / raw)
  To: Johannes Weiner; +Cc: akpm, linux-kernel, linux-mm, Mel Gorman

On Wed, 7 Nov 2007, Johannes Weiner wrote:

> > @@ -210,6 +218,8 @@ unsigned long shrink_slab(unsigned long
> >                 shrinker->nr += total_scan;
> >         }
> >         up_read(&shrinker_rwsem);
> > +       if (gfp_mask & __GFP_FS)
> > +               kmem_cache_defrag(zone ? zone_to_nid(zone) : -1);
> >         return ret;
> >  }
> 
> What about the objects that kmem_cache_defrag() releases?  Shouldn't
> they be counted too?
> 
>      ret += kmem_cache_defrag(...)
> 
> Or am I overseeing something here?

kmem_cache_defrag returns the number of pages that were released by defrag 
actions.

shrink_slab returns the number of objects released by the shrinkers. 
kmem_cache_defrag has no way of knowing how many objects where released by 
the kick methods. The kick method may have chosen to reallocate the 
object.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [patch 14/23] inodes: Support generic defragmentation
  2007-11-07 10:17   ` Jörn Engel
  2007-11-07 10:31     ` Jörn Engel
  2007-11-07 10:35     ` Andreas Schwab
@ 2007-11-07 18:40     ` Christoph Lameter
  2007-11-07 18:51       ` Jörn Engel
  2 siblings, 1 reply; 77+ messages in thread
From: Christoph Lameter @ 2007-11-07 18:40 UTC (permalink / raw)
  To: Jörn Engel; +Cc: akpm, linux-kernel, linux-mm, Mel Gorman

[-- Attachment #1: Type: TEXT/PLAIN, Size: 2805 bytes --]

On Wed, 7 Nov 2007, Jörn Engel wrote:

> On Tue, 6 November 2007 17:11:44 -0800, Christoph Lameter wrote:
> >  
> > +void *get_inodes(struct kmem_cache *s, int nr, void **v)
> > +{
> > +	int i;
> > +
> > +	spin_lock(&inode_lock);
> > +	for (i = 0; i < nr; i++) {
> > +		struct inode *inode = v[i];
> > +
> > +		if (inode->i_state & (I_FREEING|I_CLEAR|I_WILL_FREE))
> > +			v[i] = NULL;
> > +		else
> > +			__iget(inode);
> > +	}
> > +	spin_unlock(&inode_lock);
> > +	return NULL;
> > +}
> > +EXPORT_SYMBOL(get_inodes);
> 
> What purpose does the return type have?

The pointer is for communication between the get and kick methods. get() 
can  modify kick() behavior by returning a pointer to a data structure or 
using the pointer to set a flag. F.e. get() may discover that there is an 
unreclaimable object and set a flag that causes kick to simply undo the 
refcount increment. get() may build a map for the objects and indicate in 
the map special treatment. 

> > +void *fs_get_inodes(struct kmem_cache *s, int nr, void **v,
> > +						unsigned long offset)
> > +{
> > +	int i;
> > +
> > +	for (i = 0; i < nr; i++)
> > +		v[i] += offset;
> > +
> > +	return get_inodes(s, nr, v);
> > +}
> > +EXPORT_SYMBOL(fs_get_inodes);
> 
> The fact that all pointers get changed makes me a bit uneasy:
> 	struct foo_inode v[20];
> 	...
> 	fs_get_inodes(..., v, ...);
> 	...
> 	v[0].foo_field = bar;
> 	
> No warning, but spectacular fireworks.

As far as I can remember: The core code always passes pointers to struct 
inode to the filesystems. The filesystems will then recalculate the 
pointers to point to the fs ide of an inode.


> > +void kick_inodes(struct kmem_cache *s, int nr, void **v, void *private)
> > +{
> > +	struct inode *inode;
> > +	int i;
> > +	int abort = 0;
> > +	LIST_HEAD(freeable);
> > +	struct super_block *sb;
> > +
> > +	for (i = 0; i < nr; i++) {
> > +		inode = v[i];
> > +		if (!inode)
> > +			continue;
> 
> NULL is legal here?  Then fs_get_inodes should check for NULL as well
> and not add the offset to NULL pointers, I guess.

The get() method may have set a pointer to NULL. The fs_get_inodes() is 
run at a time when all pointers are valid.

> > +		}
> > +
> > +		/* Invalidate children and dentry */
> > +		if (S_ISDIR(inode->i_mode)) {
> > +			struct dentry *d = d_find_alias(inode);
> > +
> > +			if (d) {
> > +				d_invalidate(d);
> > +				dput(d);
> > +			}
> > +		}
> > +
> > +		if (inode->i_state & I_DIRTY)
> > +			write_inode_now(inode, 1);
> 
> Once more the three-bit I_DIRTY is used like a boolean value.  I don't
> hold it against you, specifically.  A general review/cleanup is
> necessary for that.

Yeah. I'd be glad if someone could take this piece off my hands.

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [patch 14/23] inodes: Support generic defragmentation
  2007-11-07 18:40     ` Christoph Lameter
@ 2007-11-07 18:51       ` Jörn Engel
  2007-11-07 19:00         ` Christoph Lameter
  0 siblings, 1 reply; 77+ messages in thread
From: Jörn Engel @ 2007-11-07 18:51 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Jörn Engel, akpm, linux-kernel, linux-mm, Mel Gorman

On Wed, 7 November 2007 10:40:55 -0800, Christoph Lameter wrote:
> On Wed, 7 Nov 2007, JA?rn Engel wrote:
> > On Tue, 6 November 2007 17:11:44 -0800, Christoph Lameter wrote:
> > >  
> > > +void *get_inodes(struct kmem_cache *s, int nr, void **v)
> > > +{
> > > +	int i;
> > > +
> > > +	spin_lock(&inode_lock);
> > > +	for (i = 0; i < nr; i++) {
> > > +		struct inode *inode = v[i];
> > > +
> > > +		if (inode->i_state & (I_FREEING|I_CLEAR|I_WILL_FREE))
> > > +			v[i] = NULL;
> > > +		else
> > > +			__iget(inode);
> > > +	}
> > > +	spin_unlock(&inode_lock);
> > > +	return NULL;
> > > +}
> > > +EXPORT_SYMBOL(get_inodes);
> > 
> > What purpose does the return type have?
> 
> The pointer is for communication between the get and kick methods. get() 
> can  modify kick() behavior by returning a pointer to a data structure or 
> using the pointer to set a flag. F.e. get() may discover that there is an 
> unreclaimable object and set a flag that causes kick to simply undo the 
> refcount increment. get() may build a map for the objects and indicate in 
> the map special treatment. 

Is there a get/kick pair that actually does this?  So far I haven't
found anything like it.

Also, something vaguely matching that paragraph might make sense in a
kerneldoc header to the function. ;)

JA?rn

-- 
There is no worse hell than that provided by the regrets
for wasted opportunities.
-- Andre-Louis Moreau in Scarabouche

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [patch 04/23] dentries: Extract common code to remove dentry from lru
  2007-11-07 18:55       ` Christoph Lameter
@ 2007-11-07 18:54         ` Jörn Engel
  2007-11-07 19:00           ` Christoph Lameter
  0 siblings, 1 reply; 77+ messages in thread
From: Jörn Engel @ 2007-11-07 18:54 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Jörn Engel, Johannes Weiner, akpm, linux-kernel, linux-mm,
	Mel Gorman

On Wed, 7 November 2007 10:55:09 -0800, Christoph Lameter wrote:
> 
> From: Christoph Lameter <clameter@sgi.com>
> Subject: dcache: use the correct variable.
> 
> We need to use "loop" instead of "dentry"
> 
> Acked-by: Joern Engel <joern@logfs.org>
> Signed-off-by: Christoph Lameter <clameter@sgi.com>
> 
> Index: linux-2.6/fs/dcache.c
> ===================================================================
> --- linux-2.6.orig/fs/dcache.c	2007-11-07 10:26:20.000000000 -0800
> +++ linux-2.6/fs/dcache.c	2007-11-07 10:26:27.000000000 -0800
> @@ -610,7 +610,7 @@ static void shrink_dcache_for_umount_sub
>  			spin_lock(&dcache_lock);
>  			list_for_each_entry(loop, &dentry->d_subdirs,
>  					    d_u.d_child) {
> -				dentry_lru_remove(dentry);
> +				dentry_lru_remove(loop);
>  				__d_drop(loop);
>  				cond_resched_lock(&dcache_lock);
>  			}

Erm - wouldn't this break git-bisect?

JA?rn

-- 
Joern's library part 5:
http://www.faqs.org/faqs/compression-faq/part2/section-9.html

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [patch 04/23] dentries: Extract common code to remove dentry from lru
  2007-11-07  9:43     ` Jörn Engel
@ 2007-11-07 18:55       ` Christoph Lameter
  2007-11-07 18:54         ` Jörn Engel
  0 siblings, 1 reply; 77+ messages in thread
From: Christoph Lameter @ 2007-11-07 18:55 UTC (permalink / raw)
  To: Jörn Engel; +Cc: Johannes Weiner, akpm, linux-kernel, linux-mm, Mel Gorman

[-- Attachment #1: Type: TEXT/PLAIN, Size: 1017 bytes --]

On Wed, 7 Nov 2007, Jörn Engel wrote:

> Looks like it.  Once this is fixed, feel free to add
> Acked-by: Joern Engel <joern@logfs.org>

Its in the slab defrag git now. I added the spelling fix and this one as a 
result of the discussions today.


From: Christoph Lameter <clameter@sgi.com>
Subject: dcache: use the correct variable.

We need to use "loop" instead of "dentry"

Acked-by: Joern Engel <joern@logfs.org>
Signed-off-by: Christoph Lameter <clameter@sgi.com>

Index: linux-2.6/fs/dcache.c
===================================================================
--- linux-2.6.orig/fs/dcache.c	2007-11-07 10:26:20.000000000 -0800
+++ linux-2.6/fs/dcache.c	2007-11-07 10:26:27.000000000 -0800
@@ -610,7 +610,7 @@ static void shrink_dcache_for_umount_sub
 			spin_lock(&dcache_lock);
 			list_for_each_entry(loop, &dentry->d_subdirs,
 					    d_u.d_child) {
-				dentry_lru_remove(dentry);
+				dentry_lru_remove(loop);
 				__d_drop(loop);
 				cond_resched_lock(&dcache_lock);
 			}

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [patch 14/23] inodes: Support generic defragmentation
  2007-11-07 18:51       ` Jörn Engel
@ 2007-11-07 19:00         ` Christoph Lameter
  0 siblings, 0 replies; 77+ messages in thread
From: Christoph Lameter @ 2007-11-07 19:00 UTC (permalink / raw)
  To: Jörn Engel; +Cc: akpm, linux-kernel, linux-mm, Mel Gorman

[-- Attachment #1: Type: TEXT/PLAIN, Size: 1153 bytes --]

On Wed, 7 Nov 2007, Jörn Engel wrote:

> > The pointer is for communication between the get and kick methods. get() 
> > can  modify kick() behavior by returning a pointer to a data structure or 
> > using the pointer to set a flag. F.e. get() may discover that there is an 
> > unreclaimable object and set a flag that causes kick to simply undo the 
> > refcount increment. get() may build a map for the objects and indicate in 
> > the map special treatment. 
> 
> Is there a get/kick pair that actually does this?  So far I haven't
> found anything like it.

Hmmm.. Nothing uses it at this point. I went through a series of get/kicks
during development. Some needed it. I suspect that we will need it when we 
implement reallocation instead of simply reclaiming. It is also necessary
if we get into the situation where we want to optimize the reclaim. At 
that point the kick method needs to know how far get() got before the 
action was aborted in order to fix up only certain refcounts.

> Also, something vaguely matching that paragraph might make sense in a
> kerneldoc header to the function. ;)

Its described in slab.h

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [patch 04/23] dentries: Extract common code to remove dentry from lru
  2007-11-07 18:54         ` Jörn Engel
@ 2007-11-07 19:00           ` Christoph Lameter
  0 siblings, 0 replies; 77+ messages in thread
From: Christoph Lameter @ 2007-11-07 19:00 UTC (permalink / raw)
  To: Jörn Engel; +Cc: Johannes Weiner, akpm, linux-kernel, linux-mm, Mel Gorman

[-- Attachment #1: Type: TEXT/PLAIN, Size: 830 bytes --]

On Wed, 7 Nov 2007, Jörn Engel wrote:

> > Acked-by: Joern Engel <joern@logfs.org>
> > Signed-off-by: Christoph Lameter <clameter@sgi.com>
> > 
> > Index: linux-2.6/fs/dcache.c
> > ===================================================================
> > --- linux-2.6.orig/fs/dcache.c	2007-11-07 10:26:20.000000000 -0800
> > +++ linux-2.6/fs/dcache.c	2007-11-07 10:26:27.000000000 -0800
> > @@ -610,7 +610,7 @@ static void shrink_dcache_for_umount_sub
> >  			spin_lock(&dcache_lock);
> >  			list_for_each_entry(loop, &dentry->d_subdirs,
> >  					    d_u.d_child) {
> > -				dentry_lru_remove(dentry);
> > +				dentry_lru_remove(loop);
> >  				__d_drop(loop);
> >  				cond_resched_lock(&dcache_lock);
> >  			}
> 
> Erm - wouldn't this break git-bisect?

Well Andrew will merge it into the earlier patch.

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [patch 11/23] SLUB: Slab defrag core
  2007-11-07  1:11 ` [patch 11/23] SLUB: Slab defrag core Christoph Lameter
@ 2007-11-07 22:13   ` Christoph Lameter
  0 siblings, 0 replies; 77+ messages in thread
From: Christoph Lameter @ 2007-11-07 22:13 UTC (permalink / raw)
  To: akpm; +Cc: linux-kernel, linux-mm, Mel Gorman

This was so far only in the slab defrag git tree.

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [patch 02/23] SLUB: Rename NUMA defrag_ratio to remote_node_defrag_ratio
  2007-11-07  1:11 ` [patch 02/23] SLUB: Rename NUMA defrag_ratio to remote_node_defrag_ratio Christoph Lameter
@ 2007-11-08 14:50   ` Mel Gorman
  2007-11-08 17:25     ` Matt Mackall
  2007-11-08 18:56     ` Christoph Lameter
  0 siblings, 2 replies; 77+ messages in thread
From: Mel Gorman @ 2007-11-08 14:50 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: akpm, linux-kernel, linux-mm

On (06/11/07 17:11), Christoph Lameter didst pronounce:
> We need the defrag ratio for the non NUMA situation now. The NUMA defrag works
> by allocating objects from partial slabs on remote nodes. Rename it to
> 
> 	remote_node_defrag_ratio
> 

I'm not too keen on the defrag name here largely because I cannot tell what
it has to do with defragmention or ratios. It's really about working out
when it is better to pack objects into a remote slab than reclaim objects
from a local slab, right? It's also not clear what it is a ratio of what to
what. I thought it might be clock cycles but that isn't very clear either.
If we are renaming this can it be something like remote_packing_cost_limit ?

> to be clear about this.
> 
> [This patch is already in mm]
> 
> Reviewed-by: Rik van Riel <riel@redhat.com>
> Signed-off-by: Christoph Lameter <clameter@sgi.com>
> ---
>  include/linux/slub_def.h |    5 ++++-
>  mm/slub.c                |   17 +++++++++--------
>  2 files changed, 13 insertions(+), 9 deletions(-)
> 
> Index: linux-2.6/include/linux/slub_def.h
> ===================================================================
> --- linux-2.6.orig/include/linux/slub_def.h	2007-11-06 12:34:13.000000000 -0800
> +++ linux-2.6/include/linux/slub_def.h	2007-11-06 12:36:28.000000000 -0800
> @@ -60,7 +60,10 @@ struct kmem_cache {
>  #endif
>  
>  #ifdef CONFIG_NUMA
> -	int defrag_ratio;
> +	/*
> +	 * Defragmentation by allocating from a remote node.
> +	 */
> +	int remote_node_defrag_ratio;

How about

/*
 * When packing objects into slabs, it may become necessary to
 * reclaim objects on a local slab or allocate from a remote node.
 * The remote_packing_cost_limit is the maximum cost of remote
 * accesses that should be paid before it becomes worthwhile to
 * reclaim instead
 */
int remote_packing_cost_limit;

?

I still don't see what get_cycles() has to do with anything but this
could be because my understanding of SLUB sucks.

>  	struct kmem_cache_node *node[MAX_NUMNODES];
>  #endif
>  #ifdef CONFIG_SMP
> Index: linux-2.6/mm/slub.c
> ===================================================================
> --- linux-2.6.orig/mm/slub.c	2007-11-06 12:36:16.000000000 -0800
> +++ linux-2.6/mm/slub.c	2007-11-06 12:37:25.000000000 -0800
> @@ -1345,7 +1345,8 @@ static unsigned long get_any_partial(str
>  	 * expensive if we do it every time we are trying to find a slab
>  	 * with available objects.
>  	 */
> -	if (!s->defrag_ratio || get_cycles() % 1024 > s->defrag_ratio)
> +	if (!s->remote_node_defrag_ratio ||
> +			get_cycles() % 1024 > s->remote_node_defrag_ratio)

I cannot figure out what the number of cycles currently showing on the TSC
have to do with a ratio :(. I could semi-understand if we were counting up
how many cycles were being spent trying to pack objects but that does not
appear to be the case. The comment didn't help a whole lot either. It felt
like a cost for packing, not a ratio

>  		return 0;
>  
>  	zonelist = &NODE_DATA(slab_node(current->mempolicy))
> @@ -2363,7 +2364,7 @@ static int kmem_cache_open(struct kmem_c
>  
>  	s->refcount = 1;
>  #ifdef CONFIG_NUMA
> -	s->defrag_ratio = 100;
> +	s->remote_node_defrag_ratio = 100;
>  #endif
>  	if (!init_kmem_cache_nodes(s, gfpflags & ~SLUB_DMA))
>  		goto error;
> @@ -4005,21 +4006,21 @@ static ssize_t free_calls_show(struct km
>  SLAB_ATTR_RO(free_calls);
>  
>  #ifdef CONFIG_NUMA
> -static ssize_t defrag_ratio_show(struct kmem_cache *s, char *buf)
> +static ssize_t remote_node_defrag_ratio_show(struct kmem_cache *s, char *buf)
>  {
> -	return sprintf(buf, "%d\n", s->defrag_ratio / 10);
> +	return sprintf(buf, "%d\n", s->remote_node_defrag_ratio / 10);
>  }
>  
> -static ssize_t defrag_ratio_store(struct kmem_cache *s,
> +static ssize_t remote_node_defrag_ratio_store(struct kmem_cache *s,
>  				const char *buf, size_t length)
>  {
>  	int n = simple_strtoul(buf, NULL, 10);
>  
>  	if (n < 100)
> -		s->defrag_ratio = n * 10;
> +		s->remote_node_defrag_ratio = n * 10;
>  	return length;
>  }
> -SLAB_ATTR(defrag_ratio);
> +SLAB_ATTR(remote_node_defrag_ratio);
>  #endif
>  
>  static struct attribute * slab_attrs[] = {
> @@ -4050,7 +4051,7 @@ static struct attribute * slab_attrs[] =
>  	&cache_dma_attr.attr,
>  #endif
>  #ifdef CONFIG_NUMA
> -	&defrag_ratio_attr.attr,
> +	&remote_node_defrag_ratio_attr.attr,
>  #endif
>  	NULL
>  };
> 
> -- 
> 

-- 
-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [patch 06/23] SLUB: Extend slabinfo to support -D and -C options
  2007-11-07  1:11 ` [patch 06/23] SLUB: Extend slabinfo to support -D and -C options Christoph Lameter
@ 2007-11-08 15:00   ` Mel Gorman
  0 siblings, 0 replies; 77+ messages in thread
From: Mel Gorman @ 2007-11-08 15:00 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: akpm, linux-kernel, linux-mm

On (06/11/07 17:11), Christoph Lameter didst pronounce:
> -D lists caches that support defragmentation
> 
> -C lists caches that use a ctor.
> 
> Change field names for defrag_ratio and remote_node_defrag_ratio.
> 
> Add determination of the allocation ratio for a slab. The allocation ratio
> is the percentage of available slots for objects in use.
> 

Total aside, is there any plan to merge slabinfo or something similar in
procps? Alternatively, is having a compatability /proc/slabinfo in the works
anywhere? Latest git doesn't appear to have anything on it but I could easily
have missed it in -mm or somewhere else.

> Reviewed-by: Rik van Riel <riel@redhat.com>
> Signed-off-by: Christoph Lameter <clameter@sgi.com>
> ---
>  Documentation/vm/slabinfo.c |   52 +++++++++++++++++++++++++++++++++++++-------
>  1 file changed, 44 insertions(+), 8 deletions(-)
> 
> Index: linux-2.6.23-mm1/Documentation/vm/slabinfo.c
> ===================================================================
> --- linux-2.6.23-mm1.orig/Documentation/vm/slabinfo.c	2007-10-12 16:25:54.000000000 -0700
> +++ linux-2.6.23-mm1/Documentation/vm/slabinfo.c	2007-10-12 17:58:14.000000000 -0700
> @@ -31,6 +31,8 @@ struct slabinfo {
>  	int hwcache_align, object_size, objs_per_slab;
>  	int sanity_checks, slab_size, store_user, trace;
>  	int order, poison, reclaim_account, red_zone;
> +	int defrag, ctor;
> +	int defrag_ratio, remote_node_defrag_ratio;
>  	unsigned long partial, objects, slabs;
>  	int numa[MAX_NODES];
>  	int numa_partial[MAX_NODES];
> @@ -57,6 +59,8 @@ int show_slab = 0;
>  int skip_zero = 1;
>  int show_numa = 0;
>  int show_track = 0;
> +int show_defrag = 0;
> +int show_ctor = 0;
>  int show_first_alias = 0;
>  int validate = 0;
>  int shrink = 0;
> @@ -91,18 +95,20 @@ void fatal(const char *x, ...)
>  void usage(void)
>  {
>  	printf("slabinfo 5/7/2007. (c) 2007 sgi. clameter@sgi.com\n\n"
> -		"slabinfo [-ahnpvtsz] [-d debugopts] [slab-regexp]\n"
> +		"slabinfo [-aCDefhilnosSrtTvz1] [-d debugopts] [slab-regexp]\n"
>  		"-a|--aliases           Show aliases\n"
> +		"-C|--ctor              Show slabs with ctors\n"
>  		"-d<options>|--debug=<options> Set/Clear Debug options\n"
> -		"-e|--empty		Show empty slabs\n"
> +		"-D|--defrag            Show defragmentable caches\n"
> +		"-e|--empty             Show empty slabs\n"
>  		"-f|--first-alias       Show first alias\n"
>  		"-h|--help              Show usage information\n"
>  		"-i|--inverted          Inverted list\n"
>  		"-l|--slabs             Show slabs\n"
>  		"-n|--numa              Show NUMA information\n"
> -		"-o|--ops		Show kmem_cache_ops\n"
> +		"-o|--ops               Show kmem_cache_ops\n"
>  		"-s|--shrink            Shrink slabs\n"
> -		"-r|--report		Detailed report on single slabs\n"
> +		"-r|--report            Detailed report on single slabs\n"
>  		"-S|--Size              Sort by size\n"
>  		"-t|--tracking          Show alloc/free information\n"
>  		"-T|--Totals            Show summary information\n"
> @@ -282,7 +288,7 @@ int line = 0;
>  void first_line(void)
>  {
>  	printf("Name                   Objects Objsize    Space "
> -		"Slabs/Part/Cpu  O/S O %%Fr %%Ef Flg\n");
> +		"Slabs/Part/Cpu  O/S O %%Ra %%Ef Flg\n");
>  }
>  
>  /*
> @@ -325,7 +331,7 @@ void slab_numa(struct slabinfo *s, int m
>  		return;
>  
>  	if (!line) {
> -		printf("\n%-21s:", mode ? "NUMA nodes" : "Slab");
> +		printf("\n%-21s: Rto ", mode ? "NUMA nodes" : "Slab");
>  		for(node = 0; node <= highest_node; node++)
>  			printf(" %4d", node);
>  		printf("\n----------------------");
> @@ -334,6 +340,7 @@ void slab_numa(struct slabinfo *s, int m
>  		printf("\n");
>  	}
>  	printf("%-21s ", mode ? "All slabs" : s->name);
> +	printf("%3d ", s->remote_node_defrag_ratio);
>  	for(node = 0; node <= highest_node; node++) {
>  		char b[20];
>  
> @@ -407,6 +414,8 @@ void report(struct slabinfo *s)
>  		printf("** Slabs are destroyed via RCU\n");
>  	if (s->reclaim_account)
>  		printf("** Reclaim accounting active\n");
> +	if (s->defrag)
> +		printf("** Defragmentation at %d%%\n", s->defrag_ratio);
>  
>  	printf("\nSizes (bytes)     Slabs              Debug                Memory\n");
>  	printf("------------------------------------------------------------------------\n");
> @@ -453,6 +462,12 @@ void slabcache(struct slabinfo *s)
>  	if (show_empty && s->slabs)
>  		return;
>  
> +	if (show_defrag && !s->defrag)
> +		return;
> +
> +	if (show_ctor && !s->ctor)
> +		return;
> +
>  	store_size(size_str, slab_size(s));
>  	snprintf(dist_str, 40, "%lu/%lu/%d", s->slabs, s->partial, s->cpu_slabs);
>  
> @@ -463,6 +478,10 @@ void slabcache(struct slabinfo *s)
>  		*p++ = '*';
>  	if (s->cache_dma)
>  		*p++ = 'd';
> +	if (s->defrag)
> +		*p++ = 'D';
> +	if (s->ctor)
> +		*p++ = 'C';
>  	if (s->hwcache_align)
>  		*p++ = 'A';
>  	if (s->poison)
> @@ -482,7 +501,7 @@ void slabcache(struct slabinfo *s)
>  	printf("%-21s %8ld %7d %8s %14s %4d %1d %3ld %3ld %s\n",
>  		s->name, s->objects, s->object_size, size_str, dist_str,
>  		s->objs_per_slab, s->order,
> -		s->slabs ? (s->partial * 100) / s->slabs : 100,
> +		s->slabs ? (s->objects * 100) / (s->slabs * s->objs_per_slab) : 100,
>  		s->slabs ? (s->objects * s->object_size * 100) /
>  			(s->slabs * (page_size << s->order)) : 100,
>  		flags);
> @@ -1074,7 +1093,16 @@ void read_slab_dir(void)
>  			free(t);
>  			slab->store_user = get_obj("store_user");
>  			slab->trace = get_obj("trace");
> +			slab->defrag_ratio = get_obj("defrag_ratio");
> +			slab->remote_node_defrag_ratio =
> +					get_obj("remote_node_defrag_ratio");
>  			chdir("..");
> +			if (read_slab_obj(slab, "ops")) {
> +				if (strstr(buffer, "ctor :"))
> +					slab->ctor = 1;
> +				if (strstr(buffer, "kick :"))
> +					slab->defrag = 1;
> +			}
>  			if (slab->name[0] == ':')
>  				alias_targets++;
>  			slab++;
> @@ -1124,7 +1152,9 @@ void output_slabs(void)
>  
>  struct option opts[] = {
>  	{ "aliases", 0, NULL, 'a' },
> +	{ "ctor", 0, NULL, 'C' },
>  	{ "debug", 2, NULL, 'd' },
> +	{ "defrag", 0, NULL, 'D' },
>  	{ "empty", 0, NULL, 'e' },
>  	{ "first-alias", 0, NULL, 'f' },
>  	{ "help", 0, NULL, 'h' },
> @@ -1149,7 +1179,7 @@ int main(int argc, char *argv[])
>  
>  	page_size = getpagesize();
>  
> -	while ((c = getopt_long(argc, argv, "ad::efhil1noprstvzTS",
> +	while ((c = getopt_long(argc, argv, "ad::efhil1noprstvzCDTS",
>  						opts, NULL)) != -1)
>  		switch (c) {
>  		case '1':
> @@ -1199,6 +1229,12 @@ int main(int argc, char *argv[])
>  		case 'z':
>  			skip_zero = 0;
>  			break;
> +		case 'C':
> +			show_ctor = 1;
> +			break;
> +		case 'D':
> +			show_defrag = 1;
> +			break;
>  		case 'T':
>  			show_totals = 1;
>  			break;
> 
> -- 
> 

-- 
-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [patch 07/23] SLUB: Add defrag_ratio field and sysfs support.
  2007-11-07  1:11 ` [patch 07/23] SLUB: Add defrag_ratio field and sysfs support Christoph Lameter
  2007-11-07  8:55   ` Johannes Weiner
@ 2007-11-08 15:07   ` Mel Gorman
  2007-11-08 18:59     ` Christoph Lameter
  1 sibling, 1 reply; 77+ messages in thread
From: Mel Gorman @ 2007-11-08 15:07 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: akpm, linux-kernel, linux-mm

On (06/11/07 17:11), Christoph Lameter didst pronounce:
> The defrag_ratio is used to set the threshold at which defragmentation
> should be run on a slabcache.
> 

I'm thick, I would like to see a quick note here on what defragmentation
means. Also, this defrag_ratio seems to have a significantly different
meaning to the other defrag_ratio which isn't helping my poor head at
all.

"The defrag_ratio sets a threshold at which a slab will be vacated of all
it's objects and the pages freed during memory reclaim."

?

> The allocation ratio is measured in a percentage of the available slots.
> The percentage will be lower for slabs that are more fragmented.
> 
> Add a defrag ratio field and set it to 30% by default. A limit of 30% specified
> that less than 3 out of 10 available slots for objects are in use before
> reclaim occurs.
> 
> Reviewed-by: Rik van Riel <riel@redhat.com>
> Signed-off-by: Christoph Lameter <clameter@sgi.com>
> ---
>  include/linux/slub_def.h |    7 +++++++
>  mm/slub.c                |   18 ++++++++++++++++++
>  2 files changed, 25 insertions(+)
> 
> Index: linux-2.6/include/linux/slub_def.h
> ===================================================================
> --- linux-2.6.orig/include/linux/slub_def.h	2007-11-06 12:36:28.000000000 -0800
> +++ linux-2.6/include/linux/slub_def.h	2007-11-06 12:37:44.000000000 -0800
> @@ -53,6 +53,13 @@ struct kmem_cache {
>  	void (*ctor)(struct kmem_cache *, void *);
>  	int inuse;		/* Offset to metadata */
>  	int align;		/* Alignment */
> +	int defrag_ratio;	/*
> +				 * objects/possible-objects limit. If we have
> +				 * less that the specified percentage of
> +				 * objects allocated then defrag passes
> +				 * will start to occur during reclaim.
> +				 */
> +
>  	const char *name;	/* Name (only for display!) */
>  	struct list_head list;	/* List of slab caches */
>  #ifdef CONFIG_SLUB_DEBUG
> Index: linux-2.6/mm/slub.c
> ===================================================================
> --- linux-2.6.orig/mm/slub.c	2007-11-06 12:37:25.000000000 -0800
> +++ linux-2.6/mm/slub.c	2007-11-06 12:37:44.000000000 -0800
> @@ -2363,6 +2363,7 @@ static int kmem_cache_open(struct kmem_c
>  		goto error;
>  
>  	s->refcount = 1;
> +	s->defrag_ratio = 30;
>  #ifdef CONFIG_NUMA
>  	s->remote_node_defrag_ratio = 100;
>  #endif
> @@ -4005,6 +4006,22 @@ static ssize_t free_calls_show(struct km
>  }
>  SLAB_ATTR_RO(free_calls);
>  
> +static ssize_t defrag_ratio_show(struct kmem_cache *s, char *buf)
> +{
> +	return sprintf(buf, "%d\n", s->defrag_ratio);
> +}
> +
> +static ssize_t defrag_ratio_store(struct kmem_cache *s,
> +				const char *buf, size_t length)
> +{
> +	int n = simple_strtoul(buf, NULL, 10);
> +
> +	if (n < 100)
> +		s->defrag_ratio = n;
> +	return length;
> +}
> +SLAB_ATTR(defrag_ratio);
> +
>  #ifdef CONFIG_NUMA
>  static ssize_t remote_node_defrag_ratio_show(struct kmem_cache *s, char *buf)
>  {
> @@ -4047,6 +4064,7 @@ static struct attribute * slab_attrs[] =
>  	&shrink_attr.attr,
>  	&alloc_calls_attr.attr,
>  	&free_calls_attr.attr,
> +	&defrag_ratio_attr.attr,
>  #ifdef CONFIG_ZONE_DMA
>  	&cache_dma_attr.attr,
>  #endif
> 
> -- 
> 

-- 
-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [patch 12/23] SLUB: Trigger defragmentation from memory reclaim
  2007-11-07  1:11 ` [patch 12/23] SLUB: Trigger defragmentation from memory reclaim Christoph Lameter
  2007-11-07  9:28   ` Johannes Weiner
@ 2007-11-08 15:12   ` Mel Gorman
  2007-11-08 19:00     ` Christoph Lameter
  1 sibling, 1 reply; 77+ messages in thread
From: Mel Gorman @ 2007-11-08 15:12 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: akpm, linux-kernel, linux-mm

On (06/11/07 17:11), Christoph Lameter didst pronounce:
> This patch triggers slab defragmentation from memory reclaim.
> The logical point for this is after slab shrinking was performed in
> vmscan.c. At that point the fragmentation ratio of a slab was increased
> because objects were freed via the LRUs. So we call kmem_cache_defrag from
> there.
> 
> slab_shrink() from vmscan.c is called in some contexts to do
> global shrinking of slabs and in others to do shrinking for
> a particular zone. Pass the zone to slab_shrink, so that slab_shrink
> can call kmem_cache_defrag() and restrict the defragmentation to
> the node that is under memory pressure.
> 
> Reviewed-by: Rik van Riel <riel@redhat.com>
> Signed-off-by: Christoph Lameter <clameter@sgi.com>
> ---
>  fs/drop_caches.c     |    2 +-
>  include/linux/mm.h   |    2 +-
>  include/linux/slab.h |    1 +
>  mm/vmscan.c          |   26 +++++++++++++++++++-------
>  4 files changed, 22 insertions(+), 9 deletions(-)
> 
> Index: linux-2.6/fs/drop_caches.c
> ===================================================================
> --- linux-2.6.orig/fs/drop_caches.c	2007-08-29 19:30:53.000000000 -0700
> +++ linux-2.6/fs/drop_caches.c	2007-11-06 12:53:40.000000000 -0800
> @@ -50,7 +50,7 @@ void drop_slab(void)
>  	int nr_objects;
>  
>  	do {
> -		nr_objects = shrink_slab(1000, GFP_KERNEL, 1000);
> +		nr_objects = shrink_slab(1000, GFP_KERNEL, 1000, NULL);
>  	} while (nr_objects > 10);
>  }
>  
> Index: linux-2.6/include/linux/mm.h
> ===================================================================
> --- linux-2.6.orig/include/linux/mm.h	2007-11-06 12:33:55.000000000 -0800
> +++ linux-2.6/include/linux/mm.h	2007-11-06 12:54:11.000000000 -0800
> @@ -1118,7 +1118,7 @@ int in_gate_area_no_task(unsigned long a
>  int drop_caches_sysctl_handler(struct ctl_table *, int, struct file *,
>  					void __user *, size_t *, loff_t *);
>  unsigned long shrink_slab(unsigned long scanned, gfp_t gfp_mask,
> -			unsigned long lru_pages);
> +			unsigned long lru_pages, struct zone *z);
>  void drop_pagecache(void);
>  void drop_slab(void);
>  
> Index: linux-2.6/include/linux/slab.h
> ===================================================================
> --- linux-2.6.orig/include/linux/slab.h	2007-11-06 12:37:51.000000000 -0800
> +++ linux-2.6/include/linux/slab.h	2007-11-06 12:53:40.000000000 -0800
> @@ -63,6 +63,7 @@ void kmem_cache_free(struct kmem_cache *
>  unsigned int kmem_cache_size(struct kmem_cache *);
>  const char *kmem_cache_name(struct kmem_cache *);
>  int kmem_ptr_validate(struct kmem_cache *cachep, const void *ptr);
> +int kmem_cache_defrag(int node);
>  
>  /*
>   * Please use this macro to create slab caches. Simply specify the
> Index: linux-2.6/mm/vmscan.c
> ===================================================================
> --- linux-2.6.orig/mm/vmscan.c	2007-10-25 18:28:41.000000000 -0700
> +++ linux-2.6/mm/vmscan.c	2007-11-06 12:55:25.000000000 -0800
> @@ -150,10 +150,18 @@ EXPORT_SYMBOL(unregister_shrinker);
>   * are eligible for the caller's allocation attempt.  It is used for balancing
>   * slab reclaim versus page reclaim.
>   *
> + * zone is the zone for which we are shrinking the slabs. If the intent
> + * is to do a global shrink then zone may be NULL. Specification of a
> + * zone is currently only used to limit slab defragmentation to a NUMA node.
> + * The performace of shrink_slab would be better (in particular under NUMA)
> + * if it could be targeted as a whole to the zone that is under memory
> + * pressure but the VFS infrastructure does not allow that at the present
> + * time.
> + *
>   * Returns the number of slab objects which we shrunk.
>   */
>  unsigned long shrink_slab(unsigned long scanned, gfp_t gfp_mask,
> -			unsigned long lru_pages)
> +			unsigned long lru_pages, struct zone *zone)
>  {
>  	struct shrinker *shrinker;
>  	unsigned long ret = 0;
> @@ -210,6 +218,8 @@ unsigned long shrink_slab(unsigned long 
>  		shrinker->nr += total_scan;
>  	}
>  	up_read(&shrinker_rwsem);
> +	if (gfp_mask & __GFP_FS)
> +		kmem_cache_defrag(zone ? zone_to_nid(zone) : -1);

Does this make an assumption that only filesystem-related slabs may be
targetted for reclaim? What if there is a slab that can free its objects
without ever caring about a filesystem?

>  	return ret;
>  }
>  
> @@ -1241,7 +1251,7 @@ unsigned long try_to_free_pages(struct z
>  		if (!priority)
>  			disable_swap_token();
>  		nr_reclaimed += shrink_zones(priority, zones, &sc);
> -		shrink_slab(sc.nr_scanned, gfp_mask, lru_pages);
> +		shrink_slab(sc.nr_scanned, gfp_mask, lru_pages, NULL);
>  		if (reclaim_state) {
>  			nr_reclaimed += reclaim_state->reclaimed_slab;
>  			reclaim_state->reclaimed_slab = 0;
> @@ -1419,7 +1429,7 @@ loop_again:
>  				nr_reclaimed += shrink_zone(priority, zone, &sc);
>  			reclaim_state->reclaimed_slab = 0;
>  			nr_slab = shrink_slab(sc.nr_scanned, GFP_KERNEL,
> -						lru_pages);
> +						lru_pages, zone);
>  			nr_reclaimed += reclaim_state->reclaimed_slab;
>  			total_scanned += sc.nr_scanned;
>  			if (zone_is_all_unreclaimable(zone))
> @@ -1658,7 +1668,7 @@ unsigned long shrink_all_memory(unsigned
>  	/* If slab caches are huge, it's better to hit them first */
>  	while (nr_slab >= lru_pages) {
>  		reclaim_state.reclaimed_slab = 0;
> -		shrink_slab(nr_pages, sc.gfp_mask, lru_pages);
> +		shrink_slab(nr_pages, sc.gfp_mask, lru_pages, NULL);
>  		if (!reclaim_state.reclaimed_slab)
>  			break;
>  
> @@ -1696,7 +1706,7 @@ unsigned long shrink_all_memory(unsigned
>  
>  			reclaim_state.reclaimed_slab = 0;
>  			shrink_slab(sc.nr_scanned, sc.gfp_mask,
> -					count_lru_pages());
> +					count_lru_pages(), NULL);
>  			ret += reclaim_state.reclaimed_slab;
>  			if (ret >= nr_pages)
>  				goto out;
> @@ -1713,7 +1723,8 @@ unsigned long shrink_all_memory(unsigned
>  	if (!ret) {
>  		do {
>  			reclaim_state.reclaimed_slab = 0;
> -			shrink_slab(nr_pages, sc.gfp_mask, count_lru_pages());
> +			shrink_slab(nr_pages, sc.gfp_mask,
> +					count_lru_pages(), NULL);
>  			ret += reclaim_state.reclaimed_slab;
>  		} while (ret < nr_pages && reclaim_state.reclaimed_slab > 0);
>  	}
> @@ -1875,7 +1886,8 @@ static int __zone_reclaim(struct zone *z
>  		 * Note that shrink_slab will free memory on all zones and may
>  		 * take a long time.
>  		 */
> -		while (shrink_slab(sc.nr_scanned, gfp_mask, order) &&
> +		while (shrink_slab(sc.nr_scanned, gfp_mask, order,
> +						zone) &&
>  			zone_page_state(zone, NR_SLAB_RECLAIMABLE) >
>  				slab_reclaimable - nr_pages)
>  			;
> 
> -- 
> 

-- 
-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [patch 20/23] dentries: Add constructor
  2007-11-07  1:11 ` [patch 20/23] dentries: Add constructor Christoph Lameter
@ 2007-11-08 15:23   ` Mel Gorman
  2007-11-08 19:03     ` Christoph Lameter
  0 siblings, 1 reply; 77+ messages in thread
From: Mel Gorman @ 2007-11-08 15:23 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: akpm, linux-kernel, linux-mm

On (06/11/07 17:11), Christoph Lameter didst pronounce:
> In order to support defragmentation on the dentry cache we need to have
> a determined object state at all times. Without a constructor the object
> would have a random state after allocation.
> 
> Reviewed-by: Rik van Riel <riel@redhat.com>
> So provide a constructor.
> 
> Signed-off-by: Christoph Lameter <clameter@sgi.com>

Seems to be some garbling on there in the signed-off lines.

> ---
>  fs/dcache.c |   26 ++++++++++++++------------
>  1 file changed, 14 insertions(+), 12 deletions(-)
> 
> Index: linux-2.6/fs/dcache.c
> ===================================================================
> --- linux-2.6.orig/fs/dcache.c	2007-11-06 12:56:56.000000000 -0800
> +++ linux-2.6/fs/dcache.c	2007-11-06 12:57:01.000000000 -0800
> @@ -870,6 +870,16 @@ static struct shrinker dcache_shrinker =
>  	.seeks = DEFAULT_SEEKS,
>  };
>  
> +void dcache_ctor(struct kmem_cache *s, void *p)
> +{
> +	struct dentry *dentry = p;
> +
> +	spin_lock_init(&dentry->d_lock);
> +	dentry->d_inode = NULL;
> +	INIT_LIST_HEAD(&dentry->d_lru);
> +	INIT_LIST_HEAD(&dentry->d_alias);
> +}
> +

Is there any noticable overhead to the constructor?

>  /**
>   * d_alloc	-	allocate a dcache entry
>   * @parent: parent of entry to allocate
> @@ -907,8 +917,6 @@ struct dentry *d_alloc(struct dentry * p
>  
>  	atomic_set(&dentry->d_count, 1);
>  	dentry->d_flags = DCACHE_UNHASHED;
> -	spin_lock_init(&dentry->d_lock);
> -	dentry->d_inode = NULL;
>  	dentry->d_parent = NULL;
>  	dentry->d_sb = NULL;
>  	dentry->d_op = NULL;
> @@ -918,9 +926,7 @@ struct dentry *d_alloc(struct dentry * p
>  	dentry->d_cookie = NULL;
>  #endif
>  	INIT_HLIST_NODE(&dentry->d_hash);
> -	INIT_LIST_HEAD(&dentry->d_lru);
>  	INIT_LIST_HEAD(&dentry->d_subdirs);
> -	INIT_LIST_HEAD(&dentry->d_alias);
>  
>  	if (parent) {
>  		dentry->d_parent = dget(parent);
> @@ -2096,14 +2102,10 @@ static void __init dcache_init(void)
>  {
>  	int loop;
>  
> -	/* 
> -	 * A constructor could be added for stable state like the lists,
> -	 * but it is probably not worth it because of the cache nature
> -	 * of the dcache. 
> -	 */
> -	dentry_cache = KMEM_CACHE(dentry,
> -		SLAB_RECLAIM_ACCOUNT|SLAB_PANIC|SLAB_MEM_SPREAD);
> -	
> +	dentry_cache = kmem_cache_create("dentry_cache", sizeof(struct dentry),
> +		0, SLAB_RECLAIM_ACCOUNT|SLAB_PANIC|SLAB_MEM_SPREAD,
> +		dcache_ctor);
> +
>  	register_shrinker(&dcache_shrinker);
>  
>  	/* Hash may have been set up in dcache_init_early */

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [patch 00/23] Slab defragmentation V6
  2007-11-07  1:11 [patch 00/23] Slab defragmentation V6 Christoph Lameter
                   ` (22 preceding siblings ...)
  2007-11-07  1:11 ` [patch 23/23] SLUB: Add SlabReclaimable() to avoid repeated reclaim attempts Christoph Lameter
@ 2007-11-08 15:26 ` Mel Gorman
  2007-11-08 16:01   ` Plans for Onezonelist patch series ??? Lee Schermerhorn
  2007-11-08 19:12   ` [patch 00/23] Slab defragmentation V6 Christoph Lameter
  23 siblings, 2 replies; 77+ messages in thread
From: Mel Gorman @ 2007-11-08 15:26 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: akpm, linux-kernel, linux-mm

On Tue, 2007-11-06 at 17:11 -0800, Christoph Lameter wrote:
> Slab defragmentation is mainly an issue if Linux is used as a fileserver

Was hoping this would get renamed to SLUB Targetted Reclaim from
discussions at VM Summit. As no copying is taking place, it's confusing
to call it defragmentation to me anyway. Not a major deal but it made
reading the patches a little confusing.

> and large amounts of dentries, inodes and buffer heads accumulate. In some
> load situations the slabs become very sparsely populated so that a lot of
> memory is wasted by slabs that only contain one or a few objects. In
> extreme cases the performance of a machine will become sluggish since
> we are continually running reclaim. Slab defragmentation adds the
> capability to recover wasted memory.
> 

When reading this first, I expected to find how slab objects get copied
around and packed which is my problem with the defragmentation name.
Again, not really that relevant to the code.

> With lumpy reclaim slab defragmentation can be used to enhance the
> ability to recover larger contiguous areas of memory. Lumpy reclaim currently
> cannot do anything if a slab page is encountered. With slab defragmentation
> that slab page can be removed and a large contiguous page freed. It may
> be possible to have slab pages also part of ZONE_MOVABLE (Mel's defrag
> scheme in 2.6.23)

More terminology nit-pick - ZONE_MOVABLE is not defragmenting anything.
It's just partitioning memory. The slab pages need to be 100%
reclaimable or movable for that to happen but even with targetted
reclaim, some dentries such as the root directory one cannot be
reclaimed, right?

>
>  or the MOVABLE areas (antifrag patches in mm).
> 

It'd still be valid to leave them as MIGRATE_RECLAIMABLE because that is
what they are. Arguably, MIGRATE_RECLAIMABLE could be dropped in it's
entirety but I'd rather not as reclaimable blocks have significantly
different reclaim costs to pages that are currently marked movable.

> The patchset is also available via git
> 
> git pull git://git.kernel.org/pub/scm/linux/kernel/git/christoph/slab.git defrag
> 
> 
> Currently memory reclaim from the following slab caches is possible:
> 
> 1. dentry cache
> 2. inode cache (with a generic interface to allow easy setup of more
>    filesystems than the currently supported ext2/3/4 reiserfs, XFS
>    and proc)
> 3. buffer_heads
> 
> One typical mechanism that triggers slab defragmentation on my systems
> is the daily run of
> 
> 	updatedb
> 
> Updatedb scans all files on the system which causes a high inode and dentry
> use. After updatedb is complete we need to go back to the regular use
> patterns (typical on my machine: kernel compiles). Those need the memory now
> for different purposes. The inodes and dentries used for updatedb will
> gradually be aged by the dentry/inode reclaim algorithm which will free
> up the dentries and inode entries randomly through the slabs that were
> allocated. As a result the slabs will become sparsely populated. If they
> become empty then they can be freed but a lot of them will remain sparsely
> populated. That is where slab defrag comes in: It removes the slabs with
> just a few entries reclaiming more memory for other uses.
> 
> V5->V6
> - Rediff against 2.6.24-rc2 + mm slub patches.
> - Add reviewed by lines.
> - Take out the experimental code to make slab pages movable. That
>   has to wait until this has been considered by Mel.
> 

I still haven't considered them properly. I've been backlogged for I
don't know how long at this point and this is on the increasingly large
todo list :( . I don't believe it is massively urgent at the moment
though and reclaiming to start with is perfectly adequate just as lumpy
reclaim is fine at the moment.

> V4->V5:
> - Support lumpy reclaim for slabs
> - Support reclaim via slab_shrink()
> - Add constructors to insure a consistent object state at all times.
> 
> V3->V4:
> - Optimize scan for slabs that need defragmentation
> - Add /sys/slab/*/defrag_ratio to allow setting defrag limits
>   per slab.
> - Add support for buffer heads.
> - Describe how the cleanup after the daily updatedb can be
>   improved by slab defragmentation.
> 
> V2->V3
> - Support directory reclaim
> - Add infrastructure to trigger defragmentation after slab shrinking if we
>   have slabs with a high degree of fragmentation.
> 
> V1->V2
> - Clean up control flow using a state variable. Simplify API. Back to 2
>   functions that now take arrays of objects.
> - Inode defrag support for a set of filesystems
> - Fix up dentry defrag support to work on negative dentries by adding
>   a new dentry flag that indicates that a dentry is not in the process
>   of being freed or allocated.
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Plans for Onezonelist patch series ???
  2007-11-08 15:26 ` [patch 00/23] Slab defragmentation V6 Mel Gorman
@ 2007-11-08 16:01   ` Lee Schermerhorn
  2007-11-08 18:34     ` Christoph Lameter
  2007-11-08 18:39     ` Mel Gorman
  2007-11-08 19:12   ` [patch 00/23] Slab defragmentation V6 Christoph Lameter
  1 sibling, 2 replies; 77+ messages in thread
From: Lee Schermerhorn @ 2007-11-08 16:01 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Christoph Lameter, akpm, linux-kernel, linux-mm

Mel [anyone?]

Do you know what the plans are for your "onezonelist" patch series?

Are they going into -mm for, maybe, .25?  Or have they been dropped.  

I carry the last posting in my mempolicy tree--sometimes below my
patches; sometimes above.  Our patches touch some of the same places in
mempolicy.c and require reject resolution when changing the order.  I
can save Andrew some work if I knew that your patches were going to be
in the next -mm by holding off and doing the rebase myself.

Regards,
Lee

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [patch 02/23] SLUB: Rename NUMA defrag_ratio to remote_node_defrag_ratio
  2007-11-08 14:50   ` Mel Gorman
@ 2007-11-08 17:25     ` Matt Mackall
  2007-11-08 19:16       ` Christoph Lameter
  2007-11-08 18:56     ` Christoph Lameter
  1 sibling, 1 reply; 77+ messages in thread
From: Matt Mackall @ 2007-11-08 17:25 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Christoph Lameter, akpm, linux-kernel, linux-mm

On Thu, Nov 08, 2007 at 02:50:44PM +0000, Mel Gorman wrote:
> > -	if (!s->defrag_ratio || get_cycles() % 1024 > s->defrag_ratio)
> > +	if (!s->remote_node_defrag_ratio ||
> > +			get_cycles() % 1024 > s->remote_node_defrag_ratio)
> 
> I cannot figure out what the number of cycles currently showing on the TSC
> have to do with a ratio :(. I could semi-understand if we were counting up
> how many cycles were being spent trying to pack objects but that does not
> appear to be the case. The comment didn't help a whole lot either. It felt
> like a cost for packing, not a ratio

It's just a random number generator. And a bad one: lots of arches
return 0. And I believe at least one of them has some NUMA support.

-- 
Mathematics is the supreme nostalgia of our time.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: Plans for Onezonelist patch series ???
  2007-11-08 16:01   ` Plans for Onezonelist patch series ??? Lee Schermerhorn
@ 2007-11-08 18:34     ` Christoph Lameter
  2007-11-08 18:40       ` Mel Gorman
  2007-11-08 18:39     ` Mel Gorman
  1 sibling, 1 reply; 77+ messages in thread
From: Christoph Lameter @ 2007-11-08 18:34 UTC (permalink / raw)
  To: Lee Schermerhorn; +Cc: Mel Gorman, akpm, linux-kernel, linux-mm

On Thu, 8 Nov 2007, Lee Schermerhorn wrote:

> Do you know what the plans are for your "onezonelist" patch series?

I wonder too whats going on? I thought they were ready for merging but I 
did not see a repost after the last round of comments.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: Plans for Onezonelist patch series ???
  2007-11-08 16:01   ` Plans for Onezonelist patch series ??? Lee Schermerhorn
  2007-11-08 18:34     ` Christoph Lameter
@ 2007-11-08 18:39     ` Mel Gorman
  2007-11-08 19:39       ` Christoph Lameter
  1 sibling, 1 reply; 77+ messages in thread
From: Mel Gorman @ 2007-11-08 18:39 UTC (permalink / raw)
  To: Lee Schermerhorn; +Cc: Christoph Lameter, akpm, linux-kernel, linux-mm

On (08/11/07 11:01), Lee Schermerhorn didst pronounce:
> Mel [anyone?]
> 
> Do you know what the plans are for your "onezonelist" patch series?
> 

I was holding off trying to add new features to current mainline or -mm as
there were a number of stability issues and one-zonelist touches a number
of areas.  Minimally, I was waiting for another -mm to come out and rebase
to that.  I'll rebase to latest git tomorrow, see how that looks and post
it if passes regression tests on Monday.

> Are they going into -mm for, maybe, .25?  Or have they been dropped.  
> 
> I carry the last posting in my mempolicy tree--sometimes below my
> patches; sometimes above.  Our patches touch some of the same places in
> mempolicy.c and require reject resolution when changing the order.  I
> can save Andrew some work if I knew that your patches were going to be
> in the next -mm by holding off and doing the rebase myself.
> 

The one-zonelist stuff is likely to be more controversial than what you
are doing. It may be best if the one-zonelist patches are based on top
of yours than the other way around.

Thanks

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: Plans for Onezonelist patch series ???
  2007-11-08 18:34     ` Christoph Lameter
@ 2007-11-08 18:40       ` Mel Gorman
  2007-11-08 18:43         ` Christoph Lameter
  0 siblings, 1 reply; 77+ messages in thread
From: Mel Gorman @ 2007-11-08 18:40 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: Lee Schermerhorn, akpm, linux-kernel, linux-mm

On (08/11/07 10:34), Christoph Lameter didst pronounce:
> On Thu, 8 Nov 2007, Lee Schermerhorn wrote:
> 
> > Do you know what the plans are for your "onezonelist" patch series?
> 
> I wonder too whats going on? I thought they were ready for merging but I 
> did not see a repost after the last round of comments.
> 

There was two bugs that were resolved but I didn't repost after that as
mainline + -mm had gone to hell in a hand-basket and I didn't want to
add to the mess.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: Plans for Onezonelist patch series ???
  2007-11-08 18:40       ` Mel Gorman
@ 2007-11-08 18:43         ` Christoph Lameter
  2007-11-08 20:06           ` Mel Gorman
  0 siblings, 1 reply; 77+ messages in thread
From: Christoph Lameter @ 2007-11-08 18:43 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Lee Schermerhorn, akpm, linux-kernel, linux-mm

On Thu, 8 Nov 2007, Mel Gorman wrote:

> There was two bugs that were resolved but I didn't repost after that as
> mainline + -mm had gone to hell in a hand-basket and I didn't want to
> add to the mess.

Hell? I must have missed it.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [patch 02/23] SLUB: Rename NUMA defrag_ratio to remote_node_defrag_ratio
  2007-11-08 14:50   ` Mel Gorman
  2007-11-08 17:25     ` Matt Mackall
@ 2007-11-08 18:56     ` Christoph Lameter
  2007-11-08 20:10       ` Mel Gorman
  1 sibling, 1 reply; 77+ messages in thread
From: Christoph Lameter @ 2007-11-08 18:56 UTC (permalink / raw)
  To: Mel Gorman; +Cc: akpm, linux-kernel, linux-mm

On Thu, 8 Nov 2007, Mel Gorman wrote:

> On (06/11/07 17:11), Christoph Lameter didst pronounce:
> > We need the defrag ratio for the non NUMA situation now. The NUMA defrag works
> > by allocating objects from partial slabs on remote nodes. Rename it to
> > 
> > 	remote_node_defrag_ratio
> > 
> 
> I'm not too keen on the defrag name here largely because I cannot tell what
> it has to do with defragmention or ratios. It's really about working out
> when it is better to pack objects into a remote slab than reclaim objects
> from a local slab, right? It's also not clear what it is a ratio of what to
> what. I thought it might be clock cycles but that isn't very clear either.
> If we are renaming this can it be something like remote_packing_cost_limit ?

In a NUMA situation we have a choice between 

1. Allocating a page from the local node (which consumes more memory and 
is advantageous performance wise.

2. Not allocating from the local node but see if any other node has 
   available partially allocated slabs. If we allocate from them then
   we save memory and reduce the amount of partial slabs on the remote 
   node. Thus the fragmentation ratio is reduced.
 
> How about
> 
> /*
>  * When packing objects into slabs, it may become necessary to
>  * reclaim objects on a local slab or allocate from a remote node.
>  * The remote_packing_cost_limit is the maximum cost of remote
>  * accesses that should be paid before it becomes worthwhile to
>  * reclaim instead
>  */
> int remote_packing_cost_limit;
> 
> ?

That is not what this is about. And the functionality has been in SLUB 
since the beginning.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [patch 07/23] SLUB: Add defrag_ratio field and sysfs support.
  2007-11-08 15:07   ` Mel Gorman
@ 2007-11-08 18:59     ` Christoph Lameter
  0 siblings, 0 replies; 77+ messages in thread
From: Christoph Lameter @ 2007-11-08 18:59 UTC (permalink / raw)
  To: Mel Gorman; +Cc: akpm, linux-kernel, linux-mm

On Thu, 8 Nov 2007, Mel Gorman wrote:

> On (06/11/07 17:11), Christoph Lameter didst pronounce:
> > The defrag_ratio is used to set the threshold at which defragmentation
> > should be run on a slabcache.
> > 
> 
> I'm thick, I would like to see a quick note here on what defragmentation
> means. Also, this defrag_ratio seems to have a significantly different
> meaning to the other defrag_ratio which isn't helping my poor head at
> all.

Yes that is why they have different names. The remote_node_defrag ratio 
controls the amount of remote allocs we do to reduce fragmentation.
 
> "The defrag_ratio sets a threshold at which a slab will be vacated of all
> it's objects and the pages freed during memory reclaim."

Sortof. If a slab is beyond the threshold during reclaim then reclaim will 
attempt to free the remaining objects in the slab to reclaim the whole 
slab.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [patch 12/23] SLUB: Trigger defragmentation from memory reclaim
  2007-11-08 15:12   ` Mel Gorman
@ 2007-11-08 19:00     ` Christoph Lameter
  0 siblings, 0 replies; 77+ messages in thread
From: Christoph Lameter @ 2007-11-08 19:00 UTC (permalink / raw)
  To: Mel Gorman; +Cc: akpm, linux-kernel, linux-mm

On Thu, 8 Nov 2007, Mel Gorman wrote:

> >  	up_read(&shrinker_rwsem);
> > +	if (gfp_mask & __GFP_FS)
> > +		kmem_cache_defrag(zone ? zone_to_nid(zone) : -1);
> 
> Does this make an assumption that only filesystem-related slabs may be
> targetted for reclaim? What if there is a slab that can free its objects
> without ever caring about a filesystem?

Correct. Currently only filesystem related slabs support slab defragy.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [patch 20/23] dentries: Add constructor
  2007-11-08 15:23   ` Mel Gorman
@ 2007-11-08 19:03     ` Christoph Lameter
  0 siblings, 0 replies; 77+ messages in thread
From: Christoph Lameter @ 2007-11-08 19:03 UTC (permalink / raw)
  To: Mel Gorman; +Cc: akpm, linux-kernel, linux-mm

On Thu, 8 Nov 2007, Mel Gorman wrote:

> > Signed-off-by: Christoph Lameter <clameter@sgi.com>
> 
> Seems to be some garbling on there in the signed-off lines.

Yes that needs to be fixed.

> > +void dcache_ctor(struct kmem_cache *s, void *p)
> > +{
> > +	struct dentry *dentry = p;
> > +
> > +	spin_lock_init(&dentry->d_lock);
> > +	dentry->d_inode = NULL;
> > +	INIT_LIST_HEAD(&dentry->d_lru);
> > +	INIT_LIST_HEAD(&dentry->d_alias);
> > +}
> > +
> 
> Is there any noticable overhead to the constructor?

Its a minor performance win since we can avoid reinitializing these
values and zeroing the object on alloc if there are already allocated 
objects.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [patch 00/23] Slab defragmentation V6
  2007-11-08 15:26 ` [patch 00/23] Slab defragmentation V6 Mel Gorman
  2007-11-08 16:01   ` Plans for Onezonelist patch series ??? Lee Schermerhorn
@ 2007-11-08 19:12   ` Christoph Lameter
  2007-11-08 20:24     ` Mel Gorman
  2007-11-08 20:58     ` Lee Schermerhorn
  1 sibling, 2 replies; 77+ messages in thread
From: Christoph Lameter @ 2007-11-08 19:12 UTC (permalink / raw)
  To: Mel Gorman; +Cc: akpm, linux-kernel, linux-mm

On Thu, 8 Nov 2007, Mel Gorman wrote:

> On Tue, 2007-11-06 at 17:11 -0800, Christoph Lameter wrote:
> > Slab defragmentation is mainly an issue if Linux is used as a fileserver
> 
> Was hoping this would get renamed to SLUB Targetted Reclaim from
> discussions at VM Summit. As no copying is taking place, it's confusing
> to call it defragmentation to me anyway. Not a major deal but it made
> reading the patches a little confusing.

The problem is that people are focusing on one feature here and forget 
about the rest. Targetted reclaim is one feature that was added later when 
lumpy reclaim was added to the kernel. The primary intend of this patchset 
was always to reduce the fragmentation. The name is appropriate and the 
patchset will support copying of objects as soon as support for that is 
added to the kick(). In that case the copying you are looking for will be 
there. The simple implementation for the kick() methods is to simply copy
pieces of the reclaim code. That is what is included here.

> > With lumpy reclaim slab defragmentation can be used to enhance the
> > ability to recover larger contiguous areas of memory. Lumpy reclaim currently
> > cannot do anything if a slab page is encountered. With slab defragmentation
> > that slab page can be removed and a large contiguous page freed. It may
> > be possible to have slab pages also part of ZONE_MOVABLE (Mel's defrag
> > scheme in 2.6.23)
> 
> More terminology nit-pick - ZONE_MOVABLE is not defragmenting anything.
> It's just partitioning memory. The slab pages need to be 100%
> reclaimable or movable for that to happen but even with targetted
> reclaim, some dentries such as the root directory one cannot be
> reclaimed, right?

100%? I am so fond of these categorical statements ....

ZONE_MOVABLE also contains mlocked pages that are also not reclaimable. 
The question is at what level would it be possible to make them MOVABLE? 
It may take some improvements to the kick() methods to make eviction more 
reliable. Allowing the moving of objects in the kick() methods will 
likely get usthere.

> It'd still be valid to leave them as MIGRATE_RECLAIMABLE because that is
> what they are. Arguably, MIGRATE_RECLAIMABLE could be dropped in it's
> entirety but I'd rather not as reclaimable blocks have significantly
> different reclaim costs to pages that are currently marked movable.

Right. That would simplify the antifrag methods. Is there any way to 
measure the reclaim costs?

> > V5->V6
> > - Rediff against 2.6.24-rc2 + mm slub patches.
> > - Add reviewed by lines.
> > - Take out the experimental code to make slab pages movable. That
> >   has to wait until this has been considered by Mel.
> > 
> 
> I still haven't considered them properly. I've been backlogged for I
> don't know how long at this point and this is on the increasingly large
> todo list :( . I don't believe it is massively urgent at the moment
> though and reclaiming to start with is perfectly adequate just as lumpy
> reclaim is fine at the moment.

Right. We can defer this for now.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [patch 02/23] SLUB: Rename NUMA defrag_ratio to remote_node_defrag_ratio
  2007-11-08 17:25     ` Matt Mackall
@ 2007-11-08 19:16       ` Christoph Lameter
  2007-11-08 19:47         ` Matt Mackall
  0 siblings, 1 reply; 77+ messages in thread
From: Christoph Lameter @ 2007-11-08 19:16 UTC (permalink / raw)
  To: Matt Mackall; +Cc: Mel Gorman, akpm, linux-kernel, linux-mm

On Thu, 8 Nov 2007, Matt Mackall wrote:

> > I cannot figure out what the number of cycles currently showing on the TSC
> > have to do with a ratio :(. I could semi-understand if we were counting up
> > how many cycles were being spent trying to pack objects but that does not
> > appear to be the case. The comment didn't help a whole lot either. It felt
> > like a cost for packing, not a ratio
> 
> It's just a random number generator. And a bad one: lots of arches
> return 0. And I believe at least one of them has some NUMA support.

Do we have a better one? Something with minimal processing overhead? I'd 
be glad to switch it.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: Plans for Onezonelist patch series ???
  2007-11-08 18:39     ` Mel Gorman
@ 2007-11-08 19:39       ` Christoph Lameter
  0 siblings, 0 replies; 77+ messages in thread
From: Christoph Lameter @ 2007-11-08 19:39 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Lee Schermerhorn, akpm, linux-kernel, linux-mm

On Thu, 8 Nov 2007, Mel Gorman wrote:

> I was holding off trying to add new features to current mainline or -mm as
> there were a number of stability issues and one-zonelist touches a number
> of areas.  Minimally, I was waiting for another -mm to come out and rebase
> to that.  I'll rebase to latest git tomorrow, see how that looks and post
> it if passes regression tests on Monday.

Ahh. Great. I am also impatiently waiting for that patchset. I was tempted 
several times this week to just pick up where you left off...

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [patch 02/23] SLUB: Rename NUMA defrag_ratio to remote_node_defrag_ratio
  2007-11-08 19:16       ` Christoph Lameter
@ 2007-11-08 19:47         ` Matt Mackall
  2007-11-08 20:01           ` Christoph Lameter
  0 siblings, 1 reply; 77+ messages in thread
From: Matt Mackall @ 2007-11-08 19:47 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: Mel Gorman, akpm, linux-kernel, linux-mm

On Thu, Nov 08, 2007 at 11:16:33AM -0800, Christoph Lameter wrote:
> On Thu, 8 Nov 2007, Matt Mackall wrote:
> 
> > > I cannot figure out what the number of cycles currently showing on the TSC
> > > have to do with a ratio :(. I could semi-understand if we were counting up
> > > how many cycles were being spent trying to pack objects but that does not
> > > appear to be the case. The comment didn't help a whole lot either. It felt
> > > like a cost for packing, not a ratio
> > 
> > It's just a random number generator. And a bad one: lots of arches
> > return 0. And I believe at least one of them has some NUMA support.
> 
> Do we have a better one? Something with minimal processing overhead? I'd 
> be glad to switch it.

Not really. drivers/char/random.c does:

__get_cpu_var(trickle_count)++ & 0xfff

for a similar purpose.

-- 
Mathematics is the supreme nostalgia of our time.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [patch 02/23] SLUB: Rename NUMA defrag_ratio to remote_node_defrag_ratio
  2007-11-08 19:47         ` Matt Mackall
@ 2007-11-08 20:01           ` Christoph Lameter
  2007-11-08 21:03             ` Matt Mackall
  0 siblings, 1 reply; 77+ messages in thread
From: Christoph Lameter @ 2007-11-08 20:01 UTC (permalink / raw)
  To: Matt Mackall; +Cc: Mel Gorman, akpm, linux-kernel, linux-mm

On Thu, 8 Nov 2007, Matt Mackall wrote:

> Not really. drivers/char/random.c does:
> 
> __get_cpu_var(trickle_count)++ & 0xfff

That is incremented on each call to add_timer_randomness. Not a high 
enough resolution there. I guess I am stuck with get_cycles().

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: Plans for Onezonelist patch series ???
  2007-11-08 18:43         ` Christoph Lameter
@ 2007-11-08 20:06           ` Mel Gorman
  2007-11-08 20:20             ` Christoph Lameter
  0 siblings, 1 reply; 77+ messages in thread
From: Mel Gorman @ 2007-11-08 20:06 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: Lee Schermerhorn, akpm, linux-kernel, linux-mm

On (08/11/07 10:43), Christoph Lameter didst pronounce:
> On Thu, 8 Nov 2007, Mel Gorman wrote:
> 
> > There was two bugs that were resolved but I didn't repost after that as
> > mainline + -mm had gone to hell in a hand-basket and I didn't want to
> > add to the mess.
> 
> Hell? I must have missed it.
> 

Some time after rc1, things appeared in a mess - at least I didn't have
much luck figuring out what was going on when I looked. Admittedly, being
very ill at the time I didn't spend much effort on it. Either way things
were churning enough that there seemed to be enough going on without adding
one-zonelist to the mix.

I've rebased the patches to mm-broken-out-2007-11-06-02-32. However, the
vanilla -mm and the one with onezonelist applied are locking up in the
same manner. I'm way too behind at the moment to guess if it is a new bug
or reported already. At best, I can say the patches are not making things
any worse :) I'll go through the archives in the morning and do a bit more
testing to see what happens.

In case this is familiar to people, the lockup I see is;

[  115.548908] BUG: spinlock bad magic on CPU#0, sshd/2752
[  115.611371]  lock: c20029c8, .magic: ffffffff, .owner: <none>/-1, .owner_cpu: -1066669496
[  115.709027]  [<c010526a>] show_trace_log_lvl+0x1a/0x30
[  115.770560]  [<c0105c02>] show_trace+0x12/0x20
[  115.823787]  [<c0105d16>] dump_stack+0x16/0x20
[  115.877011]  [<c022c226>] spin_bug+0x96/0xf0
[  115.928172]  [<c022c429>] _raw_spin_lock+0x69/0x140
[  115.986580]  [<c033f05f>] _spin_lock+0x4f/0x60
[  116.039809]  [<c0224bae>] kobject_add+0x4e/0x1a0
[  116.095112]  [<c01314b4>] uids_user_create+0x54/0x80
[  116.154555]  [<c01318e2>] alloc_uid+0xd2/0x150
[  116.207784]  [<c01356db>] set_user+0x2b/0xb0
[  116.258951]  [<c01373c1>] sys_setreuid+0x141/0x150
[  116.316305]  [<c010429e>] syscall_call+0x7/0xb
[  116.369544]  =======================
[  127.680346] BUG: soft lockup - CPU#0 stuck for 11s! [sshd:2752]
[  127.750987] 
[  127.768781] Pid: 2752, comm: sshd Not tainted (2.6.24-rc1-mm1 #1)
[  127.841498] EIP: 0060:[<c02298c1>] EFLAGS: 00000246 CPU: 0
[  127.906948] EIP is at delay_tsc+0x1/0x20
[  127.953754] EAX: 00000001 EBX: c20029c8 ECX: b0953e83 EDX: 2b35cb6d
[  128.028533] ESI: 04b81d83 EDI: 00000000 EBP: c30f1ec4 ESP: c30f1ebc
[  128.103305]  DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068
[  128.167717] CR0: 80050033 CR2: b7e45544 CR3: 02153000 CR4: 00000690
[  128.242490] DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
[  128.317253] DR6: ffff0ff0 DR7: 00000400
[  128.363010]  [<c010526a>] show_trace_log_lvl+0x1a/0x30
[  128.424541]  [<c0105c02>] show_trace+0x12/0x20
[  128.477767]  [<c010250c>] show_regs+0x1c/0x20
[  128.529948]  [<c015ab6b>] softlockup_tick+0x11b/0x150
[  128.590446]  [<c0130fb2>] run_local_timers+0x12/0x20
[  128.649888]  [<c0131182>] update_process_times+0x42/0x90
[  128.713461]  [<c01440f5>] tick_periodic+0x25/0x80
[  128.769812]  [<c0144169>] tick_handle_periodic+0x19/0x80
[  128.833397]  [<c0107519>] timer_interrupt+0x49/0x50
[  128.891802]  [<c015aeb8>] handle_IRQ_event+0x28/0x60
[  128.951246]  [<c015c7f8>] handle_level_irq+0x78/0xe0
[  129.010693]  [<c01065f0>] do_IRQ+0x40/0x80
[  129.059782]  [<c0104c5f>] common_interrupt+0x23/0x28
[  129.119229]  [<c022c472>] _raw_spin_lock+0xb2/0x140
[  129.177635]  [<c033f05f>] _spin_lock+0x4f/0x60
[  129.230851]  [<c0224bae>] kobject_add+0x4e/0x1a0
[  129.286175]  [<c01314b4>] uids_user_create+0x54/0x80
[  129.345598]  [<c01318e2>] alloc_uid+0xd2/0x150
[  129.398830]  [<c01356db>] set_user+0x2b/0xb0
[  129.450005]  [<c01373c1>] sys_setreuid+0x141/0x150
[  129.507380]  [<c010429e>] syscall_call+0x7/0xb
[  129.560605]  =======================


-- 
-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [patch 02/23] SLUB: Rename NUMA defrag_ratio to remote_node_defrag_ratio
  2007-11-08 18:56     ` Christoph Lameter
@ 2007-11-08 20:10       ` Mel Gorman
  0 siblings, 0 replies; 77+ messages in thread
From: Mel Gorman @ 2007-11-08 20:10 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: akpm, linux-kernel, linux-mm

On (08/11/07 10:56), Christoph Lameter didst pronounce:
> On Thu, 8 Nov 2007, Mel Gorman wrote:
> 
> > On (06/11/07 17:11), Christoph Lameter didst pronounce:
> > > We need the defrag ratio for the non NUMA situation now. The NUMA defrag works
> > > by allocating objects from partial slabs on remote nodes. Rename it to
> > > 
> > > 	remote_node_defrag_ratio
> > > 
> > 
> > I'm not too keen on the defrag name here largely because I cannot tell what
> > it has to do with defragmention or ratios. It's really about working out
> > when it is better to pack objects into a remote slab than reclaim objects
> > from a local slab, right? It's also not clear what it is a ratio of what to
> > what. I thought it might be clock cycles but that isn't very clear either.
> > If we are renaming this can it be something like remote_packing_cost_limit ?
> 
> In a NUMA situation we have a choice between 
> 
> 1. Allocating a page from the local node (which consumes more memory and 
> is advantageous performance wise.
> 
> 2. Not allocating from the local node but see if any other node has 
>    available partially allocated slabs. If we allocate from them then
>    we save memory and reduce the amount of partial slabs on the remote 
>    node. Thus the fragmentation ratio is reduced.
>  

Ok, I get the logic somewhat now, thanks.

> > How about
> > 
> > /*
> >  * When packing objects into slabs, it may become necessary to
> >  * reclaim objects on a local slab or allocate from a remote node.
> >  * The remote_packing_cost_limit is the maximum cost of remote
> >  * accesses that should be paid before it becomes worthwhile to
> >  * reclaim instead
> >  */
> > int remote_packing_cost_limit;
> > 
> > ?
> 
> That is not what this is about. And the functionality has been in SLUB 
> since the beginning.
> 

Yeah, my understanding of SLUB is crap. Sorry for the noise.

-- 
-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: Plans for Onezonelist patch series ???
  2007-11-08 20:06           ` Mel Gorman
@ 2007-11-08 20:20             ` Christoph Lameter
  2007-11-08 20:29               ` Mel Gorman
  0 siblings, 1 reply; 77+ messages in thread
From: Christoph Lameter @ 2007-11-08 20:20 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Lee Schermerhorn, akpm, linux-kernel, linux-mm

On Thu, 8 Nov 2007, Mel Gorman wrote:

> I've rebased the patches to mm-broken-out-2007-11-06-02-32. However, the
> vanilla -mm and the one with onezonelist applied are locking up in the
> same manner. I'm way too behind at the moment to guess if it is a new bug
> or reported already. At best, I can say the patches are not making things
> any worse :) I'll go through the archives in the morning and do a bit more
> testing to see what happens.

I usually base my patches on Linus' tree as long as there is no tree 
available from Andrew. But that means that may have to 
approximate what is in there by adding this and that.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [patch 00/23] Slab defragmentation V6
  2007-11-08 19:12   ` [patch 00/23] Slab defragmentation V6 Christoph Lameter
@ 2007-11-08 20:24     ` Mel Gorman
  2007-11-08 20:28       ` Christoph Lameter
  2007-11-08 20:58     ` Lee Schermerhorn
  1 sibling, 1 reply; 77+ messages in thread
From: Mel Gorman @ 2007-11-08 20:24 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: akpm, linux-kernel, linux-mm

On (08/11/07 11:12), Christoph Lameter didst pronounce:
> On Thu, 8 Nov 2007, Mel Gorman wrote:
> 
> > On Tue, 2007-11-06 at 17:11 -0800, Christoph Lameter wrote:
> > > Slab defragmentation is mainly an issue if Linux is used as a fileserver
> > 
> > Was hoping this would get renamed to SLUB Targetted Reclaim from
> > discussions at VM Summit. As no copying is taking place, it's confusing
> > to call it defragmentation to me anyway. Not a major deal but it made
> > reading the patches a little confusing.
> 
> The problem is that people are focusing on one feature here and forget 
> about the rest. Targetted reclaim is one feature that was added later when 
> lumpy reclaim was added to the kernel. The primary intend of this patchset 
> was always to reduce the fragmentation. The name is appropriate and the 
> patchset will support copying of objects as soon as support for that is 
> added to the kick(). In that case the copying you are looking for will be 
> there. The simple implementation for the kick() methods is to simply copy
> pieces of the reclaim code. That is what is included here.
> 

Ok, fair enough logic and it's a bit clearer in my head how to separate them
out. Thanks

> > > With lumpy reclaim slab defragmentation can be used to enhance the
> > > ability to recover larger contiguous areas of memory. Lumpy reclaim currently
> > > cannot do anything if a slab page is encountered. With slab defragmentation
> > > that slab page can be removed and a large contiguous page freed. It may
> > > be possible to have slab pages also part of ZONE_MOVABLE (Mel's defrag
> > > scheme in 2.6.23)
> > 
> > More terminology nit-pick - ZONE_MOVABLE is not defragmenting anything.
> > It's just partitioning memory. The slab pages need to be 100%
> > reclaimable or movable for that to happen but even with targetted
> > reclaim, some dentries such as the root directory one cannot be
> > reclaimed, right?
> 
> 100%? I am so fond of these categorical statements ....
> 

Yeah, they are great for all occasions.

In fairness when the time comes, I can do a few tests using the hugepage
allocation tests with ZONE_MOVABLE and Badari might do a few tests with
memory hot-remove. Currently, the success rates for these tests are 100%
within ZONE_MOVABLE although that is without locked pages. Hot-remove
should be able to deal with locked pages but hugepage allocation wouldn't
as lumpy-reclaim would fail. If we allow slab pages to use the zone and the
success rates drop, it'll be obvious which is a plus at least.

> ZONE_MOVABLE also contains mlocked pages that are also not reclaimable. 

True, but they are movable so for example memory hot-remove is able to
deal with them and the memory compaction patches should have been able
to deal with it too.

> The question is at what level would it be possible to make them MOVABLE? 
> It may take some improvements to the kick() methods to make eviction more 
> reliable. Allowing the moving of objects in the kick() methods will 
> likely get usthere.
> 

It certainly can be tried out. However, this is a future problem and
independent of the current patchset. I don't want to drag us down a blind
alley about a problem that isn't even at hand.

Right now, I think the set looks in good shape for wider testing and appears
to solve a major part of the slab fragmentation problem. Assuming I don't
fall down a hole testing one-zonelist and the mm-broken-out patches, I'll
get to testing these patches as well.

> > It'd still be valid to leave them as MIGRATE_RECLAIMABLE because that is
> > what they are. Arguably, MIGRATE_RECLAIMABLE could be dropped in it's
> > entirety but I'd rather not as reclaimable blocks have significantly
> > different reclaim costs to pages that are currently marked movable.
> 
> Right. That would simplify the antifrag methods. Is there any way to 
> measure the reclaim costs?
> 

Regrettably, no.

> > > V5->V6
> > > - Rediff against 2.6.24-rc2 + mm slub patches.
> > > - Add reviewed by lines.
> > > - Take out the experimental code to make slab pages movable. That
> > >   has to wait until this has been considered by Mel.
> > > 
> > 
> > I still haven't considered them properly. I've been backlogged for I
> > don't know how long at this point and this is on the increasingly large
> > todo list :( . I don't believe it is massively urgent at the moment
> > though and reclaiming to start with is perfectly adequate just as lumpy
> > reclaim is fine at the moment.
> 
> Right. We can defer this for now.
> 

Agreed.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [patch 00/23] Slab defragmentation V6
  2007-11-08 20:24     ` Mel Gorman
@ 2007-11-08 20:28       ` Christoph Lameter
  0 siblings, 0 replies; 77+ messages in thread
From: Christoph Lameter @ 2007-11-08 20:28 UTC (permalink / raw)
  To: Mel Gorman; +Cc: akpm, linux-kernel, linux-mm

On Thu, 8 Nov 2007, Mel Gorman wrote:

> It certainly can be tried out. However, this is a future problem and
> independent of the current patchset. I don't want to drag us down a blind
> alley about a problem that isn't even at hand.

Right. That is why I took it out.
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: Plans for Onezonelist patch series ???
  2007-11-08 20:20             ` Christoph Lameter
@ 2007-11-08 20:29               ` Mel Gorman
  0 siblings, 0 replies; 77+ messages in thread
From: Mel Gorman @ 2007-11-08 20:29 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: Lee Schermerhorn, akpm, linux-kernel, linux-mm

On (08/11/07 12:20), Christoph Lameter didst pronounce:
> On Thu, 8 Nov 2007, Mel Gorman wrote:
> 
> > I've rebased the patches to mm-broken-out-2007-11-06-02-32. However, the
> > vanilla -mm and the one with onezonelist applied are locking up in the
> > same manner. I'm way too behind at the moment to guess if it is a new bug
> > or reported already. At best, I can say the patches are not making things
> > any worse :) I'll go through the archives in the morning and do a bit more
> > testing to see what happens.
> 
> I usually base my patches on Linus' tree as long as there is no tree 
> available from Andrew. But that means that may have to 
> approximate what is in there by adding this and that.
> 

Unfortunately for me, there are several collisions with the patches when
applied against -mm if the patches are based on latest git. They are mainly in
mm/vmscan.c due to the memory controller work. For the purposes of testing and
merging, it makes more sense for me to work against -mm as much as possible.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [patch 00/23] Slab defragmentation V6
  2007-11-08 19:12   ` [patch 00/23] Slab defragmentation V6 Christoph Lameter
  2007-11-08 20:24     ` Mel Gorman
@ 2007-11-08 20:58     ` Lee Schermerhorn
  2007-11-08 21:27       ` Christoph Lameter
  1 sibling, 1 reply; 77+ messages in thread
From: Lee Schermerhorn @ 2007-11-08 20:58 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: Mel Gorman, akpm, linux-kernel, linux-mm

On Thu, 2007-11-08 at 11:12 -0800, Christoph Lameter wrote:
> On Thu, 8 Nov 2007, Mel Gorman wrote:
> 
> > On Tue, 2007-11-06 at 17:11 -0800, Christoph Lameter wrote:
> > > Slab defragmentation is mainly an issue if Linux is used as a fileserver
> > 
> > Was hoping this would get renamed to SLUB Targetted Reclaim from
> > discussions at VM Summit. As no copying is taking place, it's confusing
> > to call it defragmentation to me anyway. Not a major deal but it made
> > reading the patches a little confusing.
> 
> The problem is that people are focusing on one feature here and forget 
> about the rest. Targetted reclaim is one feature that was added later when 
> lumpy reclaim was added to the kernel. The primary intend of this patchset 
> was always to reduce the fragmentation. The name is appropriate and the 
> patchset will support copying of objects as soon as support for that is 
> added to the kick(). In that case the copying you are looking for will be 
> there. The simple implementation for the kick() methods is to simply copy
> pieces of the reclaim code. That is what is included here.
> 
> > > With lumpy reclaim slab defragmentation can be used to enhance the
> > > ability to recover larger contiguous areas of memory. Lumpy reclaim currently
> > > cannot do anything if a slab page is encountered. With slab defragmentation
> > > that slab page can be removed and a large contiguous page freed. It may
> > > be possible to have slab pages also part of ZONE_MOVABLE (Mel's defrag
> > > scheme in 2.6.23)
> > 
> > More terminology nit-pick - ZONE_MOVABLE is not defragmenting anything.
> > It's just partitioning memory. The slab pages need to be 100%
> > reclaimable or movable for that to happen but even with targetted
> > reclaim, some dentries such as the root directory one cannot be
> > reclaimed, right?
> 
> 100%? I am so fond of these categorical statements ....
> 
> ZONE_MOVABLE also contains mlocked pages that are also not reclaimable. 
> The question is at what level would it be possible to make them MOVABLE? 
> It may take some improvements to the kick() methods to make eviction more 
> reliable. Allowing the moving of objects in the kick() methods will 
> likely get usthere.

Christoph:  Although mlocked pages are not reclaimable, they ARE
migratable.  You fixed that a long time ago.  [And I just verified with
memtoy.]  Doesn't this make them "movable"?

Lee

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [patch 02/23] SLUB: Rename NUMA defrag_ratio to remote_node_defrag_ratio
  2007-11-08 20:01           ` Christoph Lameter
@ 2007-11-08 21:03             ` Matt Mackall
  2007-11-08 21:28               ` Christoph Lameter
  0 siblings, 1 reply; 77+ messages in thread
From: Matt Mackall @ 2007-11-08 21:03 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: Mel Gorman, akpm, linux-kernel, linux-mm

On Thu, Nov 08, 2007 at 12:01:24PM -0800, Christoph Lameter wrote:
> On Thu, 8 Nov 2007, Matt Mackall wrote:
> 
> > Not really. drivers/char/random.c does:
> > 
> > __get_cpu_var(trickle_count)++ & 0xfff
> 
> That is incremented on each call to add_timer_randomness. Not a high 
> enough resolution there. I guess I am stuck with get_cycles().

I'm not suggesting you use trickle_count, silly. I'm suggesting you
use a similar approach.

But perhaps I should just add a lightweight RNG to random.c and be
done with it.

-- 
Mathematics is the supreme nostalgia of our time.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [patch 00/23] Slab defragmentation V6
  2007-11-08 20:58     ` Lee Schermerhorn
@ 2007-11-08 21:27       ` Christoph Lameter
  0 siblings, 0 replies; 77+ messages in thread
From: Christoph Lameter @ 2007-11-08 21:27 UTC (permalink / raw)
  To: Lee Schermerhorn; +Cc: Mel Gorman, akpm, linux-kernel, linux-mm

On Thu, 8 Nov 2007, Lee Schermerhorn wrote:

> > ZONE_MOVABLE also contains mlocked pages that are also not reclaimable. 
> > The question is at what level would it be possible to make them MOVABLE? 
> > It may take some improvements to the kick() methods to make eviction more 
> > reliable. Allowing the moving of objects in the kick() methods will 
> > likely get usthere.
> 
> Christoph:  Although mlocked pages are not reclaimable, they ARE
> migratable.  You fixed that a long time ago.  [And I just verified with
> memtoy.]  Doesn't this make them "movable"?

I know. They are movable but not reclaimable.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [patch 02/23] SLUB: Rename NUMA defrag_ratio to remote_node_defrag_ratio
  2007-11-08 21:03             ` Matt Mackall
@ 2007-11-08 21:28               ` Christoph Lameter
  2007-11-08 23:08                 ` Matt Mackall
  0 siblings, 1 reply; 77+ messages in thread
From: Christoph Lameter @ 2007-11-08 21:28 UTC (permalink / raw)
  To: Matt Mackall; +Cc: Mel Gorman, akpm, linux-kernel, linux-mm

On Thu, 8 Nov 2007, Matt Mackall wrote:

> But perhaps I should just add a lightweight RNG to random.c and be
> done with it.

It would be appreciated.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [patch 02/23] SLUB: Rename NUMA defrag_ratio to remote_node_defrag_ratio
  2007-11-08 21:28               ` Christoph Lameter
@ 2007-11-08 23:08                 ` Matt Mackall
  0 siblings, 0 replies; 77+ messages in thread
From: Matt Mackall @ 2007-11-08 23:08 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: Mel Gorman, akpm, linux-kernel, linux-mm

On Thu, Nov 08, 2007 at 01:28:31PM -0800, Christoph Lameter wrote:
> On Thu, 8 Nov 2007, Matt Mackall wrote:
> 
> > But perhaps I should just add a lightweight RNG to random.c and be
> > done with it.
> 
> It would be appreciated.

As someone pointed out privately, there's a random32() in lib/random32.c.

Unfortunately, this function is too heavy for many fast-path uses and
too weak for most other uses. I'll see if I can come up with something
better..

-- 
Mathematics is the supreme nostalgia of our time.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 77+ messages in thread

end of thread, other threads:[~2007-11-08 23:08 UTC | newest]

Thread overview: 77+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2007-11-07  1:11 [patch 00/23] Slab defragmentation V6 Christoph Lameter
2007-11-07  1:11 ` [patch 01/23] SLUB: Move count_partial() Christoph Lameter
2007-11-07  1:11 ` [patch 02/23] SLUB: Rename NUMA defrag_ratio to remote_node_defrag_ratio Christoph Lameter
2007-11-08 14:50   ` Mel Gorman
2007-11-08 17:25     ` Matt Mackall
2007-11-08 19:16       ` Christoph Lameter
2007-11-08 19:47         ` Matt Mackall
2007-11-08 20:01           ` Christoph Lameter
2007-11-08 21:03             ` Matt Mackall
2007-11-08 21:28               ` Christoph Lameter
2007-11-08 23:08                 ` Matt Mackall
2007-11-08 18:56     ` Christoph Lameter
2007-11-08 20:10       ` Mel Gorman
2007-11-07  1:11 ` [patch 03/23] bufferhead: Revert constructor removal Christoph Lameter
2007-11-07  1:11 ` [patch 04/23] dentries: Extract common code to remove dentry from lru Christoph Lameter
2007-11-07  8:50   ` Johannes Weiner
2007-11-07  9:43     ` Jörn Engel
2007-11-07 18:55       ` Christoph Lameter
2007-11-07 18:54         ` Jörn Engel
2007-11-07 19:00           ` Christoph Lameter
2007-11-07 18:28     ` Christoph Lameter
2007-11-07  1:11 ` [patch 05/23] VM: Allow get_page_unless_zero on compound pages Christoph Lameter
2007-11-07  1:11 ` [patch 06/23] SLUB: Extend slabinfo to support -D and -C options Christoph Lameter
2007-11-08 15:00   ` Mel Gorman
2007-11-07  1:11 ` [patch 07/23] SLUB: Add defrag_ratio field and sysfs support Christoph Lameter
2007-11-07  8:55   ` Johannes Weiner
2007-11-07 18:30     ` Christoph Lameter
2007-11-08 15:07   ` Mel Gorman
2007-11-08 18:59     ` Christoph Lameter
2007-11-07  1:11 ` [patch 08/23] SLUB: Replace ctor field with ops field in /sys/slab/:0000008 /sys/slab/:0000016 /sys/slab/:0000024 /sys/slab/:0000032 /sys/slab/:0000040 /sys/slab/:0000048 /sys/slab/:0000056 /sys/slab/:0000064 /sys/slab/:0000072 /sys/slab/:0000080 /sys/slab/:0000088 /sys/slab/:0000096 /sys/slab/:0000104 /sys/slab/:0000128 /sys/slab/:0000144 /sys/slab/:0000184 /sys/slab/:0000192 /sys/slab/:0000216 /sys/slab/:0000256 /sys/slab/:0000344 /sys/slab/:0000384 /sys/slab/:0000448 /sys/slab/:0000512 /sys/slab/:0000768 /sys/slab/:0000968 /sys/slab/:0001024 /sys/slab/:0001152 /sys/slab/:0001312 /sys/slab/:0001536 /sys/slab/:0002048 /sys/slab/:0003072 /sys/slab/:0004096 /sys/slab/:a-0000016 /sys/slab/:a-0000024 /sys/slab/:a-0000056 /sys/slab/:a-0000080 /sys/slab/:a-0000128 /sys/slab/Acpi-Namesp ace /sys/slab/Acpi-Operand /sys/slab/Acpi-Parse /sys/slab/Acpi-ParseExt /sys/slab/Acpi-State /sys/ Christoph Lameter
2007-11-07  1:11 ` [patch 09/23] SLUB: Add get() and kick() methods Christoph Lameter
2007-11-07  2:37   ` Adrian Bunk
2007-11-07  3:07     ` Christoph Lameter
2007-11-07  3:26       ` Adrian Bunk
2007-11-07  1:11 ` [patch 10/23] SLUB: Sort slab cache list and establish maximum objects for defrag slabs Christoph Lameter
2007-11-07  1:11 ` [patch 11/23] SLUB: Slab defrag core Christoph Lameter
2007-11-07 22:13   ` Christoph Lameter
2007-11-07  1:11 ` [patch 12/23] SLUB: Trigger defragmentation from memory reclaim Christoph Lameter
2007-11-07  9:28   ` Johannes Weiner
2007-11-07 18:34     ` Christoph Lameter
2007-11-08 15:12   ` Mel Gorman
2007-11-08 19:00     ` Christoph Lameter
2007-11-07  1:11 ` [patch 13/23] Buffer heads: Support slab defrag Christoph Lameter
2007-11-07  1:11 ` [patch 14/23] inodes: Support generic defragmentation Christoph Lameter
2007-11-07 10:17   ` Jörn Engel
2007-11-07 10:31     ` Jörn Engel
2007-11-07 10:35     ` Andreas Schwab
2007-11-07 10:35       ` Jörn Engel
2007-11-07 18:40     ` Christoph Lameter
2007-11-07 18:51       ` Jörn Engel
2007-11-07 19:00         ` Christoph Lameter
2007-11-07  1:11 ` [patch 15/23] FS: ExtX filesystem defrag Christoph Lameter
2007-11-07  1:11 ` [patch 16/23] FS: XFS slab defragmentation Christoph Lameter
2007-11-07  1:11 ` [patch 17/23] FS: Proc filesystem support for slab defrag Christoph Lameter
2007-11-07  1:11 ` [patch 18/23] FS: Slab defrag: Reiserfs support Christoph Lameter
2007-11-07  1:11 ` [patch 19/23] FS: Socket inode defragmentation Christoph Lameter
2007-11-07  1:11 ` [patch 20/23] dentries: Add constructor Christoph Lameter
2007-11-08 15:23   ` Mel Gorman
2007-11-08 19:03     ` Christoph Lameter
2007-11-07  1:11 ` [patch 21/23] dentries: dentry defragmentation Christoph Lameter
2007-11-07  1:11 ` [patch 22/23] SLUB: Slab reclaim through Lumpy reclaim Christoph Lameter
2007-11-07  1:11 ` [patch 23/23] SLUB: Add SlabReclaimable() to avoid repeated reclaim attempts Christoph Lameter
2007-11-08 15:26 ` [patch 00/23] Slab defragmentation V6 Mel Gorman
2007-11-08 16:01   ` Plans for Onezonelist patch series ??? Lee Schermerhorn
2007-11-08 18:34     ` Christoph Lameter
2007-11-08 18:40       ` Mel Gorman
2007-11-08 18:43         ` Christoph Lameter
2007-11-08 20:06           ` Mel Gorman
2007-11-08 20:20             ` Christoph Lameter
2007-11-08 20:29               ` Mel Gorman
2007-11-08 18:39     ` Mel Gorman
2007-11-08 19:39       ` Christoph Lameter
2007-11-08 19:12   ` [patch 00/23] Slab defragmentation V6 Christoph Lameter
2007-11-08 20:24     ` Mel Gorman
2007-11-08 20:28       ` Christoph Lameter
2007-11-08 20:58     ` Lee Schermerhorn
2007-11-08 21:27       ` Christoph Lameter

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox