[patch 00/12] Slab defragmentation V4

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [patch 00/12] Slab defragmentation V4
@ 2007-07-08  3:05 Christoph Lameter
  2007-07-08  3:05 ` [patch 01/12] Slab defragmentation: Add support for kmem_cache_ops Christoph Lameter
                   ` (11 more replies)
  0 siblings, 12 replies; 13+ messages in thread
From: Christoph Lameter @ 2007-07-08  3:05 UTC (permalink / raw)
  To: akpm; +Cc: linux-kernel, linux-mm, dgc

V3->V4:
- Optimize scan for slabs that need defragmentation
- Add /sys/slab/*/defrag_ratio to allow setting defrag limits
  per slab.
- Add support for buffer heads.
- Describe how the cleanup after the daily updatedb can be
  improved by slab defragmentation.
- Rediff against 2.6.22-rc6-mm1 (+ slub patches in mm but
  there should be minimal overlap--if any--with those)

V2->V3
- Support directory reclaim
- Add infrastructure to trigger defragmentation after slab shrinking if we
  have slabs with a high degree of fragmentation.

V1->V2
- Clean up control flow using a state variable. Simplify API. Back to 2
  functions that now take arrays of objects.
- Inode defrag support for a set of filesystems
- Fix up dentry defrag support to work on negative dentries by adding
  a new dentry flag that indicates that a dentry is not in the process
  of being freed or allocated.



Slab defragmentation is useful to increase the object density in slab caches.
On reclaim the fragmentation ratios will be checked and if a the object
density (ratio between maximum objects that could be stored in the allocated
slabs and the actuall objects in use) in a defragmentable slab is less than
a certain percentage (defaults to 30%) then the slabs with the lowest number
of objects in them will be freed which increases the object density.

Currently supported are

1. dentry defrag
2. inode defrag (with a generic interface to allow easy setup of more
   filesystems than the currently supported ext2/3/4 reiserfs, XFS
   and proc)
3. buffer_heads

One typical mechanism that triggers slab defragmentation on my systems
is the daily run of

	updatedb

Updatedb scans all files on the system which causes a high inode and dentry
use. After updatedb is complete we need to go back to the regular use
patterns (typical on my machine: kernel compiles). Those need the memory now
for different purposes. The inodes and dentries used for updatedb will
gradually be aged by the dentry/inode reclaim algorithm which will free
up the dentries and inode entries randomly through the slabs that were
allocated. As a result the slabs will become sparsely populated. If they
become empty then they can be freed but a lot of them will remain sparsely
populated. That is where slab defrag comes in: It removes the slabs with
just a few entries reclaiming more memory for other uses.

Currently slab reclaim writes messages like this to the syslog if slab defrag
is occurring:





Test results (see appended scripts / user space code for more data)

(3 level tree with 10 entries at first level , 20 at the second and 30 files at the
third level. Files at the lowest level were removed to create inode fragmentation)

%Ra is the allocation ratio (need to apply the slabinfo patch to get those numbers)

inode reclaim in reiserfs

Name                   Objects Objsize    Space Slabs/Part/Cpu  O/S O %Ra %Ef Flg
dentry                   14660     200     3.0M        733/0/1   20 0 100  97 Da
reiser_inode_cache        1596     640     4.1M      256/201/1   25 2  24  24 DCa

Status after defrag

Name                   Objects Objsize    Space Slabs/Part/Cpu  O/S O %Ra %Ef Flg
dentry                    8849     200     1.8M       454/17/1   20 0  97  95 Da
reiser_inode_cache        1381     640     1.0M        65/11/0   25 2  84  82 DCa



Slab defragmentation can be triggered in two ways:

1. Manually by running

slabinfo -s <slabs-to-shrink>

or manually by the kernel calling

kmem_cache_shrink(slab)

(Currently only ACPI is doing such a call to a slab that has no
defragmentation support. In that case we simply do what SLAB does:
drop per cpu caches and sift through partial list for free slabs).

2. Automatically if defragmentable slabs reach a certain degree of
   fragmentation.

The point where slab defragmentation occurs is can be set at

/proc/sys/vm/slab_defrag_ratio

Slab fragmentation is measured by how much of the possible objects in a
slab are in use. The default setting for slab_defrag_ratio is 30%. This
means that slab fragmentation is going to be triggered if there are more than
3 free object slots for each allocated object.

Setting the slab_defrag_ratio higher will cause more defragmentation runs.
If slab_defrag_ratio is set to 0 then no slab defragmentation occurs.

Slabs are checked for their fragmentation levels after the slabs have been shrunk
by running shrinkers in vm/scan.c during memory reclaim. This means that slab
defragmentation is only triggered if we are under memory pressure and if there is
significant slab fragmentation.



Test script:

#!/bin/sh

echo 30 >/proc/sys/vm/slab_defrag_ratio

./gazfiles c 3 10 20 30
echo "Status before"
slabinfo -D
./gazfiles d 2
echo "Status after removing files"
slabinfo -D
slabinfo -s
echo "Status after defrag"
slabinfo -D
./gazfiles d 0


gazfiles.c :

/*
 * Create a gazillion of files to be able to create slab fragmentation
 *
 * (C) 2007 sgi, Christoph Lameter <clameter@sgi.com>
 *
 * Create a n layered hierachy of files of empty files
 *
 * gazfiles <action> <levels> <n1> <n2> ...
 *
 * gazfiles c[reate] 3 50 50 50
 *
 * gazfiles s[hrink] <levels>
 *
 * gazfiles r[andomkill] <nr to kill> 
 */

#include <stdio.h>
#include <stdlib.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <dirent.h>
#include <string.h>
#include <unistd.h>
#include <stdarg.h>
#include <getopt.h>
#include <regex.h>
#include <errno.h>

#define MAXIMUM_LEVELS 10

int level;
int sizes[MAXIMUM_LEVELS];

void fatal(const char *x, ...)
{
        va_list ap;

        va_start(ap, x);
        vfprintf(stderr, x, ap);
        va_end(ap);
        exit(1);
}

int read_gaz(void)
{
	FILE *f = fopen(".gazinfo", "r");
	int rc = 0;
	int i;

	if (!f)
		return 0;

	if (!fscanf(f, "%d", &level))
		goto out;

	if (level >= MAXIMUM_LEVELS)
		goto out;

	for (i = 0; i < level; i++)
		if (!fscanf(f, " %d", &sizes[i]))
			goto out;
	rc = 1;
out:
	fclose(f);
	return rc;
}

void write_gaz(void)
{
	FILE *f = fopen(".gazinfo","w");
	int i;

	fprintf(f, "%d",level);
	for (i = 0; i < level; i++)
		fprintf(f," %d", sizes[i]);
	fprintf(f, "\n");
	fclose(f);
}

void cre(int l)
{
	int i;

	for (i = 0; i < sizes[l - 1]; i++) {
		char name[20];

		sprintf(name, "%03d", i);

		if (l < level) {
			mkdir(name, 0775);
			chdir(name);
			cre(l + 1);
			chdir("..");
		} else {
			FILE *f;

			f = fopen(name,"w");
			fprintf(f, "Test");
			fclose(f);
		}
	}
}

void create(int l, char **sz)
{
	int i;

	level = l;
	for (i = 0; i < level; i++)
		sizes[i] = atoi(sz[i]);

	if (mkdir("gazf", 0775))
		fatal("Cannot create gazf here\n");
	chdir("gazf");
	write_gaz();
	cre(1);
	chdir("..");
}

void shrink(int level)
{
	if (chdir("gazf"))
		fatal("No gazfiles in this directory");
	read_gaz();
	chdir("..");
}

void scand(int l, void (*func)(int, int, char *, unsigned long),
			unsigned long level)
{
	DIR *dir;
	struct dirent *de;

	dir = opendir(".");
	if (!dir)
		fatal("Cannot open directory");
	while ((de = readdir(dir))) {
		struct stat s;

		if (de->d_name[0] == '.')
			continue;

		/*
		 * Some idiot broke the glibc library or made it impossible
		 * to figure out how to make readdir work right
		 */

		stat(de->d_name, &s);
		if (S_ISDIR(s.st_mode))
			de->d_type = DT_DIR;

		if (de->d_type == DT_DIR) {
			if (chdir(de->d_name))
				fatal("Cannot enter %s", de->d_name);
			scand(l + 1, func, level);
			chdir("..");
			func(l, 1, de->d_name, level);
		} else {
			func(l, 0, de->d_name, level);
		}
	}
	closedir(dir);
}

void traverse(void (*func)(int, int, char *, unsigned long),
		unsigned long level)
{
	if (chdir("gazf"))
		fatal("No gazfiles in this directory");
	scand(1, func, level);
	chdir("..");
}

void randomkill(int nr)
{
	if (chdir("gazf"))
		fatal("No gazfiles in this directory");
	read_gaz();
	chdir("..");
}

void del_func(int l, int dir, char *name, unsigned long level)
{
	if (l <= level)
		return;
	if (dir) {
		if (rmdir(name))
			fatal("Cannot remove directory %s");
	} else {
		if (unlink(name))
			fatal("Cannot unlink file %s");
	}
}

void delete(int l)
{
	if (l == 0) {
		system("rm -rf gazf");
		return;
	}
	traverse(del_func, l);
}

void usage(void)
{
	printf("gazfiles: Tool to manage gazillions of files\n\n");
	printf("gazfiles create <levels> <#l1> <#l2> ...\n");
	printf("gazfiles delete <levels>\n");
	printf("gazfiles shrink <levels>\n");
	printf("gazfiles randomkill <nr>\n\n");
	printf("(C) 2007 sgi, Christoph Lameter <clameter@sgi.com>\n");
	exit(0);
}

int main(int argc, char *argv[])
{
	if (argc  <  2)
		usage();

	switch (argv[1][0]) {
		case 'c' :
			create(atoi(argv[2]), argv + 3);
			break;
		case 's' :
			if (argc != 3)
				usage();

			shrink(atoi(argv[2]));
			break;
		case 'r' :
			if (argc != 3)
				usage();

			randomkill(atoi(argv[2]));
			break;
		case 'd':
			if (argc != 3)
				usage();
			delete(atoi(argv[2]));
			break;

		default:
			usage();
	}
	return 0;
}

-- 

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [patch 01/12] Slab defragmentation: Add support for kmem_cache_ops
  2007-07-08  3:05 [patch 00/12] Slab defragmentation V4 Christoph Lameter
@ 2007-07-08  3:05 ` Christoph Lameter
  2007-07-08  3:05 ` [patch 02/12] Slab defragmentation: Core piece Christoph Lameter
                   ` (10 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: Christoph Lameter @ 2007-07-08  3:05 UTC (permalink / raw)
  To: akpm; +Cc: linux-kernel, linux-mm, dgc

[-- Attachment #1: slab_defrag_kmem_cache_ops --]
[-- Type: text/plain, Size: 8569 bytes --]

We use the parameter formerly used by the destructor to pass an optional
pointer to a kmem_cache_ops structure to kmem_cache_create.

kmem_cache_ops is created as empty. Later patches populate kmem_cache_ops.

Create a KMEM_CACHE_OPS macro that allows the specification of a the
kmem_cache_ops parameter.

Code to handle kmem_cache_ops is added to SLUB. SLAB and SLOB are updated
to be able to accept a kmem_cache_ops structure but will ignore it.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

---
 include/linux/slab.h     |   13 +++++++++----
 include/linux/slub_def.h |    1 +
 mm/slab.c                |    6 +++---
 mm/slob.c                |    2 +-
 mm/slub.c                |   44 ++++++++++++++++++++++++++++++--------------
 5 files changed, 44 insertions(+), 22 deletions(-)

Index: linux-2.6.22-rc6-mm1/include/linux/slab.h
===================================================================
--- linux-2.6.22-rc6-mm1.orig/include/linux/slab.h	2007-07-04 09:09:55.000000000 -0700
+++ linux-2.6.22-rc6-mm1/include/linux/slab.h	2007-07-04 09:14:59.000000000 -0700
@@ -51,10 +51,13 @@
 void __init kmem_cache_init(void);
 int slab_is_available(void);
 
+struct kmem_cache_ops {
+};
+
 struct kmem_cache *kmem_cache_create(const char *, size_t, size_t,
 			unsigned long,
 			void (*)(void *, struct kmem_cache *, unsigned long),
-			void (*)(void *, struct kmem_cache *, unsigned long));
+			const struct kmem_cache_ops *s);
 void kmem_cache_destroy(struct kmem_cache *);
 int kmem_cache_shrink(struct kmem_cache *);
 void kmem_cache_free(struct kmem_cache *, void *);
@@ -70,9 +73,11 @@ int kmem_ptr_validate(struct kmem_cache 
  * f.e. add ____cacheline_aligned_in_smp to the struct declaration
  * then the objects will be properly aligned in SMP configurations.
  */
-#define KMEM_CACHE(__struct, __flags) kmem_cache_create(#__struct,\
-		sizeof(struct __struct), __alignof__(struct __struct),\
-		(__flags), NULL, NULL)
+#define KMEM_CACHE_OPS(__struct, __flags, __ops) \
+	kmem_cache_create(#__struct, sizeof(struct __struct), \
+	__alignof__(struct __struct), (__flags), NULL, (__ops))
+
+#define KMEM_CACHE(__struct, __flags) KMEM_CACHE_OPS(__struct, __flags, NULL)
 
 /*
  * The largest kmalloc size supported by the slab allocators is
Index: linux-2.6.22-rc6-mm1/mm/slub.c
===================================================================
--- linux-2.6.22-rc6-mm1.orig/mm/slub.c	2007-07-04 09:14:35.000000000 -0700
+++ linux-2.6.22-rc6-mm1/mm/slub.c	2007-07-04 09:15:21.000000000 -0700
@@ -300,6 +300,9 @@ static inline int check_valid_pointer(st
 	return 1;
 }
 
+struct kmem_cache_ops slub_default_ops = {
+};
+
 /*
  * Slow version of get and set free pointer.
  *
@@ -2082,11 +2085,13 @@ static int calculate_sizes(struct kmem_c
 static int kmem_cache_open(struct kmem_cache *s, gfp_t gfpflags,
 		const char *name, size_t size,
 		size_t align, unsigned long flags,
-		void (*ctor)(void *, struct kmem_cache *, unsigned long))
+		void (*ctor)(void *, struct kmem_cache *, unsigned long),
+		const struct kmem_cache_ops *ops)
 {
 	memset(s, 0, kmem_size);
 	s->name = name;
 	s->ctor = ctor;
+	s->ops = ops;
 	s->objsize = size;
 	s->flags = flags;
 	s->align = align;
@@ -2270,7 +2275,7 @@ static struct kmem_cache *create_kmalloc
 
 	down_write(&slub_lock);
 	if (!kmem_cache_open(s, gfp_flags, name, size, ARCH_KMALLOC_MINALIGN,
-			flags, NULL))
+			flags, NULL, &slub_default_ops))
 		goto panic;
 
 	list_add(&s->list, &slab_caches);
@@ -2639,12 +2644,16 @@ static int slab_unmergeable(struct kmem_
 	if (s->refcount < 0)
 		return 1;
 
+	if (s->ops != &slub_default_ops)
+		return 1;
+
 	return 0;
 }
 
 static struct kmem_cache *find_mergeable(size_t size,
 		size_t align, unsigned long flags,
-		void (*ctor)(void *, struct kmem_cache *, unsigned long))
+		void (*ctor)(void *, struct kmem_cache *, unsigned long),
+		const struct kmem_cache_ops *ops)
 {
 	struct kmem_cache *s;
 
@@ -2654,6 +2663,9 @@ static struct kmem_cache *find_mergeable
 	if (ctor)
 		return NULL;
 
+	if (ops != &slub_default_ops)
+		return NULL;
+
 	size = ALIGN(size, sizeof(void *));
 	align = calculate_alignment(flags, align, size);
 	size = ALIGN(size, align);
@@ -2686,13 +2698,15 @@ static struct kmem_cache *find_mergeable
 struct kmem_cache *kmem_cache_create(const char *name, size_t size,
 		size_t align, unsigned long flags,
 		void (*ctor)(void *, struct kmem_cache *, unsigned long),
-		void (*dtor)(void *, struct kmem_cache *, unsigned long))
+		const struct kmem_cache_ops *ops)
 {
 	struct kmem_cache *s;
 
-	BUG_ON(dtor);
+	if (!ops)
+		ops = &slub_default_ops;
+
 	down_write(&slub_lock);
-	s = find_mergeable(size, align, flags, ctor);
+	s = find_mergeable(size, align, flags, ctor, ops);
 	if (s) {
 		s->refcount++;
 		/*
@@ -2712,7 +2726,7 @@ struct kmem_cache *kmem_cache_create(con
 	s = kmalloc(kmem_size, GFP_KERNEL);
 	if (s) {
 		if (kmem_cache_open(s, GFP_KERNEL, name,
-				size, align, flags, ctor)) {
+				size, align, flags, ctor, ops)) {
 			list_add(&s->list, &slab_caches);
 			up_write(&slub_lock);
 			raise_kswapd_order(s->order);
@@ -3332,16 +3346,18 @@ static ssize_t order_show(struct kmem_ca
 }
 SLAB_ATTR_RO(order);
 
-static ssize_t ctor_show(struct kmem_cache *s, char *buf)
+static ssize_t ops_show(struct kmem_cache *s, char *buf)
 {
-	if (s->ctor) {
-		int n = sprint_symbol(buf, (unsigned long)s->ctor);
+	int x = 0;
 
-		return n + sprintf(buf + n, "\n");
+	if (s->ctor) {
+		x += sprintf(buf + x, "ctor : ");
+		x += sprint_symbol(buf + x, (unsigned long)s->ctor);
+		x += sprintf(buf + x, "\n");
 	}
-	return 0;
+	return x;
 }
-SLAB_ATTR_RO(ctor);
+SLAB_ATTR_RO(ops);
 
 static ssize_t aliases_show(struct kmem_cache *s, char *buf)
 {
@@ -3576,7 +3592,7 @@ static struct attribute * slab_attrs[] =
 	&slabs_attr.attr,
 	&partial_attr.attr,
 	&cpu_slabs_attr.attr,
-	&ctor_attr.attr,
+	&ops_attr.attr,
 	&aliases_attr.attr,
 	&align_attr.attr,
 	&sanity_checks_attr.attr,
Index: linux-2.6.22-rc6-mm1/include/linux/slub_def.h
===================================================================
--- linux-2.6.22-rc6-mm1.orig/include/linux/slub_def.h	2007-07-04 09:09:55.000000000 -0700
+++ linux-2.6.22-rc6-mm1/include/linux/slub_def.h	2007-07-04 09:14:59.000000000 -0700
@@ -42,6 +42,7 @@ struct kmem_cache {
 	int objects;		/* Number of objects in slab */
 	int refcount;		/* Refcount for slab cache destroy */
 	void (*ctor)(void *, struct kmem_cache *, unsigned long);
+	const struct kmem_cache_ops *ops;
 	int inuse;		/* Offset to metadata */
 	int align;		/* Alignment */
 	const char *name;	/* Name (only for display!) */
Index: linux-2.6.22-rc6-mm1/mm/slab.c
===================================================================
--- linux-2.6.22-rc6-mm1.orig/mm/slab.c	2007-07-04 09:09:55.000000000 -0700
+++ linux-2.6.22-rc6-mm1/mm/slab.c	2007-07-04 09:14:59.000000000 -0700
@@ -2102,7 +2102,7 @@ static int __init_refok setup_cpu_cache(
  * @align: The required alignment for the objects.
  * @flags: SLAB flags
  * @ctor: A constructor for the objects.
- * @dtor: A destructor for the objects (not implemented anymore).
+ * @ops: A kmem_cache_ops structure (ignored).
  *
  * Returns a ptr to the cache on success, NULL on failure.
  * Cannot be called within a int, but can be interrupted.
@@ -2128,7 +2128,7 @@ struct kmem_cache *
 kmem_cache_create (const char *name, size_t size, size_t align,
 	unsigned long flags,
 	void (*ctor)(void*, struct kmem_cache *, unsigned long),
-	void (*dtor)(void*, struct kmem_cache *, unsigned long))
+	const struct kmem_cache_ops *ops)
 {
 	size_t left_over, slab_size, ralign;
 	struct kmem_cache *cachep = NULL, *pc;
@@ -2137,7 +2137,7 @@ kmem_cache_create (const char *name, siz
 	 * Sanity checks... these are all serious usage bugs.
 	 */
 	if (!name || in_interrupt() || (size < BYTES_PER_WORD) ||
-	    size > KMALLOC_MAX_SIZE || dtor) {
+	    size > KMALLOC_MAX_SIZE) {
 		printk(KERN_ERR "%s: Early error in slab %s\n", __FUNCTION__,
 				name);
 		BUG();
Index: linux-2.6.22-rc6-mm1/mm/slob.c
===================================================================
--- linux-2.6.22-rc6-mm1.orig/mm/slob.c	2007-07-04 09:09:55.000000000 -0700
+++ linux-2.6.22-rc6-mm1/mm/slob.c	2007-07-04 09:14:59.000000000 -0700
@@ -493,7 +493,7 @@ struct kmem_cache {
 struct kmem_cache *kmem_cache_create(const char *name, size_t size,
 	size_t align, unsigned long flags,
 	void (*ctor)(void*, struct kmem_cache *, unsigned long),
-	void (*dtor)(void*, struct kmem_cache *, unsigned long))
+	const struct kmem_cache_ops *o)
 {
 	struct kmem_cache *c;
 

-- 

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [patch 02/12] Slab defragmentation: Core piece
  2007-07-08  3:05 [patch 00/12] Slab defragmentation V4 Christoph Lameter
  2007-07-08  3:05 ` [patch 01/12] Slab defragmentation: Add support for kmem_cache_ops Christoph Lameter
@ 2007-07-08  3:05 ` Christoph Lameter
  2007-07-08  3:05 ` [patch 03/12] Slab defragmentation: Updates to slabinfo.c Christoph Lameter
                   ` (9 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: Christoph Lameter @ 2007-07-08  3:05 UTC (permalink / raw)
  To: akpm; +Cc: linux-kernel, linux-mm, dgc

[-- Attachment #1: slab_defrag_core --]
[-- Type: text/plain, Size: 22259 bytes --]

V3->V4:
- Reduce impact of checking for slab caches in need of defragmentation
  by
  	A. Ordering the slab list. Defraggable slabs come first. Then
	   we can only scan the slab list until we hit the first
	   unfragmentable slab.
        B. Avoid determining the number of objects in partial slabs
	   when we can deduce from the number of partial slabs that
	   the fragmentation ratio will not be below the boundary.
	C. Allocate a scratch area once for multiple defragmentaable
	   slab caches. In order to do that we need to determine the
	   maximum number of objects in defragmentable slabs in
	   kmem_cache_create().

- The slab defragmentation ratio can now be setup for each slab cache
  separately.
- Rename the existing "defrag_ratio" in /sys/slab/* to
  remote_node_defrag_ratio to free up the defrag_ratio field for
  defragmentation.

Slab defragmentation occurs either

1. Unconditionally when kmem_cache_shrink is called on slab by the kernel
   calling kmem_cache_shrink or slabinfo triggering slab shrinking. This
   form performs defragmentation on all nodes of a NUMA system.

2. Conditionally when kmem_cache_defrag(<percentage>, <node>) is called.

   The defragmentation is only performed if the fragmentation of the slab
   is higher then the specified percentage. Fragmentation ratios are measured
   by calculating the percentage of objects in use compared to the total
   number of objects that the slab cache could hold.

   kmem_cache_defrag takes a node parameter. This can either be -1 if
   defragmentation should be performed on all nodes, or a node number.
   If a node number was specified then defragmentation is only performed
   on a specific node.

   Slab defragmentation is a memory intensive operation that can be
   sped up in a NUMA system if mostly node local memory is accessed. That
   is the case if we just have reclaimed reclaim on a node.

For defragmentation SLUB first generates a sorted list of partial slabs.
Sorting is performed according to the number of objects allocated.
Thus the slabs with the least objects will be at the end.

We extract slabs off the tail of that list until we have either reached a
mininum number of slabs or until we encounter a slab that has more than a
quarter of its objects allocated. Then we attempt to remove the objects
from each of the slabs taken.

In order for a slabcache to support defragmentation a couple of functions
must be defined via kmem_cache_ops. These are

void *get(struct kmem_cache *s, int nr, void **objects)

	Must obtain a reference to the listed objects. SLUB guarantees that
	the objects are still allocated. However, other threads may be blocked
	in slab_free attempting to free objects in the slab. These may succeed
	as soon as get() returns to the slab allocator. The function must
	be able to detect the situation and void the attempts to handle such
	objects (by for example voiding the corresponding entry in the objects
	array).

	No slab operations may be performed in get(). Interrupts
	are disabled. What can be done is very limited. The slab lock
	for the page with the object is taken. Any attempt to perform a slab
	operation may lead to a deadlock.

	get() returns a private pointer that is passed to kick. Should we
	be unable to obtain all references then that pointer may indicate
	to the kick() function that it should not attempt any object removal
	or move but simply remove the reference counts.

void kick(struct kmem_cache *, int nr, void **objects, void *get_result)

	After SLUB has established references to the objects in a
	slab it will drop all locks and then use kick() to move objects out
	of the slab. The existence of the object is guaranteed by virtue of
	the earlier obtained references via get(). The callback may perform
	any slab operation since no locks are held at the time of call.

	The callback should remove the object from the slab in some way. This
	may be accomplished by reclaiming the object and then running
	kmem_cache_free() or reallocating it and then running
	kmem_cache_free(). Reallocation is advantageous because the partial
	slabs were just sorted to have the partial slabs with the most objects
	first. Reallocation is likely to result in filling up a slab in
	addition to freeing up one slab so that it also can be removed from
	the partial list.

	Kick() does not return a result. SLUB will check the number of
	remaining objects in the slab. If all objects were removed then
	we know that the operation was successful.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

---
 include/linux/slab.h     |   32 +++
 include/linux/slub_def.h |   12 +
 mm/slab.c                |    5 
 mm/slob.c                |    5 
 mm/slub.c                |  415 ++++++++++++++++++++++++++++++++++++++---------
 5 files changed, 395 insertions(+), 74 deletions(-)

Index: linux-2.6.22-rc6-mm1/include/linux/slab.h
===================================================================
--- linux-2.6.22-rc6-mm1.orig/include/linux/slab.h	2007-07-06 13:29:06.000000000 -0700
+++ linux-2.6.22-rc6-mm1/include/linux/slab.h	2007-07-06 20:03:20.000000000 -0700
@@ -51,7 +51,39 @@
 void __init kmem_cache_init(void);
 int slab_is_available(void);
 
+struct kmem_cache;
+
 struct kmem_cache_ops {
+	/*
+	 * Called with slab lock held and interrupts disabled.
+	 * No slab operation may be performed.
+	 *
+	 * Parameters passed are the number of objects to process
+	 * and an array of pointers to objects for which we
+	 * need references.
+	 *
+	 * Returns a pointer that is passed to the kick function.
+	 * If all objects cannot be moved then the pointer may
+	 * indicate that this wont work and then kick can simply
+	 * remove the references that were already obtained.
+	 *
+	 * The array passed to get() is also passed to kick(). The
+	 * function may remove objects by setting array elements to NULL.
+	 */
+	void *(*get)(struct kmem_cache *, int nr, void **);
+
+	/*
+	 * Called with no locks held and interrupts enabled.
+	 * Any operation may be performed in kick().
+	 *
+	 * Parameters passed are the number of objects in the array,
+	 * the array of pointers to the objects and the pointer
+	 * returned by get().
+	 *
+	 * Success is checked by examining the number of remaining
+	 * objects in the slab.
+	 */
+	void (*kick)(struct kmem_cache *, int nr, void **, void *private);
 };
 
 struct kmem_cache *kmem_cache_create(const char *, size_t, size_t,
Index: linux-2.6.22-rc6-mm1/mm/slub.c
===================================================================
--- linux-2.6.22-rc6-mm1.orig/mm/slub.c	2007-07-06 13:29:06.000000000 -0700
+++ linux-2.6.22-rc6-mm1/mm/slub.c	2007-07-06 20:04:54.000000000 -0700
@@ -242,6 +242,9 @@ static enum {
 static DECLARE_RWSEM(slub_lock);
 LIST_HEAD(slab_caches);
 
+/* Maximum objects in defragmentable slabs */
+static unsigned int max_defrag_slab_objects = 0;
+
 /*
  * Tracking user of a slab.
  */
@@ -1308,7 +1311,8 @@ static struct page *get_any_partial(stru
 	 * expensive if we do it every time we are trying to find a slab
 	 * with available objects.
 	 */
-	if (!s->defrag_ratio || get_cycles() % 1024 > s->defrag_ratio)
+	if (!s->remote_node_defrag_ratio ||
+			get_cycles() % 1024 > s->remote_node_defrag_ratio)
 		return NULL;
 
 	zonelist = &NODE_DATA(slab_node(current->mempolicy))
@@ -2103,8 +2107,9 @@ static int kmem_cache_open(struct kmem_c
 		goto error;
 
 	s->refcount = 1;
+	s->defrag_ratio = 30;
 #ifdef CONFIG_NUMA
-	s->defrag_ratio = 100;
+	s->remote_node_defrag_ratio = 100;
 #endif
 	if (init_kmem_cache_nodes(s, gfpflags & ~SLUB_DMA))
 		return 1;
@@ -2280,7 +2285,7 @@ static struct kmem_cache *create_kmalloc
 			flags, NULL, &slub_default_ops))
 		goto panic;
 
-	list_add(&s->list, &slab_caches);
+	list_add_tail(&s->list, &slab_caches);
 	up_write(&slub_lock);
 	if (sysfs_slab_add(s))
 		goto panic;
@@ -2460,6 +2465,203 @@ void kfree(const void *x)
 }
 EXPORT_SYMBOL(kfree);
 
+static unsigned long count_partial(struct kmem_cache_node *n)
+{
+	unsigned long flags;
+	unsigned long x = 0;
+	struct page *page;
+
+	spin_lock_irqsave(&n->list_lock, flags);
+	list_for_each_entry(page, &n->partial, lru)
+		x += page->inuse;
+	spin_unlock_irqrestore(&n->list_lock, flags);
+	return x;
+}
+
+/*
+ * Vacate all objects in the given slab.
+ *
+ * Slab must be locked and frozen. Interrupts are disabled (flags must
+ * be passed).
+ *
+ * Will drop and regain and drop the slab lock. At the end the slab will
+ * either be freed or returned to the partial lists.
+ *
+ * Returns the number of remaining objects
+ *
+ * The scratch aread passed to list function is sufficient to hold
+ * struct listhead times objects per slab. We use it to hold void ** times
+ * objects per slab plus a bitmap for each object.
+ */
+static int __kmem_cache_vacate(struct kmem_cache *s,
+		struct page *page, unsigned long flags, void *scratch)
+{
+	void **vector = scratch;
+	void *p;
+	void *addr = page_address(page);
+	unsigned long *map = scratch + s->objects * sizeof(void **);
+	int leftover;
+	int objects;
+	void *private;
+
+	if (!page->inuse)
+		goto out;
+
+	/* Determine used objects */
+	bitmap_fill(map, s->objects);
+	for_each_free_object(p, s, page->freelist)
+		__clear_bit(slab_index(p, s, addr), map);
+
+	objects = 0;
+	memset(vector, 0, s->objects * sizeof(void **));
+	for_each_object(p, s, addr) {
+		if (test_bit(slab_index(p, s, addr), map))
+			vector[objects++] = p;
+	}
+
+	private = s->ops->get(s, objects, vector);
+
+	/*
+	 * Got references. Now we can drop the slab lock. The slab
+	 * is frozen so it cannot vanish from under us nor will
+	 * allocations be performed on the slab. However, unlocking the
+	 * slab will allow concurrent slab_frees to proceed.
+	 */
+	slab_unlock(page);
+	local_irq_restore(flags);
+
+	/*
+	 * Perform the KICK callbacks to remove the objects.
+	 */
+	s->ops->kick(s, objects, vector, private);
+
+	local_irq_save(flags);
+	slab_lock(page);
+out:
+	/*
+	 * Check the result and unfreeze the slab
+	 */
+	leftover = page->inuse;
+	unfreeze_slab(s, page);
+	local_irq_restore(flags);
+	return leftover;
+}
+
+/*
+ * Sort the partial slabs by the number of items allocated.
+ * The slabs with the least objects come last.
+ */
+static unsigned long sort_partial_list(struct kmem_cache *s,
+	struct kmem_cache_node *n, void *scratch)
+{
+	struct list_head *slabs_by_inuse = scratch;
+	int i;
+	struct page *page;
+	struct page *t;
+	unsigned long freed = 0;
+
+	for (i = 0; i < s->objects; i++)
+		INIT_LIST_HEAD(slabs_by_inuse + i);
+
+	/*
+	 * Build lists indexed by the items in use in each slab.
+	 *
+	 * Note that concurrent frees may occur while we hold the
+	 * list_lock. page->inuse here is the upper limit.
+	 */
+	list_for_each_entry_safe(page, t, &n->partial, lru) {
+		int inuse = page->inuse;
+
+		if (!inuse && slab_trylock(page)) {
+			/*
+			 * Must hold slab lock here because slab_free
+			 * may have freed the last object and be
+			 * waiting to release the slab.
+			 */
+			list_del(&page->lru);
+			n->nr_partial--;
+			slab_unlock(page);
+			discard_slab(s, page);
+			freed++;
+		} else {
+			list_move(&page->lru,
+				slabs_by_inuse + inuse);
+		}
+	}
+
+	/*
+	 * Rebuild the partial list with the slabs filled up most
+	 * first and the least used slabs at the end.
+	 */
+	for (i = s->objects - 1; i >= 0; i--)
+		list_splice(slabs_by_inuse + i, n->partial.prev);
+
+	return freed;
+}
+
+/*
+ * Shrink the slab cache on a particular node of the cache
+ */
+static unsigned long __kmem_cache_shrink(struct kmem_cache *s,
+	struct kmem_cache_node *n, void *scratch)
+{
+	unsigned long flags;
+	struct page *page, *page2;
+	LIST_HEAD(zaplist);
+	int freed;
+	int inuse;
+
+	spin_lock_irqsave(&n->list_lock, flags);
+	freed = sort_partial_list(s, n, scratch);
+
+	/*
+	 * If we have no functions available to defragment the slabs
+	 * then we are done.
+	*/
+	if (!s->ops->get || !s->ops->kick) {
+		spin_unlock_irqrestore(&n->list_lock, flags);
+		return freed;
+	}
+
+	/*
+	 * Take slabs with just a few objects off the tail of the now
+	 * ordered list. These are the slabs with the least objects
+	 * and those are likely easy to reclaim.
+	 */
+	while (n->nr_partial > MAX_PARTIAL) {
+		page = container_of(n->partial.prev, struct page, lru);
+		inuse = page->inuse;
+
+		/*
+		 * We are holding the list_lock so we can only
+		 * trylock the slab
+		 */
+		if (inuse > s->objects / 4)
+			break;
+
+		if (!slab_trylock(page))
+			break;
+
+		list_move_tail(&page->lru, &zaplist);
+		n->nr_partial--;
+		SetSlabFrozen(page);
+		slab_unlock(page);
+	}
+
+	spin_unlock_irqrestore(&n->list_lock, flags);
+
+	/* Now we can free objects in the slabs on the zaplist */
+	list_for_each_entry_safe(page, page2, &zaplist, lru) {
+		unsigned long flags;
+
+		local_irq_save(flags);
+		slab_lock(page);
+		if (__kmem_cache_vacate(s, page, flags, scratch) == 0)
+			freed++;
+	}
+	return freed;
+}
+
 /*
  * kmem_cache_shrink removes empty slabs from the partial lists and sorts
  * the remaining slabs by the number of items in use. The slabs with the
@@ -2473,71 +2675,111 @@ EXPORT_SYMBOL(kfree);
 int kmem_cache_shrink(struct kmem_cache *s)
 {
 	int node;
-	int i;
-	struct kmem_cache_node *n;
-	struct page *page;
-	struct page *t;
-	struct list_head *slabs_by_inuse =
-		kmalloc(sizeof(struct list_head) * s->objects, GFP_KERNEL);
-	unsigned long flags;
+	void *scratch;
+
+	flush_all(s);
 
-	if (!slabs_by_inuse)
+	scratch = kmalloc(sizeof(struct list_head) * s->objects,
+							GFP_KERNEL);
+	if (!scratch)
 		return -ENOMEM;
 
-	flush_all(s);
-	for_each_online_node(node) {
-		n = get_node(s, node);
+	for_each_online_node(node)
+		__kmem_cache_shrink(s, get_node(s, node), scratch);
 
-		if (!n->nr_partial)
-			continue;
+	kfree(scratch);
+	return 0;
+}
+EXPORT_SYMBOL(kmem_cache_shrink);
 
-		for (i = 0; i < s->objects; i++)
-			INIT_LIST_HEAD(slabs_by_inuse + i);
+static unsigned long __kmem_cache_defrag(struct kmem_cache *s, int node,
+								void *scratch)
+{
+	unsigned long capacity;
+	unsigned long objects_in_full_slabs;
+	unsigned long ratio;
+	struct kmem_cache_node *n = get_node(s, node);
 
-		spin_lock_irqsave(&n->list_lock, flags);
+	/*
+	 * An insignificant number of partial slabs makes
+	 * the slab not interesting.
+	 */
+	if (n->nr_partial <= MAX_PARTIAL)
+		return 0;
 
-		/*
-		 * Build lists indexed by the items in use in each slab.
-		 *
-		 * Note that concurrent frees may occur while we hold the
-		 * list_lock. page->inuse here is the upper limit.
-		 */
-		list_for_each_entry_safe(page, t, &n->partial, lru) {
-			if (!page->inuse && slab_trylock(page)) {
-				/*
-				 * Must hold slab lock here because slab_free
-				 * may have freed the last object and be
-				 * waiting to release the slab.
-				 */
-				list_del(&page->lru);
-				n->nr_partial--;
-				slab_unlock(page);
-				discard_slab(s, page);
-			} else {
-				if (n->nr_partial > MAX_PARTIAL)
-					list_move(&page->lru,
-					slabs_by_inuse + page->inuse);
-			}
-		}
+	capacity = atomic_long_read(&n->nr_slabs) * s->objects;
+	objects_in_full_slabs =
+			(atomic_long_read(&n->nr_slabs) - n->nr_partial)
+							* s->objects;
+	/*
+	 * Worst case calculation: If we would be over the ratio
+	 * even if all partial slabs would only have one object
+	 * then we can skip the further test that would require a scan
+	 * through all the partial page structs to sum up the actual
+	 * number of objects in the partial slabs.
+	 */
+	ratio = (objects_in_full_slabs + 1 * n->nr_partial) * 100 / capacity;
+	if (ratio > s->defrag_ratio)
+		return 0;
 
-		if (n->nr_partial <= MAX_PARTIAL)
-			goto out;
+	/*
+	 * Now for the real calculation. If usage ratio is more than required
+	 * then no defragmentation
+	 */
+	ratio = (objects_in_full_slabs + count_partial(n)) * 100 / capacity;
+	if (ratio > s->defrag_ratio)
+		return 0;
+
+	return __kmem_cache_shrink(s, n, scratch) << s->order;
+}
+
+/*
+ * Defrag slabs on the local node. This is called from the memory reclaim
+ * path.
+ */
+int kmem_cache_defrag(int node)
+{
+	struct kmem_cache *s;
+	unsigned long pages = 0;
+	void *scratch;
+
+	scratch = kmalloc(sizeof(struct list_head) * max_defrag_slab_objects,
+								GFP_KERNEL);
+	if (!scratch)
+		return 0;
+
+	/*
+	 * kmem_cache_defrag may be called from the reclaim path which may be
+	 * called for any page allocator alloc. So there is the danger that we
+	 * get called in a situation where slub already acquired the slub_lock
+	 * for other purposes.
+	 */
+	if (!down_read_trylock(&slub_lock))
+		return 0;
+
+	list_for_each_entry(s, &slab_caches, list) {
 
 		/*
-		 * Rebuild the partial list with the slabs filled up most
-		 * first and the least used slabs at the end.
+		 * The list of slab cachess is sorted so that we only scan
+		 * have to scan the one capable of defragmentation.
 		 */
-		for (i = s->objects - 1; i >= 0; i--)
-			list_splice(slabs_by_inuse + i, n->partial.prev);
+		if (!s->ops->kick)
+			break;
 
-	out:
-		spin_unlock_irqrestore(&n->list_lock, flags);
-	}
 
-	kfree(slabs_by_inuse);
-	return 0;
+		if (node == -1) {
+			int nid;
+
+			for_each_online_node(nid)
+				pages += __kmem_cache_defrag(s, nid, scratch);
+		} else
+			pages += __kmem_cache_defrag(s, node, scratch);
+	}
+	up_read(&slub_lock);
+	kfree(scratch);
+	return pages;
 }
-EXPORT_SYMBOL(kmem_cache_shrink);
+EXPORT_SYMBOL(kmem_cache_defrag);
 
 /********************************************************************
  *			Basic setup of slabs
@@ -2729,7 +2971,18 @@ struct kmem_cache *kmem_cache_create(con
 	if (s) {
 		if (kmem_cache_open(s, GFP_KERNEL, name,
 				size, align, flags, ctor, ops)) {
-			list_add(&s->list, &slab_caches);
+
+			/*
+			 * Reclaimable slabs first because we may have
+			 * to scan them repeatedly.
+			 */
+			if (ops->kick) {
+				list_add(&s->list, &slab_caches);
+				if (s->objects > max_defrag_slab_objects)
+					max_defrag_slab_objects = s->objects;
+			} else
+				list_add_tail(&s->list, &slab_caches);
+
 			up_write(&slub_lock);
 			raise_kswapd_order(s->order);
 
@@ -3189,19 +3442,6 @@ static int list_locations(struct kmem_ca
 	return n;
 }
 
-static unsigned long count_partial(struct kmem_cache_node *n)
-{
-	unsigned long flags;
-	unsigned long x = 0;
-	struct page *page;
-
-	spin_lock_irqsave(&n->list_lock, flags);
-	list_for_each_entry(page, &n->partial, lru)
-		x += page->inuse;
-	spin_unlock_irqrestore(&n->list_lock, flags);
-	return x;
-}
-
 enum slab_stat_type {
 	SL_FULL,
 	SL_PARTIAL,
@@ -3357,6 +3597,20 @@ static ssize_t ops_show(struct kmem_cach
 		x += sprint_symbol(buf + x, (unsigned long)s->ctor);
 		x += sprintf(buf + x, "\n");
 	}
+
+	if (s->ops->get) {
+		x += sprintf(buf + x, "get : ");
+		x += sprint_symbol(buf + x,
+				(unsigned long)s->ops->get);
+		x += sprintf(buf + x, "\n");
+	}
+
+	if (s->ops->kick) {
+		x += sprintf(buf + x, "kick : ");
+		x += sprint_symbol(buf + x,
+				(unsigned long)s->ops->kick);
+		x += sprintf(buf + x, "\n");
+	}
 	return x;
 }
 SLAB_ATTR_RO(ops);
@@ -3567,10 +3821,9 @@ static ssize_t free_calls_show(struct km
 }
 SLAB_ATTR_RO(free_calls);
 
-#ifdef CONFIG_NUMA
 static ssize_t defrag_ratio_show(struct kmem_cache *s, char *buf)
 {
-	return sprintf(buf, "%d\n", s->defrag_ratio / 10);
+	return sprintf(buf, "%d\n", s->defrag_ratio);
 }
 
 static ssize_t defrag_ratio_store(struct kmem_cache *s,
@@ -3579,10 +3832,27 @@ static ssize_t defrag_ratio_store(struct
 	int n = simple_strtoul(buf, NULL, 10);
 
 	if (n < 100)
-		s->defrag_ratio = n * 10;
+		s->defrag_ratio = n;
 	return length;
 }
 SLAB_ATTR(defrag_ratio);
+
+#ifdef CONFIG_NUMA
+static ssize_t remote_node_defrag_ratio_show(struct kmem_cache *s, char *buf)
+{
+	return sprintf(buf, "%d\n", s->remote_node_defrag_ratio / 10);
+}
+
+static ssize_t remote_node_defrag_ratio_store(struct kmem_cache *s,
+				const char *buf, size_t length)
+{
+	int n = simple_strtoul(buf, NULL, 10);
+
+	if (n < 100)
+		s->remote_node_defrag_ratio = n * 10;
+	return length;
+}
+SLAB_ATTR(remote_node_defrag_ratio);
 #endif
 
 static struct attribute * slab_attrs[] = {
@@ -3609,11 +3879,12 @@ static struct attribute * slab_attrs[] =
 	&shrink_attr.attr,
 	&alloc_calls_attr.attr,
 	&free_calls_attr.attr,
+	&defrag_ratio_attr.attr,
 #ifdef CONFIG_ZONE_DMA
 	&cache_dma_attr.attr,
 #endif
 #ifdef CONFIG_NUMA
-	&defrag_ratio_attr.attr,
+	&remote_node_defrag_ratio_attr.attr,
 #endif
 	NULL
 };
Index: linux-2.6.22-rc6-mm1/mm/slab.c
===================================================================
--- linux-2.6.22-rc6-mm1.orig/mm/slab.c	2007-07-06 13:29:06.000000000 -0700
+++ linux-2.6.22-rc6-mm1/mm/slab.c	2007-07-06 20:00:31.000000000 -0700
@@ -2518,6 +2518,11 @@ int kmem_cache_shrink(struct kmem_cache 
 }
 EXPORT_SYMBOL(kmem_cache_shrink);
 
+int kmem_cache_defrag(int percent, int node)
+{
+	return 0;
+}
+
 /**
  * kmem_cache_destroy - delete a cache
  * @cachep: the cache to destroy
Index: linux-2.6.22-rc6-mm1/mm/slob.c
===================================================================
--- linux-2.6.22-rc6-mm1.orig/mm/slob.c	2007-07-06 13:29:06.000000000 -0700
+++ linux-2.6.22-rc6-mm1/mm/slob.c	2007-07-06 20:00:31.000000000 -0700
@@ -591,6 +591,11 @@ int kmem_cache_shrink(struct kmem_cache 
 }
 EXPORT_SYMBOL(kmem_cache_shrink);
 
+int kmem_cache_defrag(int percentage, int node)
+{
+	return 0;
+}
+
 int kmem_ptr_validate(struct kmem_cache *a, const void *b)
 {
 	return 0;
Index: linux-2.6.22-rc6-mm1/include/linux/slub_def.h
===================================================================
--- linux-2.6.22-rc6-mm1.orig/include/linux/slub_def.h	2007-07-06 13:29:06.000000000 -0700
+++ linux-2.6.22-rc6-mm1/include/linux/slub_def.h	2007-07-06 20:00:29.000000000 -0700
@@ -50,9 +50,17 @@ struct kmem_cache {
 #ifdef CONFIG_SLUB_DEBUG
 	struct kobject kobj;	/* For sysfs */
 #endif
-
+	int defrag_ratio;	/*
+				 * objects/possible objects limit. If we have
+				 * less that the specified percentage of
+				 * objects allocated then defrag passes
+				 * will start to occur during reclaim.
+				 */
 #ifdef CONFIG_NUMA
-	int defrag_ratio;
+	int remote_node_defrag_ratio;/*
+				 * Defragmentation through allocation from a
+				 * remote node
+				 */
 	struct kmem_cache_node *node[MAX_NUMNODES];
 #endif
 	struct page *cpu_slab[NR_CPUS];

-- 

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [patch 03/12] Slab defragmentation: Updates to slabinfo.c
  2007-07-08  3:05 [patch 00/12] Slab defragmentation V4 Christoph Lameter
  2007-07-08  3:05 ` [patch 01/12] Slab defragmentation: Add support for kmem_cache_ops Christoph Lameter
  2007-07-08  3:05 ` [patch 02/12] Slab defragmentation: Core piece Christoph Lameter
@ 2007-07-08  3:05 ` Christoph Lameter
  2007-07-08  3:05 ` [patch 04/12] Slab defragmentation: Logic to trigger defragmentation from memory reclaim Christoph Lameter
                   ` (8 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: Christoph Lameter @ 2007-07-08  3:05 UTC (permalink / raw)
  To: akpm; +Cc: linux-kernel, linux-mm, dgc

[-- Attachment #1: slab_defrag_slabinfo_updates --]
[-- Type: text/plain, Size: 5866 bytes --]

-D lists caches that support defragmentation

-C lists caches that use a ctor.

Change field names for defrag_ratio and remote_node_defrag_ratio.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

---
 Documentation/vm/slabinfo.c |   52 +++++++++++++++++++++++++++++++++++++-------
 1 file changed, 44 insertions(+), 8 deletions(-)

Index: linux-2.6.22-rc6-mm1/Documentation/vm/slabinfo.c
===================================================================
--- linux-2.6.22-rc6-mm1.orig/Documentation/vm/slabinfo.c	2007-07-05 19:00:27.000000000 -0700
+++ linux-2.6.22-rc6-mm1/Documentation/vm/slabinfo.c	2007-07-05 19:04:14.000000000 -0700
@@ -30,6 +30,8 @@ struct slabinfo {
 	int hwcache_align, object_size, objs_per_slab;
 	int sanity_checks, slab_size, store_user, trace;
 	int order, poison, reclaim_account, red_zone;
+	int defrag, ctor;
+	int defrag_ratio, remote_node_defrag_ratio;
 	unsigned long partial, objects, slabs;
 	int numa[MAX_NODES];
 	int numa_partial[MAX_NODES];
@@ -56,6 +58,8 @@ int show_slab = 0;
 int skip_zero = 1;
 int show_numa = 0;
 int show_track = 0;
+int show_defrag = 0;
+int show_ctor = 0;
 int show_first_alias = 0;
 int validate = 0;
 int shrink = 0;
@@ -90,18 +94,20 @@ void fatal(const char *x, ...)
 void usage(void)
 {
 	printf("slabinfo 5/7/2007. (c) 2007 sgi. clameter@sgi.com\n\n"
-		"slabinfo [-ahnpvtsz] [-d debugopts] [slab-regexp]\n"
+		"slabinfo [-aCDefhilnosSrtTvz1] [-d debugopts] [slab-regexp]\n"
 		"-a|--aliases           Show aliases\n"
+		"-C|--ctor              Show slabs with ctors\n"
 		"-d<options>|--debug=<options> Set/Clear Debug options\n"
-		"-e|--empty		Show empty slabs\n"
+		"-D|--defrag            Show defragmentable caches\n"
+		"-e|--empty             Show empty slabs\n"
 		"-f|--first-alias       Show first alias\n"
 		"-h|--help              Show usage information\n"
 		"-i|--inverted          Inverted list\n"
 		"-l|--slabs             Show slabs\n"
 		"-n|--numa              Show NUMA information\n"
-		"-o|--ops		Show kmem_cache_ops\n"
+		"-o|--ops               Show kmem_cache_ops\n"
 		"-s|--shrink            Shrink slabs\n"
-		"-r|--report		Detailed report on single slabs\n"
+		"-r|--report            Detailed report on single slabs\n"
 		"-S|--Size              Sort by size\n"
 		"-t|--tracking          Show alloc/free information\n"
 		"-T|--Totals            Show summary information\n"
@@ -281,7 +287,7 @@ int line = 0;
 void first_line(void)
 {
 	printf("Name                   Objects Objsize    Space "
-		"Slabs/Part/Cpu  O/S O %%Fr %%Ef Flg\n");
+		"Slabs/Part/Cpu  O/S O %%Ra %%Ef Flg\n");
 }
 
 /*
@@ -324,7 +330,7 @@ void slab_numa(struct slabinfo *s, int m
 		return;
 
 	if (!line) {
-		printf("\n%-21s:", mode ? "NUMA nodes" : "Slab");
+		printf("\n%-21s: Rto ", mode ? "NUMA nodes" : "Slab");
 		for(node = 0; node <= highest_node; node++)
 			printf(" %4d", node);
 		printf("\n----------------------");
@@ -333,6 +339,7 @@ void slab_numa(struct slabinfo *s, int m
 		printf("\n");
 	}
 	printf("%-21s ", mode ? "All slabs" : s->name);
+	printf("%3d ", s->remote_node_defrag_ratio);
 	for(node = 0; node <= highest_node; node++) {
 		char b[20];
 
@@ -406,6 +413,8 @@ void report(struct slabinfo *s)
 		printf("** Slabs are destroyed via RCU\n");
 	if (s->reclaim_account)
 		printf("** Reclaim accounting active\n");
+	if (s->defrag)
+		printf("** Defragmentation at %d%%\n", s->defrag_ratio);
 
 	printf("\nSizes (bytes)     Slabs              Debug                Memory\n");
 	printf("------------------------------------------------------------------------\n");
@@ -452,6 +461,12 @@ void slabcache(struct slabinfo *s)
 	if (show_empty && s->slabs)
 		return;
 
+	if (show_defrag && !s->defrag)
+		return;
+
+	if (show_ctor && !s->ctor)
+		return;
+
 	store_size(size_str, slab_size(s));
 	sprintf(dist_str,"%lu/%lu/%d", s->slabs, s->partial, s->cpu_slabs);
 
@@ -462,6 +477,10 @@ void slabcache(struct slabinfo *s)
 		*p++ = '*';
 	if (s->cache_dma)
 		*p++ = 'd';
+	if (s->defrag)
+		*p++ = 'D';
+	if (s->ctor)
+		*p++ = 'C';
 	if (s->hwcache_align)
 		*p++ = 'A';
 	if (s->poison)
@@ -481,7 +500,7 @@ void slabcache(struct slabinfo *s)
 	printf("%-21s %8ld %7d %8s %14s %4d %1d %3ld %3ld %s\n",
 		s->name, s->objects, s->object_size, size_str, dist_str,
 		s->objs_per_slab, s->order,
-		s->slabs ? (s->partial * 100) / s->slabs : 100,
+		s->slabs ? (s->objects * 100) / (s->slabs * s->objs_per_slab) : 100,
 		s->slabs ? (s->objects * s->object_size * 100) /
 			(s->slabs * (page_size << s->order)) : 100,
 		flags);
@@ -1071,7 +1090,16 @@ void read_slab_dir(void)
 			decode_numa_list(slab->numa, t);
 			slab->store_user = get_obj("store_user");
 			slab->trace = get_obj("trace");
+			slab->defrag_ratio = get_obj("defrag_ratio");
+			slab->remote_node_defrag_ratio =
+					get_obj("remote_node_defrag_ratio");
 			chdir("..");
+			if (read_slab_obj(slab, "ops")) {
+				if (strstr(buffer, "ctor :"))
+					slab->ctor = 1;
+				if (strstr(buffer, "kick :"))
+					slab->defrag = 1;
+			}
 			if (slab->name[0] == ':')
 				alias_targets++;
 			slab++;
@@ -1121,7 +1149,9 @@ void output_slabs(void)
 
 struct option opts[] = {
 	{ "aliases", 0, NULL, 'a' },
+	{ "ctor", 0, NULL, 'C' },
 	{ "debug", 2, NULL, 'd' },
+	{ "defrag", 0, NULL, 'D' },
 	{ "empty", 0, NULL, 'e' },
 	{ "first-alias", 0, NULL, 'f' },
 	{ "help", 0, NULL, 'h' },
@@ -1146,7 +1176,7 @@ int main(int argc, char *argv[])
 
 	page_size = getpagesize();
 
-	while ((c = getopt_long(argc, argv, "ad::efhil1noprstvzTS",
+	while ((c = getopt_long(argc, argv, "ad::efhil1noprstvzCDTS",
 						opts, NULL)) != -1)
 	switch(c) {
 		case '1':
@@ -1196,6 +1226,12 @@ int main(int argc, char *argv[])
 		case 'z':
 			skip_zero = 0;
 			break;
+		case 'C':
+			show_ctor = 1;
+			break;
+		case 'D':
+			show_defrag = 1;
+			break;
 		case 'T':
 			show_totals = 1;
 			break;

-- 

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [patch 04/12] Slab defragmentation: Logic to trigger defragmentation from memory reclaim
  2007-07-08  3:05 [patch 00/12] Slab defragmentation V4 Christoph Lameter
                   ` (2 preceding siblings ...)
  2007-07-08  3:05 ` [patch 03/12] Slab defragmentation: Updates to slabinfo.c Christoph Lameter
@ 2007-07-08  3:05 ` Christoph Lameter
  2007-07-08  3:05 ` [patch 05/12] Slab defragmentation: Log information to the syslog to show defrag operations Christoph Lameter
                   ` (7 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: Christoph Lameter @ 2007-07-08  3:05 UTC (permalink / raw)
  To: akpm; +Cc: linux-kernel, linux-mm, dgc

[-- Attachment #1: slab_defrag_trigger --]
[-- Type: text/plain, Size: 6141 bytes --]

At some point slab defragmentation needs to be triggered. The logical
point for this is after slab shrinking was performed in vmscan.c. At
that point the fragmentation ratio of a slab was increased by objects
being freed. So we call kmem_cache_defrag from there.

slab_shrink() from vmscan.c is called in some contexts to do
global shrinking of slabs and in others to do shrinking for
a particular zone. Pass the zone to slab_shrink, so that slab_shrink
can call kmem_cache_defrag() and restrict the defragmentation to
the node that is under memory pressure.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

---
 fs/drop_caches.c     |    2 +-
 include/linux/mm.h   |    2 +-
 include/linux/slab.h |    1 +
 mm/vmscan.c          |   27 ++++++++++++++++++++-------
 4 files changed, 23 insertions(+), 9 deletions(-)

Index: linux-2.6.22-rc6-mm1/include/linux/slab.h
===================================================================
--- linux-2.6.22-rc6-mm1.orig/include/linux/slab.h	2007-07-04 09:53:59.000000000 -0700
+++ linux-2.6.22-rc6-mm1/include/linux/slab.h	2007-07-04 09:56:22.000000000 -0700
@@ -96,6 +96,7 @@ void kmem_cache_free(struct kmem_cache *
 unsigned int kmem_cache_size(struct kmem_cache *);
 const char *kmem_cache_name(struct kmem_cache *);
 int kmem_ptr_validate(struct kmem_cache *cachep, const void *ptr);
+int kmem_cache_defrag(int node);
 
 /*
  * Please use this macro to create slab caches. Simply specify the
Index: linux-2.6.22-rc6-mm1/mm/vmscan.c
===================================================================
--- linux-2.6.22-rc6-mm1.orig/mm/vmscan.c	2007-07-04 09:53:59.000000000 -0700
+++ linux-2.6.22-rc6-mm1/mm/vmscan.c	2007-07-04 09:59:08.000000000 -0700
@@ -152,10 +152,18 @@ EXPORT_SYMBOL(unregister_shrinker);
  * are eligible for the caller's allocation attempt.  It is used for balancing
  * slab reclaim versus page reclaim.
  *
+ * zone is the zone for which we are shrinking the slabs. If the intent
+ * is to do a global shrink then zone may be NULL. Specification of a
+ * zone is currently only used to limit slab defragmentation to a NUMA node.
+ * The performace of shrink_slab would be better (in particular under NUMA)
+ * if it could be targeted as a whole to the zone that is under memory
+ * pressure but the VFS infrastructure does not allow that at the present
+ * time.
+ *
  * Returns the number of slab objects which we shrunk.
  */
 unsigned long shrink_slab(unsigned long scanned, gfp_t gfp_mask,
-			unsigned long lru_pages)
+			unsigned long lru_pages, struct zone *zone)
 {
 	struct shrinker *shrinker;
 	unsigned long ret = 0;
@@ -218,6 +226,8 @@ unsigned long shrink_slab(unsigned long 
 		shrinker->nr += total_scan;
 	}
 	up_read(&shrinker_rwsem);
+	if (gfp_mask & __GFP_FS)
+		kmem_cache_defrag(zone ? zone_to_nid(zone) : -1);
 	return ret;
 }
 
@@ -1163,7 +1173,8 @@ unsigned long try_to_free_pages(struct z
 		if (!priority)
 			disable_swap_token();
 		nr_reclaimed += shrink_zones(priority, zones, &sc);
-		shrink_slab(sc.nr_scanned, gfp_mask, lru_pages);
+		shrink_slab(sc.nr_scanned, gfp_mask, lru_pages,
+						NULL);
 		if (reclaim_state) {
 			nr_reclaimed += reclaim_state->reclaimed_slab;
 			reclaim_state->reclaimed_slab = 0;
@@ -1333,7 +1344,7 @@ loop_again:
 			nr_reclaimed += shrink_zone(priority, zone, &sc);
 			reclaim_state->reclaimed_slab = 0;
 			nr_slab = shrink_slab(sc.nr_scanned, GFP_KERNEL,
-						lru_pages);
+						lru_pages, zone);
 			nr_reclaimed += reclaim_state->reclaimed_slab;
 			total_scanned += sc.nr_scanned;
 			if (zone->all_unreclaimable)
@@ -1601,7 +1612,7 @@ unsigned long shrink_all_memory(unsigned
 	/* If slab caches are huge, it's better to hit them first */
 	while (nr_slab >= lru_pages) {
 		reclaim_state.reclaimed_slab = 0;
-		shrink_slab(nr_pages, sc.gfp_mask, lru_pages);
+		shrink_slab(nr_pages, sc.gfp_mask, lru_pages, NULL);
 		if (!reclaim_state.reclaimed_slab)
 			break;
 
@@ -1639,7 +1650,7 @@ unsigned long shrink_all_memory(unsigned
 
 			reclaim_state.reclaimed_slab = 0;
 			shrink_slab(sc.nr_scanned, sc.gfp_mask,
-					count_lru_pages());
+					count_lru_pages(), NULL);
 			ret += reclaim_state.reclaimed_slab;
 			if (ret >= nr_pages)
 				goto out;
@@ -1656,7 +1667,8 @@ unsigned long shrink_all_memory(unsigned
 	if (!ret) {
 		do {
 			reclaim_state.reclaimed_slab = 0;
-			shrink_slab(nr_pages, sc.gfp_mask, count_lru_pages());
+			shrink_slab(nr_pages, sc.gfp_mask,
+					count_lru_pages(), NULL);
 			ret += reclaim_state.reclaimed_slab;
 		} while (ret < nr_pages && reclaim_state.reclaimed_slab > 0);
 	}
@@ -1816,7 +1828,8 @@ static int __zone_reclaim(struct zone *z
 		 * Note that shrink_slab will free memory on all zones and may
 		 * take a long time.
 		 */
-		while (shrink_slab(sc.nr_scanned, gfp_mask, order) &&
+		while (shrink_slab(sc.nr_scanned, gfp_mask, order,
+						zone) &&
 			zone_page_state(zone, NR_SLAB_RECLAIMABLE) >
 				slab_reclaimable - nr_pages)
 			;
Index: linux-2.6.22-rc6-mm1/fs/drop_caches.c
===================================================================
--- linux-2.6.22-rc6-mm1.orig/fs/drop_caches.c	2007-07-04 09:53:59.000000000 -0700
+++ linux-2.6.22-rc6-mm1/fs/drop_caches.c	2007-07-04 09:56:22.000000000 -0700
@@ -52,7 +52,7 @@ void drop_slab(void)
 	int nr_objects;
 
 	do {
-		nr_objects = shrink_slab(1000, GFP_KERNEL, 1000);
+		nr_objects = shrink_slab(1000, GFP_KERNEL, 1000, NULL);
 	} while (nr_objects > 10);
 }
 
Index: linux-2.6.22-rc6-mm1/include/linux/mm.h
===================================================================
--- linux-2.6.22-rc6-mm1.orig/include/linux/mm.h	2007-07-04 09:53:59.000000000 -0700
+++ linux-2.6.22-rc6-mm1/include/linux/mm.h	2007-07-04 09:56:22.000000000 -0700
@@ -1248,7 +1248,7 @@ int in_gate_area_no_task(unsigned long a
 int drop_caches_sysctl_handler(struct ctl_table *, int, struct file *,
 					void __user *, size_t *, loff_t *);
 unsigned long shrink_slab(unsigned long scanned, gfp_t gfp_mask,
-			unsigned long lru_pages);
+			unsigned long lru_pages, struct zone *zone);
 extern void drop_pagecache_sb(struct super_block *);
 void drop_pagecache(void);
 void drop_slab(void);

-- 

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [patch 05/12] Slab defragmentation: Log information to the syslog to show defrag operations
  2007-07-08  3:05 [patch 00/12] Slab defragmentation V4 Christoph Lameter
                   ` (3 preceding siblings ...)
  2007-07-08  3:05 ` [patch 04/12] Slab defragmentation: Logic to trigger defragmentation from memory reclaim Christoph Lameter
@ 2007-07-08  3:05 ` Christoph Lameter
  2007-07-08  3:05 ` [patch 06/12] Slab defragmentation: Support dentry defragmentation Christoph Lameter
                   ` (6 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: Christoph Lameter @ 2007-07-08  3:05 UTC (permalink / raw)
  To: akpm; +Cc: linux-kernel, linux-mm, dgc

[-- Attachment #1: slub_defrag_tracing --]
[-- Type: text/plain, Size: 1571 bytes --]

Dump information into the syslog during defragmentation actions to show that
something is occurring and what effect it has.

This is likely only useful in mm for testing and verification.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

---
 mm/slub.c |   13 +++++++++++++
 1 file changed, 13 insertions(+)

Index: linux-2.6.22-rc6-mm1/mm/slub.c
===================================================================
--- linux-2.6.22-rc6-mm1.orig/mm/slub.c	2007-07-06 20:04:54.000000000 -0700
+++ linux-2.6.22-rc6-mm1/mm/slub.c	2007-07-06 20:06:45.000000000 -0700
@@ -2599,6 +2599,8 @@ static unsigned long sort_partial_list(s
 	return freed;
 }
 
+#define NR_INUSE 40
+
 /*
  * Shrink the slab cache on a particular node of the cache
  */
@@ -2610,6 +2612,8 @@ static unsigned long __kmem_cache_shrink
 	LIST_HEAD(zaplist);
 	int freed;
 	int inuse;
+	int nr[NR_INUSE] = { 0, };
+	int i;
 
 	spin_lock_irqsave(&n->list_lock, flags);
 	freed = sort_partial_list(s, n, scratch);
@@ -2646,6 +2650,8 @@ static unsigned long __kmem_cache_shrink
 		n->nr_partial--;
 		SetSlabFrozen(page);
 		slab_unlock(page);
+		if (inuse < NR_INUSE)
+			nr[inuse]++;
 	}
 
 	spin_unlock_irqrestore(&n->list_lock, flags);
@@ -2659,6 +2665,13 @@ static unsigned long __kmem_cache_shrink
 		if (__kmem_cache_vacate(s, page, flags, scratch) == 0)
 			freed++;
 	}
+	printk(KERN_INFO "Slab %s: Defrag freed %d pages. PartSlab config=",
+			s->name, freed << s->order);
+
+	for (i = 0; i < NR_INUSE; i++)
+		if (nr[i])
+			printk(" %d=%d", i, nr[i]);
+	printk("\n");
 	return freed;
 }
 

-- 

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [patch 06/12] Slab defragmentation: Support dentry defragmentation
  2007-07-08  3:05 [patch 00/12] Slab defragmentation V4 Christoph Lameter
                   ` (4 preceding siblings ...)
  2007-07-08  3:05 ` [patch 05/12] Slab defragmentation: Log information to the syslog to show defrag operations Christoph Lameter
@ 2007-07-08  3:05 ` Christoph Lameter
  2007-07-08  3:05 ` [patch 07/12] Slab defragmentation: Support for buffer_head defrag Christoph Lameter
                   ` (5 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: Christoph Lameter @ 2007-07-08  3:05 UTC (permalink / raw)
  To: akpm; +Cc: linux-kernel, linux-mm, dgc

[-- Attachment #1: slub_defrag_dentry --]
[-- Type: text/plain, Size: 5000 bytes --]

get() uses the dcache lock and then works with dget_locked to obtain a
reference to the dentry. An additional complication is that the dentry
may be in process of being freed or it may just have been allocated.
We add an additional flag to d_flags to be able to determined the
status of an object.

kick() is called after get() has been used and after the slab has dropped
all of its own locks. The dentry pruning for unused entries works in a
straighforward way.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

---
 fs/dcache.c            |  113 +++++++++++++++++++++++++++++++++++++++++++++----
 include/linux/dcache.h |    5 ++
 2 files changed, 110 insertions(+), 8 deletions(-)

Index: linux-2.6.22-rc4-mm2/fs/dcache.c
===================================================================
--- linux-2.6.22-rc4-mm2.orig/fs/dcache.c	2007-06-26 12:00:37.000000000 -0700
+++ linux-2.6.22-rc4-mm2/fs/dcache.c	2007-06-26 12:08:49.000000000 -0700
@@ -135,6 +135,7 @@ static struct dentry *d_kill(struct dent
 
 	list_del(&dentry->d_u.d_child);
 	dentry_stat.nr_dentry--;	/* For d_free, below */
+	dentry->d_flags &= ~DCACHE_ENTRY_VALID;
 	/*drops the locks, at that point nobody can reach this dentry */
 	dentry_iput(dentry);
 	parent = dentry->d_parent;
@@ -951,6 +952,7 @@ struct dentry *d_alloc(struct dentry * p
 	if (parent)
 		list_add(&dentry->d_u.d_child, &parent->d_subdirs);
 	dentry_stat.nr_dentry++;
+	dentry->d_flags |= DCACHE_ENTRY_VALID;
 	spin_unlock(&dcache_lock);
 
 	return dentry;
@@ -2108,18 +2110,113 @@ static void __init dcache_init_early(voi
 		INIT_HLIST_HEAD(&dentry_hashtable[loop]);
 }
 
+/*
+ * The slab is holding off frees. Thus we can safely examine
+ * the object without the danger of it vanishing from under us.
+ */
+static void *get_dentries(struct kmem_cache *s, int nr, void **v)
+{
+	struct dentry *dentry;
+	int i;
+
+	spin_lock(&dcache_lock);
+	for (i = 0; i < nr; i++) {
+		dentry = v[i];
+		/*
+		 * if DCACHE_ENTRY_VALID is not set then the dentry
+		 * may be already in the process of being freed.
+		 */
+		if (!(dentry->d_flags & DCACHE_ENTRY_VALID))
+			v[i] = NULL;
+		else
+			dget_locked(dentry);
+	}
+	spin_unlock(&dcache_lock);
+	return 0;
+}
+
+/*
+ * Slab has dropped all the locks. Get rid of the
+ * refcount we obtained earlier and also rid of the
+ * object.
+ */
+static void kick_dentries(struct kmem_cache *s, int nr, void **v, void *private)
+{
+	struct dentry *dentry;
+	int abort = 0;
+	int i;
+
+	/*
+	 * First invalidate the dentries without holding the dcache lock
+	 */
+	for (i = 0; i < nr; i++) {
+		dentry = v[i];
+
+		if (dentry)
+			d_invalidate(dentry);
+	}
+
+	/*
+	 * If we are the last one holding a reference then the dentries can
+	 * be freed. We  need the dcache_lock.
+	 */
+	spin_lock(&dcache_lock);
+	for (i = 0; i < nr; i++) {
+		dentry = v[i];
+		if (!dentry)
+			continue;
+
+		if (abort)
+			goto put_dentry;
+
+		spin_lock(&dentry->d_lock);
+		if (atomic_read(&dentry->d_count) > 1) {
+			/*
+			 * Reference count was increased. This means that we
+			 * cannot free one object which makes it impossible
+			 * to reclaim this slab. So it is best to abandon the
+			 * freeing of further objects.
+			 */
+			abort = 1;
+			spin_unlock(&dentry->d_lock);
+put_dentry:
+			spin_unlock(&dcache_lock);
+			dput(dentry);
+			spin_lock(&dcache_lock);
+			continue;
+		}
+
+		/* Remove from LRU */
+		if (!list_empty(&dentry->d_lru)) {
+			dentry_stat.nr_unused--;
+			list_del_init(&dentry->d_lru);
+		}
+		/* Drop the entry */
+		prune_one_dentry(dentry, 1);
+	}
+	spin_unlock(&dcache_lock);
+
+	/*
+	 * dentries are freed using RCU so we need to wait until RCU
+	 * operations arei complete
+	 */
+	if (!abort)
+		synchronize_rcu();
+}
+
+static struct kmem_cache_ops dentry_kmem_cache_ops = {
+	.get = get_dentries,
+	.kick = kick_dentries,
+};
+
 static void __init dcache_init(unsigned long mempages)
 {
 	int loop;
 
-	/* 
-	 * A constructor could be added for stable state like the lists,
-	 * but it is probably not worth it because of the cache nature
-	 * of the dcache. 
-	 */
-	dentry_cache = KMEM_CACHE(dentry,
-		SLAB_RECLAIM_ACCOUNT|SLAB_PANIC|SLAB_MEM_SPREAD);
-	
+	dentry_cache = KMEM_CACHE_OPS(dentry,
+		SLAB_RECLAIM_ACCOUNT|SLAB_PANIC|SLAB_MEM_SPREAD,
+		&dentry_kmem_cache_ops);
+
 	register_shrinker(&dcache_shrinker);
 
 	/* Hash may have been set up in dcache_init_early */
Index: linux-2.6.22-rc4-mm2/include/linux/dcache.h
===================================================================
--- linux-2.6.22-rc4-mm2.orig/include/linux/dcache.h	2007-06-26 12:00:37.000000000 -0700
+++ linux-2.6.22-rc4-mm2/include/linux/dcache.h	2007-06-26 12:07:36.000000000 -0700
@@ -177,6 +177,11 @@ d_iput:		no		no		no       yes
 
 #define DCACHE_INOTIFY_PARENT_WATCHED	0x0020 /* Parent inode is watched */
 
+#define DCACHE_ENTRY_VALID	0x0040	/*
+					 * Entry is valid and not in the
+					 * process of being created or
+					 * destroyed.
+					 */
 extern spinlock_t dcache_lock;
 
 /**

-- 

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [patch 07/12] Slab defragmentation: Support for buffer_head defrag
  2007-07-08  3:05 [patch 00/12] Slab defragmentation V4 Christoph Lameter
                   ` (5 preceding siblings ...)
  2007-07-08  3:05 ` [patch 06/12] Slab defragmentation: Support dentry defragmentation Christoph Lameter
@ 2007-07-08  3:05 ` Christoph Lameter
  2007-07-08  3:05 ` [patch 08/12] Slab defragmentation: Support generic defragmentation for inode slab caches Christoph Lameter
                   ` (4 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: Christoph Lameter @ 2007-07-08  3:05 UTC (permalink / raw)
  To: akpm; +Cc: linux-kernel, linux-mm, dgc

[-- Attachment #1: slub_defrag_buffer_heads --]
[-- Type: text/plain, Size: 2434 bytes --]

Limited defragmentation support for buffer heads. Simply try to free the
buffers in a sparsely populated slab page.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

---
 fs/buffer.c |   67 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 65 insertions(+), 2 deletions(-)

Index: linux-2.6.22-rc6-mm1/fs/buffer.c
===================================================================
--- linux-2.6.22-rc6-mm1.orig/fs/buffer.c	2007-07-04 11:14:01.000000000 -0700
+++ linux-2.6.22-rc6-mm1/fs/buffer.c	2007-07-04 17:23:02.000000000 -0700
@@ -3078,12 +3078,75 @@ static int buffer_cpu_notify(struct noti
 	return NOTIFY_OK;
 }
 
+/*
+ * Get references on buffers.
+ *
+ * We obtain references on the page that uses the buffer. v[i] will point to
+ * the corresponding page after get_buffers() is through.
+ *
+ * We are safe from the underlying page being removed simply by doing
+ * a get_page_unless_zero. The buffer head removal may race at will.
+ * try_to_free_buffes will later take appropriate locks to remove the
+ * buffers if they are still there.
+ *
+ * TODO: Write out dirty buffers to increase the chance of kick_buffers
+ * to be successful.
+ */
+static void *get_buffers(struct kmem_cache *s, int nr, void **v)
+{
+	struct page *page;
+	struct buffer_head *bh;
+	int i;
+
+	for (i = 0; i < nr; i++) {
+		bh = v[i];
+		page = bh->b_page;
+		if (page && PagePrivate(page) && get_page_unless_zero(page))
+			v[i] = page;
+		else
+			v[i] = NULL;
+	}
+	return NULL;
+}
+
+/*
+ * Despite its name: kick_buffers operates on a list of pointers to
+ * page structs that was setup by get_buffer
+ */
+static void kick_buffers(struct kmem_cache *s, int nr, void **v,
+							void *private)
+{
+	struct page *page;
+	int i;
+
+	for (i = 0; i < nr; i++) {
+		page = v[i];
+
+		if (!page)
+			continue;
+
+		if (!TestSetPageLocked(page)) {
+			if (PagePrivate(page))
+				try_to_free_buffers(page);
+			unlock_page(page);
+		}
+		put_page(page);
+	}
+}
+
+static struct kmem_cache_ops buffer_head_kmem_cache_ops = {
+	.get = get_buffers,
+	.kick = kick_buffers,
+};
+
+
 void __init buffer_init(void)
 {
 	int nrpages;
 
-	bh_cachep = KMEM_CACHE(buffer_head,
-			SLAB_RECLAIM_ACCOUNT|SLAB_PANIC|SLAB_MEM_SPREAD);
+	bh_cachep = KMEM_CACHE_OPS(buffer_head,
+			SLAB_RECLAIM_ACCOUNT|SLAB_PANIC|SLAB_MEM_SPREAD,
+			&buffer_head_kmem_cache_ops);
 
 	/*
 	 * Limit the bh occupancy to 10% of ZONE_NORMAL

-- 

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [patch 08/12] Slab defragmentation: Support generic defragmentation for inode slab caches
  2007-07-08  3:05 [patch 00/12] Slab defragmentation V4 Christoph Lameter
                   ` (6 preceding siblings ...)
  2007-07-08  3:05 ` [patch 07/12] Slab defragmentation: Support for buffer_head defrag Christoph Lameter
@ 2007-07-08  3:05 ` Christoph Lameter
  2007-07-08  3:05 ` [patch 09/12] Slab defragmentation: Support defragmentation for extX filesystem inodes Christoph Lameter
                   ` (3 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: Christoph Lameter @ 2007-07-08  3:05 UTC (permalink / raw)
  To: akpm; +Cc: linux-kernel, linux-mm, dgc

[-- Attachment #1: slub_defrag_inode_generic --]
[-- Type: text/plain, Size: 4142 bytes --]

This implements the ability to remove inodes in a particular slab
from the inode cache. In order to remove an inode we may have to write out
the pages of an inode, the inode itself and remove the dentries referring
to the node.

Provide generic functionality that can be used by filesystems that have
their own inode caches to also tie into the defragmentation functions
that are made available here.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

---
 fs/inode.c         |  101 ++++++++++++++++++++++++++++++++++++++++++++++++++++-
 include/linux/fs.h |    5 ++
 2 files changed, 105 insertions(+), 1 deletion(-)

Index: linux-2.6.22-rc6-mm1/fs/inode.c
===================================================================
--- linux-2.6.22-rc6-mm1.orig/fs/inode.c	2007-07-03 17:19:26.000000000 -0700
+++ linux-2.6.22-rc6-mm1/fs/inode.c	2007-07-03 17:28:46.000000000 -0700
@@ -1351,6 +1351,105 @@ static int __init set_ihash_entries(char
 }
 __setup("ihash_entries=", set_ihash_entries);
 
+static void *get_inodes(struct kmem_cache *s, int nr, void **v)
+{
+	int i;
+
+	spin_lock(&inode_lock);
+	for (i = 0; i < nr; i++) {
+		struct inode *inode = v[i];
+
+		if (inode->i_state & (I_FREEING|I_CLEAR|I_WILL_FREE))
+			v[i] = NULL;
+		else
+			__iget(inode);
+	}
+	spin_unlock(&inode_lock);
+	return NULL;
+}
+
+/*
+ * Function for filesystems that embedd struct inode into their own
+ * structures. The offset is the offset of the struct inode in the fs inode.
+ */
+void *fs_get_inodes(struct kmem_cache *s, int nr, void **v,
+						unsigned long offset)
+{
+	int i;
+
+	for (i = 0; i < nr; i++)
+		v[i] += offset;
+
+	return get_inodes(s, nr, v);
+}
+EXPORT_SYMBOL(fs_get_inodes);
+
+void kick_inodes(struct kmem_cache *s, int nr, void **v, void *private)
+{
+	struct inode *inode;
+	int i;
+	int abort = 0;
+	LIST_HEAD(freeable);
+	struct super_block *sb;
+
+	for (i = 0; i < nr; i++) {
+		inode = v[i];
+		if (!inode)
+			continue;
+
+		if (inode_has_buffers(inode) || inode->i_data.nrpages) {
+			if (remove_inode_buffers(inode))
+				invalidate_mapping_pages(&inode->i_data,
+								0, -1);
+		}
+
+		/* Invalidate children and dentry */
+		if (S_ISDIR(inode->i_mode)) {
+			struct dentry *d = d_find_alias(inode);
+
+			if (d) {
+				d_invalidate(d);
+				dput(d);
+			}
+		}
+
+		if (inode->i_state & I_DIRTY)
+			write_inode_now(inode, 1);
+
+		d_prune_aliases(inode);
+	}
+
+	mutex_lock(&iprune_mutex);
+	for (i = 0; i < nr; i++) {
+		inode = v[i];
+		if (!inode)
+			continue;
+
+		sb = inode->i_sb;
+		iput(inode);
+		if (abort || !(sb->s_flags & MS_ACTIVE))
+			continue;
+
+		spin_lock(&inode_lock);
+		abort =  !can_unuse(inode);
+
+		if (!abort) {
+			list_move(&inode->i_list, &freeable);
+			inode->i_state |= I_FREEING;
+			inodes_stat.nr_unused--;
+		}
+		spin_unlock(&inode_lock);
+	}
+	dispose_list(&freeable);
+	mutex_unlock(&iprune_mutex);
+}
+EXPORT_SYMBOL(kick_inodes);
+
+static struct kmem_cache_ops inode_kmem_cache_ops = {
+	.get = get_inodes,
+	.kick = kick_inodes
+};
+
 /*
  * Initialize the waitqueues and inode hash table.
  */
@@ -1389,7 +1488,7 @@ void __init inode_init(unsigned long mem
 					 (SLAB_RECLAIM_ACCOUNT|SLAB_PANIC|
 					 SLAB_MEM_SPREAD),
 					 init_once,
-					 NULL);
+					 &inode_kmem_cache_ops);
 	register_shrinker(&icache_shrinker);
 
 	/* Hash may have been set up in inode_init_early */
Index: linux-2.6.22-rc6-mm1/include/linux/fs.h
===================================================================
--- linux-2.6.22-rc6-mm1.orig/include/linux/fs.h	2007-07-03 17:19:27.000000000 -0700
+++ linux-2.6.22-rc6-mm1/include/linux/fs.h	2007-07-03 17:28:46.000000000 -0700
@@ -1769,6 +1769,11 @@ static inline void insert_inode_hash(str
 	__insert_inode_hash(inode, inode->i_ino);
 }
 
+/* Helper functions for inode defragmentation support in filesystems */
+extern void kick_inodes(struct kmem_cache *, int, void **, void *);
+extern void *fs_get_inodes(struct kmem_cache *, int nr, void **,
+						unsigned long offset);
+
 extern struct file * get_empty_filp(void);
 extern void file_move(struct file *f, struct list_head *list);
 extern void file_kill(struct file *f);

-- 

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [patch 09/12] Slab defragmentation: Support defragmentation for extX filesystem inodes
  2007-07-08  3:05 [patch 00/12] Slab defragmentation V4 Christoph Lameter
                   ` (7 preceding siblings ...)
  2007-07-08  3:05 ` [patch 08/12] Slab defragmentation: Support generic defragmentation for inode slab caches Christoph Lameter
@ 2007-07-08  3:05 ` Christoph Lameter
  2007-07-08  3:05 ` [patch 10/12] Slab defragmentation: Support inode defragmentation for xfs Christoph Lameter
                   ` (2 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: Christoph Lameter @ 2007-07-08  3:05 UTC (permalink / raw)
  To: akpm; +Cc: linux-kernel, linux-mm, dgc

[-- Attachment #1: slub_defrag_fs_ext234 --]
[-- Type: text/plain, Size: 3334 bytes --]

Use the generic API for inodes established earlier to support all extX
filesystem.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

---
 fs/ext2/super.c |   16 ++++++++++++++--
 fs/ext3/super.c |   14 +++++++++++++-
 fs/ext4/super.c |   14 +++++++++++++-
 3 files changed, 40 insertions(+), 4 deletions(-)

Index: linux-2.6.22-rc6-mm1/fs/ext2/super.c
===================================================================
--- linux-2.6.22-rc6-mm1.orig/fs/ext2/super.c	2007-07-03 17:19:26.000000000 -0700
+++ linux-2.6.22-rc6-mm1/fs/ext2/super.c	2007-07-03 17:28:49.000000000 -0700
@@ -168,14 +168,26 @@ static void init_once(void * foo, struct
 	mutex_init(&ei->truncate_mutex);
 	inode_init_once(&ei->vfs_inode);
 }
- 
+
+static void *ext2_get_inodes(struct kmem_cache *s, int nr, void **v)
+{
+	return fs_get_inodes(s, nr, v,
+		offsetof(struct ext2_inode_info, vfs_inode));
+}
+
+static struct kmem_cache_ops ext2_kmem_cache_ops = {
+	.get = ext2_get_inodes,
+	.kick = kick_inodes
+};
+
 static int init_inodecache(void)
 {
 	ext2_inode_cachep = kmem_cache_create("ext2_inode_cache",
 					     sizeof(struct ext2_inode_info),
 					     0, (SLAB_RECLAIM_ACCOUNT|
 						SLAB_MEM_SPREAD),
-					     init_once, NULL);
+					     init_once,
+					     &ext2_kmem_cache_ops);
 	if (ext2_inode_cachep == NULL)
 		return -ENOMEM;
 	return 0;
Index: linux-2.6.22-rc6-mm1/fs/ext3/super.c
===================================================================
--- linux-2.6.22-rc6-mm1.orig/fs/ext3/super.c	2007-07-03 17:19:26.000000000 -0700
+++ linux-2.6.22-rc6-mm1/fs/ext3/super.c	2007-07-03 17:28:49.000000000 -0700
@@ -484,13 +484,25 @@ static void init_once(void * foo, struct
 	inode_init_once(&ei->vfs_inode);
 }
 
+static void *ext3_get_inodes(struct kmem_cache *s, int nr, void **v)
+{
+	return fs_get_inodes(s, nr, v,
+		offsetof(struct ext3_inode_info, vfs_inode));
+}
+
+static struct kmem_cache_ops ext3_kmem_cache_ops = {
+	.get = ext3_get_inodes,
+	.kick = kick_inodes
+};
+
 static int init_inodecache(void)
 {
 	ext3_inode_cachep = kmem_cache_create("ext3_inode_cache",
 					     sizeof(struct ext3_inode_info),
 					     0, (SLAB_RECLAIM_ACCOUNT|
 						SLAB_MEM_SPREAD),
-					     init_once, NULL);
+					     init_once,
+					     &ext3_kmem_cache_ops);
 	if (ext3_inode_cachep == NULL)
 		return -ENOMEM;
 	return 0;
Index: linux-2.6.22-rc6-mm1/fs/ext4/super.c
===================================================================
--- linux-2.6.22-rc6-mm1.orig/fs/ext4/super.c	2007-07-03 17:19:26.000000000 -0700
+++ linux-2.6.22-rc6-mm1/fs/ext4/super.c	2007-07-03 17:28:49.000000000 -0700
@@ -544,13 +544,25 @@ static void init_once(void * foo, struct
 	inode_init_once(&ei->vfs_inode);
 }
 
+static void *ext4_get_inodes(struct kmem_cache *s, int nr, void **v)
+{
+	return fs_get_inodes(s, nr, v,
+		offsetof(struct ext4_inode_info, vfs_inode));
+}
+
+static struct kmem_cache_ops ext4_kmem_cache_ops = {
+	.get = ext4_get_inodes,
+	.kick = kick_inodes
+};
+
 static int init_inodecache(void)
 {
 	ext4_inode_cachep = kmem_cache_create("ext4_inode_cache",
 					     sizeof(struct ext4_inode_info),
 					     0, (SLAB_RECLAIM_ACCOUNT|
 						SLAB_MEM_SPREAD),
-					     init_once, NULL);
+					     init_once,
+					     &ext4_kmem_cache_ops);
 	if (ext4_inode_cachep == NULL)
 		return -ENOMEM;
 	return 0;

-- 

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [patch 10/12] Slab defragmentation: Support inode defragmentation for xfs
  2007-07-08  3:05 [patch 00/12] Slab defragmentation V4 Christoph Lameter
                   ` (8 preceding siblings ...)
  2007-07-08  3:05 ` [patch 09/12] Slab defragmentation: Support defragmentation for extX filesystem inodes Christoph Lameter
@ 2007-07-08  3:05 ` Christoph Lameter
  2007-07-08  3:05 ` [patch 11/12] Slab defragmentation: Support reiserfs inode defragmentation Christoph Lameter
  2007-07-08  3:05 ` [patch 12/12] Slab defragmentation: kmem_cache_vacate for antifrag / memory compaction Christoph Lameter
  11 siblings, 0 replies; 13+ messages in thread
From: Christoph Lameter @ 2007-07-08  3:05 UTC (permalink / raw)
  To: akpm; +Cc: linux-kernel, linux-mm, dgc

[-- Attachment #1: slub_defrag_fs_xfs --]
[-- Type: text/plain, Size: 3279 bytes --]

Add slab defrag support.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

---
 fs/xfs/linux-2.6/kmem.h      |    5 +++--
 fs/xfs/linux-2.6/xfs_buf.c   |    2 +-
 fs/xfs/linux-2.6/xfs_super.c |   13 ++++++++++++-
 fs/xfs/xfs_vfsops.c          |    6 +++---
 4 files changed, 19 insertions(+), 7 deletions(-)

Index: slub/fs/xfs/linux-2.6/kmem.h
===================================================================
--- slub.orig/fs/xfs/linux-2.6/kmem.h	2007-06-06 13:08:09.000000000 -0700
+++ slub/fs/xfs/linux-2.6/kmem.h	2007-06-06 13:32:58.000000000 -0700
@@ -79,9 +79,10 @@ kmem_zone_init(int size, char *zone_name
 
 static inline kmem_zone_t *
 kmem_zone_init_flags(int size, char *zone_name, unsigned long flags,
-		     void (*construct)(void *, kmem_zone_t *, unsigned long))
+		     void (*construct)(void *, kmem_zone_t *, unsigned long),
+		     const struct kmem_cache_ops *ops)
 {
-	return kmem_cache_create(zone_name, size, 0, flags, construct, NULL);
+	return kmem_cache_create(zone_name, size, 0, flags, construct, ops);
 }
 
 static inline void
Index: slub/fs/xfs/linux-2.6/xfs_buf.c
===================================================================
--- slub.orig/fs/xfs/linux-2.6/xfs_buf.c	2007-06-06 13:08:09.000000000 -0700
+++ slub/fs/xfs/linux-2.6/xfs_buf.c	2007-06-06 13:32:58.000000000 -0700
@@ -1834,7 +1834,7 @@ xfs_buf_init(void)
 #endif
 
 	xfs_buf_zone = kmem_zone_init_flags(sizeof(xfs_buf_t), "xfs_buf",
-						KM_ZONE_HWALIGN, NULL);
+						KM_ZONE_HWALIGN, NULL, NULL);
 	if (!xfs_buf_zone)
 		goto out_free_trace_buf;
 
Index: slub/fs/xfs/linux-2.6/xfs_super.c
===================================================================
--- slub.orig/fs/xfs/linux-2.6/xfs_super.c	2007-06-06 13:08:09.000000000 -0700
+++ slub/fs/xfs/linux-2.6/xfs_super.c	2007-06-06 13:32:58.000000000 -0700
@@ -355,13 +355,24 @@ xfs_fs_inode_init_once(
 	inode_init_once(vn_to_inode((bhv_vnode_t *)vnode));
 }
 
+static void *xfs_get_inodes(struct kmem_cache *s, int nr, void **v)
+{
+	return fs_get_inodes(s, nr, v, offsetof(bhv_vnode_t, v_inode));
+};
+
+static struct kmem_cache_ops xfs_kmem_cache_ops = {
+	.get = xfs_get_inodes,
+	.kick = kick_inodes
+};
+
 STATIC int
 xfs_init_zones(void)
 {
 	xfs_vnode_zone = kmem_zone_init_flags(sizeof(bhv_vnode_t), "xfs_vnode",
 					KM_ZONE_HWALIGN | KM_ZONE_RECLAIM |
 					KM_ZONE_SPREAD,
-					xfs_fs_inode_init_once);
+					xfs_fs_inode_init_once,
+					&xfs_kmem_cache_ops);
 	if (!xfs_vnode_zone)
 		goto out;
 
Index: slub/fs/xfs/xfs_vfsops.c
===================================================================
--- slub.orig/fs/xfs/xfs_vfsops.c	2007-06-06 15:19:52.000000000 -0700
+++ slub/fs/xfs/xfs_vfsops.c	2007-06-06 15:20:36.000000000 -0700
@@ -109,13 +109,13 @@ xfs_init(void)
 	xfs_inode_zone =
 		kmem_zone_init_flags(sizeof(xfs_inode_t), "xfs_inode",
 					KM_ZONE_HWALIGN | KM_ZONE_RECLAIM |
-					KM_ZONE_SPREAD, NULL);
+					KM_ZONE_SPREAD, NULL, NULL);
 	xfs_ili_zone =
 		kmem_zone_init_flags(sizeof(xfs_inode_log_item_t), "xfs_ili",
-					KM_ZONE_SPREAD, NULL);
+					KM_ZONE_SPREAD, NULL, NULL);
 	xfs_chashlist_zone =
 		kmem_zone_init_flags(sizeof(xfs_chashlist_t), "xfs_chashlist",
-					KM_ZONE_SPREAD, NULL);
+					KM_ZONE_SPREAD, NULL, NULL);
 
 	/*
 	 * Allocate global trace buffers.

-- 

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [patch 11/12] Slab defragmentation: Support reiserfs inode defragmentation
  2007-07-08  3:05 [patch 00/12] Slab defragmentation V4 Christoph Lameter
                   ` (9 preceding siblings ...)
  2007-07-08  3:05 ` [patch 10/12] Slab defragmentation: Support inode defragmentation for xfs Christoph Lameter
@ 2007-07-08  3:05 ` Christoph Lameter
  2007-07-08  3:05 ` [patch 12/12] Slab defragmentation: kmem_cache_vacate for antifrag / memory compaction Christoph Lameter
  11 siblings, 0 replies; 13+ messages in thread
From: Christoph Lameter @ 2007-07-08  3:05 UTC (permalink / raw)
  To: akpm; +Cc: linux-kernel, linux-mm, dgc

[-- Attachment #1: slub_defrag_fs_reiser --]
[-- Type: text/plain, Size: 1168 bytes --]

Add inode defrag support

Signed-off-by: Christoph Lameter <clameter@sgi.com>

---
 fs/reiserfs/super.c |   14 +++++++++++++-
 1 file changed, 13 insertions(+), 1 deletion(-)

Index: slub/fs/reiserfs/super.c
===================================================================
--- slub.orig/fs/reiserfs/super.c	2007-06-07 14:09:36.000000000 -0700
+++ slub/fs/reiserfs/super.c	2007-06-07 14:30:49.000000000 -0700
@@ -520,6 +520,17 @@ static void init_once(void *foo, struct 
 #endif
 }
 
+static void *reiserfs_get_inodes(struct kmem_cache *s, int nr, void **v)
+{
+	return fs_get_inodes(s, nr, v,
+			offsetof(struct reiserfs_inode_info, vfs_inode));
+}
+
+struct kmem_cache_ops reiserfs_kmem_cache_ops = {
+	.get = reiserfs_get_inodes,
+	.kick = kick_inodes
+};
+
 static int init_inodecache(void)
 {
 	reiserfs_inode_cachep = kmem_cache_create("reiser_inode_cache",
@@ -527,7 +538,8 @@ static int init_inodecache(void)
 							 reiserfs_inode_info),
 						  0, (SLAB_RECLAIM_ACCOUNT|
 							SLAB_MEM_SPREAD),
-						  init_once, NULL);
+						  init_once,
+						  &reiserfs_kmem_cache_ops);
 	if (reiserfs_inode_cachep == NULL)
 		return -ENOMEM;
 	return 0;

-- 

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [patch 12/12] Slab defragmentation: kmem_cache_vacate for antifrag / memory compaction
  2007-07-08  3:05 [patch 00/12] Slab defragmentation V4 Christoph Lameter
                   ` (10 preceding siblings ...)
  2007-07-08  3:05 ` [patch 11/12] Slab defragmentation: Support reiserfs inode defragmentation Christoph Lameter
@ 2007-07-08  3:05 ` Christoph Lameter
  11 siblings, 0 replies; 13+ messages in thread
From: Christoph Lameter @ 2007-07-08  3:05 UTC (permalink / raw)
  To: akpm; +Cc: linux-kernel, linux-mm, dgc

[-- Attachment #1: slab_defrag_kmem_cache_vacate --]
[-- Type: text/plain, Size: 6810 bytes --]

[only for review, untested, waiting for Mel to get around to use this to
improve memory compaction or antifragmentation]

Special function kmem_cache_vacate() to push out the objects in a
specified slab. In order to make that work we will have to handle
slab page allocations in such a way that we can determine if a slab is valid whenever we access it regardless of its time in life.

A valid slab that can be freed has PageSlab(page) and page->inuse > 0 set.
So we need to make sure in allocate_slab that page->inuse is zero before
PageSlab is set otherwise kmem_cache_vacate may operate on a slab that
has not been properly setup yet.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

---
 include/linux/slab.h |    1 
 mm/slab.c            |    9 ++++
 mm/slob.c            |    9 ++++
 mm/slub.c            |  105 +++++++++++++++++++++++++++++++++++++++++++++++----
 4 files changed, 117 insertions(+), 7 deletions(-)

Index: linux-2.6.22-rc6-mm1/include/linux/slab.h
===================================================================
--- linux-2.6.22-rc6-mm1.orig/include/linux/slab.h	2007-07-05 19:05:02.000000000 -0700
+++ linux-2.6.22-rc6-mm1/include/linux/slab.h	2007-07-05 19:05:08.000000000 -0700
@@ -97,6 +97,7 @@ unsigned int kmem_cache_size(struct kmem
 const char *kmem_cache_name(struct kmem_cache *);
 int kmem_ptr_validate(struct kmem_cache *cachep, const void *ptr);
 int kmem_cache_defrag(int node);
+int kmem_cache_vacate(struct page *);
 
 /*
  * Please use this macro to create slab caches. Simply specify the
Index: linux-2.6.22-rc6-mm1/mm/slab.c
===================================================================
--- linux-2.6.22-rc6-mm1.orig/mm/slab.c	2007-07-05 19:00:20.000000000 -0700
+++ linux-2.6.22-rc6-mm1/mm/slab.c	2007-07-05 19:05:08.000000000 -0700
@@ -2523,6 +2523,15 @@ int kmem_cache_defrag(int percent, int n
 	return 0;
 }
 
+/*
+ * SLAB does not support slab defragmentation
+ */
+int kmem_cache_vacate(struct page *page)
+{
+	return 0;
+}
+EXPORT_SYMBOL(kmem_cache_vacate);
+
 /**
  * kmem_cache_destroy - delete a cache
  * @cachep: the cache to destroy
Index: linux-2.6.22-rc6-mm1/mm/slob.c
===================================================================
--- linux-2.6.22-rc6-mm1.orig/mm/slob.c	2007-07-05 19:00:20.000000000 -0700
+++ linux-2.6.22-rc6-mm1/mm/slob.c	2007-07-05 19:05:08.000000000 -0700
@@ -596,6 +596,15 @@ int kmem_cache_defrag(int percentage, in
 	return 0;
 }
 
+/*
+ * SLOB does not support slab defragmentation
+ */
+int kmem_cache_vacate(struct page *page)
+{
+	return 0;
+}
+EXPORT_SYMBOL(kmem_cache_vacate);
+
 int kmem_ptr_validate(struct kmem_cache *a, const void *b)
 {
 	return 0;
Index: linux-2.6.22-rc6-mm1/mm/slub.c
===================================================================
--- linux-2.6.22-rc6-mm1.orig/mm/slub.c	2007-07-05 19:01:48.000000000 -0700
+++ linux-2.6.22-rc6-mm1/mm/slub.c	2007-07-05 19:05:08.000000000 -0700
@@ -1041,6 +1041,7 @@ static inline int slab_pad_check(struct 
 static inline int check_object(struct kmem_cache *s, struct page *page,
 			void *object, int active) { return 1; }
 static inline void add_full(struct kmem_cache_node *n, struct page *page) {}
+static inline void remove_full(struct kmem_cache *s, struct page *page) {}
 static inline void kmem_cache_open_debug_check(struct kmem_cache *s) {}
 #define slub_debug 0
 #endif
@@ -1106,12 +1107,11 @@ static struct page *new_slab(struct kmem
 	n = get_node(s, page_to_nid(page));
 	if (n)
 		atomic_long_inc(&n->nr_slabs);
+
+	page->inuse = 0;
+	page->lockless_freelist = NULL;
 	page->offset = s->offset / sizeof(void *);
 	page->slab = s;
-	page->flags |= 1 << PG_slab;
-	if (s->flags & (SLAB_DEBUG_FREE | SLAB_RED_ZONE | SLAB_POISON |
-			SLAB_STORE_USER | SLAB_TRACE))
-		SetSlabDebug(page);
 
 	start = page_address(page);
 	end = start + s->objects * s->size;
@@ -1129,9 +1129,18 @@ static struct page *new_slab(struct kmem
 	set_freepointer(s, last, NULL);
 
 	page->freelist = start;
-	page->lockless_freelist = NULL;
-	page->inuse = 0;
-out:
+
+	/*
+	 * page->inuse must be 0 when PageSlab(page) becomes
+	 * true so that defrag knows that this slab is not in use.
+	 */
+	smp_wmb();
+	__SetPageSlab(page);
+	if (s->flags & (SLAB_DEBUG_FREE | SLAB_RED_ZONE | SLAB_POISON |
+			SLAB_STORE_USER | SLAB_TRACE))
+		SetSlabDebug(page);
+
+ out:
 	if (flags & __GFP_WAIT)
 		local_irq_disable();
 	return page;
@@ -2660,6 +2669,88 @@ static unsigned long __kmem_cache_shrink
 }
 
 /*
+ * Get a page off a list and freeze it. Must be holding slab lock.
+ */
+static void freeze_from_list(struct kmem_cache *s, struct page *page)
+{
+	if (page->inuse < s->objects)
+		remove_partial(s, page);
+	else if (s->flags & SLAB_STORE_USER)
+		remove_full(s, page);
+	SetSlabFrozen(page);
+}
+
+/*
+ * Attempt to free objects in a page. Return 1 if succesful.
+ */
+int kmem_cache_vacate(struct page *page)
+{
+	unsigned long flags;
+	struct kmem_cache *s;
+	int vacated = 0;
+	void **vector = NULL;
+
+	/*
+	 * Get a reference to the page. Return if its freed or being freed.
+	 * This is necessary to make sure that the page does not vanish
+	 * from under us before we are able to check the result.
+	 */
+	if (!get_page_unless_zero(page))
+		return 0;
+
+	if (!PageSlab(page))
+		goto out;
+
+	s = page->slab;
+	if (!s)
+		goto out;
+
+	vector = kmalloc(s->objects * sizeof(void *), GFP_KERNEL);
+	if (!vector)
+		goto out2;
+
+	local_irq_save(flags);
+	/*
+	 * The implicit memory barrier in slab_lock guarantees that page->inuse
+	 * is loaded after PageSlab(page) has been established to be true.
+	 * Only revelant for a  newly created slab.
+	 */
+	slab_lock(page);
+
+	/*
+	 * We may now have locked a page that may be in various stages of
+	 * being freed. If the PageSlab bit is off then we have already
+	 * reached the page allocator. If page->inuse is zero then we are
+	 * in SLUB but freeing or allocating the page.
+	 * page->inuse is never modified without the slab lock held.
+	 *
+	 * Also abort if the page happens to be already frozen. If its
+	 * frozen then a concurrent vacate may be in progress.
+	 */
+	if (!PageSlab(page) || SlabFrozen(page) || !page->inuse)
+		goto out_locked;
+
+	/*
+	 * We are holding a lock on a slab page and all operations on the
+	 * slab are blocking.
+	 */
+	if (!s->ops->get || !s->ops->kick)
+		goto out_locked;
+	freeze_from_list(s, page);
+	vacated = __kmem_cache_vacate(s, page, flags, vector);
+out:
+	kfree(vector);
+out2:
+	put_page(page);
+	return vacated == 0;
+out_locked:
+	slab_unlock(page);
+	local_irq_restore(flags);
+	goto out;
+
+}
+
+/*
  * kmem_cache_shrink removes empty slabs from the partial lists and sorts
  * the remaining slabs by the number of items in use. The slabs with the
  * most items in use come first. New allocations will then fill those up

-- 

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2007-07-08  3:05 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2007-07-08  3:05 [patch 00/12] Slab defragmentation V4 Christoph Lameter
2007-07-08  3:05 ` [patch 01/12] Slab defragmentation: Add support for kmem_cache_ops Christoph Lameter
2007-07-08  3:05 ` [patch 02/12] Slab defragmentation: Core piece Christoph Lameter
2007-07-08  3:05 ` [patch 03/12] Slab defragmentation: Updates to slabinfo.c Christoph Lameter
2007-07-08  3:05 ` [patch 04/12] Slab defragmentation: Logic to trigger defragmentation from memory reclaim Christoph Lameter
2007-07-08  3:05 ` [patch 05/12] Slab defragmentation: Log information to the syslog to show defrag operations Christoph Lameter
2007-07-08  3:05 ` [patch 06/12] Slab defragmentation: Support dentry defragmentation Christoph Lameter
2007-07-08  3:05 ` [patch 07/12] Slab defragmentation: Support for buffer_head defrag Christoph Lameter
2007-07-08  3:05 ` [patch 08/12] Slab defragmentation: Support generic defragmentation for inode slab caches Christoph Lameter
2007-07-08  3:05 ` [patch 09/12] Slab defragmentation: Support defragmentation for extX filesystem inodes Christoph Lameter
2007-07-08  3:05 ` [patch 10/12] Slab defragmentation: Support inode defragmentation for xfs Christoph Lameter
2007-07-08  3:05 ` [patch 11/12] Slab defragmentation: Support reiserfs inode defragmentation Christoph Lameter
2007-07-08  3:05 ` [patch 12/12] Slab defragmentation: kmem_cache_vacate for antifrag / memory compaction Christoph Lameter

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox