[RFC 0/3] Slab Defrag / Slab Targeted Reclaim and general Slab API changes

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [RFC 0/3] Slab Defrag / Slab Targeted Reclaim and general Slab API changes
@ 2007-05-04 22:15 clameter
  2007-05-04 22:15 ` [RFC 1/3] SLUB: slab_ops instead of constructors / destructors clameter
                   ` (3 more replies)
  0 siblings, 4 replies; 23+ messages in thread
From: clameter @ 2007-05-04 22:15 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-mm, dgc, Eric Dumazet, Mel Gorman

I originally intended this for the 2.6.23 development cycle but since there
is an aggressive push for SLUB I thought that we may want to introduce this earlier.
Note that this covers new locking approaches that we may need to talk
over before going any further.

This is an RFC for patches that do major changes to the way that slab
allocations are handled in order to introduce some more advanced features
and in order to get rid of some things that are no longer used or awkward.

A. Add Slab fragmentation

On kmem_cache_shrink SLUB will not only sort the partial slabs by object
number but attempt to free objects out of partial slabs that have a low
number of objects. Doing so increases the object density in the remaining
partial slabs and frees up memory. Ideally kmem_cache_shrink would be
able to completely defrag the partial list so that only one partial
slab is left over. But it is advantageous to have slabs with a few free
objects since that speeds up kfree. Also going to the extreme on this one
would mean that the reclaimable slabs would have to be able to move objects
in a reliable way. So we just free objects in slabs with a low population ratio
and tolerate if a attempt to move an object fails.

B. Targeted Reclaim

Mainly to support antifragmentation / defragmentation methods. The slab adds
a new function kmem_cache_vacate(struct page *) which can be used to request
that a page be cleared of all objects. This makes it possible to reduce the
size of the RECLAIMABLE fragmentation area and move slabs into the MOVABLE
area enhancing the capabilities of antifragmentation significantly.

C. Introduces a slab_ops structure that allows a slab user to provide
   operations on slabs.

This replaces the current constructor / destructor scheme. It is necessary
in order to support additional methods needed to support targeted reclaim
and slab defragmentation. A slab supporting targeted reclaim and
slab defragmentation must support the following additional methods:

	1. get_reference(void *)
		Get a reference on a particular slab object.

	2. kick_object(void *)
		Kick an object off a slab. The object is either reclaimed
		(easiest) or a new object is alloced using kmem_cache_alloc()
		and then the object is moved to the new location.

D. Slab creation is no longer done using kmem_cache_create

kmem_cache_create is not a clean API since it has only 2 call backs for
constructor and destructor, does not allow the specification of a slab ops
structure. Parameters are confusing.

F.e. It is possible to specify alignment information in the alignment
field and in addition in the flags field (SLAB_HWCACHE_ALIGN). The semantics
of SLAB_HWCACHE_ALIGN are fuzzy because it only aligns object if
larger than 1/2 cache line.

All of this is really not necessary since the compiler knows how to align
structures and we should use this information instead of having the user
specify an alignment. I would like to get rid of SLAB_HWCACHE_ALIGN
and kmem_cache_create. Instead one would use the following macros (that
then result in a call to __kmem_cache_create).

	KMEM_CACHE(<struct-name>, flags)

The macro will determine the slab name from the struct name and use that for
/sys/slab, will use the size of the struct for slab size and the alignment
of the structure for alignment. This means one will be able to set slab
object alignment by specifying the usual alignment options for static
allocations when defining the structure.

Since the name is derived from the struct name it will much easier to
find the source code for slabs listed in /sys/slab.

An additional macro is provided if the slab also supports slab operations.

	KMEM_CACHE_OPS(<struct-name>, flags, slab_ops)

It is likely that this macro will be rarely used.

E. kmem_cache_create() SLAB_HWCACHE_ALIGN legacy interface

In order to avoid having to modify all slab creation calls throughout
the kernel we will provide a kmem_cache_create emulation. That function
is the only call that will still understand SLAB_HWCACHE_ALIGN. If that
parameter is specified then it will set up the proper alignment (the slab
allocators never see that flag).

If constructor or destructor are specified then we will allocate a slab_ops
structure and populate it with the values specified. Note that this will
cause a memory leak if the slab is disposed of later. If you need disposable
slabs then the new API must be used.

F. Remove destructor support from all slab allocators?

I am only aware of two call sites left after all the changes that are
scheduled to go into 2.6.22-rc1 have been merged. These are in FRV and sh
arch code. The one in FRV will go away if they switch to quicklists like
i386. Sh contains another use but a single user is no justification for keeping
destructors around.

-- 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [RFC 1/3] SLUB: slab_ops instead of constructors / destructors
  2007-05-04 22:15 [RFC 0/3] Slab Defrag / Slab Targeted Reclaim and general Slab API changes clameter
@ 2007-05-04 22:15 ` clameter
  2007-05-05 10:14   ` Pekka Enberg
  2007-05-06 19:19   ` Bert Wesarg
  2007-05-04 22:15 ` [RFC 2/3] SLUB: Implement targeted reclaim and partial list defragmentation clameter
                   ` (2 subsequent siblings)
  3 siblings, 2 replies; 23+ messages in thread
From: clameter @ 2007-05-04 22:15 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-mm, dgc, Eric Dumazet, Mel Gorman

[-- Attachment #1: slabapic23 --]
[-- Type: text/plain, Size: 12188 bytes --]

This patch gets rid constructors and destructors and replaces them
with a slab operations structure that is passed into SLUB.

For backward compatibility we provide a kmem_cache_create() emulation
that can construct a slab operations structure on the fly.

The new API's to create slabs are:

Without any callbacks:

slabhandle = KMEM_CACHE(<struct>, <flags>)

Creates a slab based on the structure definition with the structure
alignment, size and name. This is cleaner because the name showing up
in /sys/slab/xxx will be the structure name. One can search the source
for the name. The common alignment attributs to the struct can control
slab alignmnet.

Note: SLAB_HWCACHE_ALIGN is *not* supported as a flag. The flags do
*not* specify alignments. The alignment is done to the structure and
please nowhere else.

Create a slabcache with slab_ops (please use only for special slabs):

KMEM_CACHE_OPS(<struct>, <flags>, <slab_ops>)

Old kmem_cache_create() support:

kmem_cache_create alone still accepts the specification of SLAB_HWCACHE_ALIGN
*if* there is no other alignment specified. In that case kmem_cache_create
will generate a proper alignment depending on the size of the structure.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

---
 include/linux/slab.h     |   60 ++++++++++++++++++++++++++++++------
 include/linux/slub_def.h |    3 -
 mm/slub.c                |   77 +++++++++++++++++------------------------------
 3 files changed, 80 insertions(+), 60 deletions(-)

Index: slub/include/linux/slab.h
===================================================================
--- slub.orig/include/linux/slab.h	2007-05-03 20:53:00.000000000 -0700
+++ slub/include/linux/slab.h	2007-05-04 02:38:38.000000000 -0700
@@ -23,7 +23,6 @@ typedef struct kmem_cache kmem_cache_t _
 #define SLAB_DEBUG_FREE		0x00000100UL	/* DEBUG: Perform (expensive) checks on free */
 #define SLAB_RED_ZONE		0x00000400UL	/* DEBUG: Red zone objs in a cache */
 #define SLAB_POISON		0x00000800UL	/* DEBUG: Poison objects */
-#define SLAB_HWCACHE_ALIGN	0x00002000UL	/* Align objs on cache lines */
 #define SLAB_CACHE_DMA		0x00004000UL	/* Use GFP_DMA memory */
 #define SLAB_STORE_USER		0x00010000UL	/* DEBUG: Store the last owner for bug hunting */
 #define SLAB_RECLAIM_ACCOUNT	0x00020000UL	/* Objects are reclaimable */
@@ -32,19 +31,21 @@ typedef struct kmem_cache kmem_cache_t _
 #define SLAB_MEM_SPREAD		0x00100000UL	/* Spread some memory over cpuset */
 #define SLAB_TRACE		0x00200000UL	/* Trace allocations and frees */
 
-/* Flags passed to a constructor functions */
-#define SLAB_CTOR_CONSTRUCTOR	0x001UL		/* If not set, then deconstructor */
-
 /*
  * struct kmem_cache related prototypes
  */
 void __init kmem_cache_init(void);
 int slab_is_available(void);
 
-struct kmem_cache *kmem_cache_create(const char *, size_t, size_t,
-			unsigned long,
-			void (*)(void *, struct kmem_cache *, unsigned long),
-			void (*)(void *, struct kmem_cache *, unsigned long));
+struct slab_ops {
+	/* FIXME: ctor should only take the object as an argument. */
+	void (*ctor)(void *, struct kmem_cache *, unsigned long);
+	/* FIXME: Remove all destructors ? */
+	void (*dtor)(void *, struct kmem_cache *, unsigned long);
+};
+
+struct kmem_cache *__kmem_cache_create(const char *, size_t, size_t,
+	unsigned long, struct slab_ops *s);
 void kmem_cache_destroy(struct kmem_cache *);
 int kmem_cache_shrink(struct kmem_cache *);
 void *kmem_cache_alloc(struct kmem_cache *, gfp_t);
@@ -62,9 +63,14 @@ int kmem_ptr_validate(struct kmem_cache 
  * f.e. add ____cacheline_aligned_in_smp to the struct declaration
  * then the objects will be properly aligned in SMP configurations.
  */
-#define KMEM_CACHE(__struct, __flags) kmem_cache_create(#__struct,\
+#define KMEM_CACHE(__struct, __flags) __kmem_cache_create(#__struct,\
 		sizeof(struct __struct), __alignof__(struct __struct),\
-		(__flags), NULL, NULL)
+		(__flags), NULL)
+
+#define KMEM_CACHE_OPS(__struct, __flags, __ops) \
+	__kmem_cache_create(#__struct, sizeof(struct __struct), \
+	__alignof__(struct __struct), (__flags), (__ops))
+
 
 #ifdef CONFIG_NUMA
 extern void *kmem_cache_alloc_node(struct kmem_cache *, gfp_t flags, int node);
@@ -236,6 +242,40 @@ extern void *__kmalloc_node_track_caller
 extern const struct seq_operations slabinfo_op;
 ssize_t slabinfo_write(struct file *, const char __user *, size_t, loff_t *);
 
+/*
+ * Legacy functions
+ *
+ * The sole reason that these definitions are here is because of their
+ * frequent use. Remove when all call sites have been updated.
+ */
+#define SLAB_HWCACHE_ALIGN	0x8000000000UL
+#define SLAB_CTOR_CONSTRUCTOR	0x001UL
+
+static inline struct kmem_cache *kmem_cache_create(const char *s,
+		size_t size, size_t align, unsigned long flags,
+		void (*ctor)(void *, struct kmem_cache *, unsigned long),
+		void (*dtor)(void *, struct kmem_cache *, unsigned long))
+{
+	struct slab_ops *so = NULL;
+
+	if ((flags & SLAB_HWCACHE_ALIGN) && size > L1_CACHE_BYTES / 2) {
+		/* Clear the align flag. It is no longer supported */
+		flags &= ~SLAB_HWCACHE_ALIGN;
+
+		/* Do not allow conflicting alignment specificiations */
+		BUG_ON(align);
+
+		/* And set the cacheline alignment */
+		align = L1_CACHE_BYTES;
+	}
+	if (ctor || dtor) {
+		so = kzalloc(sizeof(struct slab_ops), GFP_KERNEL);
+		so->ctor = ctor;
+		so->dtor = dtor;
+	}
+	return  __kmem_cache_create(s, size, align, flags, so);
+}
+
 #endif	/* __KERNEL__ */
 #endif	/* _LINUX_SLAB_H */
 
Index: slub/mm/slub.c
===================================================================
--- slub.orig/mm/slub.c	2007-05-04 02:23:34.000000000 -0700
+++ slub/mm/slub.c	2007-05-04 02:40:13.000000000 -0700
@@ -209,6 +209,11 @@ static inline struct kmem_cache_node *ge
 #endif
 }
 
+struct slab_ops default_slab_ops = {
+	NULL,
+	NULL
+};
+
 /*
  * Object debugging
  */
@@ -809,8 +814,8 @@ static void setup_object(struct kmem_cac
 		init_tracking(s, object);
 	}
 
-	if (unlikely(s->ctor))
-		s->ctor(object, s, SLAB_CTOR_CONSTRUCTOR);
+	if (unlikely(s->slab_ops->ctor))
+		s->slab_ops->ctor(object, s, SLAB_CTOR_CONSTRUCTOR);
 }
 
 static struct page *new_slab(struct kmem_cache *s, gfp_t flags, int node)
@@ -867,16 +872,18 @@ out:
 static void __free_slab(struct kmem_cache *s, struct page *page)
 {
 	int pages = 1 << s->order;
+	void (*dtor)(void *, struct kmem_cache *, unsigned long) =
+		s->slab_ops->dtor;
 
-	if (unlikely(PageError(page) || s->dtor)) {
+	if (unlikely(PageError(page) || dtor)) {
 		void *start = page_address(page);
 		void *end = start + (pages << PAGE_SHIFT);
 		void *p;
 
 		slab_pad_check(s, page);
 		for (p = start; p <= end - s->size; p += s->size) {
-			if (s->dtor)
-				s->dtor(p, s, 0);
+			if (dtor)
+				dtor(p, s, 0);
 			check_object(s, page, p, 0);
 		}
 	}
@@ -1618,7 +1625,7 @@ static int calculate_sizes(struct kmem_c
 	 * then we should never poison the object itself.
 	 */
 	if ((flags & SLAB_POISON) && !(flags & SLAB_DESTROY_BY_RCU) &&
-			!s->ctor && !s->dtor)
+			s->slab_ops->ctor && !s->slab_ops->dtor)
 		s->flags |= __OBJECT_POISON;
 	else
 		s->flags &= ~__OBJECT_POISON;
@@ -1647,7 +1654,7 @@ static int calculate_sizes(struct kmem_c
 	s->inuse = size;
 
 	if (((flags & (SLAB_DESTROY_BY_RCU | SLAB_POISON)) ||
-		s->ctor || s->dtor)) {
+		s->slab_ops->ctor || s->slab_ops->dtor)) {
 		/*
 		 * Relocate free pointer after the object if it is not
 		 * permitted to overwrite the first word of the object on
@@ -1731,13 +1738,11 @@ static int __init finish_bootstrap(void)
 static int kmem_cache_open(struct kmem_cache *s, gfp_t gfpflags,
 		const char *name, size_t size,
 		size_t align, unsigned long flags,
-		void (*ctor)(void *, struct kmem_cache *, unsigned long),
-		void (*dtor)(void *, struct kmem_cache *, unsigned long))
+		struct slab_ops *slab_ops)
 {
 	memset(s, 0, kmem_size);
 	s->name = name;
-	s->ctor = ctor;
-	s->dtor = dtor;
+	s->slab_ops = slab_ops;
 	s->objsize = size;
 	s->flags = flags;
 	s->align = align;
@@ -1757,7 +1762,7 @@ static int kmem_cache_open(struct kmem_c
 	if (s->size >= 65535 * sizeof(void *)) {
 		BUG_ON(flags & (SLAB_RED_ZONE | SLAB_POISON |
 				SLAB_STORE_USER | SLAB_DESTROY_BY_RCU));
-		BUG_ON(ctor || dtor);
+		BUG_ON(slab_ops->ctor || slab_ops->dtor);
 	}
 	else
 		/*
@@ -1992,7 +1997,7 @@ static struct kmem_cache *create_kmalloc
 
 	down_write(&slub_lock);
 	if (!kmem_cache_open(s, gfp_flags, name, size, ARCH_KMALLOC_MINALIGN,
-			flags, NULL, NULL))
+			flags, &default_slab_ops))
 		goto panic;
 
 	list_add(&s->list, &slab_caches);
@@ -2313,23 +2318,21 @@ static int slab_unmergeable(struct kmem_
 	if (slub_nomerge || (s->flags & SLUB_NEVER_MERGE))
 		return 1;
 
-	if (s->ctor || s->dtor)
+	if (s->slab_ops != &default_slab_ops)
 		return 1;
 
 	return 0;
 }
 
 static struct kmem_cache *find_mergeable(size_t size,
-		size_t align, unsigned long flags,
-		void (*ctor)(void *, struct kmem_cache *, unsigned long),
-		void (*dtor)(void *, struct kmem_cache *, unsigned long))
+		size_t align, unsigned long flags, struct slab_ops *slab_ops)
 {
 	struct list_head *h;
 
 	if (slub_nomerge || (flags & SLUB_NEVER_MERGE))
 		return NULL;
 
-	if (ctor || dtor)
+	if (slab_ops != &default_slab_ops)
 		return NULL;
 
 	size = ALIGN(size, sizeof(void *));
@@ -2364,15 +2367,17 @@ static struct kmem_cache *find_mergeable
 	return NULL;
 }
 
-struct kmem_cache *kmem_cache_create(const char *name, size_t size,
+struct kmem_cache *__kmem_cache_create(const char *name, size_t size,
 		size_t align, unsigned long flags,
-		void (*ctor)(void *, struct kmem_cache *, unsigned long),
-		void (*dtor)(void *, struct kmem_cache *, unsigned long))
+		struct slab_ops *slab_ops)
 {
 	struct kmem_cache *s;
 
+	if (!slab_ops)
+		slab_ops = &default_slab_ops;
+
 	down_write(&slub_lock);
-	s = find_mergeable(size, align, flags, dtor, ctor);
+	s = find_mergeable(size, align, flags, slab_ops);
 	if (s) {
 		s->refcount++;
 		/*
@@ -2386,7 +2391,7 @@ struct kmem_cache *kmem_cache_create(con
 	} else {
 		s = kmalloc(kmem_size, GFP_KERNEL);
 		if (s && kmem_cache_open(s, GFP_KERNEL, name,
-				size, align, flags, ctor, dtor)) {
+				size, align, flags, slab_ops)) {
 			if (sysfs_slab_add(s)) {
 				kfree(s);
 				goto err;
@@ -2406,7 +2411,7 @@ err:
 		s = NULL;
 	return s;
 }
-EXPORT_SYMBOL(kmem_cache_create);
+EXPORT_SYMBOL(__kmem_cache_create);
 
 void *kmem_cache_zalloc(struct kmem_cache *s, gfp_t flags)
 {
@@ -2961,28 +2966,6 @@ static ssize_t order_show(struct kmem_ca
 }
 SLAB_ATTR_RO(order);
 
-static ssize_t ctor_show(struct kmem_cache *s, char *buf)
-{
-	if (s->ctor) {
-		int n = sprint_symbol(buf, (unsigned long)s->ctor);
-
-		return n + sprintf(buf + n, "\n");
-	}
-	return 0;
-}
-SLAB_ATTR_RO(ctor);
-
-static ssize_t dtor_show(struct kmem_cache *s, char *buf)
-{
-	if (s->dtor) {
-		int n = sprint_symbol(buf, (unsigned long)s->dtor);
-
-		return n + sprintf(buf + n, "\n");
-	}
-	return 0;
-}
-SLAB_ATTR_RO(dtor);
-
 static ssize_t aliases_show(struct kmem_cache *s, char *buf)
 {
 	return sprintf(buf, "%d\n", s->refcount - 1);
@@ -3213,8 +3196,6 @@ static struct attribute * slab_attrs[] =
 	&slabs_attr.attr,
 	&partial_attr.attr,
 	&cpu_slabs_attr.attr,
-	&ctor_attr.attr,
-	&dtor_attr.attr,
 	&aliases_attr.attr,
 	&align_attr.attr,
 	&sanity_checks_attr.attr,
Index: slub/include/linux/slub_def.h
===================================================================
--- slub.orig/include/linux/slub_def.h	2007-05-04 02:23:51.000000000 -0700
+++ slub/include/linux/slub_def.h	2007-05-04 02:24:27.000000000 -0700
@@ -39,8 +39,7 @@ struct kmem_cache {
 	/* Allocation and freeing of slabs */
 	int objects;		/* Number of objects in slab */
 	int refcount;		/* Refcount for slab cache destroy */
-	void (*ctor)(void *, struct kmem_cache *, unsigned long);
-	void (*dtor)(void *, struct kmem_cache *, unsigned long);
+	struct slab_ops *slab_ops;
 	int inuse;		/* Offset to metadata */
 	int align;		/* Alignment */
 	const char *name;	/* Name (only for display!) */

-- 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [RFC 2/3] SLUB: Implement targeted reclaim and partial list defragmentation
  2007-05-04 22:15 [RFC 0/3] Slab Defrag / Slab Targeted Reclaim and general Slab API changes clameter
  2007-05-04 22:15 ` [RFC 1/3] SLUB: slab_ops instead of constructors / destructors clameter
@ 2007-05-04 22:15 ` clameter
  2007-05-04 23:03   ` Christoph Lameter
                     ` (3 more replies)
  2007-05-04 22:15 ` [RFC 3/3] Support targeted reclaim and slab defrag for dentry cache clameter
  2007-05-05  5:07 ` [RFC 0/3] Slab Defrag / Slab Targeted Reclaim and general Slab API changes Eric Dumazet
  3 siblings, 4 replies; 23+ messages in thread
From: clameter @ 2007-05-04 22:15 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-mm, dgc, Eric Dumazet, Mel Gorman

[-- Attachment #1: reclaim_callback --]
[-- Type: text/plain, Size: 9728 bytes --]

Targeted reclaim allows to target a single slab for reclaim. This is done by
calling

kmem_cache_vacate(slab, page);

It will return 1 on success, 0 if the operation failed.

The vacate functionality is also used for slab shrinking. During the shrink
operation SLUB will generate a list sorted by the number of objects in use.

We extract pages off that list that are only filled less than a quarter. These
objects are then processed using kmem_cache_vacate.

In order for a slabcache to support this functionality two functions must
be defined via slab_operations.

get_reference(void *)

Must obtain a reference to the object if it has not been freed yet. It is up
to the slabcache to resolve the race. SLUB guarantees that the objects is
still allocated. However, another thread may be blocked in slab_free
attempting to free the same object. It may succeed as soon as
get_reference() returns to the slab allocator.

get_reference() processing must recognize this situation (i.e. check refcount
for zero) and fail in such a sitation (no problem since the object will
be freed as soon we drop the slab lock before doing kick calls).

No slab operations may be performed in get_reference(). The slab lock
for the page with the object is taken. Any slab operations may lead to
a deadlock.

2. kick_object(void *)

After SLUB has established references to the remaining objects in a slab it
will drop all locks and then use kick_object on each of the objects for which
we obtained a reference. The existence of the objects is guaranteed by
virtue of the earlier obtained reference. The callback may perform any
slab operation since no locks are held at the time of call.

The callback should remove the object from the slab in some way. This may
be accomplished by reclaiming the object and then running kmem_cache_free()
or reallocating it and then running kmem_cache_free(). Reallocation
is advantageous at this point because it will then allocate from the partial
slabs with the most objects because we have just finished slab shrinking.

NOTE: This patch is for conceptual review. I'd appreciate any feedback
especially on the locking approach taken here. It will be critical to
resolve the locking issue for this approach to become feasable.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

---
 include/linux/slab.h |    3 
 mm/slub.c            |  159 ++++++++++++++++++++++++++++++++++++++++++++++++---
 2 files changed, 154 insertions(+), 8 deletions(-)

Index: slub/include/linux/slab.h
===================================================================
--- slub.orig/include/linux/slab.h	2007-05-04 13:32:34.000000000 -0700
+++ slub/include/linux/slab.h	2007-05-04 13:32:50.000000000 -0700
@@ -42,6 +42,8 @@ struct slab_ops {
 	void (*ctor)(void *, struct kmem_cache *, unsigned long);
 	/* FIXME: Remove all destructors ? */
 	void (*dtor)(void *, struct kmem_cache *, unsigned long);
+	int (*get_reference)(void *);
+	void (*kick_object)(void *);
 };
 
 struct kmem_cache *__kmem_cache_create(const char *, size_t, size_t,
@@ -54,6 +56,7 @@ void kmem_cache_free(struct kmem_cache *
 unsigned int kmem_cache_size(struct kmem_cache *);
 const char *kmem_cache_name(struct kmem_cache *);
 int kmem_ptr_validate(struct kmem_cache *cachep, const void *ptr);
+int kmem_cache_vacate(struct page *);
 
 /*
  * Please use this macro to create slab caches. Simply specify the
Index: slub/mm/slub.c
===================================================================
--- slub.orig/mm/slub.c	2007-05-04 13:32:34.000000000 -0700
+++ slub/mm/slub.c	2007-05-04 13:56:25.000000000 -0700
@@ -173,7 +173,7 @@ static struct notifier_block slab_notifi
 static enum {
 	DOWN,		/* No slab functionality available */
 	PARTIAL,	/* kmem_cache_open() works but kmalloc does not */
-	UP,		/* Everything works */
+	UP,		/* Everything works but does not show up in sysfs */
 	SYSFS		/* Sysfs up */
 } slab_state = DOWN;
 
@@ -211,6 +211,8 @@ static inline struct kmem_cache_node *ge
 
 struct slab_ops default_slab_ops = {
 	NULL,
+	NULL,
+	NULL,
 	NULL
 };
 
@@ -839,13 +841,10 @@ static struct page *new_slab(struct kmem
 	n = get_node(s, page_to_nid(page));
 	if (n)
 		atomic_long_inc(&n->nr_slabs);
+
 	page->offset = s->offset / sizeof(void *);
 	page->slab = s;
-	page->flags |= 1 << PG_slab;
-	if (s->flags & (SLAB_DEBUG_FREE | SLAB_RED_ZONE | SLAB_POISON |
-			SLAB_STORE_USER | SLAB_TRACE))
-		page->flags |= 1 << PG_error;
-
+	page->inuse = 0;
 	start = page_address(page);
 	end = start + s->objects * s->size;
 
@@ -862,7 +861,17 @@ static struct page *new_slab(struct kmem
 	set_freepointer(s, last, NULL);
 
 	page->freelist = start;
-	page->inuse = 0;
+
+	/*
+	 * pages->inuse must be visible when PageSlab(page) becomes
+	 * true for targeted reclaim
+	 */
+	smp_wmb();
+	page->flags |= 1 << PG_slab;
+	if (s->flags & (SLAB_DEBUG_FREE | SLAB_RED_ZONE | SLAB_POISON |
+			SLAB_STORE_USER | SLAB_TRACE))
+		page->flags |= 1 << PG_error;
+
 out:
 	if (flags & __GFP_WAIT)
 		local_irq_disable();
@@ -2124,6 +2133,111 @@ void kfree(const void *x)
 EXPORT_SYMBOL(kfree);
 
 /*
+ * Vacate all objects in the given slab. Slab must be locked.
+ *
+ * Will drop and regain and drop the slab lock.
+ * Slab must be marked PageActive() to avoid concurrent slab_free from
+ * remove the slab from the lists. At the end the slab will either
+ * be freed or have been returned to the partial lists.
+ *
+ * Return error code or number of remaining objects
+ */
+static int __kmem_cache_vacate(struct kmem_cache *s, struct page *page)
+{
+	void *p;
+	void *addr = page_address(page);
+	unsigned long map[BITS_TO_LONGS(s->objects)];
+	int leftover;
+
+	if (!page->inuse)
+		return 0;
+
+	/* Determine free objects */
+	bitmap_zero(map, s->objects);
+	for(p = page->freelist; p; p = get_freepointer(s, p))
+		set_bit((p - addr) / s->size, map);
+
+	/*
+	 * Get a refcount for all used objects. If that fails then
+	 * no KICK callback can be performed.
+	 */
+	for(p = addr; p < addr + s->objects * s->size; p += s->size)
+		if (!test_bit((p - addr) / s->size, map))
+			if (!s->slab_ops->get_reference(p))
+				set_bit((p - addr) / s->size, map);
+
+	/* Got all the references we need. Now we can drop the slab lock */
+	slab_unlock(page);
+
+	/* Perform the KICK callbacks to remove the objects */
+	for(p = addr; p < addr + s->objects * s->size; p += s->size)
+		if (!test_bit((p - addr) / s->size, map))
+			s->slab_ops->kick_object(p);
+
+	slab_lock(page);
+	leftover = page->inuse;
+	ClearPageActive(page);
+	putback_slab(s, page);
+	return leftover;
+}
+
+/*
+ * Remove a page from the lists. Must be holding slab lock.
+ */
+static void remove_from_lists(struct kmem_cache *s, struct page *page)
+{
+	if (page->inuse < s->objects)
+		remove_partial(s, page);
+	else if (s->flags & SLAB_STORE_USER)
+		remove_full(s, page);
+}
+
+/*
+ * Attempt to free objects in a page. Return 1 when succesful.
+ */
+int kmem_cache_vacate(struct page *page)
+{
+	struct kmem_cache *s;
+	int rc = 0;
+
+	/* Get a reference to the page. Return if its freed or being freed */
+	if (!get_page_unless_zero(page))
+		return 0;
+
+	/* Check that this is truly a slab page */
+	if (!PageSlab(page))
+		goto out;
+
+	slab_lock(page);
+
+	/*
+	 * We may now have locked a page that is in various stages of being
+	 * freed. If the PageSlab bit is off then we have already reached
+	 * the page allocator. If page->inuse is zero then we are
+	 * in SLUB but freeing or allocating the page.
+	 * page->inuse is never modified without the slab lock held.
+	 *
+	 * Also abort if the page happens to be a per cpu slab
+	 */
+	if (!PageSlab(page) || PageActive(page) || !page->inuse) {
+		slab_unlock(page);
+		goto out;
+	}
+
+	/*
+	 * We are holding a lock on a slab page that is not in the
+	 * process of being allocated or freed.
+	 */
+	s = page->slab;
+	remove_from_lists(s, page);
+	SetPageActive(page);
+	rc = __kmem_cache_vacate(s, page) == 0;
+out:
+	put_page(page);
+	return rc;
+}
+
+/*
  *  kmem_cache_shrink removes empty slabs from the partial lists
  *  and then sorts the partially allocated slabs by the number
  *  of items in use. The slabs with the most items in use
@@ -2137,11 +2251,12 @@ int kmem_cache_shrink(struct kmem_cache 
 	int node;
 	int i;
 	struct kmem_cache_node *n;
-	struct page *page;
+	struct page *page, *page2;
 	struct page *t;
 	struct list_head *slabs_by_inuse =
 		kmalloc(sizeof(struct list_head) * s->objects, GFP_KERNEL);
 	unsigned long flags;
+	LIST_HEAD(zaplist);
 
 	if (!slabs_by_inuse)
 		return -ENOMEM;
@@ -2194,8 +2309,36 @@ int kmem_cache_shrink(struct kmem_cache 
 		for (i = s->objects - 1; i >= 0; i--)
 			list_splice(slabs_by_inuse + i, n->partial.prev);
 
+		if (!s->slab_ops->get_reference || !s->slab_ops->kick_object)
+			goto out;
+
+		/* Take objects with just a few objects off the tail */
+		while (n->nr_partial > MAX_PARTIAL) {
+			page = container_of(n->partial.prev, struct page, lru);
+
+			/*
+			 * We are holding the list_lock so we can only
+			 * trylock the slab
+			 */
+			if (!slab_trylock(page))
+				break;
+
+			if (page->inuse > s->objects / 4)
+				break;
+
+			list_move(&page->lru, &zaplist);
+			n->nr_partial--;
+			SetPageActive(page);
+			slab_unlock(page);
+		}
 	out:
 		spin_unlock_irqrestore(&n->list_lock, flags);
+
+		/* Now we can free objects in the slabs on the zaplist */
+		list_for_each_entry_safe(page, page2, &zaplist, lru) {
+			slab_lock(page);
+			__kmem_cache_vacate(s, page);
+		}
 	}
 
 	kfree(slabs_by_inuse);

-- 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [RFC 3/3] Support targeted reclaim and slab defrag for dentry cache
  2007-05-04 22:15 [RFC 0/3] Slab Defrag / Slab Targeted Reclaim and general Slab API changes clameter
  2007-05-04 22:15 ` [RFC 1/3] SLUB: slab_ops instead of constructors / destructors clameter
  2007-05-04 22:15 ` [RFC 2/3] SLUB: Implement targeted reclaim and partial list defragmentation clameter
@ 2007-05-04 22:15 ` clameter
  2007-05-05  5:07 ` [RFC 0/3] Slab Defrag / Slab Targeted Reclaim and general Slab API changes Eric Dumazet
  3 siblings, 0 replies; 23+ messages in thread
From: clameter @ 2007-05-04 22:15 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-mm, dgc, Eric Dumazet, Mel Gorman

[-- Attachment #1: dcache_targetd_reclaim --]
[-- Type: text/plain, Size: 3670 bytes --]

This is an experimental patch for locking review only. I am not that
familiar with dentry cache locking.

We setup the dcache cache a bit differently using the new APIs and
define a get_reference and kick_object() function for the dentry cache.

get_dentry_reference simply works by incrementing the dentry refcount
if its not already zero. If it is zero then the slab called us while
another processor is in the process of freeing the object. The other
process will finish this free as soon as we return from this call. So
we have to fail.

kick_dentry_object() is called after get_dentry_reference() has
been used and after the slab has dropped all of its own locks.
Trying to use the dentry pruning here. Hope that is correct.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

---
 fs/dcache.c        |   48 +++++++++++++++++++++++++++++++++++++++---------
 include/linux/fs.h |    2 +-
 2 files changed, 40 insertions(+), 10 deletions(-)

Index: slub/fs/dcache.c
===================================================================
--- slub.orig/fs/dcache.c	2007-05-04 13:32:15.000000000 -0700
+++ slub/fs/dcache.c	2007-05-04 13:55:39.000000000 -0700
@@ -2133,17 +2133,48 @@ static void __init dcache_init_early(voi
 		INIT_HLIST_HEAD(&dentry_hashtable[loop]);
 }
 
+/*
+ * The slab is holding locks on the current slab. We can just
+ * get a reference
+ */
+int get_dentry_reference(void *private)
+{
+	struct dentry *dentry = private;
+
+	return atomic_inc_not_zero(&dentry->d_count);
+}
+
+/*
+ * Slab has dropped all the locks. Get rid of the
+ * refcount we obtained earlier and also rid of the
+ * object.
+ */
+void kick_dentry_object(void *private)
+{
+	struct dentry *dentry = private;
+
+	spin_lock(&dentry->d_lock);
+	if (atomic_read(&dentry->d_count) > 1) {
+		spin_unlock(&dentry->d_lock);
+		dput(dentry);
+	}
+	spin_lock(&dcache_lock);
+	prune_one_dentry(dentry, 1);
+	spin_unlock(&dcache_lock);
+}
+
+struct slab_ops dentry_slab_ops = {
+	.get_reference = get_dentry_reference,
+	.kick_object = kick_dentry_object
+};
+
 static void __init dcache_init(unsigned long mempages)
 {
 	int loop;
 
-	/* 
-	 * A constructor could be added for stable state like the lists,
-	 * but it is probably not worth it because of the cache nature
-	 * of the dcache. 
-	 */
-	dentry_cache = KMEM_CACHE(dentry,
-		SLAB_RECLAIM_ACCOUNT|SLAB_PANIC|SLAB_MEM_SPREAD);
+	dentry_cache = KMEM_CACHE_OPS(dentry,
+		SLAB_RECLAIM_ACCOUNT|SLAB_PANIC|SLAB_MEM_SPREAD,
+		&dentry_slab_ops);
 
 	register_shrinker(&dcache_shrinker);
 
@@ -2192,8 +2223,7 @@ void __init vfs_caches_init(unsigned lon
 	names_cachep = kmem_cache_create("names_cache", PATH_MAX, 0,
 			SLAB_HWCACHE_ALIGN|SLAB_PANIC, NULL, NULL);
 
-	filp_cachep = kmem_cache_create("filp", sizeof(struct file), 0,
-			SLAB_HWCACHE_ALIGN|SLAB_PANIC, NULL, NULL);
+	filp_cachep = KMEM_CACHE(file, SLAB_PANIC);
 
 	dcache_init(mempages);
 	inode_init(mempages);
Index: slub/include/linux/fs.h
===================================================================
--- slub.orig/include/linux/fs.h	2007-05-04 13:32:15.000000000 -0700
+++ slub/include/linux/fs.h	2007-05-04 13:55:39.000000000 -0700
@@ -785,7 +785,7 @@ struct file {
 	spinlock_t		f_ep_lock;
 #endif /* #ifdef CONFIG_EPOLL */
 	struct address_space	*f_mapping;
-};
+} ____cacheline_aligned;
 extern spinlock_t files_lock;
 #define file_list_lock() spin_lock(&files_lock);
 #define file_list_unlock() spin_unlock(&files_lock);

-- 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC 2/3] SLUB: Implement targeted reclaim and partial list defragmentation
  2007-05-04 22:15 ` [RFC 2/3] SLUB: Implement targeted reclaim and partial list defragmentation clameter
@ 2007-05-04 23:03   ` Christoph Lameter
  2007-05-05  1:04     ` Randy Dunlap
  2007-05-05  5:32   ` William Lee Irwin III
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 23+ messages in thread
From: Christoph Lameter @ 2007-05-04 23:03 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-mm, dgc, Eric Dumazet, Mel Gorman

Fixes suggested by Andrew

---
 include/linux/slab.h |   12 ++++++++++++
 mm/slub.c            |   32 +++++++++++++++++++++-----------
 2 files changed, 33 insertions(+), 11 deletions(-)

Index: slub/mm/slub.c
===================================================================
--- slub.orig/mm/slub.c	2007-05-04 15:52:54.000000000 -0700
+++ slub/mm/slub.c	2007-05-04 15:53:11.000000000 -0700
@@ -2142,42 +2142,46 @@ EXPORT_SYMBOL(kfree);
  *
  * Return error code or number of remaining objects
  */
-static int __kmem_cache_vacate(struct kmem_cache *s, struct page *page)
+static int __kmem_cache_vacate(struct kmem_cache *s,
+		struct page *page, unsigned long flags)
 {
 	void *p;
 	void *addr = page_address(page);
-	unsigned long map[BITS_TO_LONGS(s->objects)];
+	DECLARE_BITMAP(map, s->objects);
 	int leftover;
 
 	if (!page->inuse)
 		return 0;
 
 	/* Determine free objects */
-	bitmap_zero(map, s->objects);
-	for(p = page->freelist; p; p = get_freepointer(s, p))
-		set_bit((p - addr) / s->size, map);
+	bitmap_fill(map, s->objects);
+	for (p = page->freelist; p; p = get_freepointer(s, p))
+		__clear_bit((p - addr) / s->size, map);
 
 	/*
 	 * Get a refcount for all used objects. If that fails then
 	 * no KICK callback can be performed.
 	 */
-	for(p = addr; p < addr + s->objects * s->size; p += s->size)
-		if (!test_bit((p - addr) / s->size, map))
+	for (p = addr; p < addr + s->objects * s->size; p += s->size)
+		if (test_bit((p - addr) / s->size, map))
 			if (!s->slab_ops->get_reference(p))
-				set_bit((p - addr) / s->size, map);
+				__clear_bit((p - addr) / s->size, map);
 
 	/* Got all the references we need. Now we can drop the slab lock */
 	slab_unlock(page);
+	local_irq_restore(flags);
 
 	/* Perform the KICK callbacks to remove the objects */
 	for(p = addr; p < addr + s->objects * s->size; p += s->size)
-		if (!test_bit((p - addr) / s->size, map))
+		if (test_bit((p - addr) / s->size, map))
 			s->slab_ops->kick_object(p);
 
+	local_irq_save(flags);
 	slab_lock(page);
 	leftover = page->inuse;
 	ClearPageActive(page);
 	putback_slab(s, page);
+	local_irq_restore(flags);
 	return leftover;
 }
 
@@ -2197,6 +2201,7 @@ static void remove_from_lists(struct kme
  */
 int kmem_cache_vacate(struct page *page)
 {
+	unsigned long flags;
 	struct kmem_cache *s;
 	int rc = 0;
 
@@ -2208,6 +2213,7 @@ int kmem_cache_vacate(struct page *page)
 	if (!PageSlab(page))
 		goto out;
 
+	local_irq_save(flags);
 	slab_lock(page);
 
 	/*
@@ -2221,6 +2227,7 @@ int kmem_cache_vacate(struct page *page)
 	 */
 	if (!PageSlab(page) || PageActive(page) || !page->inuse) {
 		slab_unlock(page);
+		local_irq_restore(flags);
 		goto out;
 	}
 
@@ -2231,7 +2238,7 @@ int kmem_cache_vacate(struct page *page)
 	s = page->slab;
 	remove_from_lists(s, page);
 	SetPageActive(page);
-	rc = __kmem_cache_vacate(s, page) == 0;
+	rc = __kmem_cache_vacate(s, page, flags) == 0;
 out:
 	put_page(page);
 	return rc;
@@ -2336,8 +2343,11 @@ int kmem_cache_shrink(struct kmem_cache 
 
 		/* Now we can free objects in the slabs on the zaplist */
 		list_for_each_entry_safe(page, page2, &zaplist, lru) {
+			unsigned long flags;
+
+			local_irq_save(flags);
 			slab_lock(page);
-			__kmem_cache_vacate(s, page);
+			__kmem_cache_vacate(s, page, flags);
 		}
 	}
 
Index: slub/include/linux/slab.h
===================================================================
--- slub.orig/include/linux/slab.h	2007-05-04 15:53:06.000000000 -0700
+++ slub/include/linux/slab.h	2007-05-04 15:53:17.000000000 -0700
@@ -42,7 +42,19 @@ struct slab_ops {
 	void (*ctor)(void *, struct kmem_cache *, unsigned long);
 	/* FIXME: Remove all destructors ? */
 	void (*dtor)(void *, struct kmem_cache *, unsigned long);
+	/*
+	 * Called with slab lock held and interrupts disabled.
+	 * No slab operations may be performed in get_reference
+	 *
+	 * Must return 1 if a reference was obtained.
+	 * 0 if we failed to obtain the reference (f.e.
+	 * the object is concurrently freed).
+	 */
 	int (*get_reference)(void *);
+	/*
+	 * Called with no locks held and interrupts enabled.
+	 * Any operation may be performed in kick_object.
+	 */
 	void (*kick_object)(void *);
 };
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC 2/3] SLUB: Implement targeted reclaim and partial list defragmentation
  2007-05-04 23:03   ` Christoph Lameter
@ 2007-05-05  1:04     ` Randy Dunlap
  2007-05-05  1:07       ` Christoph Lameter
  0 siblings, 1 reply; 23+ messages in thread
From: Randy Dunlap @ 2007-05-05  1:04 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: linux-kernel, linux-mm, dgc, Eric Dumazet, Mel Gorman

On Fri, 4 May 2007 16:03:43 -0700 (PDT) Christoph Lameter wrote:

> Fixes suggested by Andrew
> 
> ---
>  include/linux/slab.h |   12 ++++++++++++
>  mm/slub.c            |   32 +++++++++++++++++++++-----------
>  2 files changed, 33 insertions(+), 11 deletions(-)
> 
>  	/* Perform the KICK callbacks to remove the objects */
>  	for(p = addr; p < addr + s->objects * s->size; p += s->size)

missed a space after "for".

> -		if (!test_bit((p - addr) / s->size, map))
> +		if (test_bit((p - addr) / s->size, map))
>  			s->slab_ops->kick_object(p);

---
~Randy
*** Remember to use Documentation/SubmitChecklist when testing your code ***

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC 2/3] SLUB: Implement targeted reclaim and partial list defragmentation
  2007-05-05  1:04     ` Randy Dunlap
@ 2007-05-05  1:07       ` Christoph Lameter
  0 siblings, 0 replies; 23+ messages in thread
From: Christoph Lameter @ 2007-05-05  1:07 UTC (permalink / raw)
  To: Randy Dunlap; +Cc: linux-kernel, linux-mm, dgc, Eric Dumazet, Mel Gorman

On Fri, 4 May 2007, Randy Dunlap wrote:

> >  	/* Perform the KICK callbacks to remove the objects */
> >  	for(p = addr; p < addr + s->objects * s->size; p += s->size)
> 
> missed a space after "for".

Thanks but I was more hoping for a higher level of review. Locking????

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC 0/3] Slab Defrag / Slab Targeted Reclaim and general Slab API changes
  2007-05-04 22:15 [RFC 0/3] Slab Defrag / Slab Targeted Reclaim and general Slab API changes clameter
                   ` (2 preceding siblings ...)
  2007-05-04 22:15 ` [RFC 3/3] Support targeted reclaim and slab defrag for dentry cache clameter
@ 2007-05-05  5:07 ` Eric Dumazet
  2007-05-05  5:14   ` Christoph Lameter
  3 siblings, 1 reply; 23+ messages in thread
From: Eric Dumazet @ 2007-05-05  5:07 UTC (permalink / raw)
  To: clameter; +Cc: linux-kernel, linux-mm, dgc, Mel Gorman

clameter@sgi.com a ecrit :
> I originally intended this for the 2.6.23 development cycle but since there
> is an aggressive push for SLUB I thought that we may want to introduce this earlier.
> Note that this covers new locking approaches that we may need to talk
> over before going any further.
> 
> This is an RFC for patches that do major changes to the way that slab
> allocations are handled in order to introduce some more advanced features
> and in order to get rid of some things that are no longer used or awkward.
> 
> A. Add Slab fragmentation
> 
> On kmem_cache_shrink SLUB will not only sort the partial slabs by object
> number but attempt to free objects out of partial slabs that have a low
> number of objects. Doing so increases the object density in the remaining
> partial slabs and frees up memory. Ideally kmem_cache_shrink would be
> able to completely defrag the partial list so that only one partial
> slab is left over. But it is advantageous to have slabs with a few free
> objects since that speeds up kfree. Also going to the extreme on this one
> would mean that the reclaimable slabs would have to be able to move objects
> in a reliable way. So we just free objects in slabs with a low population ratio
> and tolerate if a attempt to move an object fails.

nice idea

> 
> B. Targeted Reclaim
> 
> Mainly to support antifragmentation / defragmentation methods. The slab adds
> a new function kmem_cache_vacate(struct page *) which can be used to request
> that a page be cleared of all objects. This makes it possible to reduce the
> size of the RECLAIMABLE fragmentation area and move slabs into the MOVABLE
> area enhancing the capabilities of antifragmentation significantly.
> 
> C. Introduces a slab_ops structure that allows a slab user to provide
>    operations on slabs.

Could you please make it const ?

> 
> This replaces the current constructor / destructor scheme. It is necessary
> in order to support additional methods needed to support targeted reclaim
> and slab defragmentation. A slab supporting targeted reclaim and
> slab defragmentation must support the following additional methods:
> 
> 	1. get_reference(void *)
> 		Get a reference on a particular slab object.
> 
> 	2. kick_object(void *)
> 		Kick an object off a slab. The object is either reclaimed
> 		(easiest) or a new object is alloced using kmem_cache_alloc()
> 		and then the object is moved to the new location.
> 
> D. Slab creation is no longer done using kmem_cache_create
> 
> kmem_cache_create is not a clean API since it has only 2 call backs for
> constructor and destructor, does not allow the specification of a slab ops
> structure. Parameters are confusing.
> 
> F.e. It is possible to specify alignment information in the alignment
> field and in addition in the flags field (SLAB_HWCACHE_ALIGN). The semantics
> of SLAB_HWCACHE_ALIGN are fuzzy because it only aligns object if
> larger than 1/2 cache line.
> 
> All of this is really not necessary since the compiler knows how to align
> structures and we should use this information instead of having the user
> specify an alignment. I would like to get rid of SLAB_HWCACHE_ALIGN
> and kmem_cache_create. Instead one would use the following macros (that
> then result in a call to __kmem_cache_create).

Hum, the problem is the compiler sometimes doesnt know the target processor 
alignment.

Adding ____cacheline_aligned to 'struct ...' definitions might be overkill if 
you compile a generic kernel and happens to boot a Pentium III with it.


> 
> 	KMEM_CACHE(<struct-name>, flags)
> 
> The macro will determine the slab name from the struct name and use that for
> /sys/slab, will use the size of the struct for slab size and the alignment
> of the structure for alignment. This means one will be able to set slab
> object alignment by specifying the usual alignment options for static
> allocations when defining the structure.
> 
> Since the name is derived from the struct name it will much easier to
> find the source code for slabs listed in /sys/slab.
> 
> An additional macro is provided if the slab also supports slab operations.
> 
> 	KMEM_CACHE_OPS(<struct-name>, flags, slab_ops)
> 
> It is likely that this macro will be rarely used.
> 
> E. kmem_cache_create() SLAB_HWCACHE_ALIGN legacy interface
> 
> In order to avoid having to modify all slab creation calls throughout
> the kernel we will provide a kmem_cache_create emulation. That function
> is the only call that will still understand SLAB_HWCACHE_ALIGN. If that
> parameter is specified then it will set up the proper alignment (the slab
> allocators never see that flag).
> 
> If constructor or destructor are specified then we will allocate a slab_ops
> structure and populate it with the values specified. Note that this will
> cause a memory leak if the slab is disposed of later. If you need disposable
> slabs then the new API must be used.
> 
> F. Remove destructor support from all slab allocators?
> 
> I am only aware of two call sites left after all the changes that are
> scheduled to go into 2.6.22-rc1 have been merged. These are in FRV and sh
> arch code. The one in FRV will go away if they switch to quicklists like
> i386. Sh contains another use but a single user is no justification for keeping
> destructors around.
> 
> 
> 

G. Being able to track the number of pages in a kmem_cache


If you look at fs/buffer.c, you'll notice the bh_accounting, recalc_bh_state() 
that might be overkill for large SMP configurations, when the real concern is 
to be able to limit the bh's not to exceed 10% of LOWMEM.

Adding a callback in slab_ops to track total number of pages in use by a given 
kmem_cache would be good.

Same thing for fs/file_table.c : nr_file logic 
(percpu_counter_dec()/percpu_counter_inc() for each file open/close) could be 
simplified if we could just count the pages in use by filp_cachep kmem_cache. 
The get_nr_files() thing is not worth the pain.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC 0/3] Slab Defrag / Slab Targeted Reclaim and general Slab API changes
  2007-05-05  5:07 ` [RFC 0/3] Slab Defrag / Slab Targeted Reclaim and general Slab API changes Eric Dumazet
@ 2007-05-05  5:14   ` Christoph Lameter
  2007-05-05  5:41     ` Eric Dumazet
  0 siblings, 1 reply; 23+ messages in thread
From: Christoph Lameter @ 2007-05-05  5:14 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: linux-kernel, linux-mm, dgc, Mel Gorman

On Sat, 5 May 2007, Eric Dumazet wrote:

> > C. Introduces a slab_ops structure that allows a slab user to provide
> >    operations on slabs.
> 
> Could you please make it const ?

Sure. Done.

> > All of this is really not necessary since the compiler knows how to align
> > structures and we should use this information instead of having the user
> > specify an alignment. I would like to get rid of SLAB_HWCACHE_ALIGN
> > and kmem_cache_create. Instead one would use the following macros (that
> > then result in a call to __kmem_cache_create).
> 
> Hum, the problem is the compiler sometimes doesnt know the target processor
> alignment.
> 
> Adding ____cacheline_aligned to 'struct ...' definitions might be overkill if
> you compile a generic kernel and happens to boot a Pentium III with it.

Then add ___cacheline_aligned_in_smp or specify the alignment in the 
various other ways that exist. Practice is that most slabs specify 
SLAB_HWCACHE_ALIGN. So most slabs are cache aligned today.

> G. Being able to track the number of pages in a kmem_cache
> 
> 
> If you look at fs/buffer.c, you'll notice the bh_accounting, recalc_bh_state()
> that might be overkill for large SMP configurations, when the real concern is
> to be able to limit the bh's not to exceed 10% of LOWMEM.
> 
> Adding a callback in slab_ops to track total number of pages in use by a given
> kmem_cache would be good.

Such functionality exists internal to SLUB and in the reporting tool. 
I can export that function if you need it.

> Same thing for fs/file_table.c : nr_file logic
> (percpu_counter_dec()/percpu_counter_inc() for each file open/close) could be
> simplified if we could just count the pages in use by filp_cachep kmem_cache.
> The get_nr_files() thing is not worth the pain.

Sure. What exactly do you want? The absolute number of pages of memory 
that the slab is using?

	kmem_cache_pages_in_use(struct kmem_cache *) ?

The call will not be too lightweight since we will have to loop over all 
nodes and add the counters in each per node struct for allocates slabs.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC 2/3] SLUB: Implement targeted reclaim and partial list defragmentation
  2007-05-04 22:15 ` [RFC 2/3] SLUB: Implement targeted reclaim and partial list defragmentation clameter
  2007-05-04 23:03   ` Christoph Lameter
@ 2007-05-05  5:32   ` William Lee Irwin III
  2007-05-05 15:35     ` Christoph Lameter
  2007-05-05 10:38   ` Andi Kleen
  2007-05-09 15:05   ` Mel Gorman
  3 siblings, 1 reply; 23+ messages in thread
From: William Lee Irwin III @ 2007-05-05  5:32 UTC (permalink / raw)
  To: clameter; +Cc: linux-kernel, linux-mm, dgc, Eric Dumazet, Mel Gorman

On Fri, May 04, 2007 at 03:15:57PM -0700, clameter@sgi.com wrote:
> 2. kick_object(void *)
> After SLUB has established references to the remaining objects in a slab it
> will drop all locks and then use kick_object on each of the objects for which
> we obtained a reference. The existence of the objects is guaranteed by
> virtue of the earlier obtained reference. The callback may perform any
> slab operation since no locks are held at the time of call.
> The callback should remove the object from the slab in some way. This may
> be accomplished by reclaiming the object and then running kmem_cache_free()
> or reallocating it and then running kmem_cache_free(). Reallocation
> is advantageous at this point because it will then allocate from the partial
> slabs with the most objects because we have just finished slab shrinking.
> NOTE: This patch is for conceptual review. I'd appreciate any feedback
> especially on the locking approach taken here. It will be critical to
> resolve the locking issue for this approach to become feasable.
> Signed-off-by: Christoph Lameter <clameter@sgi.com>

kick_object() doesn't return an indicator of success, which might be
helpful for determining whether an object was successfully removed. The
later-added kick_dentry_object(), for instance, can't remove dentries
where reference counts are still held.

I suppose one could check to see if the ->inuse counter decreased, too.

In either event, it would probably be helpful to abort the operation if
there was a reclamation failure for an object within the slab.

This is a relatively minor optimization concern. I think this patch
series is great and a significant foray into the problem of slab
reclaim vs. fragmentation.


-- wli

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC 0/3] Slab Defrag / Slab Targeted Reclaim and general Slab API changes
  2007-05-05  5:14   ` Christoph Lameter
@ 2007-05-05  5:41     ` Eric Dumazet
  2007-05-05  7:37       ` Eric Dumazet
  2007-05-05 15:39       ` Christoph Lameter
  0 siblings, 2 replies; 23+ messages in thread
From: Eric Dumazet @ 2007-05-05  5:41 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: linux-kernel, linux-mm, dgc, Mel Gorman

Christoph Lameter a ecrit :
> On Sat, 5 May 2007, Eric Dumazet wrote:
> 
>>> C. Introduces a slab_ops structure that allows a slab user to provide
>>>    operations on slabs.
>> Could you please make it const ?
> 
> Sure. Done.

thanks :)

> 
>>> All of this is really not necessary since the compiler knows how to align
>>> structures and we should use this information instead of having the user
>>> specify an alignment. I would like to get rid of SLAB_HWCACHE_ALIGN
>>> and kmem_cache_create. Instead one would use the following macros (that
>>> then result in a call to __kmem_cache_create).
>> Hum, the problem is the compiler sometimes doesnt know the target processor
>> alignment.
>>
>> Adding ____cacheline_aligned to 'struct ...' definitions might be overkill if
>> you compile a generic kernel and happens to boot a Pentium III with it.
> 
> Then add ___cacheline_aligned_in_smp or specify the alignment in the 
> various other ways that exist. Practice is that most slabs specify 
> SLAB_HWCACHE_ALIGN. So most slabs are cache aligned today.

Yes but this alignement is dynamic, not at compile time.

include/asm-i386/processor.h:739:#define cache_line_size() 
(boot_cpu_data.x86_cache_alignment)

So adding ____cacheline_aligned  to 'struct file' for example would be a 
regression for people with PII or PIII

> 
>> G. Being able to track the number of pages in a kmem_cache
>>
>>
>> If you look at fs/buffer.c, you'll notice the bh_accounting, recalc_bh_state()
>> that might be overkill for large SMP configurations, when the real concern is
>> to be able to limit the bh's not to exceed 10% of LOWMEM.
>>
>> Adding a callback in slab_ops to track total number of pages in use by a given
>> kmem_cache would be good.
> 
> Such functionality exists internal to SLUB and in the reporting tool. 
> I can export that function if you need it.
> 
>> Same thing for fs/file_table.c : nr_file logic
>> (percpu_counter_dec()/percpu_counter_inc() for each file open/close) could be
>> simplified if we could just count the pages in use by filp_cachep kmem_cache.
>> The get_nr_files() thing is not worth the pain.
> 
> Sure. What exactly do you want? The absolute number of pages of memory 
> that the slab is using?
> 
> 	kmem_cache_pages_in_use(struct kmem_cache *) ?
> 
> The call will not be too lightweight since we will have to loop over all 
> nodes and add the counters in each per node struct for allocates slabs.
> 
> 

On a typical system, number of pages for 'filp' kmem_cache tends to be stable

-bash-2.05b# grep filp /proc/slabinfo
filp              234727 374100    256   15    1 : tunables  120   60    8 : 
slabdata  24940  24940    135
-bash-2.05b# grep filp /proc/slabinfo
filp              234776 374100    256   15    1 : tunables  120   60    8 : 
slabdata  24940  24940    168
-bash-2.05b# grep filp /proc/slabinfo
filp              234728 374100    256   15    1 : tunables  120   60    8 : 
slabdata  24940  24940    180
-bash-2.05b# grep filp /proc/slabinfo
filp              234724 374100    256   15    1 : tunables  120   60    8 : 
slabdata  24940  24940    174

So revert nr_files logic to a single integer would be enough, even for NUMA

int nr_pages_used_by_filp;
int nr_pages_filp_limit;
int filp_in_danger __read_mostly;

static void callback_pages_in_use_by_filp(int inc)
{
     int in_danger;

     nr_pages_used_by_filp += inc;

     in_danger = nr_pages_used_by_filp >= nr_pages_filp_limit;
     if (in_danger != filp_in_danger)
         filp_in_danger = in_danger;
}

struct file *get_empty_filp(void)
{
...
if (filp_in_danger && !capable(CAP_SYS_ADMIN))
	goto over;

...
}


void __init files_init(unsigned long mempages)
{
...
nr_pages_filp_limit = (mempages * 10) / 100; /* 10% for filp use */
...
}

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC 0/3] Slab Defrag / Slab Targeted Reclaim and general Slab API changes
  2007-05-05  5:41     ` Eric Dumazet
@ 2007-05-05  7:37       ` Eric Dumazet
  2007-05-05 15:39       ` Christoph Lameter
  1 sibling, 0 replies; 23+ messages in thread
From: Eric Dumazet @ 2007-05-05  7:37 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Christoph Lameter, linux-kernel, linux-mm, dgc, Mel Gorman

Eric Dumazet a ecrit :
> Christoph Lameter a ecrit :
> 
>>
>>> G. Being able to track the number of pages in a kmem_cache
>>>
>>>
>>> If you look at fs/buffer.c, you'll notice the bh_accounting, 
>>> recalc_bh_state()
>>> that might be overkill for large SMP configurations, when the real 
>>> concern is
>>> to be able to limit the bh's not to exceed 10% of LOWMEM.
>>>
>>> Adding a callback in slab_ops to track total number of pages in use 
>>> by a given
>>> kmem_cache would be good.
>>
>> Such functionality exists internal to SLUB and in the reporting tool. 
>> I can export that function if you need it.
>>
>>> Same thing for fs/file_table.c : nr_file logic
>>> (percpu_counter_dec()/percpu_counter_inc() for each file open/close) 
>>> could be
>>> simplified if we could just count the pages in use by filp_cachep 
>>> kmem_cache.
>>> The get_nr_files() thing is not worth the pain.
>>
>> Sure. What exactly do you want? The absolute number of pages of memory 
>> that the slab is using?
>>
>>     kmem_cache_pages_in_use(struct kmem_cache *) ?
>>
>> The call will not be too lightweight since we will have to loop over 
>> all nodes and add the counters in each per node struct for allocates 
>> slabs.
>>
>>
> 
> On a typical system, number of pages for 'filp' kmem_cache tends to be 
> stable
> 
> -bash-2.05b# grep filp /proc/slabinfo
> filp              234727 374100    256   15    1 : tunables  120   60    
> 8 : slabdata  24940  24940    135
> -bash-2.05b# grep filp /proc/slabinfo
> filp              234776 374100    256   15    1 : tunables  120   60    
> 8 : slabdata  24940  24940    168
> -bash-2.05b# grep filp /proc/slabinfo
> filp              234728 374100    256   15    1 : tunables  120   60    
> 8 : slabdata  24940  24940    180
> -bash-2.05b# grep filp /proc/slabinfo
> filp              234724 374100    256   15    1 : tunables  120   60    
> 8 : slabdata  24940  24940    174
> 
> So revert nr_files logic to a single integer would be enough, even for NUMA
> 
> int nr_pages_used_by_filp;
> int nr_pages_filp_limit;
> int filp_in_danger __read_mostly;
> 
> static void callback_pages_in_use_by_filp(int inc)
> {
>     int in_danger;
> 
>     nr_pages_used_by_filp += inc;
> 
>     in_danger = nr_pages_used_by_filp >= nr_pages_filp_limit;
>     if (in_danger != filp_in_danger)
>         filp_in_danger = in_danger;
> }
> 
> struct file *get_empty_filp(void)
> {
> ...
> if (filp_in_danger && !capable(CAP_SYS_ADMIN))
>     goto over;
> 
> ...
> }
> 
> 
> void __init files_init(unsigned long mempages)
> {
> ...
> nr_pages_filp_limit = (mempages * 10) / 100; /* 10% for filp use */
> ...
> }


This wont work of course : Once the pages limit is hit, file allocations would 
be forbidden until cache is shrinked.

Maybe callback should return a status, so that SLAB / SLUB can return ENOMEM

static int callback_pages_in_use_by_filp(int inc)
{
     int in_danger;

     nr_pages_used_by_filp += inc;

     in_danger = nr_pages_used_by_filp >= nr_pages_filp_limit;
     if (unlikely(in_danger != filp_in_danger))
         filp_in_danger = in_danger;
     if (unlikely(in_danger && inc > 0 && !capable(CAP_SYS_ADMIN))) {
         nr_pages_used_by_filp -= inc;
         return -1;
     }
     return 0;
}

No more tests in get_empty_filp() : just call kmem_cache_alloc()

struct file *get_empty_filp(void)
{
...
f = kmem_cache_alloc(filp_cachep, GFP_KERNEL);
...
}

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC 1/3] SLUB: slab_ops instead of constructors / destructors
  2007-05-04 22:15 ` [RFC 1/3] SLUB: slab_ops instead of constructors / destructors clameter
@ 2007-05-05 10:14   ` Pekka Enberg
  2007-05-05 15:43     ` Christoph Lameter
  2007-05-06 19:19   ` Bert Wesarg
  1 sibling, 1 reply; 23+ messages in thread
From: Pekka Enberg @ 2007-05-05 10:14 UTC (permalink / raw)
  To: clameter; +Cc: linux-kernel, linux-mm, dgc, Eric Dumazet, Mel Gorman

On 5/5/07, clameter@sgi.com <clameter@sgi.com> wrote:
> This patch gets rid constructors and destructors and replaces them
> with a slab operations structure that is passed into SLUB.

Looks good to me.

On 5/5/07, clameter@sgi.com <clameter@sgi.com> wrote:
> +struct slab_ops {
> +       /* FIXME: ctor should only take the object as an argument. */
> +       void (*ctor)(void *, struct kmem_cache *, unsigned long);
> +       /* FIXME: Remove all destructors ? */
> +       void (*dtor)(void *, struct kmem_cache *, unsigned long);
> +};

For consistency with other operations structures, can we make this
struct kmem_cache_operations or kmem_cache_ops, please?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC 2/3] SLUB: Implement targeted reclaim and partial list defragmentation
  2007-05-04 22:15 ` [RFC 2/3] SLUB: Implement targeted reclaim and partial list defragmentation clameter
  2007-05-04 23:03   ` Christoph Lameter
  2007-05-05  5:32   ` William Lee Irwin III
@ 2007-05-05 10:38   ` Andi Kleen
  2007-05-05 15:42     ` Christoph Lameter
  2007-05-09 15:05   ` Mel Gorman
  3 siblings, 1 reply; 23+ messages in thread
From: Andi Kleen @ 2007-05-05 10:38 UTC (permalink / raw)
  To: clameter; +Cc: linux-kernel, linux-mm, dgc, Eric Dumazet, Mel Gorman

clameter@sgi.com writes:
> 
> NOTE: This patch is for conceptual review. I'd appreciate any feedback
> especially on the locking approach taken here. It will be critical to
> resolve the locking issue for this approach to become feasable.

Do you have any numbers on how this improves dcache reclaim under memory pressure?

-Andi

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC 2/3] SLUB: Implement targeted reclaim and partial list defragmentation
  2007-05-05  5:32   ` William Lee Irwin III
@ 2007-05-05 15:35     ` Christoph Lameter
  0 siblings, 0 replies; 23+ messages in thread
From: Christoph Lameter @ 2007-05-05 15:35 UTC (permalink / raw)
  To: William Lee Irwin III
  Cc: linux-kernel, linux-mm, dgc, Eric Dumazet, Mel Gorman

On Fri, 4 May 2007, William Lee Irwin III wrote:

> kick_object() doesn't return an indicator of success, which might be
> helpful for determining whether an object was successfully removed. The
> later-added kick_dentry_object(), for instance, can't remove dentries
> where reference counts are still held.
> 
> I suppose one could check to see if the ->inuse counter decreased, too.

Yes that is exactly what is done. The issue is that concurrent frees may 
occur. So we just kick them all and see if all objects are gone at the 
end.
 
> In either event, it would probably be helpful to abort the operation if
> there was a reclamation failure for an object within the slab.

Hmmm... The failure may be because another process is attempting 
a kmem_cache_free on an object. But we are holding the lock. The free
will succeed when we drop it.

> This is a relatively minor optimization concern. I think this patch
> series is great and a significant foray into the problem of slab
> reclaim vs. fragmentation.

Thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC 0/3] Slab Defrag / Slab Targeted Reclaim and general Slab API changes
  2007-05-05  5:41     ` Eric Dumazet
  2007-05-05  7:37       ` Eric Dumazet
@ 2007-05-05 15:39       ` Christoph Lameter
  1 sibling, 0 replies; 23+ messages in thread
From: Christoph Lameter @ 2007-05-05 15:39 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: linux-kernel, linux-mm, dgc, Mel Gorman

On Sat, 5 May 2007, Eric Dumazet wrote:

> > Then add ___cacheline_aligned_in_smp or specify the alignment in the various
> > other ways that exist. Practice is that most slabs specify
> > SLAB_HWCACHE_ALIGN. So most slabs are cache aligned today.
> 
> Yes but this alignement is dynamic, not at compile time.
> 
> include/asm-i386/processor.h:739:#define cache_line_size()
> (boot_cpu_data.x86_cache_alignment)

Ahh.. I did not see that before.

> So adding ____cacheline_aligned  to 'struct file' for example would be a
> regression for people with PII or PIII

Yuck.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC 2/3] SLUB: Implement targeted reclaim and partial list defragmentation
  2007-05-05 10:38   ` Andi Kleen
@ 2007-05-05 15:42     ` Christoph Lameter
  2007-05-05 17:11       ` Andi Kleen
  0 siblings, 1 reply; 23+ messages in thread
From: Christoph Lameter @ 2007-05-05 15:42 UTC (permalink / raw)
  To: Andi Kleen; +Cc: linux-kernel, linux-mm, dgc, Eric Dumazet, Mel Gorman

On Sat, 5 May 2007, Andi Kleen wrote:

> clameter@sgi.com writes:
> > 
> > NOTE: This patch is for conceptual review. I'd appreciate any feedback
> > especially on the locking approach taken here. It will be critical to
> > resolve the locking issue for this approach to become feasable.
> 
> Do you have any numbers on how this improves dcache reclaim under memory pressure?

How does one measure something like that?

I wanted to first make sure that the thing is sane. If there is a gaping 
race here then I may have to add more code to cover that.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC 1/3] SLUB: slab_ops instead of constructors / destructors
  2007-05-05 10:14   ` Pekka Enberg
@ 2007-05-05 15:43     ` Christoph Lameter
  0 siblings, 0 replies; 23+ messages in thread
From: Christoph Lameter @ 2007-05-05 15:43 UTC (permalink / raw)
  To: Pekka Enberg; +Cc: linux-kernel, linux-mm, dgc, Eric Dumazet, Mel Gorman

On Sat, 5 May 2007, Pekka Enberg wrote:

> For consistency with other operations structures, can we make this
> struct kmem_cache_operations or kmem_cache_ops, please?

Ok.
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC 2/3] SLUB: Implement targeted reclaim and partial list defragmentation
  2007-05-05 15:42     ` Christoph Lameter
@ 2007-05-05 17:11       ` Andi Kleen
  0 siblings, 0 replies; 23+ messages in thread
From: Andi Kleen @ 2007-05-05 17:11 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andi Kleen, linux-kernel, linux-mm, dgc, Eric Dumazet, Mel Gorman

On Sat, May 05, 2007 at 08:42:50AM -0700, Christoph Lameter wrote:
> On Sat, 5 May 2007, Andi Kleen wrote:
> 
> > clameter@sgi.com writes:
> > > 
> > > NOTE: This patch is for conceptual review. I'd appreciate any feedback
> > > especially on the locking approach taken here. It will be critical to
> > > resolve the locking issue for this approach to become feasable.
> > 
> > Do you have any numbers on how this improves dcache reclaim under memory pressure?
> 
> How does one measure something like that?

You could measure how many dentries are flushed before an > order 1 allocation
is satisfied?  Make sure to fill the dcache first.

There are no counters there for this, but that should be reasonably easy to add.

-Andi

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC 1/3] SLUB: slab_ops instead of constructors / destructors
  2007-05-04 22:15 ` [RFC 1/3] SLUB: slab_ops instead of constructors / destructors clameter
  2007-05-05 10:14   ` Pekka Enberg
@ 2007-05-06 19:19   ` Bert Wesarg
  2007-05-06 19:46     ` Satyam Sharma
  1 sibling, 1 reply; 23+ messages in thread
From: Bert Wesarg @ 2007-05-06 19:19 UTC (permalink / raw)
  To: clameter; +Cc: linux-kernel, linux-mm, dgc, Eric Dumazet, Mel Gorman

clameter@sgi.com wrote:
> +	if (ctor || dtor) {
> +		so = kzalloc(sizeof(struct slab_ops), GFP_KERNEL);
> +		so->ctor = ctor;
> +		so->dtor = dtor;
> +	}
> +	return  __kmem_cache_create(s, size, align, flags, so);
Is this a memory leak?

Regards
Bert Wesarg

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC 1/3] SLUB: slab_ops instead of constructors / destructors
  2007-05-06 19:19   ` Bert Wesarg
@ 2007-05-06 19:46     ` Satyam Sharma
  0 siblings, 0 replies; 23+ messages in thread
From: Satyam Sharma @ 2007-05-06 19:46 UTC (permalink / raw)
  To: Bert Wesarg
  Cc: clameter, linux-kernel, linux-mm, dgc, Eric Dumazet, Mel Gorman

On 5/7/07, Bert Wesarg <wesarg@informatik.uni-halle.de> wrote:
> clameter@sgi.com wrote:
> > +     if (ctor || dtor) {
> > +             so = kzalloc(sizeof(struct slab_ops), GFP_KERNEL);
> > +             so->ctor = ctor;
> > +             so->dtor = dtor;
> > +     }
> > +     return  __kmem_cache_create(s, size, align, flags, so);
> Is this a memory leak?

Yes, but see:

On 5/5/07, clameter@sgi.com <clameter@sgi.com> wrote:
> If constructor or destructor are specified then we will allocate a slab_ops
> structure and populate it with the values specified. Note that this will
> cause a memory leak if the slab is disposed of later. If you need disposable
> slabs then the new API must be used.

BTW:

> > +     if (ctor || dtor) {
> > +             so = kzalloc(sizeof(struct slab_ops), GFP_KERNEL);
> > +             so->ctor = ctor;

It's also a potential oops, actually. kzalloc's return must be checked.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC 2/3] SLUB: Implement targeted reclaim and partial list defragmentation
  2007-05-04 22:15 ` [RFC 2/3] SLUB: Implement targeted reclaim and partial list defragmentation clameter
                     ` (2 preceding siblings ...)
  2007-05-05 10:38   ` Andi Kleen
@ 2007-05-09 15:05   ` Mel Gorman
  2007-05-09 16:34     ` Christoph Lameter
  3 siblings, 1 reply; 23+ messages in thread
From: Mel Gorman @ 2007-05-09 15:05 UTC (permalink / raw)
  To: clameter
  Cc: Linux Kernel Mailing List, Linux Memory Management List, dgc,
	Eric Dumazet

On Fri, 4 May 2007, clameter@sgi.com wrote:

> Targeted reclaim allows to target a single slab for reclaim. This is done by
> calling
>
> kmem_cache_vacate(slab, page);
>
> It will return 1 on success, 0 if the operation failed.
>
> The vacate functionality is also used for slab shrinking. During the shrink
> operation SLUB will generate a list sorted by the number of objects in use.
>
> We extract pages off that list that are only filled less than a quarter. These
> objects are then processed using kmem_cache_vacate.
>
> In order for a slabcache to support this functionality two functions must
> be defined via slab_operations.
>
> get_reference(void *)
>
> Must obtain a reference to the object if it has not been freed yet. It is up
> to the slabcache to resolve the race.

Is there also a race here between the object being searched for and the 
reference being taken? Would it be appropriate to call this 
find_get_slub_object() to mirror what find_get_page() does except instead 
of offset, it receives some key meaningful to the cache? It might help set 
a developers expectation of what the function is required to do when 
creating such a cache and simplify locking.

(written much later) Looking through, it's not super-clear if 
get_reference() is only for use by SLUB to get an exclusive reference to 
an object or something that is generally called to reference count it. It 
looks like only one reference is ever taken.

> SLUB guarantees that the objects is
> still allocated. However, another thread may be blocked in slab_free
> attempting to free the same object. It may succeed as soon as
> get_reference() returns to the slab allocator.
>

Fun. If it was find_get_slub_object() instead of get_reference(), the 
caller could block if the object is currently being freed until slab_free 
completed. It would take a hit because the object has to be recreated but 
maybe it would be an easier race to deal with?

> get_reference() processing must recognize this situation (i.e. check refcount
> for zero) and fail in such a sitation (no problem since the object will
> be freed as soon we drop the slab lock before doing kick calls).
>

Block or fail?

> No slab operations may be performed in get_reference(). The slab lock
> for the page with the object is taken. Any slab operations may lead to
> a deadlock.
>
> 2. kick_object(void *)
>
> After SLUB has established references to the remaining objects in a slab it
> will drop all locks and then use kick_object on each of the objects for which
> we obtained a reference. The existence of the objects is guaranteed by
> virtue of the earlier obtained reference. The callback may perform any
> slab operation since no locks are held at the time of call.
>
> The callback should remove the object from the slab in some way. This may
> be accomplished by reclaiming the object and then running kmem_cache_free()
> or reallocating it and then running kmem_cache_free(). Reallocation
> is advantageous at this point because it will then allocate from the partial
> slabs with the most objects because we have just finished slab shrinking.
>

>From the comments alone, it sounds like the page cache API should be 
copied in name at least like slub_object_get(), slub_object_release() and 
maybe even something like slub_object_get_unless_vacating() if the caller 
must not sleep. It seems the page cache shares similar problems to what 
SLUB is doing here.


> NOTE: This patch is for conceptual review. I'd appreciate any feedback
> especially on the locking approach taken here. It will be critical to
> resolve the locking issue for this approach to become feasable.
>

Take everything I say here with a grain of salt. My understanding of SLUB 
is non-existent at the moment and it makes reviewing this a bit tricky.

> Signed-off-by: Christoph Lameter <clameter@sgi.com>
>
> ---
> include/linux/slab.h |    3
> mm/slub.c            |  159 ++++++++++++++++++++++++++++++++++++++++++++++++---
> 2 files changed, 154 insertions(+), 8 deletions(-)
>
> Index: slub/include/linux/slab.h
> ===================================================================
> --- slub.orig/include/linux/slab.h	2007-05-04 13:32:34.000000000 -0700
> +++ slub/include/linux/slab.h	2007-05-04 13:32:50.000000000 -0700
> @@ -42,6 +42,8 @@ struct slab_ops {
> 	void (*ctor)(void *, struct kmem_cache *, unsigned long);
> 	/* FIXME: Remove all destructors ? */
> 	void (*dtor)(void *, struct kmem_cache *, unsigned long);
> +	int (*get_reference)(void *);
> +	void (*kick_object)(void *);
> };
>
> struct kmem_cache *__kmem_cache_create(const char *, size_t, size_t,
> @@ -54,6 +56,7 @@ void kmem_cache_free(struct kmem_cache *
> unsigned int kmem_cache_size(struct kmem_cache *);
> const char *kmem_cache_name(struct kmem_cache *);
> int kmem_ptr_validate(struct kmem_cache *cachep, const void *ptr);
> +int kmem_cache_vacate(struct page *);
>
> /*
>  * Please use this macro to create slab caches. Simply specify the
> Index: slub/mm/slub.c
> ===================================================================
> --- slub.orig/mm/slub.c	2007-05-04 13:32:34.000000000 -0700
> +++ slub/mm/slub.c	2007-05-04 13:56:25.000000000 -0700
> @@ -173,7 +173,7 @@ static struct notifier_block slab_notifi
> static enum {
> 	DOWN,		/* No slab functionality available */
> 	PARTIAL,	/* kmem_cache_open() works but kmalloc does not */
> -	UP,		/* Everything works */
> +	UP,		/* Everything works but does not show up in sysfs */

Should this be here?

> 	SYSFS		/* Sysfs up */
> } slab_state = DOWN;
>
> @@ -211,6 +211,8 @@ static inline struct kmem_cache_node *ge
>
> struct slab_ops default_slab_ops = {
> 	NULL,
> +	NULL,
> +	NULL,
> 	NULL
> };
>
> @@ -839,13 +841,10 @@ static struct page *new_slab(struct kmem
> 	n = get_node(s, page_to_nid(page));
> 	if (n)
> 		atomic_long_inc(&n->nr_slabs);
> +
> 	page->offset = s->offset / sizeof(void *);
> 	page->slab = s;
> -	page->flags |= 1 << PG_slab;
> -	if (s->flags & (SLAB_DEBUG_FREE | SLAB_RED_ZONE | SLAB_POISON |
> -			SLAB_STORE_USER | SLAB_TRACE))
> -		page->flags |= 1 << PG_error;
> -
> +	page->inuse = 0;

You probably have explained this a million times already and in the main 
SLUB patches, but why is _count not used?

> 	start = page_address(page);
> 	end = start + s->objects * s->size;
>
> @@ -862,7 +861,17 @@ static struct page *new_slab(struct kmem
> 	set_freepointer(s, last, NULL);
>
> 	page->freelist = start;
> -	page->inuse = 0;
> +
> +	/*
> +	 * pages->inuse must be visible when PageSlab(page) becomes
> +	 * true for targeted reclaim
> +	 */
> +	smp_wmb();
> +	page->flags |= 1 << PG_slab;
> +	if (s->flags & (SLAB_DEBUG_FREE | SLAB_RED_ZONE | SLAB_POISON |
> +			SLAB_STORE_USER | SLAB_TRACE))
> +		page->flags |= 1 << PG_error;
> +
> out:
> 	if (flags & __GFP_WAIT)
> 		local_irq_disable();
> @@ -2124,6 +2133,111 @@ void kfree(const void *x)
> EXPORT_SYMBOL(kfree);
>
> /*
> + * Vacate all objects in the given slab. Slab must be locked.
> + *
> + * Will drop and regain and drop the slab lock.
> + * Slab must be marked PageActive() to avoid concurrent slab_free from

I just noticed this PageActive() overloading of the page->flags. Is it 
worth defining SlabPerCPU() as an alias or something?

> + * remove the slab from the lists. At the end the slab will either
> + * be freed or have been returned to the partial lists.
> + *
> + * Return error code or number of remaining objects
> + */
> +static int __kmem_cache_vacate(struct kmem_cache *s, struct page *page)
> +{
> +	void *p;
> +	void *addr = page_address(page);
> +	unsigned long map[BITS_TO_LONGS(s->objects)];
> +	int leftover;
> +
> +	if (!page->inuse)
> +		return 0;
> +
> +	/* Determine free objects */
> +	bitmap_zero(map, s->objects);
> +	for(p = page->freelist; p; p = get_freepointer(s, p))
> +		set_bit((p - addr) / s->size, map);
> +
> +	/*
> +	 * Get a refcount for all used objects. If that fails then
> +	 * no KICK callback can be performed.
> +	 */
> +	for(p = addr; p < addr + s->objects * s->size; p += s->size)
> +		if (!test_bit((p - addr) / s->size, map))
> +			if (!s->slab_ops->get_reference(p))
> +				set_bit((p - addr) / s->size, map);
> +

The comment and code implies that only one caller can hold a reference at 
a time and the count is either 1 or 0.

(later)

It looks like get_reference() is only intended for use by SLUB. I don't 
currently see how an implementer of a cache could determine if an object 
is being actively referenced or not when deciding whether to return 1 for 
get_reference() unless the cache interally implemented a referencing count 
API. That might lead to many, similar but subtly different implementations 
of reference counting.

> +	/* Got all the references we need. Now we can drop the slab lock */
> +	slab_unlock(page);
> +

Where did we check we got all the references? It looks more like we got 
all the references we could.

> +	/* Perform the KICK callbacks to remove the objects */
> +	for(p = addr; p < addr + s->objects * s->size; p += s->size)
> +		if (!test_bit((p - addr) / s->size, map))
> +			s->slab_ops->kick_object(p);
> +
> +	slab_lock(page);
> +	leftover = page->inuse;
> +	ClearPageActive(page);
> +	putback_slab(s, page);
> +	return leftover;
> +}

Ok, how this works is clear from the comment in kmem_cache_vacate(). A 
note saying to look at the comment there may be useful.

> +
> +/*
> + * Remove a page from the lists. Must be holding slab lock.
> + */
> +static void remove_from_lists(struct kmem_cache *s, struct page *page)
> +{
> +	if (page->inuse < s->objects)
> +		remove_partial(s, page);
> +	else if (s->flags & SLAB_STORE_USER)
> +		remove_full(s, page);
> +}
> +
> +/*
> + * Attempt to free objects in a page. Return 1 when succesful.
> + */

when or if? Looks more like if

> +int kmem_cache_vacate(struct page *page)
> +{
> +	struct kmem_cache *s;
> +	int rc = 0;
> +
> +	/* Get a reference to the page. Return if its freed or being freed */
> +	if (!get_page_unless_zero(page))
> +		return 0;
> +
> +	/* Check that this is truly a slab page */
> +	if (!PageSlab(page))
> +		goto out;
> +
> +	slab_lock(page);
> +
> +	/*
> +	 * We may now have locked a page that is in various stages of being
> +	 * freed. If the PageSlab bit is off then we have already reached
> +	 * the page allocator. If page->inuse is zero then we are
> +	 * in SLUB but freeing or allocating the page.
> +	 * page->inuse is never modified without the slab lock held.
> +	 *
> +	 * Also abort if the page happens to be a per cpu slab
> +	 */
> +	if (!PageSlab(page) || PageActive(page) || !page->inuse) {
> +		slab_unlock(page);
> +		goto out;
> +	}
> +

The PageActive() part is going to confuse someone eventually because 
they'll interpret it to mean that slab pages are on the LRU now.

> +	/*
> +	 * We are holding a lock on a slab page that is not in the
> +	 * process of being allocated or freed.
> +	 */
> +	s = page->slab;
> +	remove_from_lists(s, page);
> +	SetPageActive(page);
> +	rc = __kmem_cache_vacate(s, page) == 0;

The name rc is misleading here. I am reading it as reference count, but 
it's not a reference count at all.

> +out:
> +	put_page(page);
> +	return rc;
> +}

Where do partially freed slab pages get put back on the lists? Do they 
just remain off the lists until they are totally freed as a type of lazy 
free? If so, a wait_on_slab_page_free() call may be needed later.

> +
> +/*
>  *  kmem_cache_shrink removes empty slabs from the partial lists
>  *  and then sorts the partially allocated slabs by the number
>  *  of items in use. The slabs with the most items in use
> @@ -2137,11 +2251,12 @@ int kmem_cache_shrink(struct kmem_cache
> 	int node;
> 	int i;
> 	struct kmem_cache_node *n;
> -	struct page *page;
> +	struct page *page, *page2;

page2, is a pretty bad name.

> 	struct page *t;
> 	struct list_head *slabs_by_inuse =
> 		kmalloc(sizeof(struct list_head) * s->objects, GFP_KERNEL);
> 	unsigned long flags;
> +	LIST_HEAD(zaplist);
>
> 	if (!slabs_by_inuse)
> 		return -ENOMEM;
> @@ -2194,8 +2309,36 @@ int kmem_cache_shrink(struct kmem_cache
> 		for (i = s->objects - 1; i >= 0; i--)
> 			list_splice(slabs_by_inuse + i, n->partial.prev);
>
> +		if (!s->slab_ops->get_reference || !s->slab_ops->kick_object)
> +			goto out;
> +
> +		/* Take objects with just a few objects off the tail */
> +		while (n->nr_partial > MAX_PARTIAL) {
> +			page = container_of(n->partial.prev, struct page, lru);
> +
> +			/*
> +			 * We are holding the list_lock so we can only
> +			 * trylock the slab
> +			 */
> +			if (!slab_trylock(page))
> +				break;
> +
> +			if (page->inuse > s->objects / 4)
> +				break;
> +

I get the idea of the check but it may have unexpected consequences if all 
of the slabs in use happen to be half full.

> +			list_move(&page->lru, &zaplist);
> +			n->nr_partial--;
> +			SetPageActive(page);
> +			slab_unlock(page);
> +		}

Ok, not sure if this is a problem or not. I don't get why SetPageActive() 
is being called and it's because I don't understand SLUB yet.

> 	out:
> 		spin_unlock_irqrestore(&n->list_lock, flags);
> +
> +		/* Now we can free objects in the slabs on the zaplist */
> +		list_for_each_entry_safe(page, page2, &zaplist, lru) {
> +			slab_lock(page);
> +			__kmem_cache_vacate(s, page);
> +		}

No checking the return value here which is why I think that slab pages 
once vacated are expected to be off the lists and totally freed. If that 
is the case, it might have consequences if the cache is badly behaving and 
never freeing objects.

> 	}
>
> 	kfree(slabs_by_inuse);
>
> -- 
>

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC 2/3] SLUB: Implement targeted reclaim and partial list defragmentation
  2007-05-09 15:05   ` Mel Gorman
@ 2007-05-09 16:34     ` Christoph Lameter
  0 siblings, 0 replies; 23+ messages in thread
From: Christoph Lameter @ 2007-05-09 16:34 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linux Kernel Mailing List, Linux Memory Management List, dgc,
	Eric Dumazet

On Wed, 9 May 2007, Mel Gorman wrote:

> > Must obtain a reference to the object if it has not been freed yet. It is up
> > to the slabcache to resolve the race.
> 
> Is there also a race here between the object being searched for and the
> reference being taken? Would it be appropriate to call this
> find_get_slub_object() to mirror what find_get_page() does except instead of
> offset, it receives some key meaningful to the cache? It might help set a
> developers expectation of what the function is required to do when creating
> such a cache and simplify locking.
> 
> (written much later) Looking through, it's not super-clear if get_reference()
> is only for use by SLUB to get an exclusive reference to an object or
> something that is generally called to reference count it. It looks like only
> one reference is ever taken.

Rights. Its only for SLUB to get a reference to be able to free the 
object. I am not sure what you mean by the object being searched for? In 
the partial lists of SLUB? That has its own locking scheme.

> > SLUB guarantees that the objects is
> > still allocated. However, another thread may be blocked in slab_free
> > attempting to free the same object. It may succeed as soon as
> > get_reference() returns to the slab allocator.
> Fun. If it was find_get_slub_object() instead of get_reference(), the caller
> could block if the object is currently being freed until slab_free completed.
> It would take a hit because the object has to be recreated but maybe it would
> be an easier race to deal with?

It is much easier to let the competing thread succeed. If we find that 
there refcount == 0 then a free is in progress and caller is blocked. SLUB 
will drop all locks after references have been established. At that point 
the concurrent free will succeed. Thus no need to wait.
 
> > get_reference() processing must recognize this situation (i.e. check
> > refcount
> > for zero) and fail in such a sitation (no problem since the object will
> > be freed as soon we drop the slab lock before doing kick calls).
> > 
> 
> Block or fail?

Fail. The competing free must succeed.

> > The callback should remove the object from the slab in some way. This may
> > be accomplished by reclaiming the object and then running kmem_cache_free()
> > or reallocating it and then running kmem_cache_free(). Reallocation
> > is advantageous at this point because it will then allocate from the partial
> > slabs with the most objects because we have just finished slab shrinking.
> > 
> 
> > From the comments alone, it sounds like the page cache API should be 
> copied in name at least like slub_object_get(), slub_object_release() and
> maybe even something like slub_object_get_unless_vacating() if the caller must
> not sleep. It seems the page cache shares similar problems to what SLUB is
> doing here.

We already have such an API. See include/linux/mm.h

/*
 * Try to grab a ref unless the page has a refcount of zero, return false 
if
 * that is the case.
 */
static inline int get_page_unless_zero(struct page *page)
{
        VM_BUG_ON(PageCompound(page));
        return atomic_inc_not_zero(&page->_count);
}


> Take everything I say here with a grain of salt. My understanding of SLUB is
> non-existent at the moment and it makes reviewing this a bit tricky.

Thanks for looking at it. This could greatly increase the useful of 
defragmentation. If all reclaimable slabs would support the hooks then we 
may even get rid of the reclaimable section?

> > Index: slub/mm/slub.c
> > ===================================================================
> > --- slub.orig/mm/slub.c	2007-05-04 13:32:34.000000000 -0700
> > +++ slub/mm/slub.c	2007-05-04 13:56:25.000000000 -0700
> > @@ -173,7 +173,7 @@ static struct notifier_block slab_notifi
> > static enum {
> > 	DOWN,		/* No slab functionality available */
> > 	PARTIAL,	/* kmem_cache_open() works but kmalloc does not */
> > -	UP,		/* Everything works */
> > +	UP,		/* Everything works but does not show up in sysfs */
> 
> Should this be here?

No that slipped in here somehow.
 

> > -	page->flags |= 1 << PG_slab;
> > -	if (s->flags & (SLAB_DEBUG_FREE | SLAB_RED_ZONE | SLAB_POISON |
> > -			SLAB_STORE_USER | SLAB_TRACE))
> > -		page->flags |= 1 << PG_error;
> > -
> > +	page->inuse = 0;
> 
> You probably have explained this a million times already and in the main SLUB
> patches, but why is _count not used?

_count is used by kmem_cache_vacate to insure that the page does not go 
away. page->inuse is used to make sure that we do not vacate objects in a 
slab that is just being setup for slab or being torn down and freed.
_count effect the page allocator. page->inuse the slab processing.

> > /*
> > + * Vacate all objects in the given slab. Slab must be locked.
> > + *
> > + * Will drop and regain and drop the slab lock.
> > + * Slab must be marked PageActive() to avoid concurrent slab_free from
> 
> I just noticed this PageActive() overloading of the page->flags. Is it worth
> defining SlabPerCPU() as an alias or something?

I have done this already for PageError. Maybe someday.

> > +	/*
> > +	 * Get a refcount for all used objects. If that fails then
> > +	 * no KICK callback can be performed.
> > +	 */
> > +	for(p = addr; p < addr + s->objects * s->size; p += s->size)
> > +		if (!test_bit((p - addr) / s->size, map))
> > +			if (!s->slab_ops->get_reference(p))
> > +				set_bit((p - addr) / s->size, map);
> > +
> 
> The comment and code implies that only one caller can hold a reference at a
> time and the count is either 1 or 0.

Multiple callers can hold a reference. If the reference is not gone by the 
time we do the kick calls then the object will not be freed.

> 
> (later)
> 
> It looks like get_reference() is only intended for use by SLUB. I don't
> currently see how an implementer of a cache could determine if an object is
> being actively referenced or not when deciding whether to return 1 for
> get_reference() unless the cache interally implemented a referencing count
> API. That might lead to many, similar but subtly different implementations of
> reference counting.

Right the reclaimable caches in general implement such a scheme. The hook 
is there so that the different implementations of reference counting can 
be handled.

> 
> > +	/* Got all the references we need. Now we can drop the slab lock */
> > +	slab_unlock(page);
> > +
> 
> Where did we check we got all the references? It looks more like we got all
> the references we could.

Right. We only reclaim on those where we did get references. The others 
may fail or succeed (concurrent free) independently.

> > +
> > +/*
> > + * Attempt to free objects in a page. Return 1 when succesful.
> > + */
> 
> when or if? Looks more like if

Right. Sorry in German if = when

> > +	if (!PageSlab(page) || PageActive(page) || !page->inuse) {
> > +		slab_unlock(page);
> > +		goto out;
> > +	}
> > +
> 
> The PageActive() part is going to confuse someone eventually because they'll
> interpret it to mean that slab pages are on the LRU now.

Ok one more argument to have a different name there.

> > +	/*
> > +	 * We are holding a lock on a slab page that is not in the
> > +	 * process of being allocated or freed.
> > +	 */
> > +	s = page->slab;
> > +	remove_from_lists(s, page);
> > +	SetPageActive(page);
> > +	rc = __kmem_cache_vacate(s, page) == 0;
> 
> The name rc is misleading here. I am reading it as reference count, but it's
> not a reference count at all.

Its the result yes. Need to rename it.

> > +out:
> > +	put_page(page);
> > +	return rc;
> > +}
> 
> Where do partially freed slab pages get put back on the lists? Do they just
> remain off the lists until they are totally freed as a type of lazy free? If
> so, a wait_on_slab_page_free() call may be needed later.

No putback_slab puts them back. The refcount for the page is separate. 
This means that a page may be "freed" by SLUB while kmem_cache_vacate 
is running but it will not really be freed because the refcount is > 0.
Only if kmem_cache_vacate terminates will the page be returned to the free 
page pool. Otherwise the page could vanish under us and be used for some 
other purpose.

> > +/*
> >  *  kmem_cache_shrink removes empty slabs from the partial lists
> >  *  and then sorts the partially allocated slabs by the number
> >  *  of items in use. The slabs with the most items in use
> > @@ -2137,11 +2251,12 @@ int kmem_cache_shrink(struct kmem_cache
> > 	int node;
> > 	int i;
> > 	struct kmem_cache_node *n;
> > -	struct page *page;
> > +	struct page *page, *page2;
> 
> page2, is a pretty bad name.

Its a throwaway variable for list operations.

 
> > +			if (page->inuse > s->objects / 4)
> > +				break;
> > +
> 
> I get the idea of the check but it may have unexpected consequences if all of
> the slabs in use happen to be half full.

I am open for other proposals. I do not want the slabs to be too dense. 
Having them half full means that SLUB does not have to do any list 
operations on any kfree operations for awhile.

> 
> > +			list_move(&page->lru, &zaplist);
> > +			n->nr_partial--;
> > +			SetPageActive(page);
> > +			slab_unlock(page);
> > +		}
> 
> Ok, not sure if this is a problem or not. I don't get why SetPageActive() is
> being called and it's because I don't understand SLUB yet.

SetPageActive disables all list operations on a slab. If all objects are 
freed in a slab then the slab is usually freed. If PageActive is set then 
the slab is left alone.

> > 		spin_unlock_irqrestore(&n->list_lock, flags);
> > +
> > +		/* Now we can free objects in the slabs on the zaplist */
> > +		list_for_each_entry_safe(page, page2, &zaplist, lru) {
> > +			slab_lock(page);
> > +			__kmem_cache_vacate(s, page);
> > +		}
> 
> No checking the return value here which is why I think that slab pages once
> vacated are expected to be off the lists and totally freed. If that is the
> case, it might have consequences if the cache is badly behaving and never
> freeing objects.

__kmem_cache_vacate puts the slab back into the lists or frees them. Thus 
no need to check return codes.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2007-05-09 16:34 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2007-05-04 22:15 [RFC 0/3] Slab Defrag / Slab Targeted Reclaim and general Slab API changes clameter
2007-05-04 22:15 ` [RFC 1/3] SLUB: slab_ops instead of constructors / destructors clameter
2007-05-05 10:14   ` Pekka Enberg
2007-05-05 15:43     ` Christoph Lameter
2007-05-06 19:19   ` Bert Wesarg
2007-05-06 19:46     ` Satyam Sharma
2007-05-04 22:15 ` [RFC 2/3] SLUB: Implement targeted reclaim and partial list defragmentation clameter
2007-05-04 23:03   ` Christoph Lameter
2007-05-05  1:04     ` Randy Dunlap
2007-05-05  1:07       ` Christoph Lameter
2007-05-05  5:32   ` William Lee Irwin III
2007-05-05 15:35     ` Christoph Lameter
2007-05-05 10:38   ` Andi Kleen
2007-05-05 15:42     ` Christoph Lameter
2007-05-05 17:11       ` Andi Kleen
2007-05-09 15:05   ` Mel Gorman
2007-05-09 16:34     ` Christoph Lameter
2007-05-04 22:15 ` [RFC 3/3] Support targeted reclaim and slab defrag for dentry cache clameter
2007-05-05  5:07 ` [RFC 0/3] Slab Defrag / Slab Targeted Reclaim and general Slab API changes Eric Dumazet
2007-05-05  5:14   ` Christoph Lameter
2007-05-05  5:41     ` Eric Dumazet
2007-05-05  7:37       ` Eric Dumazet
2007-05-05 15:39       ` Christoph Lameter

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox