Re: [patch 00/10] [RFC] SLUB patches for more functionality, performance and maintenance

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* Re: [patch 00/10] [RFC] SLUB patches for more functionality, performance and maintenance
       [not found] ` <p73y7hrywel.fsf@bingen.suse.de>
@ 2007-07-09 15:50   ` Christoph Lameter
  2007-07-09 15:59     ` Martin Bligh
  0 siblings, 1 reply; 26+ messages in thread
From: Christoph Lameter @ 2007-07-09 15:50 UTC (permalink / raw)
  To: Andi Kleen; +Cc: linux-kernel, linux-mm, mbligh

On Sun, 8 Jul 2007, Andi Kleen wrote:

> Christoph Lameter <clameter@sgi.com> writes:
> 
> > A cmpxchg is less costly than interrupt enabe/disable
> 
> That sounds wrong.

Martin Bligh was able to significantly increase his LTTng performance 
by using cmpxchg. See his article in the 2007 proceedings of the OLS 
Volume 1, page 39.

His numbers were:

interrupts enable disable : 210.6ns
local cmpxchg             : 9.0ns

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [patch 00/10] [RFC] SLUB patches for more functionality, performance and maintenance
  2007-07-09 15:50   ` [patch 00/10] [RFC] SLUB patches for more functionality, performance and maintenance Christoph Lameter
@ 2007-07-09 15:59     ` Martin Bligh
  2007-07-09 18:11       ` Christoph Lameter
  0 siblings, 1 reply; 26+ messages in thread
From: Martin Bligh @ 2007-07-09 15:59 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: Andi Kleen, linux-kernel, linux-mm

Christoph Lameter wrote:
> On Sun, 8 Jul 2007, Andi Kleen wrote:
> 
>> Christoph Lameter <clameter@sgi.com> writes:
>>
>>> A cmpxchg is less costly than interrupt enabe/disable
>> That sounds wrong.
> 
> Martin Bligh was able to significantly increase his LTTng performance 
> by using cmpxchg. See his article in the 2007 proceedings of the OLS 
> Volume 1, page 39.
> 
> His numbers were:
> 
> interrupts enable disable : 210.6ns
> local cmpxchg             : 9.0ns

Those numbers came from Mathieu Desnoyers (LTTng) if you
want more details.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [patch 09/10] Remove the SLOB allocator for 2.6.23
       [not found]         ` <84144f020707090404l657a62c7x89d7d06b3dd6c34b@mail.gmail.com>
@ 2007-07-09 16:08           ` Christoph Lameter
  2007-07-10  8:17             ` Pekka J Enberg
  0 siblings, 1 reply; 26+ messages in thread
From: Christoph Lameter @ 2007-07-09 16:08 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: Nick Piggin, Andrew Morton, Ingo Molnar, linux-kernel, linux-mm,
	suresh.b.siddha, corey.d.gough, Matt Mackall, Denis Vlasenko,
	Erik Andersen

On Mon, 9 Jul 2007, Pekka Enberg wrote:

> I assume with "slab external fragmentation" you mean allocating a
> whole page for a slab when there are not enough objects to fill the
> whole thing thus wasting memory? We could try to combat that by
> packing multiple variable-sized slabs within a single page. Also,
> adding some non-power-of-two kmalloc caches might help with internal
> fragmentation.

Ther are already non-power-of-two kmalloc caches for 96 and 192 bytes 
sizes.
> 
> In any case, SLUB needs some serious tuning for smaller machines
> before we can get rid of SLOB.

Switch off CONFIG_SLUB_DEBUG to get memory savings.
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [patch 00/10] [RFC] SLUB patches for more functionality, performance and maintenance
  2007-07-09 15:59     ` Martin Bligh
@ 2007-07-09 18:11       ` Christoph Lameter
  2007-07-09 21:00         ` Martin Bligh
  0 siblings, 1 reply; 26+ messages in thread
From: Christoph Lameter @ 2007-07-09 18:11 UTC (permalink / raw)
  To: Martin Bligh; +Cc: Andi Kleen, linux-kernel, linux-mm

On Mon, 9 Jul 2007, Martin Bligh wrote:

> Those numbers came from Mathieu Desnoyers (LTTng) if you
> want more details.

Okay the source for these numbers is in his paper for the OLS 2006: Volume 
1 page 208-209? I do not see the exact number that you referred to there.

He seems to be comparing spinlock acquire / release vs. cmpxchg. So I 
guess you got your material from somewhere else?

Also the cmpxchg used there is the lockless variant. cmpxchg 29 cycles w/o 
lock prefix and 112 with lock prefix.

I see you reference another paper by Desnoyers: 
http://tree.celinuxforum.org/CelfPubWiki/ELC2006Presentations?action=AttachFile&do=get&target=celf2006-desnoyers.pdf

I do not see anything relevant there. Where did those numbers come from?

The lockless cmpxchg is certainly an interesting idea. Certain for some 
platforms I could disable preempt and then do a lockless cmpxchg.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [patch 00/10] [RFC] SLUB patches for more functionality, performance and maintenance
  2007-07-09 18:11       ` Christoph Lameter
@ 2007-07-09 21:00         ` Martin Bligh
  2007-07-09 21:44           ` Mathieu Desnoyers
  0 siblings, 1 reply; 26+ messages in thread
From: Martin Bligh @ 2007-07-09 21:00 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: Andi Kleen, linux-kernel, linux-mm, Mathieu Desnoyers

Christoph Lameter wrote:
> On Mon, 9 Jul 2007, Martin Bligh wrote:
> 
>> Those numbers came from Mathieu Desnoyers (LTTng) if you
>> want more details.
> 
> Okay the source for these numbers is in his paper for the OLS 2006: Volume 
> 1 page 208-209? I do not see the exact number that you referred to there.

Nope, he was a direct co-author on the paper, was
working here, and measured it.

> He seems to be comparing spinlock acquire / release vs. cmpxchg. So I 
> guess you got your material from somewhere else?
> 
> Also the cmpxchg used there is the lockless variant. cmpxchg 29 cycles w/o 
> lock prefix and 112 with lock prefix.
> 
> I see you reference another paper by Desnoyers: 
> http://tree.celinuxforum.org/CelfPubWiki/ELC2006Presentations?action=AttachFile&do=get&target=celf2006-desnoyers.pdf
> 
> I do not see anything relevant there. Where did those numbers come from?
> 
> The lockless cmpxchg is certainly an interesting idea. Certain for some 
> platforms I could disable preempt and then do a lockless cmpxchg.

Matheiu, can you give some more details? Obviously the exact numbers
will vary by archicture, machine size, etc, but it's a good point
for discussion.

M.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [patch 00/10] [RFC] SLUB patches for more functionality, performance and maintenance
  2007-07-09 21:00         ` Martin Bligh
@ 2007-07-09 21:44           ` Mathieu Desnoyers
  2007-07-09 21:55             ` Christoph Lameter
  0 siblings, 1 reply; 26+ messages in thread
From: Mathieu Desnoyers @ 2007-07-09 21:44 UTC (permalink / raw)
  To: Martin Bligh; +Cc: Christoph Lameter, Andi Kleen, linux-kernel, linux-mm

Hi,

* Martin Bligh (mbligh@mbligh.org) wrote:
> Christoph Lameter wrote:
> >On Mon, 9 Jul 2007, Martin Bligh wrote:
> >
> >>Those numbers came from Mathieu Desnoyers (LTTng) if you
> >>want more details.
> >
> >Okay the source for these numbers is in his paper for the OLS 2006: Volume 
> >1 page 208-209? I do not see the exact number that you referred to there.
> 

Hrm, the reference page number is wrong: it is in OLS 2006, Vol. 1 page
216 (section 4.5.2 Scalability). I originally pulled out the page number
from my local paper copy. oops.


> Nope, he was a direct co-author on the paper, was
> working here, and measured it.
> 
> >He seems to be comparing spinlock acquire / release vs. cmpxchg. So I 
> >guess you got your material from somewhere else?
> >

I ran a test specifically for this paper where I got this result
comparing the local irq enable/disable to local cmpxchg.

> >Also the cmpxchg used there is the lockless variant. cmpxchg 29 cycles w/o 
> >lock prefix and 112 with lock prefix.

Yep, I volountarily used the variant without lock prefix because the
data is per cpu and I disable preemption.

> >
> >I see you reference another paper by Desnoyers: 
> >http://tree.celinuxforum.org/CelfPubWiki/ELC2006Presentations?action=AttachFile&do=get&target=celf2006-desnoyers.pdf
> >
> >I do not see anything relevant there. Where did those numbers come from?
> >
> >The lockless cmpxchg is certainly an interesting idea. Certain for some 
> >platforms I could disable preempt and then do a lockless cmpxchg.
> 

Yes, preempt disabling or, eventually, the new thread migration
disabling I just proposed as an RFC on LKML. (that would make -rt people
happier)

> Mathieu, can you give some more details? Obviously the exact numbers
> will vary by archicture, machine size, etc, but it's a good point
> for discussion.
> 

Sure, also note that the UP cmpxchg (see asm-$ARCH/local.h in 2.6.22) is
faster on architectures like powerpc and MIPS where it is possible to
remove some memory barriers.

See 2.6.22 Documentation/local_ops.txt for a thorough discussion. Don't
hesitate ping me if you have more questions.

Regards,

Mathieu


-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [patch 00/10] [RFC] SLUB patches for more functionality, performance and maintenance
  2007-07-09 21:44           ` Mathieu Desnoyers
@ 2007-07-09 21:55             ` Christoph Lameter
  2007-07-09 22:58               ` Mathieu Desnoyers
  0 siblings, 1 reply; 26+ messages in thread
From: Christoph Lameter @ 2007-07-09 21:55 UTC (permalink / raw)
  To: Mathieu Desnoyers; +Cc: Martin Bligh, Andi Kleen, linux-kernel, linux-mm

On Mon, 9 Jul 2007, Mathieu Desnoyers wrote:

> > >Okay the source for these numbers is in his paper for the OLS 2006: Volume 
> > >1 page 208-209? I do not see the exact number that you referred to there.
> > 
> 
> Hrm, the reference page number is wrong: it is in OLS 2006, Vol. 1 page
> 216 (section 4.5.2 Scalability). I originally pulled out the page number
> from my local paper copy. oops.

4.5.2 is on page 208 in my copy of the proceedings.


> > >He seems to be comparing spinlock acquire / release vs. cmpxchg. So I 
> > >guess you got your material from somewhere else?
> > >
> 
> I ran a test specifically for this paper where I got this result
> comparing the local irq enable/disable to local cmpxchg.


The numbers are pretty important and suggest that we can obtain 
a significant speed increase by avoid local irq disable enable in the slab 
allocator fast paths. Do you some more numbers? Any other publication that 
mentions these?


> Yep, I volountarily used the variant without lock prefix because the
> data is per cpu and I disable preemption.

local_cmpxchg generates this?

> Yes, preempt disabling or, eventually, the new thread migration
> disabling I just proposed as an RFC on LKML. (that would make -rt people
> happier)

Right.

> Sure, also note that the UP cmpxchg (see asm-$ARCH/local.h in 2.6.22) is
> faster on architectures like powerpc and MIPS where it is possible to
> remove some memory barriers.

UP cmpxchg meaning local_cmpxchg?

> See 2.6.22 Documentation/local_ops.txt for a thorough discussion. Don't
> hesitate ping me if you have more questions.

That is pretty thin and does not mention atomic_cmpxchg. You way want to 
expand on your ideas a bit.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [patch 00/10] [RFC] SLUB patches for more functionality, performance and maintenance
  2007-07-09 21:55             ` Christoph Lameter
@ 2007-07-09 22:58               ` Mathieu Desnoyers
  2007-07-09 23:08                 ` Christoph Lameter
  2007-07-10  0:55                 ` [patch 00/10] [RFC] SLUB patches for more functionality, performance and maintenance Christoph Lameter
  0 siblings, 2 replies; 26+ messages in thread
From: Mathieu Desnoyers @ 2007-07-09 22:58 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: Martin Bligh, Andi Kleen, linux-kernel, linux-mm

* Christoph Lameter (clameter@sgi.com) wrote:
> On Mon, 9 Jul 2007, Mathieu Desnoyers wrote:
> 
> > > >He seems to be comparing spinlock acquire / release vs. cmpxchg. So I 
> > > >guess you got your material from somewhere else?
> > > >
> > 
> > I ran a test specifically for this paper where I got this result
> > comparing the local irq enable/disable to local cmpxchg.
> 
> 
> The numbers are pretty important and suggest that we can obtain 
> a significant speed increase by avoid local irq disable enable in the slab 
> allocator fast paths. Do you some more numbers? Any other publication that 
> mentions these?
> 

The original publication in which I released the idea was my LTTng paper
at OLS 2006. Outside this, I have not found other paper that talks about
this idea.

The test code is basically just disabling interrupts, reading the TSC
at the beginning and end and does 20000 loops of local_cmpxchg. I can
send you the code if you want it.

> 
> > Yep, I volountarily used the variant without lock prefix because the
> > data is per cpu and I disable preemption.
> 
> local_cmpxchg generates this?
> 

Yes.

> > Yes, preempt disabling or, eventually, the new thread migration
> > disabling I just proposed as an RFC on LKML. (that would make -rt people
> > happier)
> 
> Right.
> 
> > Sure, also note that the UP cmpxchg (see asm-$ARCH/local.h in 2.6.22) is
> > faster on architectures like powerpc and MIPS where it is possible to
> > remove some memory barriers.
> 
> UP cmpxchg meaning local_cmpxchg?
> 

Yes.

> > See 2.6.22 Documentation/local_ops.txt for a thorough discussion. Don't
> > hesitate ping me if you have more questions.
> 
> That is pretty thin and does not mention atomic_cmpxchg. You way want to 
> expand on your ideas a bit.

Sure, the idea goes as follow: if you have a per cpu variable that needs
to be concurrently modified in a coherent manner by any context (NMI,
irq, bh, process) running on the given CPU, you only need to use an
operation atomic wrt to the given CPU. You just have to make sure that
only this CPU will modify the variable (therefore, you must disable
preemption around modification) and you have to make sure that the
read-side, which can come from any CPU, is accessing this variable
atomically. Also, you have to be aware that the read-side might see an
older version of the other cpu's value because there is no SMP write
memory barrier involved. The value, however, will always be up to date
if the variable is read from the "local" CPU.

What applies to local_inc, given as example in the local_ops.txt
document, applies integrally to local_cmpxchg. And I would say that
local_cmpxchg is by far the cheapest locking mechanism I have found, and
use today, for my kernel tracer. The idea emerged from my need to trace
every execution context, including NMIs, while still providing good
performances. local_cmpxchg was the perfect fit; that's why I deployed
it in local.h in each and every architecture.

Mathieu

-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [patch 00/10] [RFC] SLUB patches for more functionality, performance and maintenance
  2007-07-09 22:58               ` Mathieu Desnoyers
@ 2007-07-09 23:08                 ` Christoph Lameter
  2007-07-10  5:16                   ` [PATCH] x86_64 - Use non locked version for local_cmpxchg() Mathieu Desnoyers
  2007-07-10  0:55                 ` [patch 00/10] [RFC] SLUB patches for more functionality, performance and maintenance Christoph Lameter
  1 sibling, 1 reply; 26+ messages in thread
From: Christoph Lameter @ 2007-07-09 23:08 UTC (permalink / raw)
  To: Mathieu Desnoyers; +Cc: Martin Bligh, Andi Kleen, linux-kernel, linux-mm

On Mon, 9 Jul 2007, Mathieu Desnoyers wrote:

> > > Yep, I volountarily used the variant without lock prefix because the
> > > data is per cpu and I disable preemption.
> > 
> > local_cmpxchg generates this?
> > 
> 
> Yes.

Does not work here. If I use

static void __always_inline *slab_alloc(struct kmem_cache *s,
                gfp_t gfpflags, int node, void *addr)
{
        void **object;
        struct kmem_cache_cpu *c;

        preempt_disable();
        c = get_cpu_slab(s, smp_processor_id());
redo:
        object = c->freelist;
        if (unlikely(!object || !node_match(c, node)))
                return __slab_alloc(s, gfpflags, node, addr, c);

        if (cmpxchg_local(&c->freelist, object, object[c->offset]) != object)
                goto redo;

        preempt_enable();
        if (unlikely((gfpflags & __GFP_ZERO)))
                memset(object, 0, c->objsize);

        return object;
}

Then the code will include a lock prefix:

    3270:       48 8b 1a                mov    (%rdx),%rbx
    3273:       48 85 db                test   %rbx,%rbx
    3276:       74 23                   je     329b <kmem_cache_alloc+0x4b>
    3278:       8b 42 14                mov    0x14(%rdx),%eax
    327b:       4c 8b 0c c3             mov    (%rbx,%rax,8),%r9
    327f:       48 89 d8                mov    %rbx,%rax
    3282:       f0 4c 0f b1 0a          lock cmpxchg %r9,(%rdx)
    3287:       48 39 c3                cmp    %rax,%rbx
    328a:       75 e4                   jne    3270 <kmem_cache_alloc+0x20>
    328c:       66 85 f6                test   %si,%si
    328f:       78 19                   js     32aa <kmem_cache_alloc+0x5a>
    3291:       48 89 d8                mov    %rbx,%rax
    3294:       48 83 c4 08             add    $0x8,%rsp
    3298:       5b                      pop    %rbx
    3299:       c9                      leaveq
    329a:       c3                      retq


> What applies to local_inc, given as example in the local_ops.txt
> document, applies integrally to local_cmpxchg. And I would say that
> local_cmpxchg is by far the cheapest locking mechanism I have found, and
> use today, for my kernel tracer. The idea emerged from my need to trace
> every execution context, including NMIs, while still providing good
> performances. local_cmpxchg was the perfect fit; that's why I deployed
> it in local.h in each and every architecture.

Great idea. The SLUB allocator may be able to use your idea to improve 
both the alloc and free path.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [patch 00/10] [RFC] SLUB patches for more functionality, performance and maintenance
  2007-07-09 22:58               ` Mathieu Desnoyers
  2007-07-09 23:08                 ` Christoph Lameter
@ 2007-07-10  0:55                 ` Christoph Lameter
  2007-07-10  8:27                   ` Mathieu Desnoyers
  2007-08-13 22:18                   ` Mathieu Desnoyers
  1 sibling, 2 replies; 26+ messages in thread
From: Christoph Lameter @ 2007-07-10  0:55 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Martin Bligh, Andi Kleen, linux-kernel, linux-mm, David Miller

Ok here is a replacement patch for the cmpxchg patch. Problems

1. cmpxchg_local is not available on all arches. If we wanted to do
   this then it needs to be universally available.

2. cmpxchg_local does generate the "lock" prefix. It should not do that.
   Without fixes to cmpxchg_local we cannot expect maximum performance.

3. The approach is x86 centric. It relies on a cmpxchg that does not
   synchronize with memory used by other cpus and therefore is more
   lightweight. As far as I know the IA64 cmpxchg cannot do that.
   Neither several other processors. I am not sure how cmpxchgless
   platforms would use that. We need a detailed comparison of
   interrupt enable /disable vs. cmpxchg cycle counts for cachelines in
   the cpu cache to evaluate the impact that such a change would have.

   The cmpxchg (or its emulation) does not need any barriers since the
   accesses can only come from a single processor. 

Mathieu measured a significant performance benefit coming from not using
interrupt enable / disable.

Some rough processor cycle counts (anyone have better numbers?)

	STI	CLI	CMPXCHG
IA32	36	26	1 (assume XCHG == CMPXCHG, sti/cli also need stack pushes/pulls)
IA64	12	12	1 (but ar.ccv needs 11 cycles to set comparator,
			need register moves to preserve processors flags)

Looks like STI/CLI is pretty expensive and it seems that we may be able to
optimize the alloc / free hotpath quite a bit if we could drop the 
interrupt enable / disable. But we need some measurements.


Draft of a new patch:

SLUB: Single atomic instruction alloc/free using cmpxchg_local

A cmpxchg allows us to avoid disabling and enabling interrupts. The cmpxchg
is optimal to allow operations on per cpu freelist. We can stay on one
processor by disabling preemption() and allowing concurrent interrupts
thus avoiding the overhead of disabling and enabling interrupts.

Pro:
	- No need to disable interrupts.
	- Preempt disable /enable vanishes on non preempt kernels
Con:
        - Slightly complexer handling.
	- Updates to atomic instructions needed

Signed-off-by: Christoph Lameter <clameter@sgi.com>

---
 mm/slub.c |   72 ++++++++++++++++++++++++++++++++++++++++++--------------------
 1 file changed, 49 insertions(+), 23 deletions(-)

Index: linux-2.6.22-rc6-mm1/mm/slub.c
===================================================================
--- linux-2.6.22-rc6-mm1.orig/mm/slub.c	2007-07-09 15:04:46.000000000 -0700
+++ linux-2.6.22-rc6-mm1/mm/slub.c	2007-07-09 17:09:00.000000000 -0700
@@ -1467,12 +1467,14 @@ static void *__slab_alloc(struct kmem_ca
 {
 	void **object;
 	struct page *new;
+	unsigned long flags;
 
+	local_irq_save(flags);
 	if (!c->page)
 		goto new_slab;
 
 	slab_lock(c->page);
-	if (unlikely(!node_match(c, node)))
+	if (unlikely(!node_match(c, node) || c->freelist))
 		goto another_slab;
 load_freelist:
 	object = c->page->freelist;
@@ -1486,7 +1488,14 @@ load_freelist:
 	c->page->inuse = s->objects;
 	c->page->freelist = NULL;
 	c->node = page_to_nid(c->page);
+out:
 	slab_unlock(c->page);
+	local_irq_restore(flags);
+	preempt_enable();
+
+	if (unlikely((gfpflags & __GFP_ZERO)))
+		memset(object, 0, c->objsize);
+
 	return object;
 
 another_slab:
@@ -1527,6 +1536,8 @@ new_slab:
 		c->page = new;
 		goto load_freelist;
 	}
+	local_irq_restore(flags);
+	preempt_enable();
 	return NULL;
 debug:
 	c->freelist = NULL;
@@ -1536,8 +1547,7 @@ debug:
 
 	c->page->inuse++;
 	c->page->freelist = object[c->offset];
-	slab_unlock(c->page);
-	return object;
+	goto out;
 }
 
 /*
@@ -1554,23 +1564,20 @@ static void __always_inline *slab_alloc(
 		gfp_t gfpflags, int node, void *addr)
 {
 	void **object;
-	unsigned long flags;
 	struct kmem_cache_cpu *c;
 
-	local_irq_save(flags);
+	preempt_disable();
 	c = get_cpu_slab(s, smp_processor_id());
-	if (unlikely(!c->page || !c->freelist ||
-					!node_match(c, node)))
+redo:
+	object = c->freelist;
+	if (unlikely(!object || !node_match(c, node)))
+		return __slab_alloc(s, gfpflags, node, addr, c);
 
-		object = __slab_alloc(s, gfpflags, node, addr, c);
+	if (cmpxchg_local(&c->freelist, object, object[c->offset]) != object)
+		goto redo;
 
-	else {
-		object = c->freelist;
-		c->freelist = object[c->offset];
-	}
-	local_irq_restore(flags);
-
-	if (unlikely((gfpflags & __GFP_ZERO) && object))
+	preempt_enable();
+	if (unlikely((gfpflags & __GFP_ZERO)))
 		memset(object, 0, c->objsize);
 
 	return object;
@@ -1603,7 +1610,9 @@ static void __slab_free(struct kmem_cach
 {
 	void *prior;
 	void **object = (void *)x;
+	unsigned long flags;
 
+	local_irq_save(flags);
 	slab_lock(page);
 
 	if (unlikely(SlabDebug(page)))
@@ -1629,6 +1638,8 @@ checks_ok:
 
 out_unlock:
 	slab_unlock(page);
+	local_irq_restore(flags);
+	preempt_enable();
 	return;
 
 slab_empty:
@@ -1639,6 +1650,8 @@ slab_empty:
 		remove_partial(s, page);
 
 	slab_unlock(page);
+	local_irq_restore(flags);
+	preempt_enable();
 	discard_slab(s, page);
 	return;
 
@@ -1663,18 +1676,31 @@ static void __always_inline slab_free(st
 			struct page *page, void *x, void *addr)
 {
 	void **object = (void *)x;
-	unsigned long flags;
 	struct kmem_cache_cpu *c;
+	void **freelist;
 
-	local_irq_save(flags);
+	preempt_disable();
 	c = get_cpu_slab(s, smp_processor_id());
-	if (likely(page == c->page && c->freelist)) {
-		object[c->offset] = c->freelist;
-		c->freelist = object;
-	} else
-		__slab_free(s, page, x, addr, c->offset);
+redo:
+	freelist = c->freelist;
+	/*
+	 * Must read freelist before c->page. If a interrupt occurs and
+	 * changes c->page after we have read it here then it
+	 * will also have changed c->freelist and the cmpxchg will fail.
+	 *
+	 * If we would have checked c->page first then the freelist could
+	 * have been changed under us before we read c->freelist and we
+	 * would not be able to detect that situation.
+	 */
+	smp_rmb();
+	if (unlikely(page != c->page || !freelist))
+		return __slab_free(s, page, x, addr, c->offset);
+
+	object[c->offset] = freelist;
+	if (cmpxchg_local(&c->freelist, freelist, object) != freelist)
+		goto redo;
 
-	local_irq_restore(flags);
+	preempt_enable();
 }
 
 void kmem_cache_free(struct kmem_cache *s, void *x)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH] x86_64 - Use non locked version for local_cmpxchg()
  2007-07-09 23:08                 ` Christoph Lameter
@ 2007-07-10  5:16                   ` Mathieu Desnoyers
  2007-07-10 20:46                     ` Christoph Lameter
  0 siblings, 1 reply; 26+ messages in thread
From: Mathieu Desnoyers @ 2007-07-10  5:16 UTC (permalink / raw)
  To: akpm; +Cc: Martin Bligh, Andi Kleen, linux-kernel, linux-mm

You are completely right: on x86_64, a bit got lost in the move to
cmpxchg.h, here is the fix. It applies on 2.6.22-rc6-mm1.

x86_64 - Use non locked version for local_cmpxchg()

local_cmpxchg() should not use any LOCK prefix. This change probably got lost in
the move to cmpxchg.h.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
---
 include/asm-x86_64/cmpxchg.h |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Index: linux-2.6-lttng/include/asm-x86_64/cmpxchg.h
===================================================================
--- linux-2.6-lttng.orig/include/asm-x86_64/cmpxchg.h	2007-07-10 01:10:10.000000000 -0400
+++ linux-2.6-lttng/include/asm-x86_64/cmpxchg.h	2007-07-10 01:11:03.000000000 -0400
@@ -128,7 +128,7 @@
 	((__typeof__(*(ptr)))__cmpxchg((ptr),(unsigned long)(o),\
 					(unsigned long)(n),sizeof(*(ptr))))
 #define cmpxchg_local(ptr,o,n)\
-	((__typeof__(*(ptr)))__cmpxchg((ptr),(unsigned long)(o),\
+	((__typeof__(*(ptr)))__cmpxchg_local((ptr),(unsigned long)(o),\
 					(unsigned long)(n),sizeof(*(ptr))))
 
 #endif


* Christoph Lameter (clameter@sgi.com) wrote:
> On Mon, 9 Jul 2007, Mathieu Desnoyers wrote:
> 
> > > > Yep, I volountarily used the variant without lock prefix because the
> > > > data is per cpu and I disable preemption.
> > > 
> > > local_cmpxchg generates this?
> > > 
> > 
> > Yes.
> 
> Does not work here. If I use
> 
> static void __always_inline *slab_alloc(struct kmem_cache *s,
>                 gfp_t gfpflags, int node, void *addr)
> {
>         void **object;
>         struct kmem_cache_cpu *c;
> 
>         preempt_disable();
>         c = get_cpu_slab(s, smp_processor_id());
> redo:
>         object = c->freelist;
>         if (unlikely(!object || !node_match(c, node)))
>                 return __slab_alloc(s, gfpflags, node, addr, c);
> 
>         if (cmpxchg_local(&c->freelist, object, object[c->offset]) != object)
>                 goto redo;
> 
>         preempt_enable();
>         if (unlikely((gfpflags & __GFP_ZERO)))
>                 memset(object, 0, c->objsize);
> 
>         return object;
> }
> 
> Then the code will include a lock prefix:
> 
>     3270:       48 8b 1a                mov    (%rdx),%rbx
>     3273:       48 85 db                test   %rbx,%rbx




>     3276:       74 23                   je     329b <kmem_cache_alloc+0x4b>
>     3278:       8b 42 14                mov    0x14(%rdx),%eax
>     327b:       4c 8b 0c c3             mov    (%rbx,%rax,8),%r9
>     327f:       48 89 d8                mov    %rbx,%rax
>     3282:       f0 4c 0f b1 0a          lock cmpxchg %r9,(%rdx)
>     3287:       48 39 c3                cmp    %rax,%rbx
>     328a:       75 e4                   jne    3270 <kmem_cache_alloc+0x20>
>     328c:       66 85 f6                test   %si,%si
>     328f:       78 19                   js     32aa <kmem_cache_alloc+0x5a>
>     3291:       48 89 d8                mov    %rbx,%rax
>     3294:       48 83 c4 08             add    $0x8,%rsp
>     3298:       5b                      pop    %rbx
>     3299:       c9                      leaveq
>     329a:       c3                      retq
> 
> 
> > What applies to local_inc, given as example in the local_ops.txt
> > document, applies integrally to local_cmpxchg. And I would say that
> > local_cmpxchg is by far the cheapest locking mechanism I have found, and
> > use today, for my kernel tracer. The idea emerged from my need to trace
> > every execution context, including NMIs, while still providing good
> > performances. local_cmpxchg was the perfect fit; that's why I deployed
> > it in local.h in each and every architecture.
> 
> Great idea. The SLUB allocator may be able to use your idea to improve 
> both the alloc and free path.
> 

-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [patch 09/10] Remove the SLOB allocator for 2.6.23
  2007-07-09 16:08           ` [patch 09/10] Remove the SLOB allocator for 2.6.23 Christoph Lameter
@ 2007-07-10  8:17             ` Pekka J Enberg
  2007-07-10  8:27               ` Nick Piggin
  0 siblings, 1 reply; 26+ messages in thread
From: Pekka J Enberg @ 2007-07-10  8:17 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Nick Piggin, Andrew Morton, Ingo Molnar, linux-kernel, linux-mm,
	suresh.b.siddha, corey.d.gough, Matt Mackall, Denis Vlasenko,
	Erik Andersen

Hi Christoph,

On Mon, 9 Jul 2007, Pekka Enberg wrote:
> > I assume with "slab external fragmentation" you mean allocating a
> > whole page for a slab when there are not enough objects to fill the
> > whole thing thus wasting memory? We could try to combat that by
> > packing multiple variable-sized slabs within a single page. Also,
> > adding some non-power-of-two kmalloc caches might help with internal
> > fragmentation.

On Mon, 9 Jul 2007, Christoph Lameter wrote:
> Ther are already non-power-of-two kmalloc caches for 96 and 192 bytes 
> sizes.

I know that, but for my setup at least, there seems to be a need for a 
non-power of two cache between 512 and 1024. What I am seeing is average 
allocation size for kmalloc-512 being around 270-280 which wastes total 
of 10 KB of memory due to internal fragmentation. Might be a buggy caller 
that can be fixed with its own cache too.

On Mon, 9 Jul 2007, Pekka Enberg wrote:
> > In any case, SLUB needs some serious tuning for smaller machines
> > before we can get rid of SLOB.

On Mon, 9 Jul 2007, Christoph Lameter wrote:
> Switch off CONFIG_SLUB_DEBUG to get memory savings.

Curious, /proc/meminfo immediately after boot shows:

SLUB (debugging enabled):

(none):~# cat /proc/meminfo 
MemTotal:        30260 kB
MemFree:         22096 kB

SLUB (debugging disabled):

(none):~# cat /proc/meminfo 
MemTotal:        30276 kB
MemFree:         22244 kB

SLOB:

(none):~# cat /proc/meminfo 
MemTotal:        30280 kB
MemFree:         22004 kB

That's 92 KB advantage for SLUB with debugging enabled and 240 KB when 
debugging is disabled.

Nick, Matt, care to retest SLUB and SLOB for your setups?

				Pekka

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [patch 09/10] Remove the SLOB allocator for 2.6.23
  2007-07-10  8:17             ` Pekka J Enberg
@ 2007-07-10  8:27               ` Nick Piggin
  2007-07-10  9:31                 ` Pekka Enberg
  0 siblings, 1 reply; 26+ messages in thread
From: Nick Piggin @ 2007-07-10  8:27 UTC (permalink / raw)
  To: Pekka J Enberg
  Cc: Christoph Lameter, Andrew Morton, Ingo Molnar, linux-kernel,
	linux-mm, suresh.b.siddha, corey.d.gough, Matt Mackall,
	Denis Vlasenko, Erik Andersen

Pekka J Enberg wrote:

> Curious, /proc/meminfo immediately after boot shows:
> 
> SLUB (debugging enabled):
> 
> (none):~# cat /proc/meminfo 
> MemTotal:        30260 kB
> MemFree:         22096 kB
> 
> SLUB (debugging disabled):
> 
> (none):~# cat /proc/meminfo 
> MemTotal:        30276 kB
> MemFree:         22244 kB
> 
> SLOB:
> 
> (none):~# cat /proc/meminfo 
> MemTotal:        30280 kB
> MemFree:         22004 kB
> 
> That's 92 KB advantage for SLUB with debugging enabled and 240 KB when 
> debugging is disabled.

Interesting. What kernel version are you using?


> Nick, Matt, care to retest SLUB and SLOB for your setups?

I don't think there has been a significant change in the area of
memory efficiency in either since I last tested, and Christoph and
I both produced the same result.

I can't say where SLOB is losing its memory, but there are a few
places that can still be improved, so I might get keen and take
another look at it once all the improvements to both allocators
gets upstream.

-- 
SUSE Labs, Novell Inc.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [patch 00/10] [RFC] SLUB patches for more functionality, performance and maintenance
  2007-07-10  0:55                 ` [patch 00/10] [RFC] SLUB patches for more functionality, performance and maintenance Christoph Lameter
@ 2007-07-10  8:27                   ` Mathieu Desnoyers
  2007-07-10 18:38                     ` Christoph Lameter
  2007-07-10 20:59                     ` Mathieu Desnoyers
  2007-08-13 22:18                   ` Mathieu Desnoyers
  1 sibling, 2 replies; 26+ messages in thread
From: Mathieu Desnoyers @ 2007-07-10  8:27 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Martin Bligh, Andi Kleen, linux-kernel, linux-mm, David Miller

* Christoph Lameter (clameter@sgi.com) wrote:
> Ok here is a replacement patch for the cmpxchg patch. Problems
> 
> 1. cmpxchg_local is not available on all arches. If we wanted to do
>    this then it needs to be universally available.
> 

cmpxchg_local is not available on all archs, but local_cmpxchg is. It
expects a local_t type which is nothing else than a long. When the local
atomic operation is not more efficient or not implemented on a given
architecture, asm-generic/local.h falls back on atomic_long_t. If you
want, you could work on the local_t type, which you could cast from a
long to a pointer when you need so, since their size are, AFAIK, always
the same (and some VM code even assume this is always the case).

> 2. cmpxchg_local does generate the "lock" prefix. It should not do that.
>    Without fixes to cmpxchg_local we cannot expect maximum performance.
> 

Yup, see the patch I just posted for this.

> 3. The approach is x86 centric. It relies on a cmpxchg that does not
>    synchronize with memory used by other cpus and therefore is more
>    lightweight. As far as I know the IA64 cmpxchg cannot do that.
>    Neither several other processors. I am not sure how cmpxchgless
>    platforms would use that. We need a detailed comparison of
>    interrupt enable /disable vs. cmpxchg cycle counts for cachelines in
>    the cpu cache to evaluate the impact that such a change would have.
> 
>    The cmpxchg (or its emulation) does not need any barriers since the
>    accesses can only come from a single processor. 
> 

Yes, expected improvements goes as follow:
x86, x86_64 : must faster due to non-LOCKed cmpxchg
alpha: should be faster due to memory barrier removal
mips: memory barriers removed
powerpc 32/64: memory barriers removed

On other architectures, either there is no better implementation than
the standard atomic cmpxchg or it just has not been implemented.

I guess that a test series that would tell us how must improvement is
seen on the optimized architectures (local cmpxchg vs interrupt
enable/disable) and also what effect the standard cmpxchg has compared
to interrupt disable/enable on the architectures where we can't do
better than the standard cmpxchg will tell us if it is an interesting
way to go.  I would be happy to do these tests, but I don't have the
hardware handy. I provide a test module to get these characteristics
from various architectures in this email.

> Mathieu measured a significant performance benefit coming from not using
> interrupt enable / disable.
> 
> Some rough processor cycle counts (anyone have better numbers?)
> 
> 	STI	CLI	CMPXCHG
> IA32	36	26	1 (assume XCHG == CMPXCHG, sti/cli also need stack pushes/pulls)
> IA64	12	12	1 (but ar.ccv needs 11 cycles to set comparator,
> 			need register moves to preserve processors flags)
> 

The measurements I get (in cycles):

             enable interrupts (STI)   disable interrupts (CLI)   local CMPXCHG
IA32 (P4)    112                        82                         26
x86_64 AMD64 125                       102                         19

> Looks like STI/CLI is pretty expensive and it seems that we may be able to
> optimize the alloc / free hotpath quite a bit if we could drop the 
> interrupt enable / disable. But we need some measurements.
> 
> 
> Draft of a new patch:
> 
> SLUB: Single atomic instruction alloc/free using cmpxchg_local
> 
> A cmpxchg allows us to avoid disabling and enabling interrupts. The cmpxchg
> is optimal to allow operations on per cpu freelist. We can stay on one
> processor by disabling preemption() and allowing concurrent interrupts
> thus avoiding the overhead of disabling and enabling interrupts.
> 
> Pro:
> 	- No need to disable interrupts.
> 	- Preempt disable /enable vanishes on non preempt kernels
> Con:
>         - Slightly complexer handling.
> 	- Updates to atomic instructions needed
> 
> Signed-off-by: Christoph Lameter <clameter@sgi.com>
> 

Test local cmpxchg vs int disable/enable. Please run on a 2.6.22 kernel
(or recent 2.6.21-rcX-mmX) (with my cmpxchg local fix patch for x86_64).
Make sure the TSC reads (get_cycles()) are reliable on your platform.

Mathieu

/* test-cmpxchg-nolock.c
 *
 * Compare local cmpxchg with irq disable / enable.
 */

#include <linux/jiffies.h>
#include <linux/compiler.h>
#include <linux/init.h>
#include <linux/module.h>
#include <linux/calc64.h>
#include <asm/timex.h>
#include <asm/system.h>

#define NR_LOOPS 20000

int test_val = 0;

static void do_test_cmpxchg(void)
{
	int ret;
	long flags;
	unsigned int i;
	cycles_t time1, time2, time;
	long rem;

	local_irq_save(flags);
	preempt_disable();
	time1 = get_cycles();
	for (i = 0; i < NR_LOOPS; i++) {
		ret = cmpxchg_local(&test_val, 0, 0);
	}
	time2 = get_cycles();
	local_irq_restore(flags);
	preempt_enable();
	time = time2 - time1;

	printk(KERN_ALERT "test results: time for non locked cmpxchg\n");
	printk(KERN_ALERT "number of loops: %d\n", NR_LOOPS);
	printk(KERN_ALERT "total time: %llu\n", time);
	time = div_long_long_rem(time, NR_LOOPS, &rem);
	printk(KERN_ALERT "-> non locked cmpxchg takes %llu cycles\n", time);
	printk(KERN_ALERT "test end\n");
}

/*
 * This test will have a higher standard deviation due to incoming interrupts.
 */
static void do_test_enable_int(void)
{
	long flags;
	unsigned int i;
	cycles_t time1, time2, time;
	long rem;

	local_irq_save(flags);
	preempt_disable();
	time1 = get_cycles();
	for (i = 0; i < NR_LOOPS; i++) {
		local_irq_restore(flags);
	}
	time2 = get_cycles();
	local_irq_restore(flags);
	preempt_enable();
	time = time2 - time1;

	printk(KERN_ALERT "test results: time for enabling interrupts (STI)\n");
	printk(KERN_ALERT "number of loops: %d\n", NR_LOOPS);
	printk(KERN_ALERT "total time: %llu\n", time);
	time = div_long_long_rem(time, NR_LOOPS, &rem);
	printk(KERN_ALERT "-> enabling interrupts (STI) takes %llu cycles\n",
					time);
	printk(KERN_ALERT "test end\n");
}

static void do_test_disable_int(void)
{
	unsigned long flags, flags2;
	unsigned int i;
	cycles_t time1, time2, time;
	long rem;

	local_irq_save(flags);
	preempt_disable();
	time1 = get_cycles();
	for ( i = 0; i < NR_LOOPS; i++) {
		local_irq_save(flags2);
	}
	time2 = get_cycles();
	local_irq_restore(flags);
	preempt_enable();
	time = time2 - time1;

	printk(KERN_ALERT "test results: time for disabling interrupts (CLI)\n");
	printk(KERN_ALERT "number of loops: %d\n", NR_LOOPS);
	printk(KERN_ALERT "total time: %llu\n", time);
	time = div_long_long_rem(time, NR_LOOPS, &rem);
	printk(KERN_ALERT "-> disabling interrupts (CLI) takes %llu cycles\n",
				time);
	printk(KERN_ALERT "test end\n");
}



static int ltt_test_init(void)
{
	printk(KERN_ALERT "test init\n");
	
	do_test_cmpxchg();
	do_test_enable_int();
	do_test_disable_int();
	return -EAGAIN; /* Fail will directly unload the module */
}

static void ltt_test_exit(void)
{
	printk(KERN_ALERT "test exit\n");
}

module_init(ltt_test_init)
module_exit(ltt_test_exit)

MODULE_LICENSE("GPL");
MODULE_AUTHOR("Mathieu Desnoyers");
MODULE_DESCRIPTION("Cmpxchg local test");

-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [patch 09/10] Remove the SLOB allocator for 2.6.23
  2007-07-10  8:27               ` Nick Piggin
@ 2007-07-10  9:31                 ` Pekka Enberg
  2007-07-10 10:09                   ` Nick Piggin
  2007-07-10 12:02                   ` Matt Mackall
  0 siblings, 2 replies; 26+ messages in thread
From: Pekka Enberg @ 2007-07-10  9:31 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Christoph Lameter, Andrew Morton, Ingo Molnar, linux-kernel,
	linux-mm, suresh.b.siddha, corey.d.gough, Matt Mackall,
	Denis Vlasenko, Erik Andersen

Hi Nick,

Pekka J Enberg wrote:
> > That's 92 KB advantage for SLUB with debugging enabled and 240 KB when
> > debugging is disabled.

On 7/10/07, Nick Piggin <nickpiggin@yahoo.com.au> wrote:
> Interesting. What kernel version are you using?

Linus' git head from yesterday so the results are likely to be
sensitive to workload and mine doesn't represent real embedded use.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [patch 09/10] Remove the SLOB allocator for 2.6.23
  2007-07-10  9:31                 ` Pekka Enberg
@ 2007-07-10 10:09                   ` Nick Piggin
  2007-07-10 12:02                   ` Matt Mackall
  1 sibling, 0 replies; 26+ messages in thread
From: Nick Piggin @ 2007-07-10 10:09 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: Christoph Lameter, Andrew Morton, Ingo Molnar, linux-kernel,
	linux-mm, suresh.b.siddha, corey.d.gough, Matt Mackall,
	Denis Vlasenko, Erik Andersen

Pekka Enberg wrote:
> Hi Nick,
> 
> Pekka J Enberg wrote:
> 
>> > That's 92 KB advantage for SLUB with debugging enabled and 240 KB when
>> > debugging is disabled.
> 
> 
> On 7/10/07, Nick Piggin <nickpiggin@yahoo.com.au> wrote:
> 
>> Interesting. What kernel version are you using?
> 
> 
> Linus' git head from yesterday so the results are likely to be
> sensitive to workload and mine doesn't represent real embedded use.

Hi Pekka,

There is one thing that the SLOB patches in -mm do besides result in
slightly better packing and memory efficiency (which might be unlikely
to explain the difference you are seeing), and that is that they do
away with the delayed freeing of unused SLOB pages back to the page
allocator.

In git head, these pages are freed via a timer so they can take a
while to make their way back to the buddy allocator so they don't
register as free memory as such.

Anyway, I would be very interested to see any situation where the
SLOB in -mm uses more memory than SLUB, even on test configs like
yours.

Thanks,
Nick

-- 
SUSE Labs, Novell Inc.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [patch 09/10] Remove the SLOB allocator for 2.6.23
  2007-07-10  9:31                 ` Pekka Enberg
  2007-07-10 10:09                   ` Nick Piggin
@ 2007-07-10 12:02                   ` Matt Mackall
  2007-07-10 12:57                     ` Pekka J Enberg
  2007-07-10 22:12                     ` Christoph Lameter
  1 sibling, 2 replies; 26+ messages in thread
From: Matt Mackall @ 2007-07-10 12:02 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: Nick Piggin, Christoph Lameter, Andrew Morton, Ingo Molnar,
	linux-kernel, linux-mm, suresh.b.siddha, corey.d.gough,
	Denis Vlasenko, Erik Andersen

On Tue, Jul 10, 2007 at 12:31:40PM +0300, Pekka Enberg wrote:
> Hi Nick,
> 
> Pekka J Enberg wrote:
> >> That's 92 KB advantage for SLUB with debugging enabled and 240 KB when
> >> debugging is disabled.
> 
> On 7/10/07, Nick Piggin <nickpiggin@yahoo.com.au> wrote:
> >Interesting. What kernel version are you using?
> 
> Linus' git head from yesterday so the results are likely to be
> sensitive to workload and mine doesn't represent real embedded use.

Using 2.6.22-rc6-mm1 with a 64MB lguest and busybox, I'm seeing the
following as the best MemFree numbers after several boots each:

SLAB: 54796
SLOB: 55044
SLUB: 53944
SLUB: 54788 (debug turned off)

These numbers bounce around a lot more from boot to boot than I
remember, so take these numbers with a grain of salt.

Disabling the debug code in the build gives this, by the way:

mm/slub.c: In function a??init_kmem_cache_nodea??:
mm/slub.c:1873: error: a??struct kmem_cache_nodea?? has no member named
a??fulla??

-- 
Mathematics is the supreme nostalgia of our time.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [patch 09/10] Remove the SLOB allocator for 2.6.23
  2007-07-10 12:02                   ` Matt Mackall
@ 2007-07-10 12:57                     ` Pekka J Enberg
  2007-07-10 22:12                     ` Christoph Lameter
  1 sibling, 0 replies; 26+ messages in thread
From: Pekka J Enberg @ 2007-07-10 12:57 UTC (permalink / raw)
  To: Matt Mackall
  Cc: Nick Piggin, Christoph Lameter, Andrew Morton, Ingo Molnar,
	linux-kernel, linux-mm, suresh.b.siddha, corey.d.gough,
	Denis Vlasenko, Erik Andersen

Hi Matt,

On Tue, 10 Jul 2007, Matt Mackall wrote:
> Using 2.6.22-rc6-mm1 with a 64MB lguest and busybox, I'm seeing the
> following as the best MemFree numbers after several boots each:
> 
> SLAB: 54796
> SLOB: 55044
> SLUB: 53944
> SLUB: 54788 (debug turned off)
> 
> These numbers bounce around a lot more from boot to boot than I
> remember, so take these numbers with a grain of salt.

To rule out userland, 2.6.22 with 32 MB defconfig UML and busybox [1] on 
i386:

SLOB: 26708
SLUB: 27212 (no debug)

Unfortunately UML is broken in 2.6.22-rc6-mm1, so I don't know if SLOB 
patches help there.

  1. http://uml.nagafix.co.uk/BusyBox-1.5.0/BusyBox-1.5.0-x86-root_fs.bz2

			Pekka

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [patch 00/10] [RFC] SLUB patches for more functionality, performance and maintenance
  2007-07-10  8:27                   ` Mathieu Desnoyers
@ 2007-07-10 18:38                     ` Christoph Lameter
  2007-07-10 20:59                     ` Mathieu Desnoyers
  1 sibling, 0 replies; 26+ messages in thread
From: Christoph Lameter @ 2007-07-10 18:38 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Martin Bligh, Andi Kleen, linux-kernel, linux-mm, David Miller

On Tue, 10 Jul 2007, Mathieu Desnoyers wrote:

> cmpxchg_local is not available on all archs, but local_cmpxchg is. It
> expects a local_t type which is nothing else than a long. When the local
> atomic operation is not more efficient or not implemented on a given
> architecture, asm-generic/local.h falls back on atomic_long_t. If you
> want, you could work on the local_t type, which you could cast from a
> long to a pointer when you need so, since their size are, AFAIK, always
> the same (and some VM code even assume this is always the case).

It would be cleaner to have cmpxchg_local on all arches. The type 
conversion is hacky. If this is really working then we should also use the 
mechanism for other things like the vm statistics.

> The measurements I get (in cycles):
> 
>              enable interrupts (STI)   disable interrupts (CLI)   local CMPXCHG
> IA32 (P4)    112                        82                         26
> x86_64 AMD64 125                       102                         19


Looks good and seems to indicate that we can at least double the speed of 
slab allocation.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH] x86_64 - Use non locked version for local_cmpxchg()
  2007-07-10  5:16                   ` [PATCH] x86_64 - Use non locked version for local_cmpxchg() Mathieu Desnoyers
@ 2007-07-10 20:46                     ` Christoph Lameter
  0 siblings, 0 replies; 26+ messages in thread
From: Christoph Lameter @ 2007-07-10 20:46 UTC (permalink / raw)
  To: Mathieu Desnoyers; +Cc: akpm, Martin Bligh, Andi Kleen, linux-kernel, linux-mm

On Tue, 10 Jul 2007, Mathieu Desnoyers wrote:

> You are completely right: on x86_64, a bit got lost in the move to
> cmpxchg.h, here is the fix. It applies on 2.6.22-rc6-mm1.

A trival fix. Make sure that it gets merged soon.

Acked-by: Christoph Lameter <clameter@sgi.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [patch 00/10] [RFC] SLUB patches for more functionality, performance and maintenance
  2007-07-10  8:27                   ` Mathieu Desnoyers
  2007-07-10 18:38                     ` Christoph Lameter
@ 2007-07-10 20:59                     ` Mathieu Desnoyers
  1 sibling, 0 replies; 26+ messages in thread
From: Mathieu Desnoyers @ 2007-07-10 20:59 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Martin Bligh, Andi Kleen, linux-kernel, linux-mm, David Miller,
	Alexandre Guédon

Another architecture tested

Comparison: irq enable/disable vs local CMPXCHG
             enable interrupts (STI)   disable interrupts (CLI)    local CMPXCHG
Tested-by: Mathieu Desnoyers <compudj@krystal.dyndns.org>
IA32 (P4)               112                        82                       26
x86_64 AMD64            125                       102                       19
Tested-by: Alexandre Guedon <totalworlddomination@gmail.com>
x86_64 Intel Core2 Quad  21                        19                        7


-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [patch 09/10] Remove the SLOB allocator for 2.6.23
  2007-07-10 12:02                   ` Matt Mackall
  2007-07-10 12:57                     ` Pekka J Enberg
@ 2007-07-10 22:12                     ` Christoph Lameter
  2007-07-10 22:40                       ` Matt Mackall
  1 sibling, 1 reply; 26+ messages in thread
From: Christoph Lameter @ 2007-07-10 22:12 UTC (permalink / raw)
  To: Matt Mackall
  Cc: Pekka Enberg, Nick Piggin, Andrew Morton, Ingo Molnar,
	linux-kernel, linux-mm, suresh.b.siddha, corey.d.gough,
	Denis Vlasenko, Erik Andersen

[-- Attachment #1: Type: TEXT/PLAIN, Size: 797 bytes --]

On Tue, 10 Jul 2007, Matt Mackall wrote:

> following as the best MemFree numbers after several boots each:
> 
> SLAB: 54796
> SLOB: 55044
> SLUB: 53944
> SLUB: 54788 (debug turned off)

That was without "slub_debug" as a parameter or with !CONFIG_SLUB_DEBUG?

Data size and code size will decrease if you compile with 
!CONFIG_SLUB_DEBUG. slub_debug on the command line governs if debug 
information is used.

> These numbers bounce around a lot more from boot to boot than I
> remember, so take these numbers with a grain of salt.
> 
> Disabling the debug code in the build gives this, by the way:
> 
> mm/slub.c: In function ÿÿinit_kmem_cache_nodeÿÿ:
> mm/slub.c:1873: error: ÿÿstruct kmem_cache_nodeÿÿ has no member named
> ÿÿfullÿÿ

A fix for that is in Andrew's tree.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [patch 09/10] Remove the SLOB allocator for 2.6.23
  2007-07-10 22:12                     ` Christoph Lameter
@ 2007-07-10 22:40                       ` Matt Mackall
  2007-07-10 22:50                         ` Christoph Lameter
  0 siblings, 1 reply; 26+ messages in thread
From: Matt Mackall @ 2007-07-10 22:40 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Pekka Enberg, Nick Piggin, Andrew Morton, Ingo Molnar,
	linux-kernel, linux-mm, suresh.b.siddha, corey.d.gough,
	Denis Vlasenko, Erik Andersen

On Tue, Jul 10, 2007 at 03:12:38PM -0700, Christoph Lameter wrote:
> On Tue, 10 Jul 2007, Matt Mackall wrote:
> 
> > following as the best MemFree numbers after several boots each:
> > 
> > SLAB: 54796
> > SLOB: 55044
> > SLUB: 53944
> > SLUB: 54788 (debug turned off)
> 
> That was without "slub_debug" as a parameter or with !CONFIG_SLUB_DEBUG?

Without the parameter, as the other way doesn't compile in -mm1.

-- 
Mathematics is the supreme nostalgia of our time.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [patch 09/10] Remove the SLOB allocator for 2.6.23
  2007-07-10 22:40                       ` Matt Mackall
@ 2007-07-10 22:50                         ` Christoph Lameter
  0 siblings, 0 replies; 26+ messages in thread
From: Christoph Lameter @ 2007-07-10 22:50 UTC (permalink / raw)
  To: Matt Mackall
  Cc: Pekka Enberg, Nick Piggin, Andrew Morton, Ingo Molnar,
	linux-kernel, linux-mm, suresh.b.siddha, corey.d.gough,
	Denis Vlasenko, Erik Andersen

On Tue, 10 Jul 2007, Matt Mackall wrote:

> Without the parameter, as the other way doesn't compile in -mm1.

here is the patch that went into mm after mm1 was released.

---
 mm/slub.c |    4 ++++
 1 file changed, 4 insertions(+)

Index: linux-2.6.22-rc6-mm1/mm/slub.c
===================================================================
--- linux-2.6.22-rc6-mm1.orig/mm/slub.c	2007-07-06 13:28:57.000000000 -0700
+++ linux-2.6.22-rc6-mm1/mm/slub.c	2007-07-06 13:29:01.000000000 -0700
@@ -1868,7 +1868,9 @@ static void init_kmem_cache_node(struct 
 	atomic_long_set(&n->nr_slabs, 0);
 	spin_lock_init(&n->list_lock);
 	INIT_LIST_HEAD(&n->partial);
+#ifdef CONFIG_SLUB_DEBUG
 	INIT_LIST_HEAD(&n->full);
+#endif
 }
 
 #ifdef CONFIG_NUMA
@@ -1898,8 +1900,10 @@ static struct kmem_cache_node * __init e
 	page->freelist = get_freepointer(kmalloc_caches, n);
 	page->inuse++;
 	kmalloc_caches->node[node] = n;
+#ifdef CONFIG_SLUB_DEBUG
 	init_object(kmalloc_caches, n, 1);
 	init_tracking(kmalloc_caches, n);
+#endif
 	init_kmem_cache_node(n);
 	atomic_long_inc(&n->nr_slabs);
 	add_partial(n, page);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [patch 00/10] [RFC] SLUB patches for more functionality, performance and maintenance
  2007-07-10  0:55                 ` [patch 00/10] [RFC] SLUB patches for more functionality, performance and maintenance Christoph Lameter
  2007-07-10  8:27                   ` Mathieu Desnoyers
@ 2007-08-13 22:18                   ` Mathieu Desnoyers
  2007-08-13 22:28                     ` Christoph Lameter
  1 sibling, 1 reply; 26+ messages in thread
From: Mathieu Desnoyers @ 2007-08-13 22:18 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Martin Bligh, Andi Kleen, linux-kernel, linux-mm, David Miller

Some review here. I think we could do much better..

* Christoph Lameter (clameter@sgi.com) wrote:
 
> Index: linux-2.6.22-rc6-mm1/mm/slub.c
> ===================================================================
> --- linux-2.6.22-rc6-mm1.orig/mm/slub.c	2007-07-09 15:04:46.000000000 -0700
> +++ linux-2.6.22-rc6-mm1/mm/slub.c	2007-07-09 17:09:00.000000000 -0700
> @@ -1467,12 +1467,14 @@ static void *__slab_alloc(struct kmem_ca
>  {
>  	void **object;
>  	struct page *new;
> +	unsigned long flags;
>  
> +	local_irq_save(flags);
>  	if (!c->page)
>  		goto new_slab;
>  
>  	slab_lock(c->page);
> -	if (unlikely(!node_match(c, node)))
> +	if (unlikely(!node_match(c, node) || c->freelist))
>  		goto another_slab;
>  load_freelist:
>  	object = c->page->freelist;
> @@ -1486,7 +1488,14 @@ load_freelist:
>  	c->page->inuse = s->objects;
>  	c->page->freelist = NULL;
>  	c->node = page_to_nid(c->page);
> +out:
>  	slab_unlock(c->page);
> +	local_irq_restore(flags);
> +	preempt_enable();
> +
> +	if (unlikely((gfpflags & __GFP_ZERO)))
> +		memset(object, 0, c->objsize);
> +
>  	return object;
>  
>  another_slab:
> @@ -1527,6 +1536,8 @@ new_slab:
>  		c->page = new;
>  		goto load_freelist;
>  	}
> +	local_irq_restore(flags);
> +	preempt_enable();
>  	return NULL;
>  debug:
>  	c->freelist = NULL;
> @@ -1536,8 +1547,7 @@ debug:
>  
>  	c->page->inuse++;
>  	c->page->freelist = object[c->offset];
> -	slab_unlock(c->page);
> -	return object;
> +	goto out;
>  }
>  
>  /*
> @@ -1554,23 +1564,20 @@ static void __always_inline *slab_alloc(
>  		gfp_t gfpflags, int node, void *addr)
>  {
>  	void **object;
> -	unsigned long flags;
>  	struct kmem_cache_cpu *c;
>  

What if we prefetch c->freelist here ? I see in this diff that the other
code just reads it sooner as a condition for the if().

> -	local_irq_save(flags);
> +	preempt_disable();
>  	c = get_cpu_slab(s, smp_processor_id());
> -	if (unlikely(!c->page || !c->freelist ||
> -					!node_match(c, node)))
> +redo:
> +	object = c->freelist;
> +	if (unlikely(!object || !node_match(c, node)))
> +		return __slab_alloc(s, gfpflags, node, addr, c);
>  
> -		object = __slab_alloc(s, gfpflags, node, addr, c);
> +	if (cmpxchg_local(&c->freelist, object, object[c->offset]) != object)
> +		goto redo;
>  
> -	else {
> -		object = c->freelist;
> -		c->freelist = object[c->offset];
> -	}
> -	local_irq_restore(flags);
> -
> -	if (unlikely((gfpflags & __GFP_ZERO) && object))
> +	preempt_enable();
> +	if (unlikely((gfpflags & __GFP_ZERO)))
>  		memset(object, 0, c->objsize);
>  
>  	return object;
> @@ -1603,7 +1610,9 @@ static void __slab_free(struct kmem_cach
>  {
>  	void *prior;
>  	void **object = (void *)x;
> +	unsigned long flags;
>  
> +	local_irq_save(flags);
>  	slab_lock(page);
>  
>  	if (unlikely(SlabDebug(page)))
> @@ -1629,6 +1638,8 @@ checks_ok:
>  
>  out_unlock:
>  	slab_unlock(page);
> +	local_irq_restore(flags);
> +	preempt_enable();
>  	return;
>  
>  slab_empty:
> @@ -1639,6 +1650,8 @@ slab_empty:
>  		remove_partial(s, page);
>  
>  	slab_unlock(page);
> +	local_irq_restore(flags);
> +	preempt_enable();
>  	discard_slab(s, page);
>  	return;
>  
> @@ -1663,18 +1676,31 @@ static void __always_inline slab_free(st
>  			struct page *page, void *x, void *addr)
>  {
>  	void **object = (void *)x;
> -	unsigned long flags;
>  	struct kmem_cache_cpu *c;
> +	void **freelist;
>  

Prefetching c->freelist would also make sense here.

> -	local_irq_save(flags);
> +	preempt_disable();
>  	c = get_cpu_slab(s, smp_processor_id());
> -	if (likely(page == c->page && c->freelist)) {
> -		object[c->offset] = c->freelist;
> -		c->freelist = object;
> -	} else
> -		__slab_free(s, page, x, addr, c->offset);
> +redo:
> +	freelist = c->freelist;

I suspect this smp_rmb() may be the cause of a major slowdown.
Therefore, I think we should try taking a copy of c->page and simply
check if it has changed right after the cmpxchg_local:

  page = c->page;

> +	/*
> +	 * Must read freelist before c->page. If a interrupt occurs and
> +	 * changes c->page after we have read it here then it
> +	 * will also have changed c->freelist and the cmpxchg will fail.
> +	 *
> +	 * If we would have checked c->page first then the freelist could
> +	 * have been changed under us before we read c->freelist and we
> +	 * would not be able to detect that situation.
> +	 */
> +	smp_rmb();
> +	if (unlikely(page != c->page || !freelist))
> +		return __slab_free(s, page, x, addr, c->offset);
> +
> +	object[c->offset] = freelist;
-> +	if (cmpxchg_local(&c->freelist, freelist, object) != freelist)
+> +	if (cmpxchg_local(&c->freelist, freelist, object) != freelist
        || page != c->page)
> +		goto redo;
>  

Therefore, in the scenario where:
1 - c->page is read
2 - Interrupt comes, changes c->page and c->freelist
3 - c->freelist is read
4 - cmpxchg c->freelist succeeds
5 - Then, page != c->page, so we goto redo.

It also works if 4 and 5 are swapped.

I could test the modification if you point to me which kernel version it
should apply to. However, I don't have the same hardware you use.

By the way, the smp_rmb() barrier does not make sense with the comment.
If it is _really_ protecting against reordering wrt interrupts, then it
should be a rmb(), not smp_rmb() (because it will be reordered on UP).
But I think the best would just be to work without rmb() at all, as
proposed here.

Mathieu

> -	local_irq_restore(flags);
> +	preempt_enable();
>  }
>  
>  void kmem_cache_free(struct kmem_cache *s, void *x)
> 
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [patch 00/10] [RFC] SLUB patches for more functionality, performance and maintenance
  2007-08-13 22:18                   ` Mathieu Desnoyers
@ 2007-08-13 22:28                     ` Christoph Lameter
  0 siblings, 0 replies; 26+ messages in thread
From: Christoph Lameter @ 2007-08-13 22:28 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Martin Bligh, Andi Kleen, linux-kernel, linux-mm, David Miller

On Mon, 13 Aug 2007, Mathieu Desnoyers wrote:

> > @@ -1554,23 +1564,20 @@ static void __always_inline *slab_alloc(
> >  		gfp_t gfpflags, int node, void *addr)
> >  {
> >  	void **object;
> > -	unsigned long flags;
> >  	struct kmem_cache_cpu *c;
> >  
> 
> What if we prefetch c->freelist here ? I see in this diff that the other
> code just reads it sooner as a condition for the if().

Not sure as to what this may bring. If you read it earlier then you may 
get the wrong value and then may have to refetch the cacheline.

We cannot fetch c->freelist without determining c. I can remove the 
check for c->page == page so that the fetch of c->freelist comes 
immeidately after detemination of c. But that does not change performance.

> > -		c->freelist = object;
> > -	} else
> > -		__slab_free(s, page, x, addr, c->offset);
> > +redo:
> > +	freelist = c->freelist;
> 
> I suspect this smp_rmb() may be the cause of a major slowdown.
> Therefore, I think we should try taking a copy of c->page and simply
> check if it has changed right after the cmpxchg_local:

Thought so too and I removed that smp_rmb and tested this modification 
on UP again without any performance gains. I think the cacheline fetches 
dominates the execution thread here and cmpxchg does not bring us 
anything.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2007-08-13 22:28 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <20070708034952.022985379@sgi.com>
     [not found] ` <p73y7hrywel.fsf@bingen.suse.de>
2007-07-09 15:50   ` [patch 00/10] [RFC] SLUB patches for more functionality, performance and maintenance Christoph Lameter
2007-07-09 15:59     ` Martin Bligh
2007-07-09 18:11       ` Christoph Lameter
2007-07-09 21:00         ` Martin Bligh
2007-07-09 21:44           ` Mathieu Desnoyers
2007-07-09 21:55             ` Christoph Lameter
2007-07-09 22:58               ` Mathieu Desnoyers
2007-07-09 23:08                 ` Christoph Lameter
2007-07-10  5:16                   ` [PATCH] x86_64 - Use non locked version for local_cmpxchg() Mathieu Desnoyers
2007-07-10 20:46                     ` Christoph Lameter
2007-07-10  0:55                 ` [patch 00/10] [RFC] SLUB patches for more functionality, performance and maintenance Christoph Lameter
2007-07-10  8:27                   ` Mathieu Desnoyers
2007-07-10 18:38                     ` Christoph Lameter
2007-07-10 20:59                     ` Mathieu Desnoyers
2007-08-13 22:18                   ` Mathieu Desnoyers
2007-08-13 22:28                     ` Christoph Lameter
     [not found] ` <20070708035018.074510057@sgi.com>
     [not found]   ` <20070708075119.GA16631@elte.hu>
     [not found]     ` <20070708110224.9cd9df5b.akpm@linux-foundation.org>
     [not found]       ` <4691A415.6040208@yahoo.com.au>
     [not found]         ` <84144f020707090404l657a62c7x89d7d06b3dd6c34b@mail.gmail.com>
2007-07-09 16:08           ` [patch 09/10] Remove the SLOB allocator for 2.6.23 Christoph Lameter
2007-07-10  8:17             ` Pekka J Enberg
2007-07-10  8:27               ` Nick Piggin
2007-07-10  9:31                 ` Pekka Enberg
2007-07-10 10:09                   ` Nick Piggin
2007-07-10 12:02                   ` Matt Mackall
2007-07-10 12:57                     ` Pekka J Enberg
2007-07-10 22:12                     ` Christoph Lameter
2007-07-10 22:40                       ` Matt Mackall
2007-07-10 22:50                         ` Christoph Lameter

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox