* Re: [patch 00/10] [RFC] SLUB patches for more functionality, performance and maintenance [not found] ` <p73y7hrywel.fsf@bingen.suse.de> @ 2007-07-09 15:50 ` Christoph Lameter 2007-07-09 15:59 ` Martin Bligh 0 siblings, 1 reply; 26+ messages in thread From: Christoph Lameter @ 2007-07-09 15:50 UTC (permalink / raw) To: Andi Kleen; +Cc: linux-kernel, linux-mm, mbligh On Sun, 8 Jul 2007, Andi Kleen wrote: > Christoph Lameter <clameter@sgi.com> writes: > > > A cmpxchg is less costly than interrupt enabe/disable > > That sounds wrong. Martin Bligh was able to significantly increase his LTTng performance by using cmpxchg. See his article in the 2007 proceedings of the OLS Volume 1, page 39. His numbers were: interrupts enable disable : 210.6ns local cmpxchg : 9.0ns -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [patch 00/10] [RFC] SLUB patches for more functionality, performance and maintenance 2007-07-09 15:50 ` [patch 00/10] [RFC] SLUB patches for more functionality, performance and maintenance Christoph Lameter @ 2007-07-09 15:59 ` Martin Bligh 2007-07-09 18:11 ` Christoph Lameter 0 siblings, 1 reply; 26+ messages in thread From: Martin Bligh @ 2007-07-09 15:59 UTC (permalink / raw) To: Christoph Lameter; +Cc: Andi Kleen, linux-kernel, linux-mm Christoph Lameter wrote: > On Sun, 8 Jul 2007, Andi Kleen wrote: > >> Christoph Lameter <clameter@sgi.com> writes: >> >>> A cmpxchg is less costly than interrupt enabe/disable >> That sounds wrong. > > Martin Bligh was able to significantly increase his LTTng performance > by using cmpxchg. See his article in the 2007 proceedings of the OLS > Volume 1, page 39. > > His numbers were: > > interrupts enable disable : 210.6ns > local cmpxchg : 9.0ns Those numbers came from Mathieu Desnoyers (LTTng) if you want more details. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [patch 00/10] [RFC] SLUB patches for more functionality, performance and maintenance 2007-07-09 15:59 ` Martin Bligh @ 2007-07-09 18:11 ` Christoph Lameter 2007-07-09 21:00 ` Martin Bligh 0 siblings, 1 reply; 26+ messages in thread From: Christoph Lameter @ 2007-07-09 18:11 UTC (permalink / raw) To: Martin Bligh; +Cc: Andi Kleen, linux-kernel, linux-mm On Mon, 9 Jul 2007, Martin Bligh wrote: > Those numbers came from Mathieu Desnoyers (LTTng) if you > want more details. Okay the source for these numbers is in his paper for the OLS 2006: Volume 1 page 208-209? I do not see the exact number that you referred to there. He seems to be comparing spinlock acquire / release vs. cmpxchg. So I guess you got your material from somewhere else? Also the cmpxchg used there is the lockless variant. cmpxchg 29 cycles w/o lock prefix and 112 with lock prefix. I see you reference another paper by Desnoyers: http://tree.celinuxforum.org/CelfPubWiki/ELC2006Presentations?action=AttachFile&do=get&target=celf2006-desnoyers.pdf I do not see anything relevant there. Where did those numbers come from? The lockless cmpxchg is certainly an interesting idea. Certain for some platforms I could disable preempt and then do a lockless cmpxchg. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [patch 00/10] [RFC] SLUB patches for more functionality, performance and maintenance 2007-07-09 18:11 ` Christoph Lameter @ 2007-07-09 21:00 ` Martin Bligh 2007-07-09 21:44 ` Mathieu Desnoyers 0 siblings, 1 reply; 26+ messages in thread From: Martin Bligh @ 2007-07-09 21:00 UTC (permalink / raw) To: Christoph Lameter; +Cc: Andi Kleen, linux-kernel, linux-mm, Mathieu Desnoyers Christoph Lameter wrote: > On Mon, 9 Jul 2007, Martin Bligh wrote: > >> Those numbers came from Mathieu Desnoyers (LTTng) if you >> want more details. > > Okay the source for these numbers is in his paper for the OLS 2006: Volume > 1 page 208-209? I do not see the exact number that you referred to there. Nope, he was a direct co-author on the paper, was working here, and measured it. > He seems to be comparing spinlock acquire / release vs. cmpxchg. So I > guess you got your material from somewhere else? > > Also the cmpxchg used there is the lockless variant. cmpxchg 29 cycles w/o > lock prefix and 112 with lock prefix. > > I see you reference another paper by Desnoyers: > http://tree.celinuxforum.org/CelfPubWiki/ELC2006Presentations?action=AttachFile&do=get&target=celf2006-desnoyers.pdf > > I do not see anything relevant there. Where did those numbers come from? > > The lockless cmpxchg is certainly an interesting idea. Certain for some > platforms I could disable preempt and then do a lockless cmpxchg. Matheiu, can you give some more details? Obviously the exact numbers will vary by archicture, machine size, etc, but it's a good point for discussion. M. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [patch 00/10] [RFC] SLUB patches for more functionality, performance and maintenance 2007-07-09 21:00 ` Martin Bligh @ 2007-07-09 21:44 ` Mathieu Desnoyers 2007-07-09 21:55 ` Christoph Lameter 0 siblings, 1 reply; 26+ messages in thread From: Mathieu Desnoyers @ 2007-07-09 21:44 UTC (permalink / raw) To: Martin Bligh; +Cc: Christoph Lameter, Andi Kleen, linux-kernel, linux-mm Hi, * Martin Bligh (mbligh@mbligh.org) wrote: > Christoph Lameter wrote: > >On Mon, 9 Jul 2007, Martin Bligh wrote: > > > >>Those numbers came from Mathieu Desnoyers (LTTng) if you > >>want more details. > > > >Okay the source for these numbers is in his paper for the OLS 2006: Volume > >1 page 208-209? I do not see the exact number that you referred to there. > Hrm, the reference page number is wrong: it is in OLS 2006, Vol. 1 page 216 (section 4.5.2 Scalability). I originally pulled out the page number from my local paper copy. oops. > Nope, he was a direct co-author on the paper, was > working here, and measured it. > > >He seems to be comparing spinlock acquire / release vs. cmpxchg. So I > >guess you got your material from somewhere else? > > I ran a test specifically for this paper where I got this result comparing the local irq enable/disable to local cmpxchg. > >Also the cmpxchg used there is the lockless variant. cmpxchg 29 cycles w/o > >lock prefix and 112 with lock prefix. Yep, I volountarily used the variant without lock prefix because the data is per cpu and I disable preemption. > > > >I see you reference another paper by Desnoyers: > >http://tree.celinuxforum.org/CelfPubWiki/ELC2006Presentations?action=AttachFile&do=get&target=celf2006-desnoyers.pdf > > > >I do not see anything relevant there. Where did those numbers come from? > > > >The lockless cmpxchg is certainly an interesting idea. Certain for some > >platforms I could disable preempt and then do a lockless cmpxchg. > Yes, preempt disabling or, eventually, the new thread migration disabling I just proposed as an RFC on LKML. (that would make -rt people happier) > Mathieu, can you give some more details? Obviously the exact numbers > will vary by archicture, machine size, etc, but it's a good point > for discussion. > Sure, also note that the UP cmpxchg (see asm-$ARCH/local.h in 2.6.22) is faster on architectures like powerpc and MIPS where it is possible to remove some memory barriers. See 2.6.22 Documentation/local_ops.txt for a thorough discussion. Don't hesitate ping me if you have more questions. Regards, Mathieu -- Mathieu Desnoyers Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [patch 00/10] [RFC] SLUB patches for more functionality, performance and maintenance 2007-07-09 21:44 ` Mathieu Desnoyers @ 2007-07-09 21:55 ` Christoph Lameter 2007-07-09 22:58 ` Mathieu Desnoyers 0 siblings, 1 reply; 26+ messages in thread From: Christoph Lameter @ 2007-07-09 21:55 UTC (permalink / raw) To: Mathieu Desnoyers; +Cc: Martin Bligh, Andi Kleen, linux-kernel, linux-mm On Mon, 9 Jul 2007, Mathieu Desnoyers wrote: > > >Okay the source for these numbers is in his paper for the OLS 2006: Volume > > >1 page 208-209? I do not see the exact number that you referred to there. > > > > Hrm, the reference page number is wrong: it is in OLS 2006, Vol. 1 page > 216 (section 4.5.2 Scalability). I originally pulled out the page number > from my local paper copy. oops. 4.5.2 is on page 208 in my copy of the proceedings. > > >He seems to be comparing spinlock acquire / release vs. cmpxchg. So I > > >guess you got your material from somewhere else? > > > > > I ran a test specifically for this paper where I got this result > comparing the local irq enable/disable to local cmpxchg. The numbers are pretty important and suggest that we can obtain a significant speed increase by avoid local irq disable enable in the slab allocator fast paths. Do you some more numbers? Any other publication that mentions these? > Yep, I volountarily used the variant without lock prefix because the > data is per cpu and I disable preemption. local_cmpxchg generates this? > Yes, preempt disabling or, eventually, the new thread migration > disabling I just proposed as an RFC on LKML. (that would make -rt people > happier) Right. > Sure, also note that the UP cmpxchg (see asm-$ARCH/local.h in 2.6.22) is > faster on architectures like powerpc and MIPS where it is possible to > remove some memory barriers. UP cmpxchg meaning local_cmpxchg? > See 2.6.22 Documentation/local_ops.txt for a thorough discussion. Don't > hesitate ping me if you have more questions. That is pretty thin and does not mention atomic_cmpxchg. You way want to expand on your ideas a bit. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [patch 00/10] [RFC] SLUB patches for more functionality, performance and maintenance 2007-07-09 21:55 ` Christoph Lameter @ 2007-07-09 22:58 ` Mathieu Desnoyers 2007-07-09 23:08 ` Christoph Lameter 2007-07-10 0:55 ` [patch 00/10] [RFC] SLUB patches for more functionality, performance and maintenance Christoph Lameter 0 siblings, 2 replies; 26+ messages in thread From: Mathieu Desnoyers @ 2007-07-09 22:58 UTC (permalink / raw) To: Christoph Lameter; +Cc: Martin Bligh, Andi Kleen, linux-kernel, linux-mm * Christoph Lameter (clameter@sgi.com) wrote: > On Mon, 9 Jul 2007, Mathieu Desnoyers wrote: > > > > >He seems to be comparing spinlock acquire / release vs. cmpxchg. So I > > > >guess you got your material from somewhere else? > > > > > > > > I ran a test specifically for this paper where I got this result > > comparing the local irq enable/disable to local cmpxchg. > > > The numbers are pretty important and suggest that we can obtain > a significant speed increase by avoid local irq disable enable in the slab > allocator fast paths. Do you some more numbers? Any other publication that > mentions these? > The original publication in which I released the idea was my LTTng paper at OLS 2006. Outside this, I have not found other paper that talks about this idea. The test code is basically just disabling interrupts, reading the TSC at the beginning and end and does 20000 loops of local_cmpxchg. I can send you the code if you want it. > > > Yep, I volountarily used the variant without lock prefix because the > > data is per cpu and I disable preemption. > > local_cmpxchg generates this? > Yes. > > Yes, preempt disabling or, eventually, the new thread migration > > disabling I just proposed as an RFC on LKML. (that would make -rt people > > happier) > > Right. > > > Sure, also note that the UP cmpxchg (see asm-$ARCH/local.h in 2.6.22) is > > faster on architectures like powerpc and MIPS where it is possible to > > remove some memory barriers. > > UP cmpxchg meaning local_cmpxchg? > Yes. > > See 2.6.22 Documentation/local_ops.txt for a thorough discussion. Don't > > hesitate ping me if you have more questions. > > That is pretty thin and does not mention atomic_cmpxchg. You way want to > expand on your ideas a bit. Sure, the idea goes as follow: if you have a per cpu variable that needs to be concurrently modified in a coherent manner by any context (NMI, irq, bh, process) running on the given CPU, you only need to use an operation atomic wrt to the given CPU. You just have to make sure that only this CPU will modify the variable (therefore, you must disable preemption around modification) and you have to make sure that the read-side, which can come from any CPU, is accessing this variable atomically. Also, you have to be aware that the read-side might see an older version of the other cpu's value because there is no SMP write memory barrier involved. The value, however, will always be up to date if the variable is read from the "local" CPU. What applies to local_inc, given as example in the local_ops.txt document, applies integrally to local_cmpxchg. And I would say that local_cmpxchg is by far the cheapest locking mechanism I have found, and use today, for my kernel tracer. The idea emerged from my need to trace every execution context, including NMIs, while still providing good performances. local_cmpxchg was the perfect fit; that's why I deployed it in local.h in each and every architecture. Mathieu -- Mathieu Desnoyers Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [patch 00/10] [RFC] SLUB patches for more functionality, performance and maintenance 2007-07-09 22:58 ` Mathieu Desnoyers @ 2007-07-09 23:08 ` Christoph Lameter 2007-07-10 5:16 ` [PATCH] x86_64 - Use non locked version for local_cmpxchg() Mathieu Desnoyers 2007-07-10 0:55 ` [patch 00/10] [RFC] SLUB patches for more functionality, performance and maintenance Christoph Lameter 1 sibling, 1 reply; 26+ messages in thread From: Christoph Lameter @ 2007-07-09 23:08 UTC (permalink / raw) To: Mathieu Desnoyers; +Cc: Martin Bligh, Andi Kleen, linux-kernel, linux-mm On Mon, 9 Jul 2007, Mathieu Desnoyers wrote: > > > Yep, I volountarily used the variant without lock prefix because the > > > data is per cpu and I disable preemption. > > > > local_cmpxchg generates this? > > > > Yes. Does not work here. If I use static void __always_inline *slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node, void *addr) { void **object; struct kmem_cache_cpu *c; preempt_disable(); c = get_cpu_slab(s, smp_processor_id()); redo: object = c->freelist; if (unlikely(!object || !node_match(c, node))) return __slab_alloc(s, gfpflags, node, addr, c); if (cmpxchg_local(&c->freelist, object, object[c->offset]) != object) goto redo; preempt_enable(); if (unlikely((gfpflags & __GFP_ZERO))) memset(object, 0, c->objsize); return object; } Then the code will include a lock prefix: 3270: 48 8b 1a mov (%rdx),%rbx 3273: 48 85 db test %rbx,%rbx 3276: 74 23 je 329b <kmem_cache_alloc+0x4b> 3278: 8b 42 14 mov 0x14(%rdx),%eax 327b: 4c 8b 0c c3 mov (%rbx,%rax,8),%r9 327f: 48 89 d8 mov %rbx,%rax 3282: f0 4c 0f b1 0a lock cmpxchg %r9,(%rdx) 3287: 48 39 c3 cmp %rax,%rbx 328a: 75 e4 jne 3270 <kmem_cache_alloc+0x20> 328c: 66 85 f6 test %si,%si 328f: 78 19 js 32aa <kmem_cache_alloc+0x5a> 3291: 48 89 d8 mov %rbx,%rax 3294: 48 83 c4 08 add $0x8,%rsp 3298: 5b pop %rbx 3299: c9 leaveq 329a: c3 retq > What applies to local_inc, given as example in the local_ops.txt > document, applies integrally to local_cmpxchg. And I would say that > local_cmpxchg is by far the cheapest locking mechanism I have found, and > use today, for my kernel tracer. The idea emerged from my need to trace > every execution context, including NMIs, while still providing good > performances. local_cmpxchg was the perfect fit; that's why I deployed > it in local.h in each and every architecture. Great idea. The SLUB allocator may be able to use your idea to improve both the alloc and free path. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 26+ messages in thread
* [PATCH] x86_64 - Use non locked version for local_cmpxchg() 2007-07-09 23:08 ` Christoph Lameter @ 2007-07-10 5:16 ` Mathieu Desnoyers 2007-07-10 20:46 ` Christoph Lameter 0 siblings, 1 reply; 26+ messages in thread From: Mathieu Desnoyers @ 2007-07-10 5:16 UTC (permalink / raw) To: akpm; +Cc: Martin Bligh, Andi Kleen, linux-kernel, linux-mm You are completely right: on x86_64, a bit got lost in the move to cmpxchg.h, here is the fix. It applies on 2.6.22-rc6-mm1. x86_64 - Use non locked version for local_cmpxchg() local_cmpxchg() should not use any LOCK prefix. This change probably got lost in the move to cmpxchg.h. Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> --- include/asm-x86_64/cmpxchg.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) Index: linux-2.6-lttng/include/asm-x86_64/cmpxchg.h =================================================================== --- linux-2.6-lttng.orig/include/asm-x86_64/cmpxchg.h 2007-07-10 01:10:10.000000000 -0400 +++ linux-2.6-lttng/include/asm-x86_64/cmpxchg.h 2007-07-10 01:11:03.000000000 -0400 @@ -128,7 +128,7 @@ ((__typeof__(*(ptr)))__cmpxchg((ptr),(unsigned long)(o),\ (unsigned long)(n),sizeof(*(ptr)))) #define cmpxchg_local(ptr,o,n)\ - ((__typeof__(*(ptr)))__cmpxchg((ptr),(unsigned long)(o),\ + ((__typeof__(*(ptr)))__cmpxchg_local((ptr),(unsigned long)(o),\ (unsigned long)(n),sizeof(*(ptr)))) #endif * Christoph Lameter (clameter@sgi.com) wrote: > On Mon, 9 Jul 2007, Mathieu Desnoyers wrote: > > > > > Yep, I volountarily used the variant without lock prefix because the > > > > data is per cpu and I disable preemption. > > > > > > local_cmpxchg generates this? > > > > > > > Yes. > > Does not work here. If I use > > static void __always_inline *slab_alloc(struct kmem_cache *s, > gfp_t gfpflags, int node, void *addr) > { > void **object; > struct kmem_cache_cpu *c; > > preempt_disable(); > c = get_cpu_slab(s, smp_processor_id()); > redo: > object = c->freelist; > if (unlikely(!object || !node_match(c, node))) > return __slab_alloc(s, gfpflags, node, addr, c); > > if (cmpxchg_local(&c->freelist, object, object[c->offset]) != object) > goto redo; > > preempt_enable(); > if (unlikely((gfpflags & __GFP_ZERO))) > memset(object, 0, c->objsize); > > return object; > } > > Then the code will include a lock prefix: > > 3270: 48 8b 1a mov (%rdx),%rbx > 3273: 48 85 db test %rbx,%rbx > 3276: 74 23 je 329b <kmem_cache_alloc+0x4b> > 3278: 8b 42 14 mov 0x14(%rdx),%eax > 327b: 4c 8b 0c c3 mov (%rbx,%rax,8),%r9 > 327f: 48 89 d8 mov %rbx,%rax > 3282: f0 4c 0f b1 0a lock cmpxchg %r9,(%rdx) > 3287: 48 39 c3 cmp %rax,%rbx > 328a: 75 e4 jne 3270 <kmem_cache_alloc+0x20> > 328c: 66 85 f6 test %si,%si > 328f: 78 19 js 32aa <kmem_cache_alloc+0x5a> > 3291: 48 89 d8 mov %rbx,%rax > 3294: 48 83 c4 08 add $0x8,%rsp > 3298: 5b pop %rbx > 3299: c9 leaveq > 329a: c3 retq > > > > What applies to local_inc, given as example in the local_ops.txt > > document, applies integrally to local_cmpxchg. And I would say that > > local_cmpxchg is by far the cheapest locking mechanism I have found, and > > use today, for my kernel tracer. The idea emerged from my need to trace > > every execution context, including NMIs, while still providing good > > performances. local_cmpxchg was the perfect fit; that's why I deployed > > it in local.h in each and every architecture. > > Great idea. The SLUB allocator may be able to use your idea to improve > both the alloc and free path. > -- Mathieu Desnoyers Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH] x86_64 - Use non locked version for local_cmpxchg() 2007-07-10 5:16 ` [PATCH] x86_64 - Use non locked version for local_cmpxchg() Mathieu Desnoyers @ 2007-07-10 20:46 ` Christoph Lameter 0 siblings, 0 replies; 26+ messages in thread From: Christoph Lameter @ 2007-07-10 20:46 UTC (permalink / raw) To: Mathieu Desnoyers; +Cc: akpm, Martin Bligh, Andi Kleen, linux-kernel, linux-mm On Tue, 10 Jul 2007, Mathieu Desnoyers wrote: > You are completely right: on x86_64, a bit got lost in the move to > cmpxchg.h, here is the fix. It applies on 2.6.22-rc6-mm1. A trival fix. Make sure that it gets merged soon. Acked-by: Christoph Lameter <clameter@sgi.com> -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [patch 00/10] [RFC] SLUB patches for more functionality, performance and maintenance 2007-07-09 22:58 ` Mathieu Desnoyers 2007-07-09 23:08 ` Christoph Lameter @ 2007-07-10 0:55 ` Christoph Lameter 2007-07-10 8:27 ` Mathieu Desnoyers 2007-08-13 22:18 ` Mathieu Desnoyers 1 sibling, 2 replies; 26+ messages in thread From: Christoph Lameter @ 2007-07-10 0:55 UTC (permalink / raw) To: Mathieu Desnoyers Cc: Martin Bligh, Andi Kleen, linux-kernel, linux-mm, David Miller Ok here is a replacement patch for the cmpxchg patch. Problems 1. cmpxchg_local is not available on all arches. If we wanted to do this then it needs to be universally available. 2. cmpxchg_local does generate the "lock" prefix. It should not do that. Without fixes to cmpxchg_local we cannot expect maximum performance. 3. The approach is x86 centric. It relies on a cmpxchg that does not synchronize with memory used by other cpus and therefore is more lightweight. As far as I know the IA64 cmpxchg cannot do that. Neither several other processors. I am not sure how cmpxchgless platforms would use that. We need a detailed comparison of interrupt enable /disable vs. cmpxchg cycle counts for cachelines in the cpu cache to evaluate the impact that such a change would have. The cmpxchg (or its emulation) does not need any barriers since the accesses can only come from a single processor. Mathieu measured a significant performance benefit coming from not using interrupt enable / disable. Some rough processor cycle counts (anyone have better numbers?) STI CLI CMPXCHG IA32 36 26 1 (assume XCHG == CMPXCHG, sti/cli also need stack pushes/pulls) IA64 12 12 1 (but ar.ccv needs 11 cycles to set comparator, need register moves to preserve processors flags) Looks like STI/CLI is pretty expensive and it seems that we may be able to optimize the alloc / free hotpath quite a bit if we could drop the interrupt enable / disable. But we need some measurements. Draft of a new patch: SLUB: Single atomic instruction alloc/free using cmpxchg_local A cmpxchg allows us to avoid disabling and enabling interrupts. The cmpxchg is optimal to allow operations on per cpu freelist. We can stay on one processor by disabling preemption() and allowing concurrent interrupts thus avoiding the overhead of disabling and enabling interrupts. Pro: - No need to disable interrupts. - Preempt disable /enable vanishes on non preempt kernels Con: - Slightly complexer handling. - Updates to atomic instructions needed Signed-off-by: Christoph Lameter <clameter@sgi.com> --- mm/slub.c | 72 ++++++++++++++++++++++++++++++++++++++++++-------------------- 1 file changed, 49 insertions(+), 23 deletions(-) Index: linux-2.6.22-rc6-mm1/mm/slub.c =================================================================== --- linux-2.6.22-rc6-mm1.orig/mm/slub.c 2007-07-09 15:04:46.000000000 -0700 +++ linux-2.6.22-rc6-mm1/mm/slub.c 2007-07-09 17:09:00.000000000 -0700 @@ -1467,12 +1467,14 @@ static void *__slab_alloc(struct kmem_ca { void **object; struct page *new; + unsigned long flags; + local_irq_save(flags); if (!c->page) goto new_slab; slab_lock(c->page); - if (unlikely(!node_match(c, node))) + if (unlikely(!node_match(c, node) || c->freelist)) goto another_slab; load_freelist: object = c->page->freelist; @@ -1486,7 +1488,14 @@ load_freelist: c->page->inuse = s->objects; c->page->freelist = NULL; c->node = page_to_nid(c->page); +out: slab_unlock(c->page); + local_irq_restore(flags); + preempt_enable(); + + if (unlikely((gfpflags & __GFP_ZERO))) + memset(object, 0, c->objsize); + return object; another_slab: @@ -1527,6 +1536,8 @@ new_slab: c->page = new; goto load_freelist; } + local_irq_restore(flags); + preempt_enable(); return NULL; debug: c->freelist = NULL; @@ -1536,8 +1547,7 @@ debug: c->page->inuse++; c->page->freelist = object[c->offset]; - slab_unlock(c->page); - return object; + goto out; } /* @@ -1554,23 +1564,20 @@ static void __always_inline *slab_alloc( gfp_t gfpflags, int node, void *addr) { void **object; - unsigned long flags; struct kmem_cache_cpu *c; - local_irq_save(flags); + preempt_disable(); c = get_cpu_slab(s, smp_processor_id()); - if (unlikely(!c->page || !c->freelist || - !node_match(c, node))) +redo: + object = c->freelist; + if (unlikely(!object || !node_match(c, node))) + return __slab_alloc(s, gfpflags, node, addr, c); - object = __slab_alloc(s, gfpflags, node, addr, c); + if (cmpxchg_local(&c->freelist, object, object[c->offset]) != object) + goto redo; - else { - object = c->freelist; - c->freelist = object[c->offset]; - } - local_irq_restore(flags); - - if (unlikely((gfpflags & __GFP_ZERO) && object)) + preempt_enable(); + if (unlikely((gfpflags & __GFP_ZERO))) memset(object, 0, c->objsize); return object; @@ -1603,7 +1610,9 @@ static void __slab_free(struct kmem_cach { void *prior; void **object = (void *)x; + unsigned long flags; + local_irq_save(flags); slab_lock(page); if (unlikely(SlabDebug(page))) @@ -1629,6 +1638,8 @@ checks_ok: out_unlock: slab_unlock(page); + local_irq_restore(flags); + preempt_enable(); return; slab_empty: @@ -1639,6 +1650,8 @@ slab_empty: remove_partial(s, page); slab_unlock(page); + local_irq_restore(flags); + preempt_enable(); discard_slab(s, page); return; @@ -1663,18 +1676,31 @@ static void __always_inline slab_free(st struct page *page, void *x, void *addr) { void **object = (void *)x; - unsigned long flags; struct kmem_cache_cpu *c; + void **freelist; - local_irq_save(flags); + preempt_disable(); c = get_cpu_slab(s, smp_processor_id()); - if (likely(page == c->page && c->freelist)) { - object[c->offset] = c->freelist; - c->freelist = object; - } else - __slab_free(s, page, x, addr, c->offset); +redo: + freelist = c->freelist; + /* + * Must read freelist before c->page. If a interrupt occurs and + * changes c->page after we have read it here then it + * will also have changed c->freelist and the cmpxchg will fail. + * + * If we would have checked c->page first then the freelist could + * have been changed under us before we read c->freelist and we + * would not be able to detect that situation. + */ + smp_rmb(); + if (unlikely(page != c->page || !freelist)) + return __slab_free(s, page, x, addr, c->offset); + + object[c->offset] = freelist; + if (cmpxchg_local(&c->freelist, freelist, object) != freelist) + goto redo; - local_irq_restore(flags); + preempt_enable(); } void kmem_cache_free(struct kmem_cache *s, void *x) -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [patch 00/10] [RFC] SLUB patches for more functionality, performance and maintenance 2007-07-10 0:55 ` [patch 00/10] [RFC] SLUB patches for more functionality, performance and maintenance Christoph Lameter @ 2007-07-10 8:27 ` Mathieu Desnoyers 2007-07-10 18:38 ` Christoph Lameter 2007-07-10 20:59 ` Mathieu Desnoyers 2007-08-13 22:18 ` Mathieu Desnoyers 1 sibling, 2 replies; 26+ messages in thread From: Mathieu Desnoyers @ 2007-07-10 8:27 UTC (permalink / raw) To: Christoph Lameter Cc: Martin Bligh, Andi Kleen, linux-kernel, linux-mm, David Miller * Christoph Lameter (clameter@sgi.com) wrote: > Ok here is a replacement patch for the cmpxchg patch. Problems > > 1. cmpxchg_local is not available on all arches. If we wanted to do > this then it needs to be universally available. > cmpxchg_local is not available on all archs, but local_cmpxchg is. It expects a local_t type which is nothing else than a long. When the local atomic operation is not more efficient or not implemented on a given architecture, asm-generic/local.h falls back on atomic_long_t. If you want, you could work on the local_t type, which you could cast from a long to a pointer when you need so, since their size are, AFAIK, always the same (and some VM code even assume this is always the case). > 2. cmpxchg_local does generate the "lock" prefix. It should not do that. > Without fixes to cmpxchg_local we cannot expect maximum performance. > Yup, see the patch I just posted for this. > 3. The approach is x86 centric. It relies on a cmpxchg that does not > synchronize with memory used by other cpus and therefore is more > lightweight. As far as I know the IA64 cmpxchg cannot do that. > Neither several other processors. I am not sure how cmpxchgless > platforms would use that. We need a detailed comparison of > interrupt enable /disable vs. cmpxchg cycle counts for cachelines in > the cpu cache to evaluate the impact that such a change would have. > > The cmpxchg (or its emulation) does not need any barriers since the > accesses can only come from a single processor. > Yes, expected improvements goes as follow: x86, x86_64 : must faster due to non-LOCKed cmpxchg alpha: should be faster due to memory barrier removal mips: memory barriers removed powerpc 32/64: memory barriers removed On other architectures, either there is no better implementation than the standard atomic cmpxchg or it just has not been implemented. I guess that a test series that would tell us how must improvement is seen on the optimized architectures (local cmpxchg vs interrupt enable/disable) and also what effect the standard cmpxchg has compared to interrupt disable/enable on the architectures where we can't do better than the standard cmpxchg will tell us if it is an interesting way to go. I would be happy to do these tests, but I don't have the hardware handy. I provide a test module to get these characteristics from various architectures in this email. > Mathieu measured a significant performance benefit coming from not using > interrupt enable / disable. > > Some rough processor cycle counts (anyone have better numbers?) > > STI CLI CMPXCHG > IA32 36 26 1 (assume XCHG == CMPXCHG, sti/cli also need stack pushes/pulls) > IA64 12 12 1 (but ar.ccv needs 11 cycles to set comparator, > need register moves to preserve processors flags) > The measurements I get (in cycles): enable interrupts (STI) disable interrupts (CLI) local CMPXCHG IA32 (P4) 112 82 26 x86_64 AMD64 125 102 19 > Looks like STI/CLI is pretty expensive and it seems that we may be able to > optimize the alloc / free hotpath quite a bit if we could drop the > interrupt enable / disable. But we need some measurements. > > > Draft of a new patch: > > SLUB: Single atomic instruction alloc/free using cmpxchg_local > > A cmpxchg allows us to avoid disabling and enabling interrupts. The cmpxchg > is optimal to allow operations on per cpu freelist. We can stay on one > processor by disabling preemption() and allowing concurrent interrupts > thus avoiding the overhead of disabling and enabling interrupts. > > Pro: > - No need to disable interrupts. > - Preempt disable /enable vanishes on non preempt kernels > Con: > - Slightly complexer handling. > - Updates to atomic instructions needed > > Signed-off-by: Christoph Lameter <clameter@sgi.com> > Test local cmpxchg vs int disable/enable. Please run on a 2.6.22 kernel (or recent 2.6.21-rcX-mmX) (with my cmpxchg local fix patch for x86_64). Make sure the TSC reads (get_cycles()) are reliable on your platform. Mathieu /* test-cmpxchg-nolock.c * * Compare local cmpxchg with irq disable / enable. */ #include <linux/jiffies.h> #include <linux/compiler.h> #include <linux/init.h> #include <linux/module.h> #include <linux/calc64.h> #include <asm/timex.h> #include <asm/system.h> #define NR_LOOPS 20000 int test_val = 0; static void do_test_cmpxchg(void) { int ret; long flags; unsigned int i; cycles_t time1, time2, time; long rem; local_irq_save(flags); preempt_disable(); time1 = get_cycles(); for (i = 0; i < NR_LOOPS; i++) { ret = cmpxchg_local(&test_val, 0, 0); } time2 = get_cycles(); local_irq_restore(flags); preempt_enable(); time = time2 - time1; printk(KERN_ALERT "test results: time for non locked cmpxchg\n"); printk(KERN_ALERT "number of loops: %d\n", NR_LOOPS); printk(KERN_ALERT "total time: %llu\n", time); time = div_long_long_rem(time, NR_LOOPS, &rem); printk(KERN_ALERT "-> non locked cmpxchg takes %llu cycles\n", time); printk(KERN_ALERT "test end\n"); } /* * This test will have a higher standard deviation due to incoming interrupts. */ static void do_test_enable_int(void) { long flags; unsigned int i; cycles_t time1, time2, time; long rem; local_irq_save(flags); preempt_disable(); time1 = get_cycles(); for (i = 0; i < NR_LOOPS; i++) { local_irq_restore(flags); } time2 = get_cycles(); local_irq_restore(flags); preempt_enable(); time = time2 - time1; printk(KERN_ALERT "test results: time for enabling interrupts (STI)\n"); printk(KERN_ALERT "number of loops: %d\n", NR_LOOPS); printk(KERN_ALERT "total time: %llu\n", time); time = div_long_long_rem(time, NR_LOOPS, &rem); printk(KERN_ALERT "-> enabling interrupts (STI) takes %llu cycles\n", time); printk(KERN_ALERT "test end\n"); } static void do_test_disable_int(void) { unsigned long flags, flags2; unsigned int i; cycles_t time1, time2, time; long rem; local_irq_save(flags); preempt_disable(); time1 = get_cycles(); for ( i = 0; i < NR_LOOPS; i++) { local_irq_save(flags2); } time2 = get_cycles(); local_irq_restore(flags); preempt_enable(); time = time2 - time1; printk(KERN_ALERT "test results: time for disabling interrupts (CLI)\n"); printk(KERN_ALERT "number of loops: %d\n", NR_LOOPS); printk(KERN_ALERT "total time: %llu\n", time); time = div_long_long_rem(time, NR_LOOPS, &rem); printk(KERN_ALERT "-> disabling interrupts (CLI) takes %llu cycles\n", time); printk(KERN_ALERT "test end\n"); } static int ltt_test_init(void) { printk(KERN_ALERT "test init\n"); do_test_cmpxchg(); do_test_enable_int(); do_test_disable_int(); return -EAGAIN; /* Fail will directly unload the module */ } static void ltt_test_exit(void) { printk(KERN_ALERT "test exit\n"); } module_init(ltt_test_init) module_exit(ltt_test_exit) MODULE_LICENSE("GPL"); MODULE_AUTHOR("Mathieu Desnoyers"); MODULE_DESCRIPTION("Cmpxchg local test"); -- Mathieu Desnoyers Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [patch 00/10] [RFC] SLUB patches for more functionality, performance and maintenance 2007-07-10 8:27 ` Mathieu Desnoyers @ 2007-07-10 18:38 ` Christoph Lameter 2007-07-10 20:59 ` Mathieu Desnoyers 1 sibling, 0 replies; 26+ messages in thread From: Christoph Lameter @ 2007-07-10 18:38 UTC (permalink / raw) To: Mathieu Desnoyers Cc: Martin Bligh, Andi Kleen, linux-kernel, linux-mm, David Miller On Tue, 10 Jul 2007, Mathieu Desnoyers wrote: > cmpxchg_local is not available on all archs, but local_cmpxchg is. It > expects a local_t type which is nothing else than a long. When the local > atomic operation is not more efficient or not implemented on a given > architecture, asm-generic/local.h falls back on atomic_long_t. If you > want, you could work on the local_t type, which you could cast from a > long to a pointer when you need so, since their size are, AFAIK, always > the same (and some VM code even assume this is always the case). It would be cleaner to have cmpxchg_local on all arches. The type conversion is hacky. If this is really working then we should also use the mechanism for other things like the vm statistics. > The measurements I get (in cycles): > > enable interrupts (STI) disable interrupts (CLI) local CMPXCHG > IA32 (P4) 112 82 26 > x86_64 AMD64 125 102 19 Looks good and seems to indicate that we can at least double the speed of slab allocation. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [patch 00/10] [RFC] SLUB patches for more functionality, performance and maintenance 2007-07-10 8:27 ` Mathieu Desnoyers 2007-07-10 18:38 ` Christoph Lameter @ 2007-07-10 20:59 ` Mathieu Desnoyers 1 sibling, 0 replies; 26+ messages in thread From: Mathieu Desnoyers @ 2007-07-10 20:59 UTC (permalink / raw) To: Christoph Lameter Cc: Martin Bligh, Andi Kleen, linux-kernel, linux-mm, David Miller, Alexandre Guédon Another architecture tested Comparison: irq enable/disable vs local CMPXCHG enable interrupts (STI) disable interrupts (CLI) local CMPXCHG Tested-by: Mathieu Desnoyers <compudj@krystal.dyndns.org> IA32 (P4) 112 82 26 x86_64 AMD64 125 102 19 Tested-by: Alexandre Guedon <totalworlddomination@gmail.com> x86_64 Intel Core2 Quad 21 19 7 -- Mathieu Desnoyers Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [patch 00/10] [RFC] SLUB patches for more functionality, performance and maintenance 2007-07-10 0:55 ` [patch 00/10] [RFC] SLUB patches for more functionality, performance and maintenance Christoph Lameter 2007-07-10 8:27 ` Mathieu Desnoyers @ 2007-08-13 22:18 ` Mathieu Desnoyers 2007-08-13 22:28 ` Christoph Lameter 1 sibling, 1 reply; 26+ messages in thread From: Mathieu Desnoyers @ 2007-08-13 22:18 UTC (permalink / raw) To: Christoph Lameter Cc: Martin Bligh, Andi Kleen, linux-kernel, linux-mm, David Miller Some review here. I think we could do much better.. * Christoph Lameter (clameter@sgi.com) wrote: > Index: linux-2.6.22-rc6-mm1/mm/slub.c > =================================================================== > --- linux-2.6.22-rc6-mm1.orig/mm/slub.c 2007-07-09 15:04:46.000000000 -0700 > +++ linux-2.6.22-rc6-mm1/mm/slub.c 2007-07-09 17:09:00.000000000 -0700 > @@ -1467,12 +1467,14 @@ static void *__slab_alloc(struct kmem_ca > { > void **object; > struct page *new; > + unsigned long flags; > > + local_irq_save(flags); > if (!c->page) > goto new_slab; > > slab_lock(c->page); > - if (unlikely(!node_match(c, node))) > + if (unlikely(!node_match(c, node) || c->freelist)) > goto another_slab; > load_freelist: > object = c->page->freelist; > @@ -1486,7 +1488,14 @@ load_freelist: > c->page->inuse = s->objects; > c->page->freelist = NULL; > c->node = page_to_nid(c->page); > +out: > slab_unlock(c->page); > + local_irq_restore(flags); > + preempt_enable(); > + > + if (unlikely((gfpflags & __GFP_ZERO))) > + memset(object, 0, c->objsize); > + > return object; > > another_slab: > @@ -1527,6 +1536,8 @@ new_slab: > c->page = new; > goto load_freelist; > } > + local_irq_restore(flags); > + preempt_enable(); > return NULL; > debug: > c->freelist = NULL; > @@ -1536,8 +1547,7 @@ debug: > > c->page->inuse++; > c->page->freelist = object[c->offset]; > - slab_unlock(c->page); > - return object; > + goto out; > } > > /* > @@ -1554,23 +1564,20 @@ static void __always_inline *slab_alloc( > gfp_t gfpflags, int node, void *addr) > { > void **object; > - unsigned long flags; > struct kmem_cache_cpu *c; > What if we prefetch c->freelist here ? I see in this diff that the other code just reads it sooner as a condition for the if(). > - local_irq_save(flags); > + preempt_disable(); > c = get_cpu_slab(s, smp_processor_id()); > - if (unlikely(!c->page || !c->freelist || > - !node_match(c, node))) > +redo: > + object = c->freelist; > + if (unlikely(!object || !node_match(c, node))) > + return __slab_alloc(s, gfpflags, node, addr, c); > > - object = __slab_alloc(s, gfpflags, node, addr, c); > + if (cmpxchg_local(&c->freelist, object, object[c->offset]) != object) > + goto redo; > > - else { > - object = c->freelist; > - c->freelist = object[c->offset]; > - } > - local_irq_restore(flags); > - > - if (unlikely((gfpflags & __GFP_ZERO) && object)) > + preempt_enable(); > + if (unlikely((gfpflags & __GFP_ZERO))) > memset(object, 0, c->objsize); > > return object; > @@ -1603,7 +1610,9 @@ static void __slab_free(struct kmem_cach > { > void *prior; > void **object = (void *)x; > + unsigned long flags; > > + local_irq_save(flags); > slab_lock(page); > > if (unlikely(SlabDebug(page))) > @@ -1629,6 +1638,8 @@ checks_ok: > > out_unlock: > slab_unlock(page); > + local_irq_restore(flags); > + preempt_enable(); > return; > > slab_empty: > @@ -1639,6 +1650,8 @@ slab_empty: > remove_partial(s, page); > > slab_unlock(page); > + local_irq_restore(flags); > + preempt_enable(); > discard_slab(s, page); > return; > > @@ -1663,18 +1676,31 @@ static void __always_inline slab_free(st > struct page *page, void *x, void *addr) > { > void **object = (void *)x; > - unsigned long flags; > struct kmem_cache_cpu *c; > + void **freelist; > Prefetching c->freelist would also make sense here. > - local_irq_save(flags); > + preempt_disable(); > c = get_cpu_slab(s, smp_processor_id()); > - if (likely(page == c->page && c->freelist)) { > - object[c->offset] = c->freelist; > - c->freelist = object; > - } else > - __slab_free(s, page, x, addr, c->offset); > +redo: > + freelist = c->freelist; I suspect this smp_rmb() may be the cause of a major slowdown. Therefore, I think we should try taking a copy of c->page and simply check if it has changed right after the cmpxchg_local: page = c->page; > + /* > + * Must read freelist before c->page. If a interrupt occurs and > + * changes c->page after we have read it here then it > + * will also have changed c->freelist and the cmpxchg will fail. > + * > + * If we would have checked c->page first then the freelist could > + * have been changed under us before we read c->freelist and we > + * would not be able to detect that situation. > + */ > + smp_rmb(); > + if (unlikely(page != c->page || !freelist)) > + return __slab_free(s, page, x, addr, c->offset); > + > + object[c->offset] = freelist; -> + if (cmpxchg_local(&c->freelist, freelist, object) != freelist) +> + if (cmpxchg_local(&c->freelist, freelist, object) != freelist || page != c->page) > + goto redo; > Therefore, in the scenario where: 1 - c->page is read 2 - Interrupt comes, changes c->page and c->freelist 3 - c->freelist is read 4 - cmpxchg c->freelist succeeds 5 - Then, page != c->page, so we goto redo. It also works if 4 and 5 are swapped. I could test the modification if you point to me which kernel version it should apply to. However, I don't have the same hardware you use. By the way, the smp_rmb() barrier does not make sense with the comment. If it is _really_ protecting against reordering wrt interrupts, then it should be a rmb(), not smp_rmb() (because it will be reordered on UP). But I think the best would just be to work without rmb() at all, as proposed here. Mathieu > - local_irq_restore(flags); > + preempt_enable(); > } > > void kmem_cache_free(struct kmem_cache *s, void *x) > > - > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ -- Mathieu Desnoyers Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [patch 00/10] [RFC] SLUB patches for more functionality, performance and maintenance 2007-08-13 22:18 ` Mathieu Desnoyers @ 2007-08-13 22:28 ` Christoph Lameter 0 siblings, 0 replies; 26+ messages in thread From: Christoph Lameter @ 2007-08-13 22:28 UTC (permalink / raw) To: Mathieu Desnoyers Cc: Martin Bligh, Andi Kleen, linux-kernel, linux-mm, David Miller On Mon, 13 Aug 2007, Mathieu Desnoyers wrote: > > @@ -1554,23 +1564,20 @@ static void __always_inline *slab_alloc( > > gfp_t gfpflags, int node, void *addr) > > { > > void **object; > > - unsigned long flags; > > struct kmem_cache_cpu *c; > > > > What if we prefetch c->freelist here ? I see in this diff that the other > code just reads it sooner as a condition for the if(). Not sure as to what this may bring. If you read it earlier then you may get the wrong value and then may have to refetch the cacheline. We cannot fetch c->freelist without determining c. I can remove the check for c->page == page so that the fetch of c->freelist comes immeidately after detemination of c. But that does not change performance. > > - c->freelist = object; > > - } else > > - __slab_free(s, page, x, addr, c->offset); > > +redo: > > + freelist = c->freelist; > > I suspect this smp_rmb() may be the cause of a major slowdown. > Therefore, I think we should try taking a copy of c->page and simply > check if it has changed right after the cmpxchg_local: Thought so too and I removed that smp_rmb and tested this modification on UP again without any performance gains. I think the cacheline fetches dominates the execution thread here and cmpxchg does not bring us anything. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 26+ messages in thread
[parent not found: <20070708035018.074510057@sgi.com>]
[parent not found: <20070708075119.GA16631@elte.hu>]
[parent not found: <20070708110224.9cd9df5b.akpm@linux-foundation.org>]
[parent not found: <4691A415.6040208@yahoo.com.au>]
[parent not found: <84144f020707090404l657a62c7x89d7d06b3dd6c34b@mail.gmail.com>]
* Re: [patch 09/10] Remove the SLOB allocator for 2.6.23 [not found] ` <84144f020707090404l657a62c7x89d7d06b3dd6c34b@mail.gmail.com> @ 2007-07-09 16:08 ` Christoph Lameter 2007-07-10 8:17 ` Pekka J Enberg 0 siblings, 1 reply; 26+ messages in thread From: Christoph Lameter @ 2007-07-09 16:08 UTC (permalink / raw) To: Pekka Enberg Cc: Nick Piggin, Andrew Morton, Ingo Molnar, linux-kernel, linux-mm, suresh.b.siddha, corey.d.gough, Matt Mackall, Denis Vlasenko, Erik Andersen On Mon, 9 Jul 2007, Pekka Enberg wrote: > I assume with "slab external fragmentation" you mean allocating a > whole page for a slab when there are not enough objects to fill the > whole thing thus wasting memory? We could try to combat that by > packing multiple variable-sized slabs within a single page. Also, > adding some non-power-of-two kmalloc caches might help with internal > fragmentation. Ther are already non-power-of-two kmalloc caches for 96 and 192 bytes sizes. > > In any case, SLUB needs some serious tuning for smaller machines > before we can get rid of SLOB. Switch off CONFIG_SLUB_DEBUG to get memory savings. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [patch 09/10] Remove the SLOB allocator for 2.6.23 2007-07-09 16:08 ` [patch 09/10] Remove the SLOB allocator for 2.6.23 Christoph Lameter @ 2007-07-10 8:17 ` Pekka J Enberg 2007-07-10 8:27 ` Nick Piggin 0 siblings, 1 reply; 26+ messages in thread From: Pekka J Enberg @ 2007-07-10 8:17 UTC (permalink / raw) To: Christoph Lameter Cc: Nick Piggin, Andrew Morton, Ingo Molnar, linux-kernel, linux-mm, suresh.b.siddha, corey.d.gough, Matt Mackall, Denis Vlasenko, Erik Andersen Hi Christoph, On Mon, 9 Jul 2007, Pekka Enberg wrote: > > I assume with "slab external fragmentation" you mean allocating a > > whole page for a slab when there are not enough objects to fill the > > whole thing thus wasting memory? We could try to combat that by > > packing multiple variable-sized slabs within a single page. Also, > > adding some non-power-of-two kmalloc caches might help with internal > > fragmentation. On Mon, 9 Jul 2007, Christoph Lameter wrote: > Ther are already non-power-of-two kmalloc caches for 96 and 192 bytes > sizes. I know that, but for my setup at least, there seems to be a need for a non-power of two cache between 512 and 1024. What I am seeing is average allocation size for kmalloc-512 being around 270-280 which wastes total of 10 KB of memory due to internal fragmentation. Might be a buggy caller that can be fixed with its own cache too. On Mon, 9 Jul 2007, Pekka Enberg wrote: > > In any case, SLUB needs some serious tuning for smaller machines > > before we can get rid of SLOB. On Mon, 9 Jul 2007, Christoph Lameter wrote: > Switch off CONFIG_SLUB_DEBUG to get memory savings. Curious, /proc/meminfo immediately after boot shows: SLUB (debugging enabled): (none):~# cat /proc/meminfo MemTotal: 30260 kB MemFree: 22096 kB SLUB (debugging disabled): (none):~# cat /proc/meminfo MemTotal: 30276 kB MemFree: 22244 kB SLOB: (none):~# cat /proc/meminfo MemTotal: 30280 kB MemFree: 22004 kB That's 92 KB advantage for SLUB with debugging enabled and 240 KB when debugging is disabled. Nick, Matt, care to retest SLUB and SLOB for your setups? Pekka -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [patch 09/10] Remove the SLOB allocator for 2.6.23 2007-07-10 8:17 ` Pekka J Enberg @ 2007-07-10 8:27 ` Nick Piggin 2007-07-10 9:31 ` Pekka Enberg 0 siblings, 1 reply; 26+ messages in thread From: Nick Piggin @ 2007-07-10 8:27 UTC (permalink / raw) To: Pekka J Enberg Cc: Christoph Lameter, Andrew Morton, Ingo Molnar, linux-kernel, linux-mm, suresh.b.siddha, corey.d.gough, Matt Mackall, Denis Vlasenko, Erik Andersen Pekka J Enberg wrote: > Curious, /proc/meminfo immediately after boot shows: > > SLUB (debugging enabled): > > (none):~# cat /proc/meminfo > MemTotal: 30260 kB > MemFree: 22096 kB > > SLUB (debugging disabled): > > (none):~# cat /proc/meminfo > MemTotal: 30276 kB > MemFree: 22244 kB > > SLOB: > > (none):~# cat /proc/meminfo > MemTotal: 30280 kB > MemFree: 22004 kB > > That's 92 KB advantage for SLUB with debugging enabled and 240 KB when > debugging is disabled. Interesting. What kernel version are you using? > Nick, Matt, care to retest SLUB and SLOB for your setups? I don't think there has been a significant change in the area of memory efficiency in either since I last tested, and Christoph and I both produced the same result. I can't say where SLOB is losing its memory, but there are a few places that can still be improved, so I might get keen and take another look at it once all the improvements to both allocators gets upstream. -- SUSE Labs, Novell Inc. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [patch 09/10] Remove the SLOB allocator for 2.6.23 2007-07-10 8:27 ` Nick Piggin @ 2007-07-10 9:31 ` Pekka Enberg 2007-07-10 10:09 ` Nick Piggin 2007-07-10 12:02 ` Matt Mackall 0 siblings, 2 replies; 26+ messages in thread From: Pekka Enberg @ 2007-07-10 9:31 UTC (permalink / raw) To: Nick Piggin Cc: Christoph Lameter, Andrew Morton, Ingo Molnar, linux-kernel, linux-mm, suresh.b.siddha, corey.d.gough, Matt Mackall, Denis Vlasenko, Erik Andersen Hi Nick, Pekka J Enberg wrote: > > That's 92 KB advantage for SLUB with debugging enabled and 240 KB when > > debugging is disabled. On 7/10/07, Nick Piggin <nickpiggin@yahoo.com.au> wrote: > Interesting. What kernel version are you using? Linus' git head from yesterday so the results are likely to be sensitive to workload and mine doesn't represent real embedded use. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [patch 09/10] Remove the SLOB allocator for 2.6.23 2007-07-10 9:31 ` Pekka Enberg @ 2007-07-10 10:09 ` Nick Piggin 2007-07-10 12:02 ` Matt Mackall 1 sibling, 0 replies; 26+ messages in thread From: Nick Piggin @ 2007-07-10 10:09 UTC (permalink / raw) To: Pekka Enberg Cc: Christoph Lameter, Andrew Morton, Ingo Molnar, linux-kernel, linux-mm, suresh.b.siddha, corey.d.gough, Matt Mackall, Denis Vlasenko, Erik Andersen Pekka Enberg wrote: > Hi Nick, > > Pekka J Enberg wrote: > >> > That's 92 KB advantage for SLUB with debugging enabled and 240 KB when >> > debugging is disabled. > > > On 7/10/07, Nick Piggin <nickpiggin@yahoo.com.au> wrote: > >> Interesting. What kernel version are you using? > > > Linus' git head from yesterday so the results are likely to be > sensitive to workload and mine doesn't represent real embedded use. Hi Pekka, There is one thing that the SLOB patches in -mm do besides result in slightly better packing and memory efficiency (which might be unlikely to explain the difference you are seeing), and that is that they do away with the delayed freeing of unused SLOB pages back to the page allocator. In git head, these pages are freed via a timer so they can take a while to make their way back to the buddy allocator so they don't register as free memory as such. Anyway, I would be very interested to see any situation where the SLOB in -mm uses more memory than SLUB, even on test configs like yours. Thanks, Nick -- SUSE Labs, Novell Inc. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [patch 09/10] Remove the SLOB allocator for 2.6.23 2007-07-10 9:31 ` Pekka Enberg 2007-07-10 10:09 ` Nick Piggin @ 2007-07-10 12:02 ` Matt Mackall 2007-07-10 12:57 ` Pekka J Enberg 2007-07-10 22:12 ` Christoph Lameter 1 sibling, 2 replies; 26+ messages in thread From: Matt Mackall @ 2007-07-10 12:02 UTC (permalink / raw) To: Pekka Enberg Cc: Nick Piggin, Christoph Lameter, Andrew Morton, Ingo Molnar, linux-kernel, linux-mm, suresh.b.siddha, corey.d.gough, Denis Vlasenko, Erik Andersen On Tue, Jul 10, 2007 at 12:31:40PM +0300, Pekka Enberg wrote: > Hi Nick, > > Pekka J Enberg wrote: > >> That's 92 KB advantage for SLUB with debugging enabled and 240 KB when > >> debugging is disabled. > > On 7/10/07, Nick Piggin <nickpiggin@yahoo.com.au> wrote: > >Interesting. What kernel version are you using? > > Linus' git head from yesterday so the results are likely to be > sensitive to workload and mine doesn't represent real embedded use. Using 2.6.22-rc6-mm1 with a 64MB lguest and busybox, I'm seeing the following as the best MemFree numbers after several boots each: SLAB: 54796 SLOB: 55044 SLUB: 53944 SLUB: 54788 (debug turned off) These numbers bounce around a lot more from boot to boot than I remember, so take these numbers with a grain of salt. Disabling the debug code in the build gives this, by the way: mm/slub.c: In function a??init_kmem_cache_nodea??: mm/slub.c:1873: error: a??struct kmem_cache_nodea?? has no member named a??fulla?? -- Mathematics is the supreme nostalgia of our time. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [patch 09/10] Remove the SLOB allocator for 2.6.23 2007-07-10 12:02 ` Matt Mackall @ 2007-07-10 12:57 ` Pekka J Enberg 2007-07-10 22:12 ` Christoph Lameter 1 sibling, 0 replies; 26+ messages in thread From: Pekka J Enberg @ 2007-07-10 12:57 UTC (permalink / raw) To: Matt Mackall Cc: Nick Piggin, Christoph Lameter, Andrew Morton, Ingo Molnar, linux-kernel, linux-mm, suresh.b.siddha, corey.d.gough, Denis Vlasenko, Erik Andersen Hi Matt, On Tue, 10 Jul 2007, Matt Mackall wrote: > Using 2.6.22-rc6-mm1 with a 64MB lguest and busybox, I'm seeing the > following as the best MemFree numbers after several boots each: > > SLAB: 54796 > SLOB: 55044 > SLUB: 53944 > SLUB: 54788 (debug turned off) > > These numbers bounce around a lot more from boot to boot than I > remember, so take these numbers with a grain of salt. To rule out userland, 2.6.22 with 32 MB defconfig UML and busybox [1] on i386: SLOB: 26708 SLUB: 27212 (no debug) Unfortunately UML is broken in 2.6.22-rc6-mm1, so I don't know if SLOB patches help there. 1. http://uml.nagafix.co.uk/BusyBox-1.5.0/BusyBox-1.5.0-x86-root_fs.bz2 Pekka -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [patch 09/10] Remove the SLOB allocator for 2.6.23 2007-07-10 12:02 ` Matt Mackall 2007-07-10 12:57 ` Pekka J Enberg @ 2007-07-10 22:12 ` Christoph Lameter 2007-07-10 22:40 ` Matt Mackall 1 sibling, 1 reply; 26+ messages in thread From: Christoph Lameter @ 2007-07-10 22:12 UTC (permalink / raw) To: Matt Mackall Cc: Pekka Enberg, Nick Piggin, Andrew Morton, Ingo Molnar, linux-kernel, linux-mm, suresh.b.siddha, corey.d.gough, Denis Vlasenko, Erik Andersen [-- Attachment #1: Type: TEXT/PLAIN, Size: 797 bytes --] On Tue, 10 Jul 2007, Matt Mackall wrote: > following as the best MemFree numbers after several boots each: > > SLAB: 54796 > SLOB: 55044 > SLUB: 53944 > SLUB: 54788 (debug turned off) That was without "slub_debug" as a parameter or with !CONFIG_SLUB_DEBUG? Data size and code size will decrease if you compile with !CONFIG_SLUB_DEBUG. slub_debug on the command line governs if debug information is used. > These numbers bounce around a lot more from boot to boot than I > remember, so take these numbers with a grain of salt. > > Disabling the debug code in the build gives this, by the way: > > mm/slub.c: In function ÿÿinit_kmem_cache_nodeÿÿ: > mm/slub.c:1873: error: ÿÿstruct kmem_cache_nodeÿÿ has no member named > ÿÿfullÿÿ A fix for that is in Andrew's tree. ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [patch 09/10] Remove the SLOB allocator for 2.6.23 2007-07-10 22:12 ` Christoph Lameter @ 2007-07-10 22:40 ` Matt Mackall 2007-07-10 22:50 ` Christoph Lameter 0 siblings, 1 reply; 26+ messages in thread From: Matt Mackall @ 2007-07-10 22:40 UTC (permalink / raw) To: Christoph Lameter Cc: Pekka Enberg, Nick Piggin, Andrew Morton, Ingo Molnar, linux-kernel, linux-mm, suresh.b.siddha, corey.d.gough, Denis Vlasenko, Erik Andersen On Tue, Jul 10, 2007 at 03:12:38PM -0700, Christoph Lameter wrote: > On Tue, 10 Jul 2007, Matt Mackall wrote: > > > following as the best MemFree numbers after several boots each: > > > > SLAB: 54796 > > SLOB: 55044 > > SLUB: 53944 > > SLUB: 54788 (debug turned off) > > That was without "slub_debug" as a parameter or with !CONFIG_SLUB_DEBUG? Without the parameter, as the other way doesn't compile in -mm1. -- Mathematics is the supreme nostalgia of our time. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [patch 09/10] Remove the SLOB allocator for 2.6.23 2007-07-10 22:40 ` Matt Mackall @ 2007-07-10 22:50 ` Christoph Lameter 0 siblings, 0 replies; 26+ messages in thread From: Christoph Lameter @ 2007-07-10 22:50 UTC (permalink / raw) To: Matt Mackall Cc: Pekka Enberg, Nick Piggin, Andrew Morton, Ingo Molnar, linux-kernel, linux-mm, suresh.b.siddha, corey.d.gough, Denis Vlasenko, Erik Andersen On Tue, 10 Jul 2007, Matt Mackall wrote: > Without the parameter, as the other way doesn't compile in -mm1. here is the patch that went into mm after mm1 was released. --- mm/slub.c | 4 ++++ 1 file changed, 4 insertions(+) Index: linux-2.6.22-rc6-mm1/mm/slub.c =================================================================== --- linux-2.6.22-rc6-mm1.orig/mm/slub.c 2007-07-06 13:28:57.000000000 -0700 +++ linux-2.6.22-rc6-mm1/mm/slub.c 2007-07-06 13:29:01.000000000 -0700 @@ -1868,7 +1868,9 @@ static void init_kmem_cache_node(struct atomic_long_set(&n->nr_slabs, 0); spin_lock_init(&n->list_lock); INIT_LIST_HEAD(&n->partial); +#ifdef CONFIG_SLUB_DEBUG INIT_LIST_HEAD(&n->full); +#endif } #ifdef CONFIG_NUMA @@ -1898,8 +1900,10 @@ static struct kmem_cache_node * __init e page->freelist = get_freepointer(kmalloc_caches, n); page->inuse++; kmalloc_caches->node[node] = n; +#ifdef CONFIG_SLUB_DEBUG init_object(kmalloc_caches, n, 1); init_tracking(kmalloc_caches, n); +#endif init_kmem_cache_node(n); atomic_long_inc(&n->nr_slabs); add_partial(n, page); -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 26+ messages in thread
end of thread, other threads:[~2007-08-13 22:28 UTC | newest]
Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
[not found] <20070708034952.022985379@sgi.com>
[not found] ` <p73y7hrywel.fsf@bingen.suse.de>
2007-07-09 15:50 ` [patch 00/10] [RFC] SLUB patches for more functionality, performance and maintenance Christoph Lameter
2007-07-09 15:59 ` Martin Bligh
2007-07-09 18:11 ` Christoph Lameter
2007-07-09 21:00 ` Martin Bligh
2007-07-09 21:44 ` Mathieu Desnoyers
2007-07-09 21:55 ` Christoph Lameter
2007-07-09 22:58 ` Mathieu Desnoyers
2007-07-09 23:08 ` Christoph Lameter
2007-07-10 5:16 ` [PATCH] x86_64 - Use non locked version for local_cmpxchg() Mathieu Desnoyers
2007-07-10 20:46 ` Christoph Lameter
2007-07-10 0:55 ` [patch 00/10] [RFC] SLUB patches for more functionality, performance and maintenance Christoph Lameter
2007-07-10 8:27 ` Mathieu Desnoyers
2007-07-10 18:38 ` Christoph Lameter
2007-07-10 20:59 ` Mathieu Desnoyers
2007-08-13 22:18 ` Mathieu Desnoyers
2007-08-13 22:28 ` Christoph Lameter
[not found] ` <20070708035018.074510057@sgi.com>
[not found] ` <20070708075119.GA16631@elte.hu>
[not found] ` <20070708110224.9cd9df5b.akpm@linux-foundation.org>
[not found] ` <4691A415.6040208@yahoo.com.au>
[not found] ` <84144f020707090404l657a62c7x89d7d06b3dd6c34b@mail.gmail.com>
2007-07-09 16:08 ` [patch 09/10] Remove the SLOB allocator for 2.6.23 Christoph Lameter
2007-07-10 8:17 ` Pekka J Enberg
2007-07-10 8:27 ` Nick Piggin
2007-07-10 9:31 ` Pekka Enberg
2007-07-10 10:09 ` Nick Piggin
2007-07-10 12:02 ` Matt Mackall
2007-07-10 12:57 ` Pekka J Enberg
2007-07-10 22:12 ` Christoph Lameter
2007-07-10 22:40 ` Matt Mackall
2007-07-10 22:50 ` Christoph Lameter
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox