linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Andrew Morton <akpm@linux-foundation.org>
To: Christoph Lameter <clameter@sgi.com>
Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	Pekka Enberg <penberg@cs.helsinki.fi>
Subject: Re: [patch 0/6] Per cpu structures for SLUB
Date: Fri, 24 Aug 2007 14:38:48 -0700	[thread overview]
Message-ID: <20070824143848.a1ecb6bc.akpm@linux-foundation.org> (raw)
In-Reply-To: <20070823064653.081843729@sgi.com>

On Wed, 22 Aug 2007 23:46:53 -0700
Christoph Lameter <clameter@sgi.com> wrote:

> The following patchset introduces per cpu structures for SLUB. These
> are very small (and multiples of these may fit into one cacheline)
> and (apart from performance improvements) allow the addressing of
> several isues in SLUB:
> 
> 1. The number of objects per slab is no longer limited to a 16 bit
>    number.
> 
> 2. Room is freed up in the page struct. We can avoid using the
>    mapping field which allows to get rid of the #ifdef CONFIG_SLUB
>    in page_mapping().
> 
> 3. We will have an easier time adding new things like Peter Z.s reserve
>    management.
> 
> The RFC for this patchset was discussed on lkml a while ago:
> 
> http://marc.info/?l=linux-kernel&m=118386677704534&w=2
> 
> (And no this patchset does not include the use of cmpxchg_local that
> we discussed recently on lkml nor the cmpxchg implementation
> mentioned in the RFC)
> 
> Performance
> -----------
> 
> 
> Norm = 2.6.23-rc3
> PCPU = Adds page allocator pass through plus per cpu structure patches
> 
> 
> IA64 8p 4n NUMA Altix
> 
>             Single threaded               Concurrent Alloc
> 
> 	Kmalloc		Alloc/Free	Kmalloc         Alloc/Free
>  Size	Norm   PCPU	Norm   PCPU	Norm   PCPU	Norm   PCPU
> -------------------------------------------------------------------
>     8	132	84	93	104	98	90	95	106
>    16    98	92	93	104	115	98	95	106
>    32   112	105	93	104	146	111	95	106
>    64	119	112	93	104	214	133	95	106
>   128   132	119	94	104	321	163	95	106
>   256+  83255	176	106	115	415	224	108	117
>   512   191	176	106	115	487	341	108	117
>  1024   252	246	106	115	937	609	108	117
>  2048   308	292	107	115	2494	1207	108	117
>  4096   341	319	107	115	2497	1217	108	117
>  8192   402	380	107	115	2367	1188	108	117
> 16384*  560	474	106	434	4464	1904	108	478
> 
> X86_64 2p SMP (Dual Core Pentium 940)
> 
>          Single threaded                   Concurrent Alloc
> 
>         Kmalloc         Alloc/Free      Kmalloc         Alloc/Free
>  Size   Norm   PCPU     Norm   PCPU     Norm   PCPU     Norm   PCPU
> --------------------------------------------------------------------
>     8	313	227	314	324	207	208	314	323
>    16   202	203	315	324	209	211	312	321
>    32	212	207	314	324	251	243	312	321
>    64	240	237	314	326	329	306	312	321
>   128	301	302	314	324	511	416	313	324
>   256   498	554	327	332	970	837	326	332
>   512   532	553	324	332	1025	932	326	335
>  1024   705	718	325	333	1489	1231	324	330
>  2048   764	767	324	334	2708	2175	324	332
>  4096* 1033	476	325	674	4727	782	324	678

I'm struggling a bit to understand these numbers.  Bigger is better, I
assume?  In what units are these numbers?

> Notes:
> 
> Worst case:
> -----------
> We generally loose in the alloc free test (x86_64 3%, IA64 5-10%)
> since the processing overhead increases because we need to lookup
> the per cpu structure. Alloc/Free is simply kfree(kmalloc(size, mask)).
> So objects with the shortest lifetime possible. We would never use
> objects in that way but the measurement is important to show the worst
> case overhead created.
> 
> Single Threaded:
> ----------------
> The single threaded kmalloc test shows behavior of a continual stream
> of allocation without contention. In the SMP case the losses are minimal.
> In the NUMA case we already have a winner there because the per cpu structure
> is placed local to the processor. So in the single threaded case we already
> win around 5% just by placing things better.
> 
> Concurrent Alloc:
> -----------------
> We have varying gains up to a 50% on NUMA because we are now never updating
> a cacheline used by the other processor and the data structures are local
> to the processor.
> 
> The SMP case shows gains but they are smaller (especially since
> this is the smallest SMP system possible.... 2 CPUs). So only up
> to 25%.
> 
> Page allocator pass through
> ---------------------------
> There is a significant difference in the columns marked with a * because
> of the way that allocations for page sized objects are handled.

OK, but what happened to the third pair of columns (Concurrent Alloc,
Kmalloc) for 1024 and 2048-byte allocations?  They seem to have become
significantly slower?

Thanks for running the numbers, but it's still a bit hard to work out
whether these changes are an aggregate benefit?

> If we handle
> the allocations in the slab allocator (Norm) then the alloc free tests
> results are superb since we can use the per cpu slab to just pass a pointer
> back and forth. The page allocator pass through (PCPU) shows that the page
> allocator may have problems with giving back the same page after a free.
> Or there something else in the page allocator that creates significant
> overhead compared to slab. Needs to be checked out I guess.
> 
> However, the page allocator pass through is a win in the other cases
> since we can cut out the page allocator overhead. That is the more typical
> load of allocating a sequence of objects and we should optimize for that.
> 
> (+ = Must be some cache artifact here or code crossing a TLB boundary.
> The result is reproducable)
> 

Most Linux machines are uniprocessor.  We should keep an eye on what effect
a change like this has on code size and performance for CONFIG_SMP=n
builds..


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

  parent reply	other threads:[~2007-08-24 21:38 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2007-08-23  6:46 Christoph Lameter
2007-08-23  6:46 ` [patch 1/6] SLUB: Avoid page struct cacheline bouncing due to remote frees to cpu slab Christoph Lameter
2007-08-23  6:46 ` [patch 2/6] SLUB: Do not use page->mapping Christoph Lameter
2007-08-23  6:46 ` [patch 3/6] SLUB: Move page->offset to kmem_cache_cpu->offset Christoph Lameter
2007-08-23  6:46 ` [patch 4/6] SLUB: Avoid touching page struct when freeing to per cpu slab Christoph Lameter
2007-08-23  6:46 ` [patch 5/6] SLUB: Place kmem_cache_cpu structures in a NUMA aware way Christoph Lameter
2007-08-23  6:46 ` [patch 6/6] SLUB: Optimize cacheline use for zeroing Christoph Lameter
2007-08-24 21:38 ` Andrew Morton [this message]
2007-08-27 18:50   ` [patch 0/6] Per cpu structures for SLUB Christoph Lameter
2007-08-27 23:51     ` Andrew Morton

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20070824143848.a1ecb6bc.akpm@linux-foundation.org \
    --to=akpm@linux-foundation.org \
    --cc=clameter@sgi.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=penberg@cs.helsinki.fi \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox