Re: [RFC] another way to speed up fake numa node page_alloc

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: David Rientjes <rientjes@cs.washington.edu>
To: Paul Jackson <pj@sgi.com>
Cc: linux-mm@kvack.org, akpm@osdl.org,
	Nick Piggin <nickpiggin@yahoo.com.au>, Andi Kleen <ak@suse.de>,
	mbligh@google.com, rohitseth@google.com, menage@google.com,
	clameter@sgi.com
Subject: Re: [RFC] another way to speed up fake numa node page_alloc
Date: Mon, 25 Sep 2006 23:08:17 -0700 (PDT)	[thread overview]
Message-ID: <Pine.LNX.4.64N.0609252214590.14826@attu4.cs.washington.edu> (raw)
In-Reply-To: <20060925091452.14277.9236.sendpatchset@v0>

On Mon, 25 Sep 2006, Paul Jackson wrote:

>  - Some per-node data in the struct zonelist is now modified frequently,
>    with no locking.  Multiple CPU cores on a node could hit and mangle
>    this data.  The theory is that this is just performance hint data,
>    and the memory allocator will work just fine despite any such mangling.
>    The fields at risk are the struct 'zonelist_faster' fields 'fullnodes'
>    (a nodemask_t) and 'last_full_zap' (unsigned long jiffies).  It should
>    all be self correcting after at most a one second delay.
>  

If there's mangling on 'last_full_zap' in the scenario with multiple CPU's 
on one node, that means that we might be clearing 'fullnodes' more often 
than every 1*HZ, and that clear is always done by one CPU.  Since the only 
purpose of the delay is to allow a certain period of time go by where 
these hints will actually serve a purpose, this entire speed-up will 
then be degraded.  I agree that adding locking for 'zonelist_faster' is 
probably going too far in terms of performance hint data, but it seems 
necessary with 'last_full_zap' if the goal is to preserve this 1*HZ 
delay.

>  - I pay no attention to the various watermarks and such in this performance
>    hint.  A node could be marked full for one watermark, and then skipped
>    over when searching for a page using a different watermark.  I think
>    that's actually quite ok, as it will tend to slightly increase the
>    spreading of memory over other nodes, away from a memory stressed node.
> 

Since we currently lack support for dynamically allocating nodes with a 
node hotplug API, it actually seems advantageous to have a memory stressed 
node in a pool or cpuset of 'mems'.  Now when another cpuset is facing 
memory pressure I can cherry-pick an untouched node from a less bogged 
down cpuset for my own use.

It seems like an immutable time interval embedded in the page alloc code 
may not be the best way to measure when a full zap should occur.  A more 
appropriate metric might be to do a full zap after a certain threshold of 
pages have been freed.  If it's done that way, the zap would occur in a 
more appropriate place (when pages are freed) as opposed to when pages are 
allocated.  The overhead that we incur of zapping the nodemask every 
second and then being forced to recheck all the nodes again would then be 
eliminated in the case where there's been no change.  Based on the 
benchmarks I ran earlier, that's a popular case.  It's more appropriate 
when we're freeing pages and we know for sure that we're getting memory 
somewhere.

Note to self: in 2.6.18-rc7-mm1, NUMA_BUILD is just a synonym for 
CONFIG_NUMA.  And since this and CONFIG_NUMA_EMU is defined by default on 
x64_64, we're going to have overhead on a single processor system.  In my 
earlier patch I started extracting a macro that could be tested against 
in generic kernel code to determine at least whether NUMA emulation was 
being _used_.  This might need to make a comeback if this type of 
implementation is considered later.

This is a creative solution, especially considering the use of a 
statically-sized zlfast_ptr to find zlfast hidden away in struct zonelist.  
This definitely seems to be headed in the right direction because it works 
in both the real NUMA case and the fake NUMA case.  I would really like to 
run benchmarks on this implementation as I have done for the others but I 
no longer have access to a 64-bit machine.  I don't see how it could cause 
a performance degredation in the non-NUMA case.

		David

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

next prev parent reply	other threads:[~2006-09-26  6:08 UTC|newest]

Thread overview: 28+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2006-09-25  9:14 Paul Jackson
2006-09-26  6:08 ` David Rientjes [this message]
2006-09-26  7:06   ` Paul Jackson
2006-09-26 18:17     ` David Rientjes
2006-09-26 19:24       ` Paul Jackson
2006-09-26 19:58         ` David Rientjes
2006-09-26 21:48           ` Paul Jackson
2006-10-02  6:18 ` Paul Jackson
2006-10-02  6:31   ` David Rientjes
2006-10-02  6:48     ` Paul Jackson
2006-10-02  7:05       ` David Rientjes
2006-10-02  8:41         ` Paul Jackson
2006-10-03 18:15           ` Paul Jackson
2006-10-03 19:37             ` David Rientjes
2006-10-04 15:45               ` Paul Jackson
2006-10-04 16:11                 ` Christoph Lameter
2006-10-04 22:10                 ` David Rientjes
2006-10-05  2:27                   ` Paul Jackson
2006-10-05  2:37                     ` David Rientjes
2006-10-05  2:53                       ` Paul Jackson
2006-10-05  3:00                         ` David Rientjes
2006-10-05  3:26                           ` Paul Jackson
2006-10-05  3:49                             ` David Rientjes
2006-10-05  4:07                               ` Andrew Morton
2006-10-05  4:14                                 ` Paul Jackson
2006-10-05  4:50                                 ` David Rientjes
2006-10-05  4:53                                   ` Paul Jackson
2006-10-11  3:42                     ` Paul Jackson

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Pine.LNX.4.64N.0609252214590.14826@attu4.cs.washington.edu \
    --to=rientjes@cs.washington.edu \
    --cc=ak@suse.de \
    --cc=akpm@osdl.org \
    --cc=clameter@sgi.com \
    --cc=linux-mm@kvack.org \
    --cc=mbligh@google.com \
    --cc=menage@google.com \
    --cc=nickpiggin@yahoo.com.au \
    --cc=pj@sgi.com \
    --cc=rohitseth@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox