From mboxrd@z Thu Jan 1 00:00:00 1970 Date: Tue, 26 Sep 2006 12:58:57 -0700 (PDT) From: David Rientjes Subject: Re: [RFC] another way to speed up fake numa node page_alloc In-Reply-To: <20060926122445.717c7c11.pj@sgi.com> Message-ID: References: <20060925091452.14277.9236.sendpatchset@v0> <20060926000612.9db145a9.pj@sgi.com> <20060926122445.717c7c11.pj@sgi.com> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-linux-mm@kvack.org Return-Path: To: Paul Jackson Cc: linux-mm@kvack.org, akpm@osdl.org, nickpiggin@yahoo.com.au, ak@suse.de, mbligh@google.com, rohitseth@google.com, menage@google.com, clameter@sgi.com List-ID: On Tue, 26 Sep 2006, Paul Jackson wrote: > So what if the average time between zaps is 0.9 seconds instead of 1.0 > seconds? More realistically, we are talking something like 0.99999 > versus 1.00000 seconds, given that writing a 64 bit word on a 32 bit > arch offers only a tiny window for lost races. > > Lost races that break things are unacceptable, even in tiny windows. > > But lost races that just slightly nudge an already arbitrary and not > particularly fussy performance heuristic are not worth a single line > of code to avoid. > Why is it arbitrary, though? This is hard-coded into the page allocation code as the performance enhancement window for which your code relies upon. If time is the metric to be used to determine when we should go back and see if nodes have gained more memory, and I disagree that it is, then surely this one second window cannot possibly achieve the most efficient results you can squeeze out of your implementation for all possible workloads. In my opinion a more appropriate metric would be when we _know_ the amount of free memory in a zone has changed. And if you're seeking a distributed amount of memory among mems as your original post specified, then you could even get away with a simple counter and the nodemask is zapped after X number of page allocations. This would _not_ be susceptible to race conditions among multiple CPU's on one node. > > When we free memory from a specific zone, why is it not better to use > > zone_to_nid and then zap that _node_ in the nodemask only because we are > > guaranteed that the status has changed? > > It might be better. And it might not. More likely, it would be an > immeasurable difference except on custom microbenchmarks designed to > highlight this difference one way or the other. > If that's the case, then the entire speed-up is broken. As it stands right now you're zapping the _entire_ nodemask every second and going back to rechecking all those that you failed to find free memory on in the past. In my suggestion, you're only zapping a node when it is known that the free memory has changed (increased) based on a free. So when my process that wants to mlock and allocate tons and tons of pages, you're zapping unnecessarily because the _exact_ same nodemask is going to reproduce itself but only after unnecessary delay. > And unless I locked the bit clear, I'd still have to occassionally zap > the entire nodemask. Setting or clearing individual bits in a mask opens > a bigger critical section to races. Eventually, after loosing enough > such races, that nodemask would be suitable for donating a little bit of > entropy to the random number subsystem -- mush. > The only such race conditions that exist are among the CPU's on that particular node in this case and the node bit is only zapped when pages are freed from a zone on that node. And since the node bit is only turned on when it has been passed by and deemed too full to allocate on, I don't see where the race exists. It's what we want since we aren't sure whether the free has allowed us to allocate there again, all we are doing is saying that it should be rechecked on the next alloc. > > Four people on the Cc list to this email, however, still have access to > > my script. > > Perhaps you could ping them off-list, and see if they are in a position > to participate. > Done. David -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org