From mboxrd@z Thu Jan 1 00:00:00 1970 Date: Thu, 20 Apr 2006 09:41:11 -0700 From: "Luck, Tony" Subject: Re: [RFC] - Kernel text replication on IA64 Message-ID: <20060420164111.GA18770@agluck-lia64.sc.intel.com> References: <20060420135315.GA28021@sgi.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20060420135315.GA28021@sgi.com> Sender: owner-linux-mm@kvack.org Return-Path: To: Jack Steiner Cc: linux-ia64@vger.kernel.org, lee.schermerhorn@hp.com, clameter@sgi.com, linux-mm@kvack.org List-ID: On Thu, Apr 20, 2006 at 08:53:16AM -0500, Jack Steiner wrote: > There was a question about the effects of kernel text replication last > month. I was curious so I resurrected an old trillian patch (Tony Luck's) > & got it working again. Here is the preliminary patch & some data about > the benefit. It's not *that* old ... just from Atlas days, not from Trillian. Google carbon dating shows old versions of this patch from around the August 2002 time frame (against 2.4.19). > All tests were run on a 12 cpu (Itanium2, 900 MHz, 1.5MB L3) , 6 node > system. All cpus are idle with the exception of the cpu running the test. Presumably results would be even better on a bigger system where the distance from the test node to node 0 is even bigger. On truly huge systems does node0 (or the interconnects leading to it) ever suffer measureable slowdown from the instruction fetch traffic coming from the other 255 (or more) nodes? > Enabling replication reserves 1 additional DTLB entry for kernel code. > This reduces the number of DTLB entries that is available for user code. > There is the potential that this could impact some applications. > Additional measurements are still needed. Ken's recent patch to free up the DTLB that is currently used for per-cpu data would mitigate this (though I'm sure he'll be unamused if I blow the 1.6% gain he saw on his transaction processing benchmark on this :-) > ------------------------------------------------------ > Cold cache. Running on node 3 of 6 node system > > NoRep Rep %improvement > null : 0.894 : 0.812 : 9.17 > forkexit : 521.518 : 416.467 : 20.14 > openclose : 106.683 : 75.000 : 29.70 > pid : 2.577 : 2.356 : 8.58 > time : 17.882 : 11.693 : 34.61 > gettimeofday : 17.523 : 11.695 : 33.26 Those are some pretty nice numbers. > ------------------------------------------------------ > Hot cache. Running on node 3 of 6 node system > > NoRep Rep %improvement > forkexit : 162.019 : 151.927 : 6.23 > openclose : 8.445 : 8.128 : 3.75 These ones are good too. But a bit surprising ... it implies that we are still seeing significant kernel-code i-cache misses even in a micro-benchmark tight loop. Montecito (with the big L2 icache) should be better here (and so see less improvement with replicated text). > + Say Y if you want to eeplicate kernel text on each node of a NUMA system. s/eeplicate/replicate/ -Tony -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org