From mboxrd@z Thu Jan 1 00:00:00 1970 MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Message-ID: <16843.19972.17026.69228@cargo.ozlabs.ibm.com> Date: Fri, 24 Dec 2004 10:00:20 +1100 From: Paul Mackerras Subject: Re: Prezeroing V2 [0/3]: Why and When it works In-Reply-To: <20041223133745.1d95bb08.akpm@osdl.org> References: <41C20E3E.3070209@yahoo.com.au> <16843.13418.630413.64809@cargo.ozlabs.ibm.com> <20041223133745.1d95bb08.akpm@osdl.org> Sender: owner-linux-mm@kvack.org Return-Path: To: Andrew Morton Cc: clameter@sgi.com, linux-ia64@vger.kernel.org, torvalds@osdl.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org List-ID: Andrew Morton writes: > When the workload is a gcc run, the pagefault handler dominates the system > time. That's the page zeroing. For a program which uses a lot of heap and doesn't fork, that sounds reasonable. > x86's movnta instructions provide a way of initialising memory without > trashing the caches and it has pretty good bandwidth, I believe. We should > wire that up to these patches and see if it speeds things up. Yes. I don't know the movnta instruction, but surely, whatever scheme is used, there has to be a snoop for every cache line's worth of memory that is zeroed. The other point is that having the page hot in the cache may well be a benefit to the program. Using any sort of cache-bypassing zeroing might not actually make things faster, when the user time as well as the system time is taken into account. > > I did some measurements once on my G5 powermac (running a ppc64 linux > > kernel) of how long clear_page takes, and it only takes 96ns for a 4kB > > page. > > 40GB/s. Is that straight into L1 or does the measurement include writeback? It is the average elapsed time in clear_page, so it would include the writeback of any cache lines displaced by the zeroing, but not the writeback of the newly-zeroed cache lines (which we hope will be modified by the program before they get written back anyway). This is using the dcbz (data cache block zero) instruction, which establishes a cache line in modified state with zero contents without any memory traffic other than a cache line kill transaction sent to the other CPUs and possible writeback of a dirty cache line displaced by the newly-zeroed cache line. The new cache line is established in the L2 cache, because the L1 is write-through on the G5, and all stores and dcbz instructions have to go to the L2 cache. Thus, on the G5 (and POWER4, which is similar) I don't think there will be much if any benefit from having pre-zeroed cache-cold pages. We can establish the zero lines in cache much faster using dcbz than we can by reading them in from main memory. If the program uses only a few cache lines out of each new page, then reading them from memory might be faster, but that seems unlikely. Paul. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: aart@kvack.org