From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail137.messagelabs.com (mail137.messagelabs.com [216.82.249.19]) by kanga.kvack.org (Postfix) with ESMTP id EACD98D003B for ; Mon, 11 Apr 2011 20:48:10 -0400 (EDT) Received: from d23relay05.au.ibm.com (d23relay05.au.ibm.com [202.81.31.247]) by e23smtp02.au.ibm.com (8.14.4/8.13.1) with ESMTP id p3C0gZN7023988 for ; Tue, 12 Apr 2011 10:42:35 +1000 Received: from d23av03.au.ibm.com (d23av03.au.ibm.com [9.190.234.97]) by d23relay05.au.ibm.com (8.13.8/8.13.8/NCO v10.0) with ESMTP id p3C0m7wJ1728550 for ; Tue, 12 Apr 2011 10:48:07 +1000 Received: from d23av03.au.ibm.com (loopback [127.0.0.1]) by d23av03.au.ibm.com (8.14.4/8.13.1/NCO v10.0 AVout) with ESMTP id p3C0m6o8030367 for ; Tue, 12 Apr 2011 10:48:07 +1000 Date: Tue, 12 Apr 2011 10:48:18 +1000 From: Christopher Yeoh Subject: Re: [Resend] Cross Memory Attach v3 [PATCH] Message-ID: <20110412104818.59ed8c50@rockpopper> In-Reply-To: <20110325235225.2aa4ebdc@lilo> References: <20110315143547.1b233cd4@lilo> <20110315161623.4099664b.akpm@linux-foundation.org> <20110317154026.61ddd925@lilo> <20110317125427.eebbfb51.akpm@linux-foundation.org> <20110321122018.6306d067@lilo> <20110320185532.08394018.akpm@linux-foundation.org> <20110323125213.69a7a914@lilo> <877hbpcuym.fsf@rustcorp.com.au> <20110325235225.2aa4ebdc@lilo> Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Andrew Morton Cc: Rusty Russell , linux-mm@kvack.org, Linus Torvalds On Fri, 25 Mar 2011 23:52:25 +1030 cyeoh@ozlabs.au.ibm.com wrote: > On Thu, 24 Mar 2011 09:20:41 +1030 > Rusty Russell wrote: > > > On Wed, 23 Mar 2011 12:52:13 +1030, Christopher Yeoh > > wrote: > > > On Sun, 20 Mar 2011 18:55:32 -0700 > > > Andrew Morton wrote: > > > > On Mon, 21 Mar 2011 12:20:18 +1030 Christopher Yeoh > > > > wrote: > > > > > > > > > On Thu, 17 Mar 2011 12:54:27 -0700 > > > > > Andrew Morton wrote: > > > > > > On Thu, 17 Mar 2011 15:40:26 +1030 > > > > > > Christopher Yeoh wrote: > > > > > > > > > > > > > > Thinking out loud: if we had a way in which a process > > > > > > > > can add and remove a local anonymous page into > > > > > > > > pagecache then other processes could access that page > > > > > > > > via mmap. If both processes map the file with a > > > > > > > > nonlinear vma they they can happily sit there flipping > > > > > > > > pages into and out of the shared mmap at arbitrary file > > > > > > > > offsets. The details might get hairy ;) We wouldn't > > > > > > > > want all the regular mmap semantics of > > > > > > > > > > > > > > Yea, its the complexity of trying to do it that way that > > > > > > > eventually lead me to implementing it via a syscall and > > > > > > > get_user_pages instead, trying to keep things as simple as > > > > > > > possible. > > > > > > > > > > > > The pagecache trick potentially gives zero-copy access, > > > > > > whereas the proposed code is single-copy. Although the > > > > > > expected benefits of that may not be so great due to TLB > > > > > > manipulation overheads. > > > > > > > > > > > > I worry that one day someone will come along and implement > > > > > > the pagecache trick, then we're stuck with obsolete code > > > > > > which we have to maintain for ever. > > > > Since this is for MPI (ie. message passing), they really want copy > > semantics. If they didn't want copy semantics, they could just > > MAP_SHARED some memory and away they go... > > > > You don't want to implement copy semantics with page-flipping; you > > would need to COW the outgoing pages, so you end up copying *and* > > trapping. > > > > If you are allowed to replace "sent" pages with zeroed ones or > > something then you don't have to COW. Yet even if your messages > > were a few MB, it's still not clear you'd win; in a NUMA world > > you're better off copying into a local page and then working on it. > > > > Copying just isn't that bad when it's cache-hot on the sender and > > you are about to use it on the receiver, as MPI tends to be. And > > it's damn simple. > > > > But we should be able to benchmark an approximation to the > > page-flipping approach anyway, by not copying the data and doing the > > appropriate tlb flushes in the system call. > > I've done some hacking on the naturally ordered and randomly ordered > ring bandwidth tests of hpcc to try to simulate what we'd get with a > page flipping approach. > > - Modified hpcc so it checksums the data on the receiver. normally it > just checks the data in a couple of places but the checksum > simulates the receiver actually using all of the data > > - For the page flipping scenario > - allocate from a shared memory pool for data that is to be > transferred > - instead of sending the data via OpenMPI send some control data > instead which describes where the receiver can read the data in > shared memory. Thus "zero copy" with just checksum > - Adds tlb flushing for sender/receiver processes > > The results are below (numbers are in MB/s, higher the better). Base > is double copy via shared memory, CMA is single copy. > > Num MPI Processes > Naturally Ordered 4 8 16 32 > Base 1152 929 567 370 > CMA 3682 3071 2753 2548 > Zero Copy 4634 4039 3149 2852 > > Num MPI Processes > Randomly Ordered 4 8 16 32 > Base 1154 927 588 389 > CMA 3632 3060 2897 2904 > Zero Copy 4668 3970 3077 2962 > > the benchmarks were run on a 32 way (SMT-off) Power6 machine. > > So we can see that on lower numbers of processes there is a gain in > performance between single and zero copy (though the big jump is > between double and single copy), but this reduces as the number of > processes increases. The difference between the single and zero copy > approach reduces to almost nothing for when the number of MPI > processes is equal to the number of processors (for the randomly > ordered ring bandwidth). Andrew - just wondering if you had any more thoughts about this? Any other information you were looking for? Regards, Chris -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: email@kvack.org