From mboxrd@z Thu Jan 1 00:00:00 1970 Message-ID: <426FB56B.5000006@pobox.com> Date: Wed, 27 Apr 2005 11:53:15 -0400 From: Jeff Garzik MIME-Version: 1.0 Subject: Re: returning non-ram via ->nopage, was Re: [patch] mspec driver for 2.6.12-rc2-mm3 References: <16987.39773.267117.925489@jaguar.mkp.net> <20050412032747.51c0c514.akpm@osdl.org> <20050413204335.GA17012@infradead.org> <20050424101615.GA22393@infradead.org> <20050425144749.GA10093@infradead.org> In-Reply-To: Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org Return-Path: To: Jes Sorensen Cc: Christoph Hellwig , Andrew Morton , linux-kernel@vger.kernel.org, linux-mm@kvack.org List-ID: Jes Sorensen wrote: >>>>>>"Christoph" == Christoph Hellwig writes: > > > Christoph> http://marc.theaimsgroup.com/?l=linux-kernel&m=111416930927092&w=2), > Christoph> which has a nopage routine that calls remap_pfn_range from > Christoph> ->nopage for uncached memory that's not part of the mem > Christoph> map. Because ->nopage wants to return a struct page * he's > Christoph> allocating a normal kernel page and actually returns that > Christoph> one - to get the page he wants into the pagetables his does > Christoph> all the pagetable manipulation himself before (See the > Christoph> glory details of pagetable walks and modification inside a > Christoph> driver in the patch above). > > Christoph> I don't think these hacks are acceptable for a driver, > Christoph> especially as the problem can easily be solved by calling > Christoph> remap_pfn_range in ->mmap - except SGI also wants node > Christoph> locality.. > > Christoph, > > Let me try and provide some more background then. > > Simply doing remap_pfn_range in the mmap call doesn't work for large > systems. > > Take the example of a 2048 CPU system (512 CPUs per partition/machine > - each machine running it's own OS) running an MPI application > across all 2048 CPUs using cross coherency domain traffic. > > A standard application will allocate 56 DDQs per thread (the DDQs are > used for synchronization and allocated through the mspec driver) which > translates to having 126976 uncached cache lines reserved or 992 pages > per worker thread. The controlling thread on each partition will mmap > the entire DDQ space up front and then fork off the workers who will > then go and touch their pages. With the current approach by the driver > this means that if you have two threads per node you will end up with > ~32MB of uncached memory allocated per node. > > Alternatively doing this at mmap time having 512 worker threads per > partition, the result is ~8GB (992 * 16K * 512) of uncached memory all > allocated by the master thread on each machine. > > A typical system configuration is 4GB or 8GB of RAM per node. This > means that by using the remap_pfn_range at mmap time approach and the > kernel's standard overhead you end up completely starving the first > couple of nodes of memory on each partition. > > Combine this with the effect of all synchronization traffic hitting > the same node, you effectively end up with 512 CPUs all constantly > hammering the same memory controller to death. > > FWIW, an initial implementation of the driver was done by someone > within SGI, prior to me having anything to do with it. It was using > the remap_pfn_range at mmap time approach and it was noticed then that > 16 worker threads was pretty much enough to overwhelm a node. > > Having the page allocations and drop ins on a first touch basis is > consistent with what is done for cached memory and seems a pretty > reasonable approach to me. Sure it isn't particularly pretty to use > the ->nopage approach, nobody disagrees with you there, but what is > the alternative? I don't see anything wrong with a ->nopage approach. At Linus's suggestion, I used ->nopage in the implementation of sound/oss/via82cxxx_audio.c. Jeff -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: aart@kvack.org