* returning non-ram via ->nopage, was Re: [patch] mspec driver for 2.6.12-rc2-mm3 [not found] ` <yq03btftb9u.fsf@jaguar.mkp.net> @ 2005-04-25 14:47 ` Christoph Hellwig 2005-04-26 22:14 ` Jes Sorensen 0 siblings, 1 reply; 7+ messages in thread From: Christoph Hellwig @ 2005-04-25 14:47 UTC (permalink / raw) To: Jes Sorensen; +Cc: Andrew Morton, linux-kernel, linux-mm Jes has this shiny new IA64 uncached foo bar whizbang driver (see the patch at http://marc.theaimsgroup.com/?l=linux-kernel&m=111416930927092&w=2), which has a nopage routine that calls remap_pfn_range from ->nopage for uncached memory that's not part of the mem map. Because ->nopage wants to return a struct page * he's allocating a normal kernel page and actually returns that one - to get the page he wants into the pagetables his does all the pagetable manipulation himself before (See the glory details of pagetable walks and modification inside a driver in the patch above). I don't think these hacks are acceptable for a driver, especially as the problem can easily be solved by calling remap_pfn_range in ->mmap - except SGI also wants node locality.. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: returning non-ram via ->nopage, was Re: [patch] mspec driver for 2.6.12-rc2-mm3 2005-04-25 14:47 ` returning non-ram via ->nopage, was Re: [patch] mspec driver for 2.6.12-rc2-mm3 Christoph Hellwig @ 2005-04-26 22:14 ` Jes Sorensen 2005-04-27 15:53 ` Jeff Garzik 2005-05-03 20:40 ` William Lee Irwin III 0 siblings, 2 replies; 7+ messages in thread From: Jes Sorensen @ 2005-04-26 22:14 UTC (permalink / raw) To: Christoph Hellwig; +Cc: Andrew Morton, linux-kernel, linux-mm >>>>> "Christoph" == Christoph Hellwig <hch@infradead.org> writes: Christoph> http://marc.theaimsgroup.com/?l=linux-kernel&m=111416930927092&w=2), Christoph> which has a nopage routine that calls remap_pfn_range from Christoph> ->nopage for uncached memory that's not part of the mem Christoph> map. Because ->nopage wants to return a struct page * he's Christoph> allocating a normal kernel page and actually returns that Christoph> one - to get the page he wants into the pagetables his does Christoph> all the pagetable manipulation himself before (See the Christoph> glory details of pagetable walks and modification inside a Christoph> driver in the patch above). Christoph> I don't think these hacks are acceptable for a driver, Christoph> especially as the problem can easily be solved by calling Christoph> remap_pfn_range in ->mmap - except SGI also wants node Christoph> locality.. Christoph, Let me try and provide some more background then. Simply doing remap_pfn_range in the mmap call doesn't work for large systems. Take the example of a 2048 CPU system (512 CPUs per partition/machine - each machine running it's own OS) running an MPI application across all 2048 CPUs using cross coherency domain traffic. A standard application will allocate 56 DDQs per thread (the DDQs are used for synchronization and allocated through the mspec driver) which translates to having 126976 uncached cache lines reserved or 992 pages per worker thread. The controlling thread on each partition will mmap the entire DDQ space up front and then fork off the workers who will then go and touch their pages. With the current approach by the driver this means that if you have two threads per node you will end up with ~32MB of uncached memory allocated per node. Alternatively doing this at mmap time having 512 worker threads per partition, the result is ~8GB (992 * 16K * 512) of uncached memory all allocated by the master thread on each machine. A typical system configuration is 4GB or 8GB of RAM per node. This means that by using the remap_pfn_range at mmap time approach and the kernel's standard overhead you end up completely starving the first couple of nodes of memory on each partition. Combine this with the effect of all synchronization traffic hitting the same node, you effectively end up with 512 CPUs all constantly hammering the same memory controller to death. FWIW, an initial implementation of the driver was done by someone within SGI, prior to me having anything to do with it. It was using the remap_pfn_range at mmap time approach and it was noticed then that 16 worker threads was pretty much enough to overwhelm a node. Having the page allocations and drop ins on a first touch basis is consistent with what is done for cached memory and seems a pretty reasonable approach to me. Sure it isn't particularly pretty to use the ->nopage approach, nobody disagrees with you there, but what is the alternative? Is the problem more an issue of the ugliness of allocating a page just to return it to the nopage handler or the fact that we're trying to make the allocations node local? If you have any suggestions for how to do this differently, then I'm all ears. Cheers, Jes PS: Thanks to Robin Holt for providing more info on MPI application behavior than I ever wanted to know ;-) -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: returning non-ram via ->nopage, was Re: [patch] mspec driver for 2.6.12-rc2-mm3 2005-04-26 22:14 ` Jes Sorensen @ 2005-04-27 15:53 ` Jeff Garzik 2005-04-27 15:55 ` Christoph Hellwig 2005-05-03 20:40 ` William Lee Irwin III 1 sibling, 1 reply; 7+ messages in thread From: Jeff Garzik @ 2005-04-27 15:53 UTC (permalink / raw) To: Jes Sorensen; +Cc: Christoph Hellwig, Andrew Morton, linux-kernel, linux-mm Jes Sorensen wrote: >>>>>>"Christoph" == Christoph Hellwig <hch@infradead.org> writes: > > > Christoph> http://marc.theaimsgroup.com/?l=linux-kernel&m=111416930927092&w=2), > Christoph> which has a nopage routine that calls remap_pfn_range from > Christoph> ->nopage for uncached memory that's not part of the mem > Christoph> map. Because ->nopage wants to return a struct page * he's > Christoph> allocating a normal kernel page and actually returns that > Christoph> one - to get the page he wants into the pagetables his does > Christoph> all the pagetable manipulation himself before (See the > Christoph> glory details of pagetable walks and modification inside a > Christoph> driver in the patch above). > > Christoph> I don't think these hacks are acceptable for a driver, > Christoph> especially as the problem can easily be solved by calling > Christoph> remap_pfn_range in ->mmap - except SGI also wants node > Christoph> locality.. > > Christoph, > > Let me try and provide some more background then. > > Simply doing remap_pfn_range in the mmap call doesn't work for large > systems. > > Take the example of a 2048 CPU system (512 CPUs per partition/machine > - each machine running it's own OS) running an MPI application > across all 2048 CPUs using cross coherency domain traffic. > > A standard application will allocate 56 DDQs per thread (the DDQs are > used for synchronization and allocated through the mspec driver) which > translates to having 126976 uncached cache lines reserved or 992 pages > per worker thread. The controlling thread on each partition will mmap > the entire DDQ space up front and then fork off the workers who will > then go and touch their pages. With the current approach by the driver > this means that if you have two threads per node you will end up with > ~32MB of uncached memory allocated per node. > > Alternatively doing this at mmap time having 512 worker threads per > partition, the result is ~8GB (992 * 16K * 512) of uncached memory all > allocated by the master thread on each machine. > > A typical system configuration is 4GB or 8GB of RAM per node. This > means that by using the remap_pfn_range at mmap time approach and the > kernel's standard overhead you end up completely starving the first > couple of nodes of memory on each partition. > > Combine this with the effect of all synchronization traffic hitting > the same node, you effectively end up with 512 CPUs all constantly > hammering the same memory controller to death. > > FWIW, an initial implementation of the driver was done by someone > within SGI, prior to me having anything to do with it. It was using > the remap_pfn_range at mmap time approach and it was noticed then that > 16 worker threads was pretty much enough to overwhelm a node. > > Having the page allocations and drop ins on a first touch basis is > consistent with what is done for cached memory and seems a pretty > reasonable approach to me. Sure it isn't particularly pretty to use > the ->nopage approach, nobody disagrees with you there, but what is > the alternative? I don't see anything wrong with a ->nopage approach. At Linus's suggestion, I used ->nopage in the implementation of sound/oss/via82cxxx_audio.c. Jeff -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: returning non-ram via ->nopage, was Re: [patch] mspec driver for 2.6.12-rc2-mm3 2005-04-27 15:53 ` Jeff Garzik @ 2005-04-27 15:55 ` Christoph Hellwig 2005-04-27 18:03 ` Jes Sorensen 0 siblings, 1 reply; 7+ messages in thread From: Christoph Hellwig @ 2005-04-27 15:55 UTC (permalink / raw) To: Jeff Garzik Cc: Jes Sorensen, Christoph Hellwig, Andrew Morton, linux-kernel, linux-mm On Wed, Apr 27, 2005 at 11:53:15AM -0400, Jeff Garzik wrote: > I don't see anything wrong with a ->nopage approach. > > At Linus's suggestion, I used ->nopage in the implementation of > sound/oss/via82cxxx_audio.c. The difference is that you return kernel memory (actually pci_alloc_consistant memory that has it's own set of problems), while this is memory not in mem_map, so he allocates some regularly kernel memory too to have a struct page and just leaks it -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: returning non-ram via ->nopage, was Re: [patch] mspec driver for 2.6.12-rc2-mm3 2005-04-27 15:55 ` Christoph Hellwig @ 2005-04-27 18:03 ` Jes Sorensen 2005-04-27 18:55 ` Russell King 0 siblings, 1 reply; 7+ messages in thread From: Jes Sorensen @ 2005-04-27 18:03 UTC (permalink / raw) To: Christoph Hellwig; +Cc: Jeff Garzik, Andrew Morton, linux-kernel, linux-mm >>>>> "Christoph" == Christoph Hellwig <hch@infradead.org> writes: Christoph> On Wed, Apr 27, 2005 at 11:53:15AM -0400, Jeff Garzik Christoph> wrote: >> I don't see anything wrong with a ->nopage approach. >> >> At Linus's suggestion, I used ->nopage in the implementation of >> sound/oss/via82cxxx_audio.c. Christoph> The difference is that you return kernel memory (actually Christoph> pci_alloc_consistant memory that has it's own set of Christoph> problems), while this is memory not in mem_map, so he Christoph> allocates some regularly kernel memory too to have a struct Christoph> page and just leaks it Are you suggesting then that we change do_no_page to handle this as a special return value then? Jes -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: returning non-ram via ->nopage, was Re: [patch] mspec driver for 2.6.12-rc2-mm3 2005-04-27 18:03 ` Jes Sorensen @ 2005-04-27 18:55 ` Russell King 0 siblings, 0 replies; 7+ messages in thread From: Russell King @ 2005-04-27 18:55 UTC (permalink / raw) To: Jes Sorensen Cc: Christoph Hellwig, Jeff Garzik, Andrew Morton, linux-kernel, linux-mm On Wed, Apr 27, 2005 at 02:03:50PM -0400, Jes Sorensen wrote: > >>>>> "Christoph" == Christoph Hellwig <hch@infradead.org> writes: > > Christoph> On Wed, Apr 27, 2005 at 11:53:15AM -0400, Jeff Garzik > Christoph> wrote: > >> I don't see anything wrong with a ->nopage approach. > >> > >> At Linus's suggestion, I used ->nopage in the implementation of > >> sound/oss/via82cxxx_audio.c. > > Christoph> The difference is that you return kernel memory (actually > Christoph> pci_alloc_consistant memory that has it's own set of > Christoph> problems), while this is memory not in mem_map, so he > Christoph> allocates some regularly kernel memory too to have a struct > Christoph> page and just leaks it > > Are you suggesting then that we change do_no_page to handle this as a > special return value then? If you're looking to mmap dma memory, ARM already supports the API which was discussed (although not properly imho) on linux-arch. I previously posted a potential patch for x86, but it has the problem that remap_pfn_range() will not work on such memory because it isn't marked reserved. In addition, if you're mmaping dma memory on x86 as is, you're providing a potential security hole - the x86 DMA memory allocator does not extend its zeroing to cover the entire last page of the allocation. -- Russell King Linux kernel 2.6 ARM Linux - http://www.arm.linux.org.uk/ maintainer of: 2.6 Serial core -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: returning non-ram via ->nopage, was Re: [patch] mspec driver for 2.6.12-rc2-mm3 2005-04-26 22:14 ` Jes Sorensen 2005-04-27 15:53 ` Jeff Garzik @ 2005-05-03 20:40 ` William Lee Irwin III 1 sibling, 0 replies; 7+ messages in thread From: William Lee Irwin III @ 2005-05-03 20:40 UTC (permalink / raw) To: Jes Sorensen; +Cc: Christoph Hellwig, Andrew Morton, linux-kernel, linux-mm On Tue, Apr 26, 2005 at 06:14:02PM -0400, Jes Sorensen wrote: > Having the page allocations and drop ins on a first touch basis is > consistent with what is done for cached memory and seems a pretty > reasonable approach to me. Sure it isn't particularly pretty to use > the ->nopage approach, nobody disagrees with you there, but what is > the alternative? > Is the problem more an issue of the ugliness of allocating a page > just to return it to the nopage handler or the fact that we're trying > to make the allocations node local? > If you have any suggestions for how to do this differently, then I'm > all ears. > Cheers, > Jes > PS: Thanks to Robin Holt for providing more info on MPI application > behavior than I ever wanted to know ;-) This and several other issues all fall down when instead of ->nopage(), the vma's fault handling method takes a vma, a virtual address, and an access type, and returns a VM_FAULT_* code. Yes, I remember how I got heavily criticized the last time I wrote/suggested/whatever this. -- wli -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2005-05-03 20:40 UTC | newest]
Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
[not found] <16987.39773.267117.925489@jaguar.mkp.net>
[not found] ` <20050412032747.51c0c514.akpm@osdl.org>
[not found] ` <yq07jj8123j.fsf@jaguar.mkp.net>
[not found] ` <20050413204335.GA17012@infradead.org>
[not found] ` <yq08y3bys4e.fsf@jaguar.mkp.net>
[not found] ` <20050424101615.GA22393@infradead.org>
[not found] ` <yq03btftb9u.fsf@jaguar.mkp.net>
2005-04-25 14:47 ` returning non-ram via ->nopage, was Re: [patch] mspec driver for 2.6.12-rc2-mm3 Christoph Hellwig
2005-04-26 22:14 ` Jes Sorensen
2005-04-27 15:53 ` Jeff Garzik
2005-04-27 15:55 ` Christoph Hellwig
2005-04-27 18:03 ` Jes Sorensen
2005-04-27 18:55 ` Russell King
2005-05-03 20:40 ` William Lee Irwin III
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox