* One idea to free up page flags on NUMA @ 2006-09-23 3:02 Christoph Lameter 2006-09-23 16:04 ` Andi Kleen 0 siblings, 1 reply; 11+ messages in thread From: Christoph Lameter @ 2006-09-23 3:02 UTC (permalink / raw) To: linux-mm; +Cc: haveblue Andrew asked for a way to free up page flags and I think there is a way to get rid of both the node information and the sparse field in the page flags. For that we adopt some ideas from VIRTUAL_MEM_MAP. Virtual memmap has the disadvantage that it needs a page table and thus is wasting TLB entries on i386 and x86_64. But it has the advantage that page_address() and virt_to_page() are simple add / subtract shift operations. Plus one can can use the super fast hardware table walker on i386 and x86_64. Plus the cpu is using its special optimized TLB caches to store the mappings. Sparse has the ability to configure larger and smaller chunk sizes. Lets say we do that with VMEMMAP. Say we use huge pages as the basic unit to map the memmap array. Then one block of 2M can map 32768 pages (assuming 128 byte struct page size) which is around 128 Megabytes. The TLB pressure is significantly reduced. So we would have a 3 level page table to index into that array that is comparable to a sparsemem tree. We do not need all bits of the virtual memmap address to index since we shift the address by PAGE_SHIFT. We could use the higher portion to store the node number (Hmm... Not all bits are supported for virtual mappings, right? But that would also reduce the number of bits that need to be mapped through pfns.) Then the node number could be retrieved from the address of the page struct without even having to touch the page flags. page_zone would avoid yet another lookup. The section ID and the section tables are replaced by the page table and the hardware walker. The memory plugin / plugout could still work like under sparse. It is just a matter managing the page table to add and remove sections of memmap. The page table is actually very much like the sparse tree. The code could likely be made to work with a page table. By that scheme we would win 6 bits on NUMAQ (32bit) and would save around 20-30 bits on 64 bit machine. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: One idea to free up page flags on NUMA 2006-09-23 3:02 One idea to free up page flags on NUMA Christoph Lameter @ 2006-09-23 16:04 ` Andi Kleen 2006-09-23 16:39 ` Christoph Lameter 0 siblings, 1 reply; 11+ messages in thread From: Andi Kleen @ 2006-09-23 16:04 UTC (permalink / raw) To: Christoph Lameter; +Cc: linux-mm, haveblue > By that scheme we would win 6 bits on NUMAQ (32bit) NUMAsaurus is total legacy and I'm just waiting for the last one to die to remove the code ;-) > and would save around > 20-30 bits on 64 bit machine. And what would we use them for? -Andi -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: One idea to free up page flags on NUMA 2006-09-23 16:04 ` Andi Kleen @ 2006-09-23 16:39 ` Christoph Lameter 2006-09-23 18:43 ` Andi Kleen 2006-09-23 19:24 ` Dave Hansen 0 siblings, 2 replies; 11+ messages in thread From: Christoph Lameter @ 2006-09-23 16:39 UTC (permalink / raw) To: Andi Kleen; +Cc: linux-mm, haveblue On Sat, 23 Sep 2006, Andi Kleen wrote: > And what would we use them for? Maybe a container number? Anyways the scheme also would reduce the number of lookups needed and thus the general footprint of the VM using sparse. I just looked at the arch code for i386 and x86_64 and it seems that both already have page tables for all of memory. It seems that a virtual memmap like this would just eliminate sparse overhead and not add any additional page table overhead. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: One idea to free up page flags on NUMA 2006-09-23 16:39 ` Christoph Lameter @ 2006-09-23 18:43 ` Andi Kleen 2006-09-24 1:57 ` Christoph Lameter 2006-09-23 19:24 ` Dave Hansen 1 sibling, 1 reply; 11+ messages in thread From: Andi Kleen @ 2006-09-23 18:43 UTC (permalink / raw) To: Christoph Lameter; +Cc: linux-mm, haveblue On Saturday 23 September 2006 18:39, Christoph Lameter wrote: > On Sat, 23 Sep 2006, Andi Kleen wrote: > > > And what would we use them for? > > Maybe a container number? > > Anyways the scheme also would reduce the number of lookups needed and > thus the general footprint of the VM using sparse. So far most users (distributions) are not using sparse yet anyways. > I just looked at the arch code for i386 and x86_64 and it seems that both > already have page tables for all of memory. i386 doesn't map all of memory. > It seems that a virtual memmap > like this would just eliminate sparse overhead and not add any additional > page table overhead. You would have new mappings with new overhead, no? -Andi -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: One idea to free up page flags on NUMA 2006-09-23 18:43 ` Andi Kleen @ 2006-09-24 1:57 ` Christoph Lameter 2006-09-24 7:24 ` Andi Kleen 0 siblings, 1 reply; 11+ messages in thread From: Christoph Lameter @ 2006-09-24 1:57 UTC (permalink / raw) To: Andi Kleen; +Cc: linux-mm, haveblue On Sat, 23 Sep 2006, Andi Kleen wrote: > > I just looked at the arch code for i386 and x86_64 and it seems that both > > already have page tables for all of memory. > > i386 doesn't map all of memory. Hmmm... It only maps the kernel text segment? > > It seems that a virtual memmap > > like this would just eliminate sparse overhead and not add any additional > > page table overhead. > > You would have new mappings with new overhead, no? If mappings already exist then this would just mean using the existing mappings to implement a virtual memmap array. If 386 has no mappings for the kernel mappings then this may add more overhead. However, we would be using the MMU which would be faster than manually simulating MMU like lookups as sparse does now. I think sparsemem could be modified to use the page table format. The sparsemem infrastructure would still work. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: One idea to free up page flags on NUMA 2006-09-24 1:57 ` Christoph Lameter @ 2006-09-24 7:24 ` Andi Kleen 2006-09-25 0:31 ` Christoph Lameter 0 siblings, 1 reply; 11+ messages in thread From: Andi Kleen @ 2006-09-24 7:24 UTC (permalink / raw) To: Christoph Lameter; +Cc: linux-mm, haveblue On Sunday 24 September 2006 03:57, Christoph Lameter wrote: > On Sat, 23 Sep 2006, Andi Kleen wrote: > > > I just looked at the arch code for i386 and x86_64 and it seems that > > > both already have page tables for all of memory. > > > > i386 doesn't map all of memory. > > Hmmm... It only maps the kernel text segment? Only lowmem (normally upto ~900MB) But virtual memory is very scarce so I don't know where a new map for mem_map would come from. Ok you could try to move the physical location of mem_map to somewhere not in lowmem I suppose. -Andi -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: One idea to free up page flags on NUMA 2006-09-24 7:24 ` Andi Kleen @ 2006-09-25 0:31 ` Christoph Lameter 2006-09-25 3:04 ` Andi Kleen 0 siblings, 1 reply; 11+ messages in thread From: Christoph Lameter @ 2006-09-25 0:31 UTC (permalink / raw) To: Andi Kleen; +Cc: linux-mm, haveblue On Sun, 24 Sep 2006, Andi Kleen wrote: > > Hmmm... It only maps the kernel text segment? > Only lowmem (normally upto ~900MB) > > But virtual memory is very scarce so I don't know where a new map for mem_map > would come from. Ok you could try to move the physical location of mem_map to > somewhere not in lowmem I suppose. Right could be in highmem and thus would free up around 20 Megabytes of low memory. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: One idea to free up page flags on NUMA 2006-09-25 0:31 ` Christoph Lameter @ 2006-09-25 3:04 ` Andi Kleen 2006-09-25 3:46 ` Christoph Lameter 0 siblings, 1 reply; 11+ messages in thread From: Andi Kleen @ 2006-09-25 3:04 UTC (permalink / raw) To: Christoph Lameter; +Cc: linux-mm, haveblue On Monday 25 September 2006 02:31, Christoph Lameter wrote: > On Sun, 24 Sep 2006, Andi Kleen wrote: > > > > Hmmm... It only maps the kernel text segment? > > Only lowmem (normally upto ~900MB) > > > > But virtual memory is very scarce so I don't know where a new map for mem_map > > would come from. Ok you could try to move the physical location of mem_map to > > somewhere not in lowmem I suppose. > > Right could be in highmem and thus would free up around 20 Megabytes of > low memory. But won't the vmemmap need more than the 20MB? -Andi -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: One idea to free up page flags on NUMA 2006-09-25 3:04 ` Andi Kleen @ 2006-09-25 3:46 ` Christoph Lameter 0 siblings, 0 replies; 11+ messages in thread From: Christoph Lameter @ 2006-09-25 3:46 UTC (permalink / raw) To: Andi Kleen; +Cc: linux-mm, haveblue On Mon, 25 Sep 2006, Andi Kleen wrote: > > Right could be in highmem and thus would free up around 20 Megabytes of > > low memory. > But won't the vmemmap need more than the 20MB? It will need the same as the regular mmap + one / two page table pages pointing to the huge pages of the virtual memmap. So if one goes from regular mmap to virtual mmap one pays with a few page table pages and the need for additional TLBs for lookup. But one can remove the memmap entirely from the low memory area. If we upgrade sparse to be able to use vmemmap then we trade the existing sparse structures against the few page table pages plus the TLB overhead. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: One idea to free up page flags on NUMA 2006-09-23 16:39 ` Christoph Lameter 2006-09-23 18:43 ` Andi Kleen @ 2006-09-23 19:24 ` Dave Hansen 2006-09-24 1:56 ` Christoph Lameter 1 sibling, 1 reply; 11+ messages in thread From: Dave Hansen @ 2006-09-23 19:24 UTC (permalink / raw) To: Christoph Lameter; +Cc: Andi Kleen, linux-mm, Andy Whitcroft On Sat, 2006-09-23 at 09:39 -0700, Christoph Lameter wrote: > On Sat, 23 Sep 2006, Andi Kleen wrote: > > And what would we use them for? > > Maybe a container number? I have a feeling this is better done at the more coarse objects like address_spaces and vmas. > Anyways the scheme also would reduce the number of lookups needed and > thus the general footprint of the VM using sparse. > > I just looked at the arch code for i386 and x86_64 and it seems that both > already have page tables for all of memory. It seems that a virtual memmap > like this would just eliminate sparse overhead and not add any additional > page table overhead. I'm not sure to what sparse overhead you are referring. Its only storage overhead is one pointer per SECTION_SIZE bytes of memory. The worst case scenario is 16MB sections on ppc64 with 16TB of memory. 2^20 sections * 2^3 bytes/pointer = 2^23 bytes of sparse overhead, which is 8MB. That's pretty little overhead no matter how you look at it, cache footprint, tlb load, etc... Add to that the fact that we get some extra things from sparsemem like pfn_valid() and the bookkeeping for whether or not the memory is there (before the mem_map is actually allocated), and it doesn't look too bad. If someone can actually demonstrate some actual, measurable performance problem with it, then I'm all ears. I worry that anything else is just potential overzealous micro-optimization trying to solve problems that don't really exist. Remember, sparsemem slightly beats discontigmem on x86 NUMA hardware, so it isn't much of a dog to begin with. Sparsemem is a ~100 line patch to port to a new architecture. That code is virtually all #defines and hooking into the pfn_to_page() mechanisms. There's virtually no logic in there. That's going to be hard to beat with any kind of vmem_map[] approach. -- Dave -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: One idea to free up page flags on NUMA 2006-09-23 19:24 ` Dave Hansen @ 2006-09-24 1:56 ` Christoph Lameter 0 siblings, 0 replies; 11+ messages in thread From: Christoph Lameter @ 2006-09-24 1:56 UTC (permalink / raw) To: Dave Hansen; +Cc: Andi Kleen, linux-mm, Andy Whitcroft On Sat, 23 Sep 2006, Dave Hansen wrote: > I'm not sure to what sparse overhead you are referring. Its only > storage overhead is one pointer per SECTION_SIZE bytes of memory. The > worst case scenario is 16MB sections on ppc64 with 16TB of memory. The problem is that these arrays frequently referenced. They increase the VM overhead and if we already have page table in place then its easy to just use the format of the page tables for sparse like memory functionality. > 2^20 sections * 2^3 bytes/pointer = 2^23 bytes of sparse overhead, which > is 8MB. That's pretty little overhead no matter how you look at it, > cache footprint, tlb load, etc... Add to that the fact that we get some > extra things from sparsemem like pfn_valid() and the bookkeeping for > whether or not the memory is there (before the mem_map is actually > allocated), and it doesn't look too bad. Page table also provide the same functionality. There is a present bit etc. Simulation of core MMU functionality is certainly not faster than using the cpu MMU engines. > If someone can actually demonstrate some actual, measurable performance > problem with it, then I'm all ears. I worry that anything else is just > potential overzealous micro-optimization trying to solve problems that > don't really exist. Remember, sparsemem slightly beats discontigmem on > x86 NUMA hardware, so it isn't much of a dog to begin with. Yes it may beat it if you use 4k page sizes for it and if you are wasting additional TLB entries for it. If we are already using a page table for memory then this can only be better than managing tables on your own. > Sparsemem is a ~100 line patch to port to a new architecture. That code > is virtually all #defines and hooking into the pfn_to_page() mechanisms. > There's virtually no logic in there. That's going to be hard to beat > with any kind of vmem_map[] approach. Well we already have page tables there. Its just a matter of reserving a virtual memory area for the virtual memmap and changing some page table entries. Then one can get rid of the sparse tables and simply use existing non sparse virt_to_page and page_address() (have a look how ia64 does it). The main problem with sparsemem is in that situation is that we uselessly have additional tables that waste cachelines plus we use a series of bits in page flags that could be used for better purposes. If sparse would use the native page table format then you can use that to plug memory in and out. From what I can tell there is the same information in those tables. virt_to_page and page_address are really fast without table lookups. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2006-09-25 3:46 UTC | newest] Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2006-09-23 3:02 One idea to free up page flags on NUMA Christoph Lameter 2006-09-23 16:04 ` Andi Kleen 2006-09-23 16:39 ` Christoph Lameter 2006-09-23 18:43 ` Andi Kleen 2006-09-24 1:57 ` Christoph Lameter 2006-09-24 7:24 ` Andi Kleen 2006-09-25 0:31 ` Christoph Lameter 2006-09-25 3:04 ` Andi Kleen 2006-09-25 3:46 ` Christoph Lameter 2006-09-23 19:24 ` Dave Hansen 2006-09-24 1:56 ` Christoph Lameter
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox