One idea to free up page flags on NUMA

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* One idea to free up page flags on NUMA
@ 2006-09-23  3:02 Christoph Lameter
  2006-09-23 16:04 ` Andi Kleen
  0 siblings, 1 reply; 11+ messages in thread
From: Christoph Lameter @ 2006-09-23  3:02 UTC (permalink / raw)
  To: linux-mm; +Cc: haveblue

Andrew asked for a way to free up page flags and I think there is a way to 
get rid of both the node information and the sparse field in the page 
flags.

For that we adopt some ideas from VIRTUAL_MEM_MAP. Virtual memmap has the 
disadvantage that it needs a page table and thus is wasting TLB entries on 
i386 and x86_64.  But it has the advantage that page_address() and
virt_to_page() are simple add / subtract shift operations. Plus one can 
can use the super fast hardware table walker on i386 and x86_64. Plus the 
cpu is using its special optimized TLB caches to store the mappings.

Sparse has the ability to configure larger and smaller chunk sizes. Lets 
say we do that with VMEMMAP. Say we use huge pages as the basic unit
to map the memmap array.

Then one block of 2M can map 32768 pages (assuming 128 byte struct page 
size) which is around 128 Megabytes. The TLB pressure is significantly 
reduced.

So we would have a 3 level page table to index into that array that is 
comparable to a sparsemem tree.

We do not need all bits of the virtual memmap address to index
since we shift the address by PAGE_SHIFT. We could use the 
higher portion to store the node number (Hmm... Not all bits
are supported for virtual mappings, right? But that would also reduce the 
number of bits that need to be mapped through pfns.)

Then the node number could be retrieved from the address of the page 
struct without even having to touch the page flags. page_zone would 
avoid yet another lookup. The section ID and the section 
tables are replaced by the page table and the hardware walker.

The memory plugin / plugout could still work like under sparse. It is just 
a matter managing the page table to add and remove sections of memmap. The 
page table is actually very much like the sparse tree. The code could 
likely be made to work with a page table.

By that scheme we would win 6 bits on NUMAQ (32bit) and would save around 
20-30 bits on 64 bit machine.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: One idea to free up page flags on NUMA
  2006-09-23  3:02 One idea to free up page flags on NUMA Christoph Lameter
@ 2006-09-23 16:04 ` Andi Kleen
  2006-09-23 16:39   ` Christoph Lameter
  0 siblings, 1 reply; 11+ messages in thread
From: Andi Kleen @ 2006-09-23 16:04 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: linux-mm, haveblue

 
> By that scheme we would win 6 bits on NUMAQ (32bit) 

NUMAsaurus is total legacy and I'm just waiting for the last one to die to 
remove the code ;-)

> and would save around  
> 20-30 bits on 64 bit machine.

And what would we use them for?

-Andi

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: One idea to free up page flags on NUMA
  2006-09-23 16:04 ` Andi Kleen
@ 2006-09-23 16:39   ` Christoph Lameter
  2006-09-23 18:43     ` Andi Kleen
  2006-09-23 19:24     ` Dave Hansen
  0 siblings, 2 replies; 11+ messages in thread
From: Christoph Lameter @ 2006-09-23 16:39 UTC (permalink / raw)
  To: Andi Kleen; +Cc: linux-mm, haveblue

On Sat, 23 Sep 2006, Andi Kleen wrote:

> And what would we use them for?

Maybe a container number?

Anyways the scheme also would reduce the number of lookups needed and 
thus the general footprint of the VM using sparse.

I just looked at the arch code for i386 and x86_64 and it seems that both 
already have page tables for all of memory. It seems that a virtual memmap 
like this would just eliminate sparse overhead and not add any additional 
page table overhead.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: One idea to free up page flags on NUMA
  2006-09-23 16:39   ` Christoph Lameter
@ 2006-09-23 18:43     ` Andi Kleen
  2006-09-24  1:57       ` Christoph Lameter
  2006-09-23 19:24     ` Dave Hansen
  1 sibling, 1 reply; 11+ messages in thread
From: Andi Kleen @ 2006-09-23 18:43 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: linux-mm, haveblue

On Saturday 23 September 2006 18:39, Christoph Lameter wrote:
> On Sat, 23 Sep 2006, Andi Kleen wrote:
> 
> > And what would we use them for?
> 
> Maybe a container number?
> 
> Anyways the scheme also would reduce the number of lookups needed and 
> thus the general footprint of the VM using sparse.

So far most users (distributions) are not using sparse yet anyways.
  
> I just looked at the arch code for i386 and x86_64 and it seems that both 
> already have page tables for all of memory. 

i386 doesn't map all of memory.

> It seems that a virtual memmap  
> like this would just eliminate sparse overhead and not add any additional 
> page table overhead.

You would have new mappings with new overhead, no?

-Andi


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: One idea to free up page flags on NUMA
  2006-09-23 16:39   ` Christoph Lameter
  2006-09-23 18:43     ` Andi Kleen
@ 2006-09-23 19:24     ` Dave Hansen
  2006-09-24  1:56       ` Christoph Lameter
  1 sibling, 1 reply; 11+ messages in thread
From: Dave Hansen @ 2006-09-23 19:24 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: Andi Kleen, linux-mm, Andy Whitcroft

On Sat, 2006-09-23 at 09:39 -0700, Christoph Lameter wrote:
> On Sat, 23 Sep 2006, Andi Kleen wrote:
> > And what would we use them for?
> 
> Maybe a container number?

I have a feeling this is better done at the more coarse objects like
address_spaces and vmas.

> Anyways the scheme also would reduce the number of lookups needed and 
> thus the general footprint of the VM using sparse.
> 
> I just looked at the arch code for i386 and x86_64 and it seems that both 
> already have page tables for all of memory. It seems that a virtual memmap 
> like this would just eliminate sparse overhead and not add any additional 
> page table overhead.

I'm not sure to what sparse overhead you are referring.  Its only
storage overhead is one pointer per SECTION_SIZE bytes of memory.  The
worst case scenario is 16MB sections on ppc64 with 16TB of memory.  

2^20 sections * 2^3 bytes/pointer = 2^23 bytes of sparse overhead, which
is 8MB.  That's pretty little overhead no matter how you look at it,
cache footprint, tlb load, etc...  Add to that the fact that we get some
extra things from sparsemem like pfn_valid() and the bookkeeping for
whether or not the memory is there (before the mem_map is actually
allocated), and it doesn't look too bad.

If someone can actually demonstrate some actual, measurable performance
problem with it, then I'm all ears.  I worry that anything else is just
potential overzealous micro-optimization trying to solve problems that
don't really exist.  Remember, sparsemem slightly beats discontigmem on
x86 NUMA hardware, so it isn't much of a dog to begin with.

Sparsemem is a ~100 line patch to port to a new architecture.  That code
is virtually all #defines and hooking into the pfn_to_page() mechanisms.
There's virtually no logic in there.  That's going to be hard to beat
with any kind of vmem_map[] approach.

-- Dave

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: One idea to free up page flags on NUMA
  2006-09-23 19:24     ` Dave Hansen
@ 2006-09-24  1:56       ` Christoph Lameter
  0 siblings, 0 replies; 11+ messages in thread
From: Christoph Lameter @ 2006-09-24  1:56 UTC (permalink / raw)
  To: Dave Hansen; +Cc: Andi Kleen, linux-mm, Andy Whitcroft

On Sat, 23 Sep 2006, Dave Hansen wrote:

> I'm not sure to what sparse overhead you are referring.  Its only
> storage overhead is one pointer per SECTION_SIZE bytes of memory.  The
> worst case scenario is 16MB sections on ppc64 with 16TB of memory.  

The problem is that these arrays frequently referenced. They increase
the VM overhead and if we already have page table in place then its easy
to just use the format of the page tables for sparse like memory 
functionality.

> 2^20 sections * 2^3 bytes/pointer = 2^23 bytes of sparse overhead, which
> is 8MB.  That's pretty little overhead no matter how you look at it,
> cache footprint, tlb load, etc...  Add to that the fact that we get some
> extra things from sparsemem like pfn_valid() and the bookkeeping for
> whether or not the memory is there (before the mem_map is actually
> allocated), and it doesn't look too bad.

Page table also provide the same functionality. There is a present bit
etc. Simulation of core MMU functionality is certainly not faster than
using the cpu MMU engines.

> If someone can actually demonstrate some actual, measurable performance
> problem with it, then I'm all ears.  I worry that anything else is just
> potential overzealous micro-optimization trying to solve problems that
> don't really exist.  Remember, sparsemem slightly beats discontigmem on
> x86 NUMA hardware, so it isn't much of a dog to begin with.

Yes it may beat it if you use 4k page sizes for it and if you are
wasting additional TLB entries for it. If we are already using a page
table for memory then this can only be better than managing tables on your 
own.

> Sparsemem is a ~100 line patch to port to a new architecture.  That code
> is virtually all #defines and hooking into the pfn_to_page() mechanisms.
> There's virtually no logic in there.  That's going to be hard to beat
> with any kind of vmem_map[] approach.

Well we already have page tables there. Its just a matter of reserving
a virtual memory area for the virtual memmap and changing some page table
entries. Then one can get rid of the sparse tables and simply use
existing non sparse virt_to_page and page_address() (have a look how ia64 
does it). The main problem with sparsemem is in that situation is that we 
uselessly have additional tables that waste cachelines plus we use a 
series of bits in page flags that could be used for better purposes.

If sparse would use the native page table format then you can use that to 
plug memory in and out. From what I can tell there is the same information 
in those tables. virt_to_page and page_address are really fast without 
table lookups.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: One idea to free up page flags on NUMA
  2006-09-23 18:43     ` Andi Kleen
@ 2006-09-24  1:57       ` Christoph Lameter
  2006-09-24  7:24         ` Andi Kleen
  0 siblings, 1 reply; 11+ messages in thread
From: Christoph Lameter @ 2006-09-24  1:57 UTC (permalink / raw)
  To: Andi Kleen; +Cc: linux-mm, haveblue

On Sat, 23 Sep 2006, Andi Kleen wrote:

> > I just looked at the arch code for i386 and x86_64 and it seems that both 
> > already have page tables for all of memory. 
> 
> i386 doesn't map all of memory.

Hmmm... It only maps the kernel text segment?

> > It seems that a virtual memmap  
> > like this would just eliminate sparse overhead and not add any additional 
> > page table overhead.
> 
> You would have new mappings with new overhead, no?

If mappings already exist then this would just mean using the existing 
mappings to implement a virtual memmap array. If 386 has no mappings for
the kernel mappings then this may add more overhead. However, we would be 
using the MMU which would be faster than manually simulating MMU like
lookups as sparse does now. I think sparsemem could be modified to use
the page table format. The sparsemem infrastructure would still work.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: One idea to free up page flags on NUMA
  2006-09-24  1:57       ` Christoph Lameter
@ 2006-09-24  7:24         ` Andi Kleen
  2006-09-25  0:31           ` Christoph Lameter
  0 siblings, 1 reply; 11+ messages in thread
From: Andi Kleen @ 2006-09-24  7:24 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: linux-mm, haveblue

On Sunday 24 September 2006 03:57, Christoph Lameter wrote:
> On Sat, 23 Sep 2006, Andi Kleen wrote:
> > > I just looked at the arch code for i386 and x86_64 and it seems that
> > > both already have page tables for all of memory.
> >
> > i386 doesn't map all of memory.
>
> Hmmm... It only maps the kernel text segment?

Only lowmem (normally upto ~900MB)

But virtual memory is very scarce so I don't know where a new map for mem_map
would come from. Ok you could try to move the physical location of mem_map to 
somewhere not in lowmem I suppose.

-Andi

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: One idea to free up page flags on NUMA
  2006-09-24  7:24         ` Andi Kleen
@ 2006-09-25  0:31           ` Christoph Lameter
  2006-09-25  3:04             ` Andi Kleen
  0 siblings, 1 reply; 11+ messages in thread
From: Christoph Lameter @ 2006-09-25  0:31 UTC (permalink / raw)
  To: Andi Kleen; +Cc: linux-mm, haveblue

On Sun, 24 Sep 2006, Andi Kleen wrote:

> > Hmmm... It only maps the kernel text segment?
> Only lowmem (normally upto ~900MB)
> 
> But virtual memory is very scarce so I don't know where a new map for mem_map
> would come from. Ok you could try to move the physical location of mem_map to 
> somewhere not in lowmem I suppose.

Right could be in highmem and thus would free up around 20 Megabytes of 
low memory.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: One idea to free up page flags on NUMA
  2006-09-25  0:31           ` Christoph Lameter
@ 2006-09-25  3:04             ` Andi Kleen
  2006-09-25  3:46               ` Christoph Lameter
  0 siblings, 1 reply; 11+ messages in thread
From: Andi Kleen @ 2006-09-25  3:04 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: linux-mm, haveblue

On Monday 25 September 2006 02:31, Christoph Lameter wrote:
> On Sun, 24 Sep 2006, Andi Kleen wrote:
> 
> > > Hmmm... It only maps the kernel text segment?
> > Only lowmem (normally upto ~900MB)
> > 
> > But virtual memory is very scarce so I don't know where a new map for mem_map
> > would come from. Ok you could try to move the physical location of mem_map to 
> > somewhere not in lowmem I suppose.
> 
> Right could be in highmem and thus would free up around 20 Megabytes of 
> low memory.

But won't the vmemmap need more than the 20MB?

-Andi

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: One idea to free up page flags on NUMA
  2006-09-25  3:04             ` Andi Kleen
@ 2006-09-25  3:46               ` Christoph Lameter
  0 siblings, 0 replies; 11+ messages in thread
From: Christoph Lameter @ 2006-09-25  3:46 UTC (permalink / raw)
  To: Andi Kleen; +Cc: linux-mm, haveblue

On Mon, 25 Sep 2006, Andi Kleen wrote:

> > Right could be in highmem and thus would free up around 20 Megabytes of 
> > low memory.
> But won't the vmemmap need more than the 20MB?

It will need the same as the regular mmap + one / two page table pages 
pointing to the huge pages of the virtual memmap. So if one goes from 
regular mmap to virtual mmap one pays with a few page table pages and the 
need for additional TLBs for lookup. But one can remove the memmap 
entirely from the low memory area.

If we upgrade sparse to be able to use vmemmap then we trade the 
existing sparse structures against the few page table pages plus the 
TLB overhead.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2006-09-25  3:46 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2006-09-23  3:02 One idea to free up page flags on NUMA Christoph Lameter
2006-09-23 16:04 ` Andi Kleen
2006-09-23 16:39   ` Christoph Lameter
2006-09-23 18:43     ` Andi Kleen
2006-09-24  1:57       ` Christoph Lameter
2006-09-24  7:24         ` Andi Kleen
2006-09-25  0:31           ` Christoph Lameter
2006-09-25  3:04             ` Andi Kleen
2006-09-25  3:46               ` Christoph Lameter
2006-09-23 19:24     ` Dave Hansen
2006-09-24  1:56       ` Christoph Lameter

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox