virtual mmap basics

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* virtual mmap basics
@ 2006-09-24 16:59 Christoph Lameter
  2006-09-25 12:28 ` Andy Whitcroft
  0 siblings, 1 reply; 15+ messages in thread
From: Christoph Lameter @ 2006-09-24 16:59 UTC (permalink / raw)
  To: linux-mm

Lets say we have memory of MAX_PFN pages.

Then we need a page struct array with MAX_PFN page structs to manage
that memory called mem_map.

For mmap processing without virtualization (FLATMEM) (simplified)
we have:

#define pfn_valid(pfn)		(pfn < max_pfn)
#define pfn_to_page(pfn)	&mem_map[pfn]
#define page_to_pfn(page)     	(page - mem_map))

which is then used to build the commonly used functions:

#define virt_to_page(kaddr)     pfn_to_page(kaddr >> PAGE_SHIFT)
#define page_address(page)	(page_to_pfn(page) << PAGE_SHIFT)

Virtual Memmory Map
-------------------

For a virtual memory map we reserve a virtual memory area
VMEMMAP_START ... VMEMMAP_START + max_pfn * sizeof(page_struct))
vmem_map is defined to be a pointer to struct page. It is a constant
pointing to VMEMMAP_START. 

We use page tables to manage the virtual memory map. Page tables
may be sparse. Pages in the area used for page structs may be missing.
Software may dynamically add new page table entries to make new
ranges of pfn's valid. Its like sparse.

The basic functions then become:

#define pfn_valid(pfn)		(pfn < max_pfn && valid_page_table_entry(pfn))
#define pfn_to_page(pfn)	&vmem_map[pfn]
#define page_to_pfn(page)     	(page - vmem_map))

We only loose (apart from additional TLB use if this memory was not 
already using page tables) on pfn_valid when we have to traverse the page 
table via valid_page_table_entry() if the processor does not have an 
instruction to check that condition. We could avoid the page table 
traversal by having the page fault handler deal with it somehow. But then 
pfn_valid is not that frequent an operation.

virt_to_page and page_to_virt remain unchanged.

Sparse
------

Sparse currently does troublesome lookups for virt_to_page
and page_address.

#define page_to_pfn(pg) (pg - 
	section_mem_map_addr(nr_to_section(page_to_section(pg)))

#define pfn_to_page(pfn)
	 section_mem_map_addr(pfn_to_section(pfn)) + __pfn;

page_to_section is an extraction of flags from page->flags.

static inline struct mem_section *nr_to_section(unsigned long nr)
{
        if (!mem_section[SECTION_NR_TO_ROOT(nr)])
                return NULL;
        return &mem_section[SECTION_NR_TO_ROOT(nr)][nr & SECTION_ROOT_MASK];
}

static inline struct page *section_mem_map_addr(struct mem_section 
*section)
{
        unsigned long map = section->section_mem_map;
        map &= SECTION_MAP_MASK;
        return (struct page *)map;
}

So we have a mininum of a couple of table lookups and one page->flags 
retrieval (okay that may be argued to be in cache) in virt_to_page versus 
*none* in the virtual memory map case. Similar troublesome code is
there fore the reverse case.

pfn_valid requires at least 3 lookups. Which may be equivalent
to walking to page table over 3 levels if the processor has no command to 
make the hardware do it.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: virtual mmap basics
  2006-09-24 16:59 virtual mmap basics Christoph Lameter
@ 2006-09-25 12:28 ` Andy Whitcroft
  2006-09-25 16:27   ` Christoph Lameter
  0 siblings, 1 reply; 15+ messages in thread
From: Andy Whitcroft @ 2006-09-25 12:28 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: linux-mm

Christoph Lameter wrote:
> Lets say we have memory of MAX_PFN pages.
> 
> Then we need a page struct array with MAX_PFN page structs to manage
> that memory called mem_map.
> 
> For mmap processing without virtualization (FLATMEM) (simplified)
> we have:
> 
> #define pfn_valid(pfn)		(pfn < max_pfn)
> #define pfn_to_page(pfn)	&mem_map[pfn]
> #define page_to_pfn(page)     	(page - mem_map))
> 
> which is then used to build the commonly used functions:
> 
> #define virt_to_page(kaddr)     pfn_to_page(kaddr >> PAGE_SHIFT)
> #define page_address(page)	(page_to_pfn(page) << PAGE_SHIFT)
> 
> Virtual Memmory Map
> -------------------
> 
> For a virtual memory map we reserve a virtual memory area
> VMEMMAP_START ... VMEMMAP_START + max_pfn * sizeof(page_struct))
> vmem_map is defined to be a pointer to struct page. It is a constant
> pointing to VMEMMAP_START. 
> 
> We use page tables to manage the virtual memory map. Page tables
> may be sparse. Pages in the area used for page structs may be missing.
> Software may dynamically add new page table entries to make new
> ranges of pfn's valid. Its like sparse.
> 
> The basic functions then become:
> 
> #define pfn_valid(pfn)		(pfn < max_pfn && valid_page_table_entry(pfn))
> #define pfn_to_page(pfn)	&vmem_map[pfn]
> #define page_to_pfn(page)     	(page - vmem_map))
> 
> We only loose (apart from additional TLB use if this memory was not 
> already using page tables) on pfn_valid when we have to traverse the page 
> table via valid_page_table_entry() if the processor does not have an 
> instruction to check that condition. We could avoid the page table 
> traversal by having the page fault handler deal with it somehow. But then 
> pfn_valid is not that frequent an operation.

pfn_valid is most commonly required on virtual mem_map setups as its
implementation (currently) violates the 'contiguious and present' out to
MAX_ORDER constraint that the buddy expects.  So we have additional
frequent checks on pfn_valid in the allocator to check for it when there
are holes within zones (which is virtual memmaps in all but name).

We also need to consider the size of the mem_map.  The reason we have a
problem with smaller machines is that virtual space in zone NORMAL is
limited.  The mem_map here has to be contigious and spase in KVA, this
is exactly the resource we are short of.

-apw

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: virtual mmap basics
  2006-09-25 12:28 ` Andy Whitcroft
@ 2006-09-25 16:27   ` Christoph Lameter
  2006-09-25 17:11     ` Christoph Lameter
  2006-09-25 18:09     ` Andy Whitcroft
  0 siblings, 2 replies; 15+ messages in thread
From: Christoph Lameter @ 2006-09-25 16:27 UTC (permalink / raw)
  To: Andy Whitcroft; +Cc: linux-mm

On Mon, 25 Sep 2006, Andy Whitcroft wrote:

> pfn_valid is most commonly required on virtual mem_map setups as its
> implementation (currently) violates the 'contiguious and present' out to
> MAX_ORDER constraint that the buddy expects.  So we have additional
> frequent checks on pfn_valid in the allocator to check for it when there
> are holes within zones (which is virtual memmaps in all but name).

Why would the page allocator require frequent calls to pfn_valid? One 
you have the free lists setup then there is no need for it AFAIK.

Still pfn_valid with virtual memmap is still comparable to sparses 
current implementation. If the cpu has an instruction to check the 
validity of an address then it will be superior.

> We also need to consider the size of the mem_map.  The reason we have a
> problem with smaller machines is that virtual space in zone NORMAL is
> limited.  The mem_map here has to be contigious and spase in KVA, this
> is exactly the resource we are short of.

The point of the virtual memmap is that it does not have to be contiguous 
and it is sparse. Sparsemem could use that format and then we would be 
able to optimize important VM function such as virt_to_page() and 
page_address().

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: virtual mmap basics
  2006-09-25 16:27   ` Christoph Lameter
@ 2006-09-25 17:11     ` Christoph Lameter
  2006-09-25 21:05       ` Christoph Lameter
  2006-09-25 18:09     ` Andy Whitcroft
  1 sibling, 1 reply; 15+ messages in thread
From: Christoph Lameter @ 2006-09-25 17:11 UTC (permalink / raw)
  To: Andy Whitcroft; +Cc: linux-mm, ak

Hmmm... Some more thoughts on virtual memory requirements

A page struct is 8 words on 32 bit platforms = 32 byte

On 64 bit we have 7 words... lets just go with 64.

The memory requirements for a memmap structure covering all of memory
(and also the virtual memory requirements for a virtual memmap)
 are 

32 bit 4K page size:

Regular:
4GB of adressable memory = 1 mio page structs = 32 MB.

PAE mode:
64GB of memory = 16  mio page structs = 512MB.

Hmm.... So without PAE mode we are fine on i386. The 512MB 
virtual space requirement to support all of 64GB of memory with highmem 
64G may be difficult to fulfill. This is 1/8th of the address space!
Sparses ability to avoid virtual memory use comes in handy if memory is 
actually larger than supported by the processor. But then these 
configurations are becoming rarer with the advent of 64 bit processors.

On 64 bit platforms we need 64 byte per potential page.

The maximum on IA64 is 16TB. With a page size of 16k this gets you one 1 
Gig of pages which take up 64 GB of address space. This is the 
implementation that IA64 used from the beginning.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: virtual mmap basics
  2006-09-25 17:11     ` Christoph Lameter
@ 2006-09-25 21:05       ` Christoph Lameter
  2006-09-25 22:22         ` Andy Whitcroft
  0 siblings, 1 reply; 15+ messages in thread
From: Christoph Lameter @ 2006-09-25 21:05 UTC (permalink / raw)
  To: Andy Whitcroft; +Cc: linux-mm, ak

On Mon, 25 Sep 2006, Christoph Lameter wrote:

> PAE mode:
> 64GB of memory = 16  mio page structs = 512MB.
> 
> Hmm.... So without PAE mode we are fine on i386. The 512MB 
> virtual space requirement to support all of 64GB of memory with highmem 
> 64G may be difficult to fulfill. This is 1/8th of the address space!
> Sparses ability to avoid virtual memory use comes in handy if memory is 
> actually larger than supported by the processor. But then these 
> configurations are becoming rarer with the advent of 64 bit processors.

On the other hand the PAE sparse approach is not that good for 
i386 with 64GB. Sparse memmmap must be in regular memory and thus we
are forced to use 512 MB of the available 900MB in lowmem for
memmap.

Using a virtual memmap there would allow relocation of the memmap array 
into high memory and would double the available low memory. So may be 
worth even on this 32 bit platform to sacrifice 1/8th of the virtual 
address space for memmap.

So far I am not seeing any convincing case for the current sparsemem table 
lookups. But there must have been some reason that such an implementation 
was chosen. What was it?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: virtual mmap basics
  2006-09-25 21:05       ` Christoph Lameter
@ 2006-09-25 22:22         ` Andy Whitcroft
  2006-09-25 23:37           ` Christoph Lameter
  0 siblings, 1 reply; 15+ messages in thread
From: Andy Whitcroft @ 2006-09-25 22:22 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: linux-mm, ak

Christoph Lameter wrote:
> On Mon, 25 Sep 2006, Christoph Lameter wrote:
> 
>> PAE mode:
>> 64GB of memory = 16  mio page structs = 512MB.
>>
>> Hmm.... So without PAE mode we are fine on i386. The 512MB 
>> virtual space requirement to support all of 64GB of memory with highmem 
>> 64G may be difficult to fulfill. This is 1/8th of the address space!
>> Sparses ability to avoid virtual memory use comes in handy if memory is 
>> actually larger than supported by the processor. But then these 
>> configurations are becoming rarer with the advent of 64 bit processors.
> 
> On the other hand the PAE sparse approach is not that good for 
> i386 with 64GB. Sparse memmmap must be in regular memory and thus we
> are forced to use 512 MB of the available 900MB in lowmem for
> memmap.
> 
> Using a virtual memmap there would allow relocation of the memmap array 
> into high memory and would double the available low memory. So may be 
> worth even on this 32 bit platform to sacrifice 1/8th of the virtual 
> address space for memmap.

How does moving to a virtual memmap help here.  The virtual mem_map also
has to be allocated in KVA, any KVA used for it is not available to and
thereby shrinks the size of zone NORMAL?  The size of NORMAL in x86 is
defined by the addressable space in kernel mode (by KVA size), 1GB less
other things we have mapped.  Virtual map would be one of those.

> So far I am not seeing any convincing case for the current sparsemem table 
> lookups. But there must have been some reason that such an implementation 
> was chosen. What was it?

As I said the problem is not memory but KVA space.  Zone normal is all
the pages we can map into the kernel address space, its 1Gb less the
kernel itself, less vmap space.  In the current NUMA scheme its then
less the mem_map allocated out of HIGHMEM but mapped into KVA.  In
vmem_map its allocated out of HIGHMEM but mapped into KVA.  The loss is
the same.

-apw

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: virtual mmap basics
  2006-09-25 22:22         ` Andy Whitcroft
@ 2006-09-25 23:37           ` Christoph Lameter
  2006-09-26 12:06             ` Andy Whitcroft
  0 siblings, 1 reply; 15+ messages in thread
From: Christoph Lameter @ 2006-09-25 23:37 UTC (permalink / raw)
  To: Andy Whitcroft; +Cc: linux-mm, ak

On Mon, 25 Sep 2006, Andy Whitcroft wrote:

> > Using a virtual memmap there would allow relocation of the memmap array 
> > into high memory and would double the available low memory. So may be 
> > worth even on this 32 bit platform to sacrifice 1/8th of the virtual 
> > address space for memmap.
> 
> How does moving to a virtual memmap help here.  The virtual mem_map also
> has to be allocated in KVA, any KVA used for it is not available to and
> thereby shrinks the size of zone NORMAL?  The size of NORMAL in x86 is
> defined by the addressable space in kernel mode (by KVA size), 1GB less
> other things we have mapped.  Virtual map would be one of those.

Hmmm... Strange architecture and I may be a bit ignorant on this one. You 
could reserve the 1G for kernel 1-1 mapped. 2nd G for VMALLOC / virtual 
memmap and the remaining 2G for user space? Probably wont work since you 
would have to decrease user space from 3G to 2G?

Having the virtual memmap in high memory also allows you to place the 
sections of the memmap that map files of that node into the memory of the 
node itself. This alone would get you a nice performance boost.

> > So far I am not seeing any convincing case for the current sparsemem table 
> > lookups. But there must have been some reason that such an implementation 
> > was chosen. What was it?
> 
> As I said the problem is not memory but KVA space.  Zone normal is all
> the pages we can map into the kernel address space, its 1Gb less the
> kernel itself, less vmap space.  In the current NUMA scheme its then
> less the mem_map allocated out of HIGHMEM but mapped into KVA.  In
> vmem_map its allocated out of HIGHMEM but mapped into KVA.  The loss is
> the same.

Yup the only way around would be to decrease user space sizes.

But then we are talking about a rare breed of NUMA machine, right?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: virtual mmap basics
  2006-09-25 23:37           ` Christoph Lameter
@ 2006-09-26 12:06             ` Andy Whitcroft
  0 siblings, 0 replies; 15+ messages in thread
From: Andy Whitcroft @ 2006-09-26 12:06 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: linux-mm, ak

Christoph Lameter wrote:
> On Mon, 25 Sep 2006, Andy Whitcroft wrote:
> 
>>> Using a virtual memmap there would allow relocation of the memmap array 
>>> into high memory and would double the available low memory. So may be 
>>> worth even on this 32 bit platform to sacrifice 1/8th of the virtual 
>>> address space for memmap.
>> How does moving to a virtual memmap help here.  The virtual mem_map also
>> has to be allocated in KVA, any KVA used for it is not available to and
>> thereby shrinks the size of zone NORMAL?  The size of NORMAL in x86 is
>> defined by the addressable space in kernel mode (by KVA size), 1GB less
>> other things we have mapped.  Virtual map would be one of those.
> 
> Hmmm... Strange architecture and I may be a bit ignorant on this one. You 
> could reserve the 1G for kernel 1-1 mapped. 2nd G for VMALLOC / virtual 
> memmap and the remaining 2G for user space? Probably wont work since you 
> would have to decrease user space from 3G to 2G?

Not strange?  Limited.  Any 32bit architecture has this limitation.
They can only map a limited amount of memory into the kernel at the same
time.

Yes some users already change their U/K split on 32bit do do this, but
as you say its not a general solution here.

> Having the virtual memmap in high memory also allows you to place the 
> sections of the memmap that map files of that node into the memory of the 
> node itself. This alone would get you a nice performance boost.

We already do this.  We don't allocate mem_map out of zone normal, we
pull it from the end of each node.  This is then mapped after the end of
zone normal and before vmap space.  mem_map is physically node local,
but mapped into KVA.

>>> So far I am not seeing any convincing case for the current sparsemem table 
>>> lookups. But there must have been some reason that such an implementation 
>>> was chosen. What was it?
>> As I said the problem is not memory but KVA space.  Zone normal is all
>> the pages we can map into the kernel address space, its 1Gb less the
>> kernel itself, less vmap space.  In the current NUMA scheme its then
>> less the mem_map allocated out of HIGHMEM but mapped into KVA.  In
>> vmem_map its allocated out of HIGHMEM but mapped into KVA.  The loss is
>> the same.
> 
> Yup the only way around would be to decrease user space sizes.
> 
> But then we are talking about a rare breed of NUMA machine, right?

We are talking about any 32bit NUMA.  Getting rarer yes.

-apw

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: virtual mmap basics
  2006-09-25 16:27   ` Christoph Lameter
  2006-09-25 17:11     ` Christoph Lameter
@ 2006-09-25 18:09     ` Andy Whitcroft
  2006-09-25 21:00       ` Christoph Lameter
  1 sibling, 1 reply; 15+ messages in thread
From: Andy Whitcroft @ 2006-09-25 18:09 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: linux-mm

Christoph Lameter wrote:
> On Mon, 25 Sep 2006, Andy Whitcroft wrote:
> 
>> pfn_valid is most commonly required on virtual mem_map setups as its
>> implementation (currently) violates the 'contiguious and present' out to
>> MAX_ORDER constraint that the buddy expects.  So we have additional
>> frequent checks on pfn_valid in the allocator to check for it when there
>> are holes within zones (which is virtual memmaps in all but name).
> 
> Why would the page allocator require frequent calls to pfn_valid? One 
> you have the free lists setup then there is no need for it AFAIK.
> 
> Still pfn_valid with virtual memmap is still comparable to sparses 
> current implementation. If the cpu has an instruction to check the 
> validity of an address then it will be superior.

If you are not guarenteeing contiuity of mem_map out to MAX_ORDER you
have to add additional checks.  These are only enabled on ia64, see
CONFIG_HOLES_IN_ZONES and only if we have VIRTUAL_MEM_MAP defined.  As a
key example when this is defined we have to add a
pfn_valid(page_to_pfn()) stanza to page_is_buddy() which is used heavily
on page free.  This is a problem when this check is not cheap such as
appears to be true in ia64 where we do do a number of checks on segments
boundaries, then we try and read the first word of the entry.  This is
done as a user access, and if my reading is correct we take and handle a
fault if the page is missing.  This on top of the fetches required to
load the MMU sound like they increase not decrease the complexity of
this operation?

>> We also need to consider the size of the mem_map.  The reason we have a
>> problem with smaller machines is that virtual space in zone NORMAL is
>> limited.  The mem_map here has to be contigious and spase in KVA, this
>> is exactly the resource we are short of.
> 
> The point of the virtual memmap is that it does not have to be contiguous 
> and it is sparse. Sparsemem could use that format and then we would be 
> able to optimize important VM function such as virt_to_page() and 
> page_address().

The point I am making here is its not the cost of storage of the active
segments of the mem_map that are the issue.  We have GB's of memory in
highmem we can use to back it.  The problem is the kernel virtual
address space we need to use to represent the mem_map, which includes
the holes; on 32bit it is this KVA  which is in short supply.  We cannot
reuse the holes as they are needed by the implementation.

The problem we have is that 32bit needs sparsemem to be truly sparse in
KVA terms.  So we need a sparse implementation which keeps the KVA
footprint down, the virtual mem_map cannot cater to that usage model.

It may have value for 64bit systems, but I'd like to see some
comparitive numbers showing the benefit, as to my eye at least you are
hiding much of the work to be done not eliminating it.  And at least in
some cases adding significant overhead.

-apw

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: virtual mmap basics
  2006-09-25 18:09     ` Andy Whitcroft
@ 2006-09-25 21:00       ` Christoph Lameter
  2006-09-25 23:54         ` virtual memmap sparsity: Dealing with fragmented MAX_ORDER blocks Christoph Lameter
  0 siblings, 1 reply; 15+ messages in thread
From: Christoph Lameter @ 2006-09-25 21:00 UTC (permalink / raw)
  To: Andy Whitcroft; +Cc: linux-mm

On Mon, 25 Sep 2006, Andy Whitcroft wrote:

> If you are not guarenteeing contiuity of mem_map out to MAX_ORDER you
> have to add additional checks.  These are only enabled on ia64, see
> CONFIG_HOLES_IN_ZONES and only if we have VIRTUAL_MEM_MAP defined.  As a
> key example when this is defined we have to add a
> pfn_valid(page_to_pfn()) stanza to page_is_buddy() which is used heavily
> on page free.  This is a problem when this check is not cheap such as
> appears to be true in ia64 where we do do a number of checks on segments
> boundaries, then we try and read the first word of the entry.  This is
> done as a user access, and if my reading is correct we take and handle a
> fault if the page is missing.  This on top of the fetches required to
> load the MMU sound like they increase not decrease the complexity of
> this operation?

Ahh the buddy checks. The node structure contains the pfn boundaries which 
could be checked. The check can be implemented in a cheap way on IA64 
because we have an instruction to check the validity of a mapping.

> > The point of the virtual memmap is that it does not have to be contiguous 
> > and it is sparse. Sparsemem could use that format and then we would be 
> > able to optimize important VM function such as virt_to_page() and 
> > page_address().
> 
> The point I am making here is its not the cost of storage of the active
> segments of the mem_map that are the issue.  We have GB's of memory in
> highmem we can use to back it.  The problem is the kernel virtual
> address space we need to use to represent the mem_map, which includes
> the holes; on 32bit it is this KVA  which is in short supply.  We cannot
> reuse the holes as they are needed by the implementation.

I just talked with Martin and he told me that the address space on 32 bit 
systems must be mostly linear due to the scarcity of it. So I cannot see 
any issue there.

> The problem we have is that 32bit needs sparsemem to be truly sparse in
> KVA terms.  So we need a sparse implementation which keeps the KVA
> footprint down, the virtual mem_map cannot cater to that usage model.

Huh? I have given some numbers in another thread that contradict this.

> It may have value for 64bit systems, but I'd like to see some
> comparitive numbers showing the benefit, as to my eye at least you are
> hiding much of the work to be done not eliminating it.  And at least in
> some cases adding significant overhead.

Multiple lookups in virt_to_page, page_address compared to none is not 
enough? Are you telling me that multiple table lookups are 
performance wise better than a simple address calculation?

I really wish you could show one case in which the virtual memmap approach 
would not be advantageous. It looks as if this may be somehow possible 
with sparse on 32 bit but I do not understand how this could be possible 
given the lack of sparsity of a 32 bit address space.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* virtual memmap sparsity: Dealing with fragmented MAX_ORDER blocks
  2006-09-25 21:00       ` Christoph Lameter
@ 2006-09-25 23:54         ` Christoph Lameter
  2006-09-26  0:31           ` Christoph Lameter
  2006-09-26  8:16           ` Mel Gorman
  0 siblings, 2 replies; 15+ messages in thread
From: Christoph Lameter @ 2006-09-25 23:54 UTC (permalink / raw)
  To: Andy Whitcroft; +Cc: linux-mm

Regarding buddy checks out of memmap:

1. This problem only occurs if we allow fragments of MAX_ORDER size 
   segments. The default needs to be not to allow that. Then we do not 
   need  any checks like right now on IA64. Why would one want smaller
   granularity than 2M/4M in hotplugging?

2. If you must have these fragments then we need to check the validity
   of the buddy pointers before derefencing them to see if pages can
   be combined. If fragments are permitted then a
   special function needs to be called to check if the address we are
   accessing is legit. Preferably this would be done with an instruction
   that can use the MMU to verify if the address is valid 

   On IA64 this is done with the "probe" instruction

   Looking through the i386 commands I see a VERR mnemonic that
   I guess will do what you need on i386 and x86_64 in order to do
   what we need without a page table walk.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: virtual memmap sparsity: Dealing with fragmented MAX_ORDER blocks
  2006-09-25 23:54         ` virtual memmap sparsity: Dealing with fragmented MAX_ORDER blocks Christoph Lameter
@ 2006-09-26  0:31           ` Christoph Lameter
  2006-09-26 12:11             ` Andy Whitcroft
  2006-09-26  8:16           ` Mel Gorman
  1 sibling, 1 reply; 15+ messages in thread
From: Christoph Lameter @ 2006-09-26  0:31 UTC (permalink / raw)
  To: Andy Whitcroft; +Cc: linux-mm

On Mon, 25 Sep 2006, Christoph Lameter wrote:

>    Looking through the i386 commands I see a VERR mnemonic that
>    I guess will do what you need on i386 and x86_64 in order to do
>    what we need without a page table walk.

I think I guessed wrong VERR does something with segments???

We could sidestep the issue by not marking the huge page 
non present but pointing it to a pte page with all pointers to the 
zero page.

All page flags will be cleared of the page structs in the zero page and 
thus we cannot reference an invalid address. Therefore:

static inline int page_is_buddy(struct page *page, struct page *buddy,
                                                                int order)
{
#ifdef CONFIG_HOLES_IN_ZONE
        if (!pfn_valid(page_to_pfn(buddy)))
                return 0;
#endif

        if (page_zone_id(page) != page_zone_id(buddy))
                return 0;

        if (PageBuddy(buddy) && page_order(buddy) == order) {
                BUG_ON(page_count(buddy) != 0);
                return 1;
        }
        return 0;
}

can become

static inline int page_is_buddy(struct page *page, struct page *buddy,
                                                                int order)
{
        if (page_zone_id(page) != page_zone_id(buddy))
                return 0;

        if (PageBuddy(buddy) && page_order(buddy) == order) {
                BUG_ON(page_count(buddy) != 0);
                return 1;
        }
        return 0;
}

for all cases.

Also note that page_zone_id(page) no longer needs to be using a lookup of
page->flags. We just need to insure that both pages are in the same 
MAX_ORDER group. For that to be true the upper portion of the addresses
must match.

int page_zone_id(struct page *page)
{
	return pfn_page >> MAX_ORDER;
}




--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: virtual memmap sparsity: Dealing with fragmented MAX_ORDER blocks
  2006-09-26  0:31           ` Christoph Lameter
@ 2006-09-26 12:11             ` Andy Whitcroft
  2006-09-26 15:23               ` Christoph Lameter
  0 siblings, 1 reply; 15+ messages in thread
From: Andy Whitcroft @ 2006-09-26 12:11 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: linux-mm

Christoph Lameter wrote:
> On Mon, 25 Sep 2006, Christoph Lameter wrote:
> 
>>    Looking through the i386 commands I see a VERR mnemonic that
>>    I guess will do what you need on i386 and x86_64 in order to do
>>    what we need without a page table walk.
> 
> I think I guessed wrong VERR does something with segments???
> 
> We could sidestep the issue by not marking the huge page 
> non present but pointing it to a pte page with all pointers to the 
> zero page.

Well we'd really want it to be a page of page*'s marked as in
PG_reserved and probabally in an invalid zone or some such to prevent
them coelesing with logically adjacent buddies.

> 
> All page flags will be cleared of the page structs in the zero page and 
> thus we cannot reference an invalid address. Therefore:
> 
> static inline int page_is_buddy(struct page *page, struct page *buddy,
>                                                                 int order)
> {
> #ifdef CONFIG_HOLES_IN_ZONE
>         if (!pfn_valid(page_to_pfn(buddy)))
>                 return 0;
> #endif
> 
>         if (page_zone_id(page) != page_zone_id(buddy))
>                 return 0;
> 
>         if (PageBuddy(buddy) && page_order(buddy) == order) {
>                 BUG_ON(page_count(buddy) != 0);
>                 return 1;
>         }
>         return 0;
> }
> 
> can become
> 
> static inline int page_is_buddy(struct page *page, struct page *buddy,
>                                                                 int order)
> {
>         if (page_zone_id(page) != page_zone_id(buddy))
>                 return 0;
> 
>         if (PageBuddy(buddy) && page_order(buddy) == order) {
>                 BUG_ON(page_count(buddy) != 0);
>                 return 1;
>         }
>         return 0;
> }
> 
> for all cases.
> 
> Also note that page_zone_id(page) no longer needs to be using a lookup of
> page->flags. We just need to insure that both pages are in the same 
> MAX_ORDER group. For that to be true the upper portion of the addresses
> must match.
> 
> int page_zone_id(struct page *page)
> {
> 	return pfn_page >> MAX_ORDER;
> }

I don't think this is correct.  Zones are not guarenteed to be max order
aligned, and we cannot allow pages in different zones be coelesced.
Indeed, as often they are properly aligned we looked at taking this test
out very recently.  The complexity and risk was deemed to great.  It is
after all a simple check against a cache line which is about to be used
and checked in the very next statement.  So the cost is very small.

-apw

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: virtual memmap sparsity: Dealing with fragmented MAX_ORDER blocks
  2006-09-26 12:11             ` Andy Whitcroft
@ 2006-09-26 15:23               ` Christoph Lameter
  0 siblings, 0 replies; 15+ messages in thread
From: Christoph Lameter @ 2006-09-26 15:23 UTC (permalink / raw)
  To: Andy Whitcroft; +Cc: linux-mm

On Tue, 26 Sep 2006, Andy Whitcroft wrote:

> Well we'd really want it to be a page of page*'s marked as in
> PG_reserved and probabally in an invalid zone or some such to prevent
> them coelesing with logically adjacent buddies.

If we use a zero page for the memory map then we have a series of 
struct pages with the pagebuddy flag cleared. No merging will occur.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: virtual memmap sparsity: Dealing with fragmented MAX_ORDER blocks
  2006-09-25 23:54         ` virtual memmap sparsity: Dealing with fragmented MAX_ORDER blocks Christoph Lameter
  2006-09-26  0:31           ` Christoph Lameter
@ 2006-09-26  8:16           ` Mel Gorman
  1 sibling, 0 replies; 15+ messages in thread
From: Mel Gorman @ 2006-09-26  8:16 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: Andy Whitcroft, linux-mm

On Mon, 25 Sep 2006, Christoph Lameter wrote:

> Regarding buddy checks out of memmap:
>
> 1. This problem only occurs if we allow fragments of MAX_ORDER size
>   segments. The default needs to be not to allow that. Then we do not
>   need  any checks like right now on IA64. Why would one want smaller
>   granularity than 2M/4M in hotplugging?
>

On a local IA64 machine, the MAX_ORDER block of pages is not 2M or 4M but 
1GB. This is a base pagesize of 16K and a MAX_ORDER of 17. At best, 
MAX_ORDER could be fixed to present 256MB but there would be wastage.

> 2. If you must have these fragments then we need to check the validity
>   of the buddy pointers before derefencing them to see if pages can
>   be combined.

i.e. pfn_valid()

> If fragments are permitted then a
>   special function needs to be called to check if the address we are
>   accessing is legit. Preferably this would be done with an instruction
>   that can use the MMU to verify if the address is valid
>
>   On IA64 this is done with the "probe" instruction
>

Why does IA64 not use this then? Currently, it uses __get_user() and 
catches faults when they occur.

>   Looking through the i386 commands I see a VERR mnemonic that
>   I guess will do what you need on i386 and x86_64 in order to do
>   what we need without a page table walk.
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2006-09-26 15:23 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2006-09-24 16:59 virtual mmap basics Christoph Lameter
2006-09-25 12:28 ` Andy Whitcroft
2006-09-25 16:27   ` Christoph Lameter
2006-09-25 17:11     ` Christoph Lameter
2006-09-25 21:05       ` Christoph Lameter
2006-09-25 22:22         ` Andy Whitcroft
2006-09-25 23:37           ` Christoph Lameter
2006-09-26 12:06             ` Andy Whitcroft
2006-09-25 18:09     ` Andy Whitcroft
2006-09-25 21:00       ` Christoph Lameter
2006-09-25 23:54         ` virtual memmap sparsity: Dealing with fragmented MAX_ORDER blocks Christoph Lameter
2006-09-26  0:31           ` Christoph Lameter
2006-09-26 12:11             ` Andy Whitcroft
2006-09-26 15:23               ` Christoph Lameter
2006-09-26  8:16           ` Mel Gorman

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox