[RFC] 2-pointer PTE chaining idea

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [RFC] 2-pointer PTE chaining idea
@ 2001-01-19  5:08 Rik van Riel
  2001-01-19  6:57 ` David S. Miller
  2001-01-19 11:37 ` Ingo Molnar
  0 siblings, 2 replies; 10+ messages in thread
From: Rik van Riel @ 2001-01-19  5:08 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: davem, linux-mm, Stephen C. Tweedie, Matthew Dillon

Hi,

I think I have come up with an idea that would allow us to
do pte-chaining with an overhead of just 2 pointers (8 bytes
on x86) per pte ...

This idea depends on the fact that each page table is only
used by one mm_struct, which is probably a safe assumption
since shared page tables are a big can of worms on some
architectures (shared TLB entries on machines with virtually
indexed caches, etc.).

	Basics:

The pte chain entries will look like this:

struct pte_chain {
	struct pte_chain * next;
	struct pte_t * pte;
};

... where *next points to the next pte_chain entry in the list
of pte_chains for this page and *pte points directly to the
page table entry that maps this page.

>From the pte_chain pointer, refill_inactive_scan can directly
do test_and_clear_referenced(pte) to implement the page aging.

In order to find the vma and the mm_struct each pte belongs to,
we can use the ->mapping and ->index fields in the page_struct
of the page table, with the ->mapping pointing to the mm_struct
and the ->index containing the offset within the mm_struct, so
we can support non-aligned or even page-sized VMAs ... this means
walking 2 or 3 pointers extra when unmapping a page, but the page
scanning is really cheap.

	Allocation trick / overhead:

In order to avoid failing allocations for pte_chain structs and
nasty things like that, we can simply unmap a pte whenever we run
out of reverse mapping space or when an allocation fails.

	DIRTY trick you don't want to read:

We could reduce the overhead further by using the dirty trick of
having a set of 2^16 struct pte_chains per N pages of physical
memory.

This way we can remove the *next pointer and turn that into an
array offset for this smaller local array, shaving off 2 bytes
in the struct pte_chain. It also means that we don't need to
have a pointer in the page_struct, but can subdivide the word
that's currently taken by page->age and put a starting array
offset into 16 bits out of this word.

	(as I said .. DIRTY, please ignore)

Would this main idea (with the 2-pointer struct pte_chain) work
for pte chaining and do everything we want ?

regards,

Rik
--
Virtual memory is like a game you can't win;
However, without VM there's truly nothing to lose...

		http://www.surriel.com/
http://www.conectiva.com/	http://distro.conectiva.com.br/

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC] 2-pointer PTE chaining idea
  2001-01-19  5:08 [RFC] 2-pointer PTE chaining idea Rik van Riel
@ 2001-01-19  6:57 ` David S. Miller
  2001-01-19  7:17   ` Linus Torvalds
  2001-01-19 11:37 ` Ingo Molnar
  1 sibling, 1 reply; 10+ messages in thread
From: David S. Miller @ 2001-01-19  6:57 UTC (permalink / raw)
  To: Rik van Riel; +Cc: Linus Torvalds, linux-mm, Stephen C. Tweedie, Matthew Dillon

Rik van Riel writes:
 > In order to find the vma and the mm_struct each pte belongs to,
 > we can use the ->mapping and ->index fields in the page_struct
 > of the page table, with the ->mapping pointing to the mm_struct
 > and the ->index containing the offset within the mm_struct

Anonymous pages have no page->mapping, how can this work?

Later,
David S. Miller
davem@redhat.com
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC] 2-pointer PTE chaining idea
  2001-01-19  6:57 ` David S. Miller
@ 2001-01-19  7:17   ` Linus Torvalds
  2001-01-19  7:55     ` Rik van Riel
                       ` (2 more replies)
  0 siblings, 3 replies; 10+ messages in thread
From: Linus Torvalds @ 2001-01-19  7:17 UTC (permalink / raw)
  To: David S. Miller
  Cc: Rik van Riel, linux-mm, Stephen C. Tweedie, Matthew Dillon

On Thu, 18 Jan 2001, David S. Miller wrote:

> 
> Rik van Riel writes:
>  > In order to find the vma and the mm_struct each pte belongs to,
>  > we can use the ->mapping and ->index fields in the page_struct
>  > of the page table, with the ->mapping pointing to the mm_struct
>  > and the ->index containing the offset within the mm_struct
> 
> Anonymous pages have no page->mapping, how can this work?

Note the "in the page struct of the page table".
                                    ^^^^^^^^^^

What Rik is saying is that if your page tables themselves are full pages
(which is not true everywhere, but hey, close enough), you can use the
"struct page *" of the _page_table_ page to save off the "struct
mm_struct" pointer, along with the base in the mm_struct. You can then lok
up the vma the normal way (get the mm->page_table_lock, and search for it,
you know where the page table entry is in the page table, and you know
where the page table itself is virtually).

It doesn't help us, though. 2 or 3 pointers doesn't make any difference on
x86, at least: the 3-pointer-scheme had a "next, prev, mm" pointer triple,
and there is an _implied_ pointer pointing to the page table entry itself,
that Rik probably forgot about.

The only sane way I can think of to do the "implied pointer" is to do an
order-2 allocation when you allocate a page directory: you get 16kB of
memory, and you use the low 4kB for the hardware page table stuff, with
the upper 3kB for the pointers associated with the page table entries. You
do this for two reasons: 
 - still only one allocation
 - this way you can get from the pointers to the page table entry (and
   the other way around) by arithmetic rather than having to have a
   pointer to the page table entry.

But this also means that If you need 2 pointers, you might as well use 3
extra words anyway, because otherwise you'd just have an unused 32-bit
word for every page table entry anyway.

Whatever. Maybe it can be done other ways. The fact that the way I thought
to implement it was with an order-2 allocation to do this efficiently is
what really killed it for me. Maybe Rik has other ideas. I don't think
order-2 allocations are acceptable.

		Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC] 2-pointer PTE chaining idea
  2001-01-19  7:17   ` Linus Torvalds
@ 2001-01-19  7:55     ` Rik van Riel
  2001-01-20  5:22       ` Linus Torvalds
  2001-01-19 11:49     ` Ingo Molnar
  2001-01-20  6:58     ` Rik van Riel
  2 siblings, 1 reply; 10+ messages in thread
From: Rik van Riel @ 2001-01-19  7:55 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: David S. Miller, linux-mm, Stephen C. Tweedie, Matthew Dillon

On Thu, 18 Jan 2001, Linus Torvalds wrote:
> On Thu, 18 Jan 2001, David S. Miller wrote:
> > Rik van Riel writes:
> >  > In order to find the vma and the mm_struct each pte belongs to,
> >  > we can use the ->mapping and ->index fields in the page_struct
> >  > of the page table, with the ->mapping pointing to the mm_struct
> >  > and the ->index containing the offset within the mm_struct
> >
> > Anonymous pages have no page->mapping, how can this work?
>
> Note the "in the page struct of the page table".
>                                     ^^^^^^^^^^
>
> What Rik is saying is that if your page tables themselves are full pages
> (which is not true everywhere, but hey, close enough), you can use the
> "struct page *" of the _page_table_ page to save off the "struct
> mm_struct" pointer, along with the base in the mm_struct.

> It doesn't help us, though. 2 or 3 pointers doesn't make any difference on
> x86, at least: the 3-pointer-scheme had a "next, prev, mm" pointer triple,
> and there is an _implied_ pointer pointing to the page table entry itself,
> that Rik probably forgot about.

Actually, the pointer is to the page table entry ... on systems
where the page table is a multiple of the full page we know that
the page table itself has address:

page_table = pte_t & ~(PAGE_TABLE_SIZE - 1);

And from there we can easily get the struct page *.

> The only sane way I can think of to do the "implied pointer" is to do an
> order-2 allocation when you allocate a page directory: you get 16kB of

How about doing an order-1 allocation and having a singly linked
list ?

The structure would then look like this (on x86)

struct bidir_page_table {
	struct pte_t pte[1024];
	void * next[1024];
};

With next[400]:
- indicating that pte[400] is in the pte chain we're currently
  searching
- pointing to the next pointer in the pte chain, much like used
  block listed in the FAT filesystem

regards,

Rik
--
Virtual memory is like a game you can't win;
However, without VM there's truly nothing to lose...

		http://www.surriel.com/
http://www.conectiva.com/	http://distro.conectiva.com.br/

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC] 2-pointer PTE chaining idea
  2001-01-19  7:55     ` Rik van Riel
@ 2001-01-20  5:22       ` Linus Torvalds
  2001-01-20  5:31         ` David S. Miller
  0 siblings, 1 reply; 10+ messages in thread
From: Linus Torvalds @ 2001-01-20  5:22 UTC (permalink / raw)
  To: Rik van Riel
  Cc: David S. Miller, linux-mm, Stephen C. Tweedie, Matthew Dillon

On Fri, 19 Jan 2001, Rik van Riel wrote:
> 
> > The only sane way I can think of to do the "implied pointer" is to do an
> > order-2 allocation when you allocate a page directory: you get 16kB of
> 
> How about doing an order-1 allocation and having a singly linked
> list ?

The thing is, that _whatever_ you do, I think it's going to suck.

I'll tell you why: I think you're trying to optimize the uncommon case. 

I realize that you think that page table scanning is slow etc. I happen to
think it's acceptable, but never mind that. More important is the fact
that NOT scanning the page tables is what is the normal case BY FAR.

Do you actually have any profiles showing that scanning the page tables is
a problem? I realize that you can create loads that scan the page tables a
lot, but have you really understood and internalized the fact that those
same loads thend to have a CPU usage of just a few percent? The bad loads
tend to spend more time waiting for IO to complete because everybody is
busy SWAPPING.

And you have to realize, that it doesn't MATTER if we spend even 25% of
the CPU power on scanning the page tables (and I want to point out that
I've never heard of such a load), if we spend 50% idle just waiting for
the disk (and the rest of the time mayb eworking or in other VM routines).

This is why I don't think this "try to be clever to avoid work when
swapping" approach is really all that relevant.

There are two IMPORTANT things to do in the VM layer:

 - select the right pages. Don't worry too much about CPU at this point:
   if you have to do IO it's ok to waste some cycles per page. You'll win
   bigger from selecting the right page, than from trying to make the
   infrastructure really cheap.

 - DO NOT WASTE TIME IF YOU HAVE MEMORY!

Th esecond point is important. You have to really think about how Linux
handles anonymous pages, and understand that that is not just an accident.
It's really important to NOT do extra work for the case where an
application just wants a page. Don't allocate swap backing store early.
Don't add it to the page cache if it doesn't need to be there. Don't do
ANYTHING.

This, btw, also implies: don't make the page tables more complex.

		Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC] 2-pointer PTE chaining idea
  2001-01-20  5:22       ` Linus Torvalds
@ 2001-01-20  5:31         ` David S. Miller
  2001-01-20  7:05           ` Rik van Riel
  0 siblings, 1 reply; 10+ messages in thread
From: David S. Miller @ 2001-01-20  5:31 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Rik van Riel, linux-mm, Stephen C. Tweedie, Matthew Dillon

Linus Torvalds writes:
 >  - DO NOT WASTE TIME IF YOU HAVE MEMORY!
 > 
 > The second point is important. You have to really think about how Linux
 > handles anonymous pages, and understand that that is not just an accident.
 > It's really important to NOT do extra work for the case where an
 > application just wants a page. Don't allocate swap backing store early.
 > Don't add it to the page cache if it doesn't need to be there. Don't do
 > ANYTHING.
 > 
 > This, btw, also implies: don't make the page tables more complex.

I have to concur.  The more I think about the whole idea of
pte-chaining the more I dislike it and think work on it is a waste of
time.  I can say this, being that I actually tried once to fully make
such a scheme work.

My old gripe and incentive to do such things has honestly
disappeared.  The big issue was cache aliasing problems on
some silly cpus, but the current recommented APIs described
in Documentation/cachetlb.txt can handle such situations quite
acceptably.

Basically, that would leave us with the issue of choosing anonymous
pages to tap out correctly.  I see nothing that prevents our page
table scanning from being fundamentally unable to do quite well in
this area.  Sure, true LRU aging of anonymous pages alongside all the
other reclaimable pages in the system is not possible now, but I
cannot provably show that this is actually required for good behavior.

-DaveM

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC] 2-pointer PTE chaining idea
  2001-01-20  5:31         ` David S. Miller
@ 2001-01-20  7:05           ` Rik van Riel
  0 siblings, 0 replies; 10+ messages in thread
From: Rik van Riel @ 2001-01-20  7:05 UTC (permalink / raw)
  To: David S. Miller
  Cc: Linus Torvalds, linux-mm, Stephen C. Tweedie, Matthew Dillon

On Fri, 19 Jan 2001, David S. Miller wrote:
> Linus Torvalds writes:
>  >  - DO NOT WASTE TIME IF YOU HAVE MEMORY!
	[snip]
>  > This, btw, also implies: don't make the page tables more complex.
>
> I have to concur.

> Basically, that would leave us with the issue of choosing anonymous
> pages to tap out correctly.  I see nothing that prevents our page
> table scanning from being fundamentally unable to do quite well in
> this area.

I agree with this. However, having more uniform page aging
could lead to better page replacement and this pte chaining
thing is something I'd still like to try. ;)

If it turns out to be a win (with no measurable losses) I
may even submit a patch, but if it turns out to be a loss
I'll just drop the idea...

regards,

Rik
--
Virtual memory is like a game you can't win;
However, without VM there's truly nothing to lose...

		http://www.surriel.com/
http://www.conectiva.com/	http://distro.conectiva.com.br/

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC] 2-pointer PTE chaining idea
  2001-01-19  7:17   ` Linus Torvalds
  2001-01-19  7:55     ` Rik van Riel
@ 2001-01-19 11:49     ` Ingo Molnar
  2001-01-20  6:58     ` Rik van Riel
  2 siblings, 0 replies; 10+ messages in thread
From: Ingo Molnar @ 2001-01-19 11:49 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: David S. Miller, Rik van Riel, linux-mm, Stephen C. Tweedie,
	Matthew Dillon

On Thu, 18 Jan 2001, Linus Torvalds wrote:

> Whatever. Maybe it can be done other ways. The fact that the way I
> thought to implement it was with an order-2 allocation to do this
> efficiently is what really killed it for me. [...]

it can be done by 'implicitly' linking the soft and hard table via putting
it on the same 8K-aligned order-2 page, but linking them through their
mem_map[] entries. The fields of mem_map[] entries of ordinary pagetables
are largely unused, they are privately allocated pages.

	Ingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC] 2-pointer PTE chaining idea
  2001-01-19  7:17   ` Linus Torvalds
  2001-01-19  7:55     ` Rik van Riel
  2001-01-19 11:49     ` Ingo Molnar
@ 2001-01-20  6:58     ` Rik van Riel
  2 siblings, 0 replies; 10+ messages in thread
From: Rik van Riel @ 2001-01-20  6:58 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: David S. Miller, linux-mm, Stephen C. Tweedie, Matthew Dillon

On Thu, 18 Jan 2001, Linus Torvalds wrote:

> The only sane way I can think of to do the "implied pointer" is
> to do an order-2 allocation when you allocate a page directory:

While this idea seemed the best one at first glance, after
thinking about it a bit more I think your idea may actually
have _higher_ overhead than my idea of keeping the pte chain
structures external.

The reason for this is three-fold. Firstly, a lot of the page
tables will only be "occupied" for a small percentage. I don't
know the numbers, but I wouldn't be surprised if the page table
"occupation" is well under 50% for programs that are fully
resident ... probably less for programs which are partly swapped
out.

Secondly, if we do "dynamic" pte chaining, we can free up or
re-use the pte_chain structure as soon as we unmap a page, so
swapping out a page will free up the pte chain structure, which
is a big improvement compared to the unswappable page tables.

Thirdly, this idea doesn't suffer from memory fragmentation and
also works efficiently on architectures where the page table size
isn't equal to the page size.

Ideas ?

(btw, if I'm unlucky I won't be online again until the 26th)

regards,

Rik
--
Virtual memory is like a game you can't win;
However, without VM there's truly nothing to lose...

		http://www.surriel.com/
http://www.conectiva.com/	http://distro.conectiva.com.br/

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC] 2-pointer PTE chaining idea
  2001-01-19  5:08 [RFC] 2-pointer PTE chaining idea Rik van Riel
  2001-01-19  6:57 ` David S. Miller
@ 2001-01-19 11:37 ` Ingo Molnar
  1 sibling, 0 replies; 10+ messages in thread
From: Ingo Molnar @ 2001-01-19 11:37 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Linus Torvalds, David S. Miller, linux-mm, Stephen C. Tweedie,
	Matthew Dillon

On Fri, 19 Jan 2001, Rik van Riel wrote:

> The pte chain entries will look like this:
>
> struct pte_chain {
> 	struct pte_chain * next;
> 	struct pte_t * pte;
> };

why not just use a 'shadow' pagetable for every pagetable. The 'shadow'
pagetable has the same physical structure but has soft data. So for every
pte one has 32 bits (well, sizeof(pte_t) bytes) worth of extra
information.

the most obvious implementation would be to use an order-2 allocation, but
that is problematic on low memory systems (which we are trying to optimize
...). BUT, maybe it's not all that problematic, since we have reverse ptes
already :-) [catch-22]

a variation of this scheme that avoids the order-2 allocation is to use an
explicit (not implicit), per-pagetable pointer, by (ab)using the
pagetable's page->mapping or page->list pointer. This way the 'soft' part
of the pagetable can be allocated anywhere, and can be found via
page->list.next. [and the soft table points to the hardware table via
page->list.next as well] (The pagetable's page->list is an unused field.)

traversing the pte list (chain) of alias mappings goes like this:

 pte_t * get_next_pte(pte_t *pte)
 {
	soft_table = (mem_map + MAP_NR(pte))->list.next;
	next_pte = soft_table + ((pte & ~PAGE_MASK) >> PTE_SHIFT);
 }

it's fast, O(1) and has a 1-pointer overhead per pte and uses PAGE_SIZE
allocations only. Important: there is no extra allocation overhead while
establishing mappings. It works on every architecture, because the
allocation 'mirrors' that of the real pagetable's allocation.

	Ingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2001-01-20  7:05 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2001-01-19  5:08 [RFC] 2-pointer PTE chaining idea Rik van Riel
2001-01-19  6:57 ` David S. Miller
2001-01-19  7:17   ` Linus Torvalds
2001-01-19  7:55     ` Rik van Riel
2001-01-20  5:22       ` Linus Torvalds
2001-01-20  5:31         ` David S. Miller
2001-01-20  7:05           ` Rik van Riel
2001-01-19 11:49     ` Ingo Molnar
2001-01-20  6:58     ` Rik van Riel
2001-01-19 11:37 ` Ingo Molnar

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox