reverse pte lookups and anonymous private mappings; avl trees?

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* reverse pte lookups and anonymous private mappings; avl trees?
@ 1998-03-02 19:04 Benjamin C.R. LaHaise
  1998-03-02 23:03 ` Stephen C. Tweedie
  0 siblings, 1 reply; 9+ messages in thread
From: Benjamin C.R. LaHaise @ 1998-03-02 19:04 UTC (permalink / raw)
  To: linux-mm; +Cc: torvalds

Hello,

Okay, I've been ripping my hair out this weekend trying to get my reverse
pte lookups working using the inode/vma walking scheme, and I'm about to
go mad because of it.  Here are the alternatives I've come up with. 
Please give me some comments as to which one people think is the best from
a design standpoint (hence the cc to Linus) and performance wise, too. 
Note that I'm ignoring the problem of overlapping inode/offset pairs given
the new swap cache code.  (this is probably riddled with typos/thinkos)

	a)  add a struct vm_area_struct pointer and vm_offset to struct
page.  To each vm_area_struct, add a pair of pointers so that there's a
list of vm_area_structs which are private, but can be sharing pages.
Hence we end up with something like:

	page--> vma -> vm_next_private -> ... (cicular list)
	         |          |
	       vm_mm      vm_mm
	         |         ...
	        ... (use page->vm_offset - vma->vm_offset + vma->vm_start)
	         |
	        pte

This, I think, is the cleanest approach.  It makes life easy for finding a
shared anonymous pages, but has the downside of adding 8 bytes onto the
page map (16 on the 64 bit machines), same to the vm_area_struct.  Perhaps
the vma's vm_next_share/vm_pprev_share pointers could be reused (although
the seperate private mapping share list would make searching for
non-anonymous, but private mappings faster).

	b) per-vma inodes.  This scheme is a headache.  It would involve
linking the inodes in order to maintain a chain of the anonymous pages.
At fork() time, each private anonymous vma would need to have two new
inodes allocated (one for each task), which would point to the old inode.
Ptes are found using the inode, offset pair already in struct page, plus
walking the inode tree.  Has the advantage that we can use the inode,
offset pair already in struct page, no need to grow struct vm_area_struct;
disadvantage: hairy, conflicts with the new swap cache code - requires
the old swap_entry to return to struct page.

	c) per-mm inodes, shared on fork.  Again, this one is a bit
painful, although less so than b.  This one requires that we allow
multiple pages in the page cache exist with the same offset (ugly).  Each
anonymous page gets the mm_struct's inode attached to it, with the virtual
address within the process' memory space being the offset.  The ptes are
found in the same manner as for normal shared pages (inode->i_mmap->vmas).
Aliased page-cache entries are created on COW and mremap().

My thoughts are that the only real options are (a) and (c).  (a) seems to
be the cleanest conceptually, while (c) has very little overhead, but,
again, conflicts with the new swap cache...

On another note, is there any particular reason why the AVL tree for vma's
was removed in 2.1?  Because of recent changes to use the struct file * in
the vma, vma's aren't going to be coalesced as much, and some of the
private/anon changes I'm suggesting could contribute to that even further. 
I seem to remember someone suggesting the use of red-black trees as an
alternative, and methinks a friend has some code we can borrow. Just as a
note, with the introduction of PAM, quite a few daemons have ~30 vma's. 
If each time we want to steal a page we have to do a 30 element list walk,
the complexity of the swapper remains a bit high in my opinion (as kswapd
is now).

		-ben

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: reverse pte lookups and anonymous private mappings; avl trees?
  1998-03-02 19:04 reverse pte lookups and anonymous private mappings; avl trees? Benjamin C.R. LaHaise
@ 1998-03-02 23:03 ` Stephen C. Tweedie
  1998-03-02 23:37   ` Rik van Riel
  1998-03-03  6:29   ` Benjamin C.R. LaHaise
  0 siblings, 2 replies; 9+ messages in thread
From: Stephen C. Tweedie @ 1998-03-02 23:03 UTC (permalink / raw)
  To: Benjamin C.R. LaHaise; +Cc: linux-mm, torvalds

Hi Ben,

On Mon, 2 Mar 1998 14:04:01 -0500 (U\x01), "Benjamin C.R. LaHaise"
<blah@kvack.org> said:

> Okay, I've been ripping my hair out this weekend trying to get my
> reverse pte lookups working using the inode/vma walking scheme, and
> I'm about to go mad because of it.

I'm not in the slightest bit surprised.  I've been giving it a lot of
thought, and it just gets uglier and uglier. :)

The problem --- of course --- is that the inheritance relationship of
private pages is a tree.  If we have a process which forks off several
children then in any given privately mapped vma, some pages can be
shared between all children.  Other children may have their own
private copies of other pages.  If _those_ children fork, we get a
subtree amongst which certain pages are shared only amongst that
subtree; other pages in the leaf processes are not shared anywhere at
all; and still other pages are shared by the whole tree.

Ouch.  Basically, what it comes down to is that after forking, the vma
is not really a good unit by which to account for individual pages.  

> Here are the alternatives I've come up with.  Please give me some
> comments as to which one people think is the best from a design
> standpoint (hence the cc to Linus) and performance wise, too.  Note
> that I'm ignoring the problem of overlapping inode/offset pairs
> given the new swap cache code.  (this is probably riddled with
> typos/thinkos)

> 	a)  add a struct vm_area_struct pointer and vm_offset to struct
> page.  To each vm_area_struct, add a pair of pointers so that there's a
> list of vm_area_structs which are private, but can be sharing pages.

What about privately mapped images, such as shared libraries?  Will
those vmas will end up with one inode for the backing file and a
separate inode for the private pages?  Such private vmas still have
anonymous pages in them.

> Hence we end up with something like:

page--> vma -> vm_next_private -> ... (cicular list)
> 	         |          |
> 	       vm_mm      vm_mm
> 	         |         ...
> 	        ... (use page->vm_offset - vma->vm_offset + vma->vm_start)
> 	         |
> 	        pte

> This, I think, is the cleanest approach.  

What happens to these vmas and inodes on fork()?  On mremap()?  

> It makes life easy for finding a shared anonymous pages, 

Shared anonymous pages are a very special case, and they are much much
much easier than the general case.  With shared anonymous vmas, _all_
pages in the entire vma are guaranteed to be shared; where is no
mixing of shared and private pages as there is with a private map
(anonymous or not).  My original plan with the swap cache was to
introduce a basis inode for vmas only for this special case; all other
cases are dealt with adequately by the swap cache scheme (not counting
the performance issues of being able to swap out arbitrary pages).

> 	b) per-vma inodes.  This scheme is a headache.  It would involve
> linking the inodes in order to maintain a chain of the anonymous
> pages.

> At fork() time, each private anonymous vma would need to have two new
> inodes allocated (one for each task), which would point to the old inode.
> Ptes are found using the inode, offset pair already in struct page, plus
> walking the inode tree.  Has the advantage that we can use the inode,
> offset pair already in struct page, no need to grow struct vm_area_struct;
> disadvantage: hairy, conflicts with the new swap cache code - requires
> the old swap_entry to return to struct page.

This is getting much closer to the FreeBSD scheme, which has similar
stacks of inodes for each anonymous vma.  It is a bit cumbersome, but
ought to work pretty well.

> 	c) per-mm inodes, shared on fork.  Again, this one is a bit
> painful, although less so than b.  This one requires that we allow
> multiple pages in the page cache exist with the same offset (ugly).

Very.  :) 

> Each anonymous page gets the mm_struct's inode attached to it, with
> the virtual address within the process' memory space being the
> offset.  The ptes are found in the same manner as for normal shared
> pages (inode->i_mmap->vmas).  Aliased page-cache entries are created
> on COW and mremap().

> My thoughts are that the only real options are (a) and (c).  (a) seems to
> be the cleanest conceptually, while (c) has very little overhead, but,
> again, conflicts with the new swap cache...

Yes.  The conclusion I came to a while ago is that (c) is the only way
to go if we do this.

This leaves us with one main question --- how do we select pages for
swapout?  We either do it via a scan of the page tables, as currently
happens in vmscan, or with a scan of physical pages as shrink_mmap()
does.  The changes described here are only useful if we take the
second choice.

My personal feeling at the moment is that the second option is going
to give us lower overhead swapping and a much fairer balance between
private data and the page cache.  However, there are are other
arguments; in particular, it is much easier to do RSS quotas and
guarantees if we are swapping on a per page-table basis.

If we do make this change, then some of the new swap cache code
becomes redundant.  The next question is, how do we cache swap pages
in this new scenario?  We still need a swap cache mechanism, both to
support proper readahead for swap and to allow us to defer the freeing
of swapped out pages until the last possible moment.  Now, if we know
we can atomically remove a page from all ptes at once, then it is
probably sufficient to unhook the struct page from the (inode-vma,
offset) hash and rehash it as (swapper-inode, entry).  It gets harder
if we want per-process control of RSS, since in that case we want to
use the same physical page for both vma and swap, and we will really
need a quick lookup from both indexes.

I'm not at all averse to growing the struct page if we absolutely have
to, however.  If we can effectively keep several hundred more pages in
memory as a result of not having to throw data away just to cope with
the worst-case interrupt memory requirements, then we will still see a
great overall performance improvement.

Cheers,
 Stephen.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: reverse pte lookups and anonymous private mappings; avl trees?
  1998-03-02 23:03 ` Stephen C. Tweedie
@ 1998-03-02 23:37   ` Rik van Riel
  1998-03-03  6:29   ` Benjamin C.R. LaHaise
  1 sibling, 0 replies; 9+ messages in thread
From: Rik van Riel @ 1998-03-02 23:37 UTC (permalink / raw)
  To: Stephen C. Tweedie; +Cc: Benjamin C.R. LaHaise, linux-mm, torvalds

On Mon, 2 Mar 1998, Stephen C. Tweedie wrote:

[second option == linear memory scanning a`la shrink_mmap()]

> My personal feeling at the moment is that the second option is going
> to give us lower overhead swapping and a much fairer balance between
> private data and the page cache.  However, there are are other
> arguments; in particular, it is much easier to do RSS quotas and
> guarantees if we are swapping on a per page-table basis.

RSS quotas and guarantees are quite easy. The guarantee
can be looked up when we do the swapout scan, and the RSS
limit can be enforced at allocation time (once we have an
inactive list, we can simply enforce a maximum number of
active pages per-process. Normally, no extra thrashing
will occur, because those pages will be reactivated soon).

> If we do make this change, then some of the new swap cache code
> becomes redundant.  The next question is, how do we cache swap pages
> in this new scenario?  We still need a swap cache mechanism, both to

Personally, I would use NRU paging on the active pages, and
LRU paging on the inactive list. That is, if a page hasn't
been touched since the last time, it's deactivated, and once
on the inactive list, it'll be deallocated in a LRU fashion
(unless we reactivate it or we're short on big-order pages).

> support proper readahead for swap and to allow us to defer the freeing
> of swapped out pages until the last possible moment.  Now, if we know
> we can atomically remove a page from all ptes at once, then it is
> probably sufficient to unhook the struct page from the (inode-vma,
> offset) hash and rehash it as (swapper-inode, entry).  It gets harder
> if we want per-process control of RSS, since in that case we want to
> use the same physical page for both vma and swap, and we will really
> need a quick lookup from both indexes.
> 
> I'm not at all averse to growing the struct page if we absolutely have
> to, however.  If we can effectively keep several hundred more pages in
> memory as a result of not having to throw data away just to cope with
> the worst-case interrupt memory requirements, then we will still see a
> great overall performance improvement.

And a better aging / minimum RSS / maximum RSS will help
quite a lot too. I'll be working on a swapper daemon RSN,
once the free_memory_available() thingy is sorted I'll start
on the page_struct and the scheduler (to add sleep_time
registration) to do the support work.

Rik.
+-----------------------------+------------------------------+
| For Linux mm-patches, go to | "I'm busy managing memory.." |
| my homepage (via LinuxHQ).  | H.H.vanRiel@fys.ruu.nl       |
| ...submissions welcome...   | http://www.fys.ruu.nl/~riel/ |
+-----------------------------+------------------------------+

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: reverse pte lookups and anonymous private mappings; avl trees?
  1998-03-02 23:03 ` Stephen C. Tweedie
  1998-03-02 23:37   ` Rik van Riel
@ 1998-03-03  6:29   ` Benjamin C.R. LaHaise
  1998-03-03 23:49     ` Stephen C. Tweedie
  1998-03-04 21:26     ` Stephen C. Tweedie
  1 sibling, 2 replies; 9+ messages in thread
From: Benjamin C.R. LaHaise @ 1998-03-03  6:29 UTC (permalink / raw)
  To: Stephen C. Tweedie; +Cc: linux-mm, torvalds

Hello Stephen!

On Mon, 2 Mar 1998, Stephen C. Tweedie wrote:

> Hi Ben,
> 
> On Mon, 2 Mar 1998 14:04:01 -0500 (U\x01), "Benjamin C.R. LaHaise"
> <blah@kvack.org> said:
> 
> > Okay, I've been ripping my hair out this weekend trying to get my
> > reverse pte lookups working using the inode/vma walking scheme, and
> > I'm about to go mad because of it.

Monday's brought some enlightenment now.

> I'm not in the slightest bit surprised.  I've been giving it a lot of
> thought, and it just gets uglier and uglier. :)

The more I've thought about it this evening, the more I'm liking the type
(a) approach (aka the vma->vm_private list)...  It was only as I wrote the
original message that I came up with the idea (as inodes we all that came
to mind over the weekend), so it was a bit fuzzy at the time of writing. 

> The problem --- of course --- is that the inheritance relationship of
> private pages is a tree.  If we have a process which forks off several
> children then in any given privately mapped vma, some pages can be
> shared between all children.  Other children may have their own
> private copies of other pages.  If _those_ children fork, we get a
> subtree amongst which certain pages are shared only amongst that
> subtree; other pages in the leaf processes are not shared anywhere at
> all; and still other pages are shared by the whole tree.

	     vm_next/prev_private (aka vm_private list)
	vma -> vma -> vma -> vma
	 |      |      |      |
	...    ...    ...    ...
	 |      |      |      |
	pte    pte    pte    pte
	 |   /          \    /
	page             page
	 |->vma == vma1   |->vma == vma3

(number the vma's from 1->4, left to right)  the values for page->vma are
arbitrary, and really only need to be one of the vmas on the (doubly link)
vm_private list.  Somewhere below in this message is the code for finding
a single pte, trivially extended for all cases as the early January
patch/idea suggested.  It basically creates the same thing that exists for
inode backed pages - a list to traverse that contains all *possible*
mappings to the page.  As with the i_mmap list, not all mappings point to
the same page, but they can be found.  One thing this would allow us to do
is to pre-fill all pointers on swapin.  Not nescessarily a good idea, but
possible. - swap_map could be replaced by a bitmap if we pre-filled
(makes sense as the sharing of swapped pages is rare).

> Ouch.  Basically, what it comes down to is that after forking, the vma
> is not really a good unit by which to account for individual pages.  

This is where the type (a) approach really shines.  At fork time, each
non-shared vma created gets inserted into the vm_next_private list.  The
more its swirled around my mind, the nastier doing a get_empty_inode for
each vma on fork() makes me kringe -- we can't slow down fork! ;-) 

...snip...
> What about privately mapped images, such as shared libraries?  Will
> those vmas will end up with one inode for the backing file and a
> separate inode for the private pages?  Such private vmas still have
> anonymous pages in them.

The type 'a' approach can basically be considered to be exactly the same
as what is done now: if the page is still being shared, it's in the page
cache for the inode, hence we can find the pte by means of the i_mmap
list.  If it's anonymous, then we can find each and every pte via the
vm_private list (the vma of the page will get set upon its creation - COW 
or swapin) by walking vm_(prev|next)_private in either direction, skipping
over the ptes which do *not* point to the page -- exactly the same as we'd 
do for shared mappings.

> > Hence we end up with something like:
> 
> page--> vma -> vm_next_private -> ... (cicular list)
> > 	         |          |
> > 	       vm_mm      vm_mm
> > 	         |         ...
> > 	        ... (use page->vm_offset - vma->vm_offset + vma->vm_start)
> > 	         |
> > 	        pte
> 
> > This, I think, is the cleanest approach.  
> 
> What happens to these vmas and inodes on fork()?  On mremap()?  

On mremap(), nothing really happens -- unless the area is shrunk, in which
case any vma that disappears is removed from it's vma_private list. Note
that it will need to pass the vma being zapped down to zap_page_range, as
if that is the vma to which the page belongs, the page->vma pointer will
need to be set to one of the other members of the private list (unless the
page isn't used by anyone else, in which case the page can be thrown onto
the discard list if it's page cached and still used someone else, or free
list if its not used by anyone).  This applies to unmapping as well.

This all translates into making the merging of vm_area_structs much rarer.
If we want to keep being able to merge vm_area_structs during the
single-mapping (normal) case, then life would get hairy (basically it
could only work when the vma is the sole member of its vm_private list,
and any pages present would require their vma->vm_offset be adjusted).

One of the things that I didn't describe earlier is that page->vm_offset
is relative to vma->vm_offset (same as page->offset, that is pte vaddr =
page->vm_offset - vma->vm_offset + vma->vm_start).  If the vma is shrunk,
vma->vm_offset is increased as it would be for shared pages.

In order to maintain the mergability (wonderful word ;) of the 'usual
case' (eg sbrk, or extending the size of a mapping), the vma->vm_offset
value will need to be set to the virtual address of the mapping.  That way
everything lines up nicely.

> > It makes life easy for finding a shared anonymous pages, 
> 
> Shared anonymous pages are a very special case, and they are much much
> much easier than the general case.  With shared anonymous vmas, _all_
> pages in the entire vma are guaranteed to be shared; where is no
> mixing of shared and private pages as there is with a private map
> (anonymous or not).  My original plan with the swap cache was to
> introduce a basis inode for vmas only for this special case; all other
> cases are dealt with adequately by the swap cache scheme (not counting
> the performance issues of being able to swap out arbitrary pages).

Thankfully this scheme coexists with the new swap cache.  More important
that shared anonymous pages, is the case of a single not shared anonymous
page:
	unsigned long addr = page->vm_offset - page->vma->vm_offset + page->vma->vm_start;
	pte_t *pte_p = pte_offset(pmd_offset(pgd_offset(page->vma->vm_mm), addr), addr);

> > 	b) per-vma inodes.  This scheme is a headache.  It would involve
> > linking the inodes in order to maintain a chain of the anonymous
> > pages.
...
> This is getting much closer to the FreeBSD scheme, which has similar
> stacks of inodes for each anonymous vma.  It is a bit cumbersome, but
> ought to work pretty well.

Cumbersome indeed - I don't want to have to throw out your new swap cache
for this one, which looks like a *huge* looser on fork(); typical
processes on my machine have 10-12 (normal) or 30-32 (using the PAM
libraries, or X) vmas apiece.  Almost all of them are private writable
mappings, and get_empty_inode is not the fastest code in the kernel.  Add
to that the fact that people running 512+ processes now need an additional
5000->15000 inodes and things begin to look pretty bad.

> > 	c) per-mm inodes, shared on fork.  Again, this one is a bit
> > painful, although less so than b.  This one requires that we allow
> > multiple pages in the page cache exist with the same offset (ugly).
> 
> Very.  :) 
> 
> > Each anonymous page gets the mm_struct's inode attached to it, with
> > the virtual address within the process' memory space being the
> > offset.  The ptes are found in the same manner as for normal shared
> > pages (inode->i_mmap->vmas).  Aliased page-cache entries are created
> > on COW and mremap().
> 
> > My thoughts are that the only real options are (a) and (c).  (a) seems to
> > be the cleanest conceptually, while (c) has very little overhead, but,
> > again, conflicts with the new swap cache...
> 
> Yes.  The conclusion I came to a while ago is that (c) is the only way
> to go if we do this.

I hope I've cleared up my explanation of (a) enough that it is beginning
to look viable.  It seems to be what Linus had in mind when the
discussions came up a while ago (he suggested adding a pointer to the
vma in struct page - the offset and vm_private list follow naturally).

> This leaves us with one main question --- how do we select pages for
> swapout?  We either do it via a scan of the page tables, as currently
> happens in vmscan, or with a scan of physical pages as shrink_mmap()
> does.  The changes described here are only useful if we take the
> second choice.

> My personal feeling at the moment is that the second option is going
> to give us lower overhead swapping and a much fairer balance between
> private data and the page cache.  However, there are are other
> arguments; in particular, it is much easier to do RSS quotas and
> guarantees if we are swapping on a per page-table basis.

Ummm, not really.  I think that because of the mm scheme chosen to support
clone, these quotas are really only applicable on a per-mm basis.
Sticking a quota on a process, only to have it exceeded by it's clone()
twin doesn't make terribly much sense to me.  In the pte_list patch,
rather than kill a single task when a fault occurred during swapin/out, I
walked the task table and killed all which shared the mm.

...snipped concerns over (c) - methinks to avoid (c) until (a) is
undoable...
> I'm not at all averse to growing the struct page if we absolutely have
> to, however.  If we can effectively keep several hundred more pages in
> memory as a result of not having to throw data away just to cope with
> the worst-case interrupt memory requirements, then we will still see a
> great overall performance improvement.

That's what I thought.  Also, the people who are starting to run machines
with huge amounts of memory (512megs+) will benefit greatly.  What do you
think about having all the page stealing done through get_free_page?  I'd
like to see get_free_pages (non-atomic) block until enough pages have been
reclaimed to the dumpable list; that way we can avoid the system ever
deadlocking on swap.  But the network stack would need the patches Pavel
wrote to drop packets when low on memory -- don't drop packets destined
for the swapping socket, but all others are game.  Wow, that would make
Linux much more stable indeed!

Oh, below are my current changes to mm.h to give you a hint of where I'm
going.  Note that despite the cvs directory name, it is against 2.1.89-5.
I'm actually proposing the use of 6 possible queues for pages to be on,
mostly so that we can keep track of and grab what we want when we want it;
this could be pruned if it matters.

		-ben

Index: ../include/linux/mm.h
===================================================================
RCS file: /home/dot1/blah/cvs/linux-2.1.86-mm/include/linux/mm.h,v
retrieving revision 1.1.1.3
diff -u -u -r1.1.1.3 mm.h
--- mm.h	1998/03/03 03:23:20	1.1.1.3
+++ mm.h	1998/03/03 04:47:59
@@ -46,6 +46,11 @@
 	struct vm_area_struct *vm_next_share;
 	struct vm_area_struct **vm_pprev_share;

+	/* Mappings that are private, yet shared.  Derived from fork()ing.
+	 */
+	struct vm_area_struct *vm_next_private;
+	struct vm_area_struct *vm_prev_private;
+
 	struct vm_operations_struct * vm_ops;
 	unsigned long vm_offset;
 	struct file * vm_file;
@@ -100,6 +105,7 @@
 		unsigned long page);
 	int (*swapout)(struct vm_area_struct *,  unsigned long, pte_t *);
 	pte_t (*swapin)(struct vm_area_struct *, unsigned long, unsigned long);
+	int (*unuse)(struct vm_area_struct *, struct page *, pte_t *); /* not finalized */
 };

 /*
@@ -116,21 +122,48 @@
 	struct page *prev;
 	struct inode *inode;
 	unsigned long offset;
+
 	struct page *next_hash;
 	atomic_t count;
 	unsigned int age;
 	unsigned long flags;	/* atomic flags, some possibly updated asynchronously */
+
 	struct wait_queue *wait;
 	struct page **pprev_hash;
 	struct buffer_head * buffers;
 	unsigned long map_nr;	/* page->map_nr == page - mem_map */
+
+	/* used for private mappings -> inode == NULL or &swapper_inode */
+	struct vm_area_struct *vma;
+	unsigned long vma_offset;
+
+	/* page on one of the circular page_queues */
+	struct page *pgq_next;
+	struct page *pgq_prev;
 } mem_map_t;

+/* uses a dummy struct page so we get next & prev for beginning/end of lists */
+extern struct page page_queues[];
+extern atomic_t page_queues_cnt[];
+
+#define PgQ_Locked	0	/* page is unswappable - mlock()'d */
+#define PgQ_Active	1	/* page is mapped and active -> young */
+#define PgQ_Inactive	2	/* page is mapped, but hasn't been referenced recently -> old */
+#define PgQ_Swappable	3	/* page has no mappings, is dirty */
+#define PgQ_Swapping	4	/* page is being swapped */
+#define PgQ_Dumpable	5	/* page has no mappings, is not dirty, but is still in the page cache */
+
+#define NR_PAGE_QUEUE		(PgQ_Dumpable+1)
+
+/* The low 3 bits of page->flag have been snarfed to index into page_queues */
+#define PGmask_pgq		0x7
+
 /* Page flag bit values */
-#define PG_locked		 0
-#define PG_error		 1
-#define PG_referenced		 2
-#define PG_uptodate		 3
+#define PG_on_queue		 3
+#define PG_locked		10
+#define PG_error		11
+#define PG_referenced		12
+#define PG_uptodate		13
 #define PG_free_after		 4
 #define PG_decr_after		 5
 /* Unused			 6 */
@@ -140,6 +173,7 @@
 #define PG_reserved		31

 /* Make it prettier to test the above... */
+#define PageOnQueue(page)	(test_bit(PG_on_queue, &(page)->flags))
 #define PageLocked(page)	(test_bit(PG_locked, &(page)->flags))
 #define PageError(page)		(test_bit(PG_error, &(page)->flags))
 #define PageReferenced(page)	(test_bit(PG_referenced, &(page)->flags))

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: reverse pte lookups and anonymous private mappings; avl trees?
  1998-03-03  6:29   ` Benjamin C.R. LaHaise
@ 1998-03-03 23:49     ` Stephen C. Tweedie
  1998-03-04 21:26     ` Stephen C. Tweedie
  1 sibling, 0 replies; 9+ messages in thread
From: Stephen C. Tweedie @ 1998-03-03 23:49 UTC (permalink / raw)
  To: Benjamin C.R. LaHaise; +Cc: Stephen C. Tweedie, linux-mm, torvalds

Hi again,

On Tue, 3 Mar 1998 01:29:41 -0500 (EST), "Benjamin C.R. LaHaise"
<blah@kvack.org> said:

> Monday's brought some enlightenment now.

> 	     vm_next/prev_private (aka vm_private list)
> 	vma -> vma -> vma -> vma
> 	 |      |      |      |
> 	...    ...    ...    ...
> 	 |      |      |      |
> 	pte    pte    pte    pte
> 	 |   /          \    /
> 	page             page
> 	 |->vma == vma1   |->vma == vma3

> (number the vma's from 1->4, left to right)  the values for page->vma are
> arbitrary, and really only need to be one of the vmas on the (doubly link)
> vm_private list.  Somewhere below in this message is the code for finding
> a single pte, trivially extended for all cases as the early January
> patch/idea suggested.  It basically creates the same thing that exists for
> inode backed pages - a list to traverse that contains all *possible*
> mappings to the page.

> This is where the type (a) approach really shines.  At fork time, each
> non-shared vma created gets inserted into the vm_next_private list.  The
> more its swirled around my mind, the nastier doing a get_empty_inode for
> each vma on fork() makes me kringe -- we can't slow down fork! ;-) 

A different but largely equivalent approach occurred to me a while ago.
On exec(), create a new "mm_family" context, and link the mm_struct to
it.  On fork, add the mm_struct to the original mm_family.  That gives
us a quick way to find every mm_struct (and, by implication, every vma)
which can share any given page.

>> What about privately mapped images, such as shared libraries?  Will
>> those vmas will end up with one inode for the backing file and a
>> separate inode for the private pages?  Such private vmas still have
>> anonymous pages in them.

> The type 'a' approach can basically be considered to be exactly the
> same as what is done now: if the page is still being shared, it's in
> the page cache for the inode, hence we can find the pte by means of
> the i_mmap list.  If it's anonymous, then we can find each and every
> pte via the vm_private list (the vma of the page will get set upon its
> creation - COW or swapin) by walking vm_(prev|next)_private in either
> direction, skipping over the ptes which do *not* point to the page --
> exactly the same as we'd do for shared mappings.

OK.  Next point --- what about RSS limits and quotas?  The big problem
in this case is that we may want an anonymous page to be present in some
page tables but not in others.  The reason that this is such a problem
is that in this case, the page is marked by swap entry in some ptes, and
by a present pte in others.  The way we deal with that is the swap
cache, which implies that either we disallow this situation, we deal
with it by some mechanism other than the swap cache, or we cannot use
the page->inode and ->offset fields for the vma indexing.

That's not necessarily a bad thing --- we *don't* need the full
functionality of the page cache for the vma linkage.  In particular, we
won't ever want to page by its (vma, offset) name --- that's what the
page tables themselves are for.  All we need is to get the (vma, offset)
from the struct page, so we don't necessarily need all of the page hash
linkages to support this.

Also, any page which has been swapped in readonly does still need a swap
cache entry if we are to avoid freeing the space on disk.

Note that the swap cache functionality _does_ need the page hash
linkages, since we do need to find the physical page by its
(swapper-inode, entry) name.  That implies that if we combine these two
mechanisms, we need to keep the swap cache as is and add separate
page->vma and page->vm_offset fields independent of the page->inode and
page->offset fields.

> On mremap(), nothing really happens -- unless the area is shrunk, in
> which case any vma that disappears is removed from it's vma_private
> list.

Hmm.  How do we find all of the vmas which reference a specific physical
page if the virtual address of that page is not a constant?  Consider a
forked process having a page range vmremapped in a child --- we still
have only the one page, but it is now at a different VA in each process.
We need to record the _original_ starting virtual address in each
private vma to cope with this --- in other words, when we remap(), we
cannot afford to change the offset we are using to label the page, since
other forked children will continue to use the original offset.  Note
also that we may have many different vmas with the same starting VA even
in a single process --- we can allocate private pages, remap them, then
reallocate the original address range again.  That gives two distinct
vmas with the same original starting VA.

> One of the things that I didn't describe earlier is that page->vm_offset
> is relative to vma->vm_offset (same as page->offset, that is pte vaddr =
> page->vm_offset - vma->vm_offset + vma->vm_start).  If the vma is shrunk,
> vma->vm_offset is increased as it would be for shared pages.

Hmm, that makes a difference --- keeping vma->vm_offset the same over a
mremap, and initialising the vm_offset to VA on a new mmap and
page->vm_offset to (VA - vma->vm_offset + vma->vm_start) on a new page
fault, will sort out a lot of the above problems.

>> Yes.  The conclusion I came to a while ago is that (c) is the only way
>> to go if we do this.

> I hope I've cleared up my explanation of (a) enough that it is
> beginning to look viable.  It seems to be what Linus had in mind when
> the discussions came up a while ago (he suggested adding a pointer to
> the vma in struct page - the offset and vm_private list follow
> naturally).

Actually, what you describe as (a) above is more close to what I had in
mind for (c) --- a single anchor point (the private vma list in this
case) which is shared between all forked processes.  This is definitely
workable.

>> This leaves us with one main question --- how do we select pages for
>> swapout?  We either do it via a scan of the page tables, as currently
>> happens in vmscan, or with a scan of physical pages as shrink_mmap()
>> does.  The changes described here are only useful if we take the
>> second choice.

>> My personal feeling at the moment is that the second option is going
>> to give us lower overhead swapping and a much fairer balance between
>> private data and the page cache.  However, there are are other
>> arguments; in particular, it is much easier to do RSS quotas and
>> guarantees if we are swapping on a per page-table basis.

> Ummm, not really.  I think that because of the mm scheme chosen to
> support clone, these quotas are really only applicable on a per-mm
> basis.  

Yes, absolutely.

> Sticking a quota on a process, only to have it exceeded by it's
> clone() twin doesn't make terribly much sense to me.  In the pte_list
> patch, rather than kill a single task when a fault occurred during
> swapin/out, I walked the task table and killed all which shared the
> mm.

But even imposing a mm_struct RSS limit means that we have to unlink a
bunch of pages by walking ptes, not physical pages.  It effectively
gives us two separate swapping mechanisms.

>> I'm not at all averse to growing the struct page if we absolutely
>> have to, however.  If we can effectively keep several hundred more
>> pages in memory as a result of not having to throw data away just to
>> cope with the worst-case interrupt memory requirements, then we will
>> still see a great overall performance improvement.

> I'd like to see get_free_pages (non-atomic) block until enough pages
> have been reclaimed to the dumpable list; that way we can avoid the
> system ever deadlocking on swap.  But the network stack would need the
> patches Pavel wrote to drop packets when low on memory -- don't drop
> packets destined for the swapping socket, but all others are game.
> Wow, that would make Linux much more stable indeed!

Yes.  The other part of this is to allow kswapd to continue looking for
free pages even when there is insufficient memory to write to swap ---
just let it look for non-dirty pages to discard.

> Oh, below are my current changes to mm.h to give you a hint of where
> I'm going.  Note that despite the cvs directory name, it is against
> 2.1.89-5.  I'm actually proposing the use of 6 possible queues for
> pages to be on, mostly so that we can keep track of and grab what we
> want when we want it; this could be pruned if it matters.

OK, it's the dumpable queue which matters to my own code.  I'm nearly
done making the page cache IRQ- and SMP-safe, so page unhook from IRQ
will be possible soon.  I'll give you patches when it's working, and I
can also mark out hooks for adding pages to that queue at the
appropriate time.

Cheers,
 Stephen.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: reverse pte lookups and anonymous private mappings; avl trees?
  1998-03-03  6:29   ` Benjamin C.R. LaHaise
  1998-03-03 23:49     ` Stephen C. Tweedie
@ 1998-03-04 21:26     ` Stephen C. Tweedie
  1998-03-04 23:20       ` Rik van Riel
  1 sibling, 1 reply; 9+ messages in thread
From: Stephen C. Tweedie @ 1998-03-04 21:26 UTC (permalink / raw)
  To: Benjamin C.R. LaHaise; +Cc: Stephen C. Tweedie, linux-mm, torvalds

Hi,

Another observation on the subject of new page lists.

You suggested adding:

> +	/* used for private mappings -> inode == NULL or &swapper_inode */
> +	struct vm_area_struct *vma;
> +	unsigned long vma_offset;
> +
> +	/* page on one of the circular page_queues */
> +	struct page *pgq_next;
> +	struct page *pgq_prev;

to the struct page.  However, note that the vma and vma_offset fields
are irrelevant to pages which are not mapped (the fields can be
initialised on first map, and a page which is purely in the page or swap
cache is not attached to any particular vma and so doesn't need these
lookups).  Detecting such unmapped pages reliably may require separate
unmapped and mapped page counts.

Given this, can we not over load the two new fields and reduce the
expansion of the struct page?  The answer is yes, if and only if we
restrict the new page queues to unmapped pages.  For my own code, the
only queue which is really necessary is the list of pages ready to be
reclaimed at interrupt time, and those pages will never be mapped.  The
other queues you mention:

> +#define PgQ_Locked	0	/* page is unswappable - mlock()'d */
> +#define PgQ_Active	1	/* page is mapped and active -> young */
> +#define PgQ_Inactive	2	/* page is mapped, but hasn't been referenced recently -> old */
> +#define PgQ_Swappable	3	/* page has no mappings, is dirty */
> +#define PgQ_Swapping	4	/* page is being swapped */
> +#define PgQ_Dumpable	5	/* page has no mappings, is not dirty, but is still in the page cache */

don't seem to give us all that much extra, since we probably never want
to go out and explicitly search for all pages on such lists.  (That's
assuming that the page aging and swapping scanner is working by walking
pages in physical address order, not by traversing any other lists.)

Other than that, I like this idea more and more.  Overloading these two
sets of fields gives us huge extra functionality over the 2.0 vm, and at
the cost of only one extra longword per page.

Cheers,
 Stephen.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: reverse pte lookups and anonymous private mappings; avl trees?
  1998-03-04 21:26     ` Stephen C. Tweedie
@ 1998-03-04 23:20       ` Rik van Riel
  1998-03-05  4:21         ` Benjamin C.R. LaHaise
  1998-03-05 22:25         ` Stephen C. Tweedie
  0 siblings, 2 replies; 9+ messages in thread
From: Rik van Riel @ 1998-03-04 23:20 UTC (permalink / raw)
  To: Stephen C. Tweedie; +Cc: Benjamin C.R. LaHaise, linux-mm, torvalds

On Wed, 4 Mar 1998, Stephen C. Tweedie wrote:

> > +#define PgQ_Locked	0	/* page is unswappable - mlock()'d */
> > +#define PgQ_Active	1	/* page is mapped and active -> young */
> > +#define PgQ_Inactive	2	/* page is mapped, but hasn't been referenced recently -> old */
> > +#define PgQ_Swappable	3	/* page has no mappings, is dirty */
> > +#define PgQ_Swapping	4	/* page is being swapped */
> > +#define PgQ_Dumpable	5	/* page has no mappings, is not dirty, but is still in the page cache */
> 
> don't seem to give us all that much extra, since we probably never want
> to go out and explicitly search for all pages on such lists.  (That's
> assuming that the page aging and swapping scanner is working by walking
> pages in physical address order, not by traversing any other lists.)

We just might want to do that. If we can _guarantee_
a certain number of free+(inactive&clean) pages, we
can keep the number of free pages lower, and we can
keep more pages longer in memory, giving more speed
to the overall system.

Rik.
+-----------------------------+------------------------------+
| For Linux mm-patches, go to | "I'm busy managing memory.." |
| my homepage (via LinuxHQ).  | H.H.vanRiel@fys.ruu.nl       |
| ...submissions welcome...   | http://www.fys.ruu.nl/~riel/ |
+-----------------------------+------------------------------+

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: reverse pte lookups and anonymous private mappings; avl trees?
  1998-03-04 23:20       ` Rik van Riel
@ 1998-03-05  4:21         ` Benjamin C.R. LaHaise
  1998-03-05 22:25         ` Stephen C. Tweedie
  1 sibling, 0 replies; 9+ messages in thread
From: Benjamin C.R. LaHaise @ 1998-03-05  4:21 UTC (permalink / raw)
  To: Rik van Riel; +Cc: Stephen C. Tweedie, linux-mm, torvalds

On Thu, 5 Mar 1998, Rik van Riel wrote:

> On Wed, 4 Mar 1998, Stephen C. Tweedie wrote:
> 
> > > +#define PgQ_Locked	0	/* page is unswappable - mlock()'d */
> > > +#define PgQ_Active	1	/* page is mapped and active -> young */
> > > +#define PgQ_Inactive	2	/* page is mapped, but hasn't been referenced recently -> old */
> > > +#define PgQ_Swappable	3	/* page has no mappings, is dirty */
> > > +#define PgQ_Swapping	4	/* page is being swapped */
> > > +#define PgQ_Dumpable	5	/* page has no mappings, is not dirty, but is still in the page cache */
> > 
> > don't seem to give us all that much extra, since we probably never want
> > to go out and explicitly search for all pages on such lists.  (That's
> > assuming that the page aging and swapping scanner is working by walking
> > pages in physical address order, not by traversing any other lists.)
> 
> We just might want to do that. If we can _guarantee_
> a certain number of free+(inactive&clean) pages, we
> can keep the number of free pages lower, and we can
> keep more pages longer in memory, giving more speed
> to the overall system.

That's 'xactly what I had in mind (there's an extern atomic_t
page_queues_cnt[]; in my proposed mm.h ;-) .  The other aspect of the
queues is to replace the page->age scheme

Not only that, but I've just realized how we can get the queue's for free 
with some hackery...

In another message, Stephen C. Tweedie wrote:

> Given this, can we not over load the two new fields and reduce the
> expansion of the struct page?  The answer is yes, if and only if we
> restrict the new page queues to unmapped pages.  For my own code, the
> only queue which is really necessary is the list of pages ready to be
> reclaimed at interrupt time, and those pages will never be mapped.

You're absolutely right.  And by going all the way, we can even get the
queues in place, and still have a struct page that's the same size as 2.0.

struct page {
	union {
		struct {
			struct page *next;
			struct page *prev;
		} normal;
		struct {
			struct vm_area_struct *vma;
			unsigned long vm_offset;
		} private;
	} u;
	struct inode *inode;
	unsigned long offset;

	struct page *next_hash;
	atomic_t count;
	unsigned long flags;
	struct wait_queue *wait;

	struct page **prev_hash;
	struct buffer_head * buffers;
	struct page *pgq_next;
	struct page *pgq_prev;
}

What happened?  Well, both age and map_nr are gone.  With struct page
being 48 bytes on 32 bit machines, defining map_nr as: 

	((unsigned long)page - (unsigned long)mem_map) / sizeof(struct page) 

is sufficiently cheap.  I checked with egcs 1.0 and it generates awful
code for map_nr = page - mem_map (why?), whereas the above is fine (so
defining a map_nr(page) macro/inline should be okay).  Even if we keep
map_nr, it's still about the same size as in 2.0 (52 bytes).

Anyhoo, overlapping vma/vm_offset with next/prev works nicely as next/prev
are only used for the per-inode page list to discard pages in
invalidate_inode_pages if the page is shared, or for the page's position
in the free list.  If the page belongs to the swapper inode,
invalidate_inode_pages makes no sense, and it certainly isn't free.  This
looks to be cleaner than the suggestion of overlapping 

Another issue that Stephen brought to my attention, RSS limits, seems to
have a reasonable approach: when the RSS limit is lowered/exceeded, walk
the inactive/active lists looking for pages that are used by the process.
For the normal case (a private page used only by the mm in question), the
test is a simple check if page->u.private.vma->mm == mm.  If the page is
shared, we have to do the expensive walk-lots check, until we've dropped
the RSS of the process sufficiently.  However, we should be able to avoid
that most of the time by having an amount of 'slack' for the RSS, which
will allow the normal movement of pages from active->inactive->swappable
to reduce the process' RSS.  If it really is an issue, we can always walk
the page tables looking for inactive pages to toss..

Oh, it's almost working... =)

		-ben

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: reverse pte lookups and anonymous private mappings; avl trees?
  1998-03-04 23:20       ` Rik van Riel
  1998-03-05  4:21         ` Benjamin C.R. LaHaise
@ 1998-03-05 22:25         ` Stephen C. Tweedie
  1 sibling, 0 replies; 9+ messages in thread
From: Stephen C. Tweedie @ 1998-03-05 22:25 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Stephen C. Tweedie, Benjamin C.R. LaHaise, linux-mm, torvalds

Hi,

On Thu, 5 Mar 1998 00:20:39 +0100 (MET), Rik van Riel
<H.H.vanRiel@fys.ruu.nl> said:

> On Wed, 4 Mar 1998, Stephen C. Tweedie wrote: 

>> don't seem to give us all that much extra, since we probably never
>> want to go out and explicitly search for all pages on such lists.
>> (That's assuming that the page aging and swapping scanner is
>> working by walking pages in physical address order, not by
>> traversing any other lists.)

> We just might want to do that. If we can _guarantee_ a certain
> number of free+(inactive&clean) pages, we can keep the number of
> free pages lower, and we can keep more pages longer in memory,
> giving more speed to the overall system.

I know --- that's precisely the point I was trying to make!  The
"inactive plus clean" pages are exactly the pages we need to keep on a
queue for rapid reclamation, and these pages are guaranteed not to be
mapped.  So, we can overload the queue links with the vma,vm_offset
values (which are only needed for mapped pages, which cannot be so
rapidly reclaimed).

We only have to keep the two structures separate if we ever need to
locate lists of mapped pages, and I was only planning to use unmapped
pages for the fast-reclaim lists (I don't like the idea of twiddling
process page tables from inside interrupts!).

Cheers,
 Stephen

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~1998-03-05 22:25 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
1998-03-02 19:04 reverse pte lookups and anonymous private mappings; avl trees? Benjamin C.R. LaHaise
1998-03-02 23:03 ` Stephen C. Tweedie
1998-03-02 23:37   ` Rik van Riel
1998-03-03  6:29   ` Benjamin C.R. LaHaise
1998-03-03 23:49     ` Stephen C. Tweedie
1998-03-04 21:26     ` Stephen C. Tweedie
1998-03-04 23:20       ` Rik van Riel
1998-03-05  4:21         ` Benjamin C.R. LaHaise
1998-03-05 22:25         ` Stephen C. Tweedie

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox