linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* RE: broken VM in 2.4.10-pre9
@ 2001-09-19 22:15 Rob Fuller
  2001-09-19 22:21 ` David S. Miller
                   ` (3 more replies)
  0 siblings, 4 replies; 38+ messages in thread
From: Rob Fuller @ 2001-09-19 22:15 UTC (permalink / raw)
  To: David S. Miller, ebiederm; +Cc: alan, phillips, linux-kernel, linux-mm

In my one contribution to this thread I wrote:

"One argument for reverse mappings is distributed shared memory or
distributed file systems and their interaction with memory mapped files.
For example, a distributed file system may need to invalidate a specific
page of a file that may be mapped multiple times on a node."

I believe reverse mappings are an essential feature for memory mapped
files in order for Linux to support sophisticated distributed file
systems or distributed shared memory.  In general, this memory is NOT
anonymous.  As such, it should not affect the performance of a
fork/exec/exit.

I suppose I confused the issue when I offered a supporting argument for
reverse mappings.  It's not reverse mappings for anonymous pages I'm
advocating, but reverse mappings for mapped file data.

> -----Original Message-----
> From: David S. Miller [mailto:davem@redhat.com]
> Sent: Wednesday, September 19, 2001 4:56 PM
> To: ebiederm@xmission.com
> Cc: alan@lxorguk.ukuu.org.uk; phillips@bonn-fries.net; Rob Fuller;
> linux-kernel@vger.kernel.org; linux-mm@kvack.org
> Subject: Re: broken VM in 2.4.10-pre9
>
>
>    From: ebiederm@xmission.com (Eric W. Biederman)
>    Date: 19 Sep 2001 15:37:26 -0600
>
>    That I think is a significant cost.
>
> My own personal feeling, after having tried to implement a much
> lighter weight scheme involving "anon areas", is that reverse maps or
> something similar should be looked at as a latch ditch effort.
>
> We are tons faster than anyone else in fork/exec/exit precisely
> because we keep track of so little state for anonymous pages.
>
> Later,
> David S. Miller
> davem@redhat.com
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: broken VM in 2.4.10-pre9
  2001-09-19 22:15 broken VM in 2.4.10-pre9 Rob Fuller
@ 2001-09-19 22:21 ` David S. Miller
  2001-09-19 22:30 ` Alan Cox
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 38+ messages in thread
From: David S. Miller @ 2001-09-19 22:21 UTC (permalink / raw)
  To: rfuller; +Cc: ebiederm, alan, phillips, linux-kernel, linux-mm

   I suppose I confused the issue when I offered a supporting argument for
   reverse mappings.  It's not reverse mappings for anonymous pages I'm
   advocating, but reverse mappings for mapped file data.

We already have reverse mappings for files, via the VMA chain off the
inode.

Later,
David S. Miller
davem@redhat.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: broken VM in 2.4.10-pre9
  2001-09-19 22:15 broken VM in 2.4.10-pre9 Rob Fuller
  2001-09-19 22:21 ` David S. Miller
@ 2001-09-19 22:30 ` Alan Cox
  2001-09-19 22:48 ` Eric W. Biederman
  2001-09-19 22:51 ` Bryan O'Sullivan
  3 siblings, 0 replies; 38+ messages in thread
From: Alan Cox @ 2001-09-19 22:30 UTC (permalink / raw)
  To: Rob Fuller
  Cc: David S. Miller, ebiederm, alan, phillips, linux-kernel, linux-mm

> "One argument for reverse mappings is distributed shared memory or
> distributed file systems and their interaction with memory mapped files.
> For example, a distributed file system may need to invalidate a specific
> page of a file that may be mapped multiple times on a node."

Wouldn't it be better for the file system itself to be doing that work. Also
do real world file systems that actually perform usably do this or just zap
the cached mappings like OpenGFS does.

Alan

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: broken VM in 2.4.10-pre9
  2001-09-19 22:15 broken VM in 2.4.10-pre9 Rob Fuller
  2001-09-19 22:21 ` David S. Miller
  2001-09-19 22:30 ` Alan Cox
@ 2001-09-19 22:48 ` Eric W. Biederman
  2001-09-19 22:51 ` Bryan O'Sullivan
  3 siblings, 0 replies; 38+ messages in thread
From: Eric W. Biederman @ 2001-09-19 22:48 UTC (permalink / raw)
  To: Rob Fuller; +Cc: David S. Miller, alan, phillips, linux-kernel, linux-mm

"Rob Fuller" <rfuller@nsisoftware.com> writes:

> In my one contribution to this thread I wrote:
>
> "One argument for reverse mappings is distributed shared memory or
> distributed file systems and their interaction with memory mapped files.
> For example, a distributed file system may need to invalidate a specific
> page of a file that may be mapped multiple times on a node."
>
> I believe reverse mappings are an essential feature for memory mapped
> files in order for Linux to support sophisticated distributed file
> systems or distributed shared memory.  In general, this memory is NOT
> anonymous.  As such, it should not affect the performance of a
> fork/exec/exit.
>
> I suppose I confused the issue when I offered a supporting argument for
> reverse mappings.  It's not reverse mappings for anonymous pages I'm
> advocating, but reverse mappings for mapped file data.

The reverse mapping issue is not do we have a way to find where in the page
tables a page is mapped.  But if we keep track of it in a data structure
that allows us to do so extremely quickly.  The worst case for our current
data structures to unmap one page is O(page mappings).

For distributed filesystems contention sucks.  No matter how you play it
contention for file data will never be a fast case.  Not if you have
very many people contending for the data.  So this isn't a fast case.

Additionally our current data structures are optimized for unmapping
page ranges.  Since if your contention case is sane you will be
grabbing more than 4k at a time our looping through the vm_areas of
a mapping should be more efficient than doing that loop once for
each page that needs to be unmapped.

Eric




--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: broken VM in 2.4.10-pre9
  2001-09-19 22:15 broken VM in 2.4.10-pre9 Rob Fuller
                   ` (2 preceding siblings ...)
  2001-09-19 22:48 ` Eric W. Biederman
@ 2001-09-19 22:51 ` Bryan O'Sullivan
  3 siblings, 0 replies; 38+ messages in thread
From: Bryan O'Sullivan @ 2001-09-19 22:51 UTC (permalink / raw)
  To: Rob Fuller; +Cc: linux-kernel, linux-mm

r> I believe reverse mappings are an essential feature for memory
r> mapped files in order for Linux to support sophisticated
r> distributed file systems or distributed shared memory.

You already have the needed mechanisms for memory-mapped files in the
distributed FS case.  Distributed shared memory is much less
convincing, as DSM types have their heads irretrievably stuck up their
ar^Hcademia.

        <b
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: broken VM in 2.4.10-pre9
  2001-09-26 23:44               ` Pavel Machek
  2001-09-27 13:52                 ` Eric W. Biederman
@ 2001-10-01 11:37                 ` Marcelo Tosatti
  1 sibling, 0 replies; 38+ messages in thread
From: Marcelo Tosatti @ 2001-10-01 11:37 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Alan Cox, Eric W. Biederman, Daniel Phillips, Rob Fuller,
	linux-kernel, linux-mm


On Thu, 27 Sep 2001, Pavel Machek wrote:

> Hi!
>
> > > > > So my suggestion was to look at getting anonymous pages backed by what
> > > > > amounts to a shared memory segment.  In that vein.  By using an extent
> > > > > based data structure we can get the cost down under the current 8 bits
> > > > > per page that we have for the swap counts, and make allocating swap
> > > > > pages faster.  And we want to cluster related swap pages anyway so
> > > > > an extent based system is a natural fit.
> > > >
> > > > Much of this goes away if you get rid of both the swap and anonymous page
> > > > special cases. Back anonymous pages with the "whoops everything I write here
> > > > vanishes mysteriously" file system and swap with a swapfs
> > >
> > > What exactly is anonymous memory? I thought it is what you do when you
> > > want to malloc(), but you want to back that up by swap, not /dev/null.
> >
> > Anonymous memory is memory which is not backed by a filesystem or a
> > device. eg: malloc()ed memory, shmem, mmap(MAP_PRIVATE) on a file (which
> > will create anonymous memory as soon as the program which did the mmap
> > writes to the mapped memory (COW)), etc.
>
> So... how can alan propose to back anonymous memory with /dev/null?

I guess he means anonymous memory backed up by /dev/null means anonymous
memory backep up by nothing.

> [see above] It should be backed by swap, no?

Not necessarily. As soon as we need to swapout anon memory, we have to
back it up by swap. (mm/vmscan.c:try_to_swap_out() job)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: broken VM in 2.4.10-pre9
  2001-09-26 23:44               ` Pavel Machek
@ 2001-09-27 13:52                 ` Eric W. Biederman
  2001-10-01 11:37                 ` Marcelo Tosatti
  1 sibling, 0 replies; 38+ messages in thread
From: Eric W. Biederman @ 2001-09-27 13:52 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Marcelo Tosatti, Alan Cox, Daniel Phillips, Rob Fuller,
	linux-kernel, linux-mm

Pavel Machek <pavel@suse.cz> writes:

> Hi!
>
> > > > > So my suggestion was to look at getting anonymous pages backed by what
> > > > > amounts to a shared memory segment.  In that vein.  By using an extent
> > > > > based data structure we can get the cost down under the current 8 bits
> > > > > per page that we have for the swap counts, and make allocating swap
> > > > > pages faster.  And we want to cluster related swap pages anyway so
> > > > > an extent based system is a natural fit.
> > > >
> > > > Much of this goes away if you get rid of both the swap and anonymous page
> > > > special cases. Back anonymous pages with the "whoops everything I write
> here
>
> > > > vanishes mysteriously" file system and swap with a swapfs
> > >
> > > What exactly is anonymous memory? I thought it is what you do when you
> > > want to malloc(), but you want to back that up by swap, not /dev/null.
> >
> > Anonymous memory is memory which is not backed by a filesystem or a
> > device. eg: malloc()ed memory, shmem, mmap(MAP_PRIVATE) on a file (which
> > will create anonymous memory as soon as the program which did the mmap
> > writes to the mapped memory (COW)), etc.
>
> So... how can alan propose to back anonymous memory with /dev/null?
> [see above] It should be backed by swap, no?

He's not.  Alan if I understand him correctly is advocating remove special
cases.  And making it look like all pages are backed by something.
The /dev/nullfs is just until swap is allocated for that page.

I don't agree with the exact details of what Alan is envsions but I do
argree with the basic idea...

Eric

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: broken VM in 2.4.10-pre9
  2001-09-26 18:22             ` Marcelo Tosatti
@ 2001-09-26 23:44               ` Pavel Machek
  2001-09-27 13:52                 ` Eric W. Biederman
  2001-10-01 11:37                 ` Marcelo Tosatti
  0 siblings, 2 replies; 38+ messages in thread
From: Pavel Machek @ 2001-09-26 23:44 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Alan Cox, Eric W. Biederman, Daniel Phillips, Rob Fuller,
	linux-kernel, linux-mm

Hi!

> > > > So my suggestion was to look at getting anonymous pages backed by what
> > > > amounts to a shared memory segment.  In that vein.  By using an extent
> > > > based data structure we can get the cost down under the current 8 bits
> > > > per page that we have for the swap counts, and make allocating swap
> > > > pages faster.  And we want to cluster related swap pages anyway so
> > > > an extent based system is a natural fit.
> > >
> > > Much of this goes away if you get rid of both the swap and anonymous page
> > > special cases. Back anonymous pages with the "whoops everything I write here
> > > vanishes mysteriously" file system and swap with a swapfs
> >
> > What exactly is anonymous memory? I thought it is what you do when you
> > want to malloc(), but you want to back that up by swap, not /dev/null.
>
> Anonymous memory is memory which is not backed by a filesystem or a
> device. eg: malloc()ed memory, shmem, mmap(MAP_PRIVATE) on a file (which
> will create anonymous memory as soon as the program which did the mmap
> writes to the mapped memory (COW)), etc.

So... how can alan propose to back anonymous memory with /dev/null?
[see above] It should be backed by swap, no?
								Pavel
-- 
I'm pavel@ucw.cz. "In my country we have almost anarchy and I don't care."
Panos Katsaloulis describing me w.r.t. patents at discuss@linmodems.org

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: broken VM in 2.4.10-pre9
  2001-09-24 22:50           ` Pavel Machek
@ 2001-09-26 18:22             ` Marcelo Tosatti
  2001-09-26 23:44               ` Pavel Machek
  0 siblings, 1 reply; 38+ messages in thread
From: Marcelo Tosatti @ 2001-09-26 18:22 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Alan Cox, Eric W. Biederman, Daniel Phillips, Rob Fuller,
	linux-kernel, linux-mm


On Tue, 25 Sep 2001, Pavel Machek wrote:

> Hi!
>
> > > So my suggestion was to look at getting anonymous pages backed by what
> > > amounts to a shared memory segment.  In that vein.  By using an extent
> > > based data structure we can get the cost down under the current 8 bits
> > > per page that we have for the swap counts, and make allocating swap
> > > pages faster.  And we want to cluster related swap pages anyway so
> > > an extent based system is a natural fit.
> >
> > Much of this goes away if you get rid of both the swap and anonymous page
> > special cases. Back anonymous pages with the "whoops everything I write here
> > vanishes mysteriously" file system and swap with a swapfs
>
> What exactly is anonymous memory? I thought it is what you do when you
> want to malloc(), but you want to back that up by swap, not /dev/null.

Anonymous memory is memory which is not backed by a filesystem or a
device. eg: malloc()ed memory, shmem, mmap(MAP_PRIVATE) on a file (which
will create anonymous memory as soon as the program which did the mmap
writes to the mapped memory (COW)), etc.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: broken VM in 2.4.10-pre9
  2001-09-22  7:09                   ` Daniel Phillips
@ 2001-09-25 11:04                     ` Mike Fedyk
  0 siblings, 0 replies; 38+ messages in thread
From: Mike Fedyk @ 2001-09-25 11:04 UTC (permalink / raw)
  To: linux-kernel, linux-mm

On Sat, Sep 22, 2001 at 09:09:10AM +0200, Daniel Phillips wrote:
> On September 21, 2001 05:27 pm, Jan Harkes wrote:
> > On Fri, Sep 21, 2001 at 10:13:11AM +0200, Daniel Phillips wrote:
> > >   - small inactive list really means large active list (and vice versa)
> > >   - aging increments need to depend on the size of the active list
> > >   - "exponential" aging may be completely bogus
> > 
> > I don't think so, whenever there is sufficient memory pressure, the scan
> > of the active list is not only done by kswapd, but also by the page
> > allocations.
> > 
> > This does have the nice effect that with a large active list on a system
> > that has a working set that fits in memory, pages basically always age
> > up, and we get an automatic used-once/drop-behind behaviour for
> > streaming data because the age of these pages is relatively low.
> > 
> > As soon as the rate of new allocations increases to the point that
> > kswapd can't keep up, which happens if the number of cached used-once
> > pages is too small, or the working set expands so that it doesn't fit in
> > memory. The memory shortage then causes all pages to agressively get
> > aged down, pushing out the less frequently used pages of the working set.
> > 
> > Exponential down aging simply causes us to loop fewer times in
> > do_try_to_free_pages is such situations.
> 
> In such a situation that's a horribly inefficient way to accomplish this and 
> throws away a lot of valuable information.  Consider that we're doing nothing 
> but looping in the vm in this situation, so nobody gets a chance to touch 
> pages, so nothing gets aged up.  So we are really just deactivating all the 
> pages that lie below a given theshold.
> 
> Say that the threshold happens to be 16.  We loop through the active list 5 
> times and now we have not only deactivated the pages we needed but collapsed 
> all ages between 16 and 31 to the same value, and all ages between 32 and 63 
> to just two values, losing most of the relative weighting information.
> 
> Would it not make more sense to go through the active list once, deactivate 
> all pages with age less than some computed threshold, and subtract that 
> threshold from the rest?
> 

If I understand the thread between Rik and the guy from FreeBSD (sorry,
don't remember his name), then what they are doing is they have a computed
swap level that rises as needed, and doesn't modify the aging of any of the
pages.

So, if you have pages ages at 5 7 15 30 45 each loop through
do_try_to_free_pages will raise swap_thresh by whatever increment.

Looping through, you first get the pages at 5, 7, then 15 until you swap out
enough.  While this is happening, you let the normal referencing modify the
aging, not the act of swapping.

I know this is quite simplistic, but it may help.  What do you guys think?
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: broken VM in 2.4.10-pre9
  2001-09-19 22:04         ` Alan Cox
                             ` (2 preceding siblings ...)
  2001-09-20 11:28           ` Daniel Phillips
@ 2001-09-24 22:50           ` Pavel Machek
  2001-09-26 18:22             ` Marcelo Tosatti
  3 siblings, 1 reply; 38+ messages in thread
From: Pavel Machek @ 2001-09-24 22:50 UTC (permalink / raw)
  To: Alan Cox, Eric W. Biederman
  Cc: Daniel Phillips, Rob Fuller, linux-kernel, linux-mm

Hi!

> > So my suggestion was to look at getting anonymous pages backed by what
> > amounts to a shared memory segment.  In that vein.  By using an extent
> > based data structure we can get the cost down under the current 8 bits
> > per page that we have for the swap counts, and make allocating swap
> > pages faster.  And we want to cluster related swap pages anyway so
> > an extent based system is a natural fit.
>
> Much of this goes away if you get rid of both the swap and anonymous page
> special cases. Back anonymous pages with the "whoops everything I write here
> vanishes mysteriously" file system and swap with a swapfs

What exactly is anonymous memory? I thought it is what you do when you
want to malloc(), but you want to back that up by swap, not /dev/null.

								Pavel
-- 
I'm pavel@ucw.cz. "In my country we have almost anarchy and I don't care."
Panos Katsaloulis describing me w.r.t. patents at discuss@linmodems.org

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: broken VM in 2.4.10-pre9
  2001-09-21 15:27                 ` Jan Harkes
@ 2001-09-22  7:09                   ` Daniel Phillips
  2001-09-25 11:04                     ` Mike Fedyk
  0 siblings, 1 reply; 38+ messages in thread
From: Daniel Phillips @ 2001-09-22  7:09 UTC (permalink / raw)
  To: Jan Harkes
  Cc: Rik van Riel, Alan Cox, Eric W. Biederman, Rob Fuller,
	linux-kernel, linux-mm

On September 21, 2001 05:27 pm, Jan Harkes wrote:
> On Fri, Sep 21, 2001 at 10:13:11AM +0200, Daniel Phillips wrote:
> >   - small inactive list really means large active list (and vice versa)
> >   - aging increments need to depend on the size of the active list
> >   - "exponential" aging may be completely bogus
>
> I don't think so, whenever there is sufficient memory pressure, the scan
> of the active list is not only done by kswapd, but also by the page
> allocations.
>
> This does have the nice effect that with a large active list on a system
> that has a working set that fits in memory, pages basically always age
> up, and we get an automatic used-once/drop-behind behaviour for
> streaming data because the age of these pages is relatively low.
>
> As soon as the rate of new allocations increases to the point that
> kswapd can't keep up, which happens if the number of cached used-once
> pages is too small, or the working set expands so that it doesn't fit in
> memory. The memory shortage then causes all pages to agressively get
> aged down, pushing out the less frequently used pages of the working set.
>
> Exponential down aging simply causes us to loop fewer times in
> do_try_to_free_pages is such situations.

In such a situation that's a horribly inefficient way to accomplish this and
throws away a lot of valuable information.  Consider that we're doing nothing
but looping in the vm in this situation, so nobody gets a chance to touch
pages, so nothing gets aged up.  So we are really just deactivating all the
pages that lie below a given theshold.

Say that the threshold happens to be 16.  We loop through the active list 5
times and now we have not only deactivated the pages we needed but collapsed
all ages between 16 and 31 to the same value, and all ages between 32 and 63
to just two values, losing most of the relative weighting information.

Would it not make more sense to go through the active list once, deactivate
all pages with age less than some computed threshold, and subtract that
threshold from the rest?

--
Daniel

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: broken VM in 2.4.10-pre9
  2001-09-22  2:14             ` Alexander Viro
@ 2001-09-22  3:09               ` Rik van Riel
  0 siblings, 0 replies; 38+ messages in thread
From: Rik van Riel @ 2001-09-22  3:09 UTC (permalink / raw)
  To: Alexander Viro
  Cc: Eric W. Biederman, Alan Cox, Daniel Phillips, Rob Fuller,
	linux-kernel, linux-mm

On Fri, 21 Sep 2001, Alexander Viro wrote:

> It means that you prefer system dying under much lighter load.  At
> some point any box will get into feedback loop,

> The question being, at which point will it happen and how graceful
> will the degradation be when we get near that point.

And ... what do we do when we reach that point ?

It's obvious that we need load control to make the machine
survive at that point; load control is a horrible measure
which will make interactivity very bad, but will cause the
box to survive where otherwise it would be thrashing.

Having a better paging system would mean having the 'thrashing
point' (where we need to kick in load control' much further
out and being able to keep the system behave better under
heavier VM loads.

regards,

Rik
-- 
IA64: a worthy successor to i860.

http://www.surriel.com/		http://distro.conectiva.com/

Send all your spam to aardvark@nl.linux.org (spam digging piggy)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: broken VM in 2.4.10-pre9
  2001-09-21  8:23           ` Eric W. Biederman
  2001-09-21 12:01             ` Rik van Riel
@ 2001-09-22  2:14             ` Alexander Viro
  2001-09-22  3:09               ` Rik van Riel
  1 sibling, 1 reply; 38+ messages in thread
From: Alexander Viro @ 2001-09-22  2:14 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Rik van Riel, Alan Cox, Daniel Phillips, Rob Fuller,
	linux-kernel, linux-mm


On 21 Sep 2001, Eric W. Biederman wrote:

> Swapping is an important case.  But 9 times out of 10 you are managing
> memory in caches, and throwing unused pages into swap.  You aren't busily
> paging the data back an forth.  But if I have to make a choice in
> what kind of situation I want to take a performance hit, paging
> approaching thrashing or a system whose working set size is well
> within RAM.  I'd rather take the hit in the system that is paging.

It means that you prefer system dying under much lighter load.  At some
point any box will get into feedback loop, when slowdown from VM load
will make request handling slower, which will make temp. allocations
needed to handle these requests to be kept around for longer periods,
which will contribute to VM load.  The question being, at which point
will it happen and how graceful will the degradation be when we get
near that point.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: broken VM in 2.4.10-pre9
  2001-09-21  8:13               ` Daniel Phillips
  2001-09-21 12:10                 ` Rik van Riel
@ 2001-09-21 15:27                 ` Jan Harkes
  2001-09-22  7:09                   ` Daniel Phillips
  1 sibling, 1 reply; 38+ messages in thread
From: Jan Harkes @ 2001-09-21 15:27 UTC (permalink / raw)
  To: Daniel Phillips
  Cc: Rik van Riel, Alan Cox, Eric W. Biederman, Rob Fuller,
	linux-kernel, linux-mm

On Fri, Sep 21, 2001 at 10:13:11AM +0200, Daniel Phillips wrote:
>   - small inactive list really means large active list (and vice versa)
>   - aging increments need to depend on the size of the active list
>   - "exponential" aging may be completely bogus

I don't think so, whenever there is sufficient memory pressure, the scan
of the active list is not only done by kswapd, but also by the page
allocations.

This does have the nice effect that with a large active list on a system
that has a working set that fits in memory, pages basically always age
up, and we get an automatic used-once/drop-behind behaviour for
streaming data because the age of these pages is relatively low.

As soon as the rate of new allocations increases to the point that
kswapd can't keep up, which happens if the number of cached used-once
pages is too small, or the working set expands so that it doesn't fit in
memory. The memory shortage then causes all pages to agressively get
aged down, pushing out the less frequently used pages of the working set.

Exponential down aging simply causes us to loop fewer times in
do_try_to_free_pages is such situations.

Jan

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: broken VM in 2.4.10-pre9
  2001-09-21  8:13               ` Daniel Phillips
@ 2001-09-21 12:10                 ` Rik van Riel
  2001-09-21 15:27                 ` Jan Harkes
  1 sibling, 0 replies; 38+ messages in thread
From: Rik van Riel @ 2001-09-21 12:10 UTC (permalink / raw)
  To: Daniel Phillips
  Cc: Alan Cox, Eric W. Biederman, Rob Fuller, linux-kernel, linux-mm

On Fri, 21 Sep 2001, Daniel Phillips wrote:

> Have you tried making the down increment larger and the up increment
> smaller when the active list is larger?

This would make the page age of pages referenced in the page
tables smaller, not larger. And we already know that decreasing
the page age of heavily referenced pages isn't good.

regards,

Rik
-- 
IA64: a worthy successor to i860.

http://www.surriel.com/		http://distro.conectiva.com/

Send all your spam to aardvark@nl.linux.org (spam digging piggy)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: broken VM in 2.4.10-pre9
  2001-09-21  8:23           ` Eric W. Biederman
@ 2001-09-21 12:01             ` Rik van Riel
  2001-09-22  2:14             ` Alexander Viro
  1 sibling, 0 replies; 38+ messages in thread
From: Rik van Riel @ 2001-09-21 12:01 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Alan Cox, Daniel Phillips, Rob Fuller, linux-kernel, linux-mm

On 21 Sep 2001, Eric W. Biederman wrote:

> Swapping is an important case.  But 9 times out of 10 you are managing
> memory in caches, and throwing unused pages into swap.  You aren't
> busily paging the data back an forth.  But if I have to make a choice
> in what kind of situation I want to take a performance hit, paging
> approaching thrashing or a system whose working set size is well
> within RAM.  I'd rather take the hit in the system that is paging.

> Besides I also like to run a lot of shell scripts, which again stress
> the fork()/exec()/exit() path.
>
> So no I don't think keeping those paths fast is silly.

Absolutely agreed.

Ben and I have already been thinking a bit about memory
objects, so we have both reverse mappings AND we can skip
copying the page tables at fork() time (needing to clear
less at the subsequent exec(), too) ...

Of course this means I'll throw away my pte-based reverse
mapping code and will look at an object-based reverse mapping
scheme like Ben made for 2.1 and DaveM made for 2.3 ;)

regards,

Rik
-- 
IA64: a worthy successor to i860.

http://www.surriel.com/		http://distro.conectiva.com/

Send all your spam to aardvark@nl.linux.org (spam digging piggy)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: broken VM in 2.4.10-pre9
  2001-09-19 23:00         ` Rik van Riel
@ 2001-09-21  8:23           ` Eric W. Biederman
  2001-09-21 12:01             ` Rik van Riel
  2001-09-22  2:14             ` Alexander Viro
  0 siblings, 2 replies; 38+ messages in thread
From: Eric W. Biederman @ 2001-09-21  8:23 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Alan Cox, Daniel Phillips, Rob Fuller, linux-kernel, linux-mm

Rik van Riel <riel@conectiva.com.br> writes:

> On 19 Sep 2001, Eric W. Biederman wrote:
>
> > That added to the fact that last time someone ran the numbers linux
> > was considerably faster than the BSD for mm type operations when not
> > swapping.  And this is the common case.
>
> Optimising the VM for not swapping sounds kind of like
> optimising your system for doing empty fork()/exec()/exit()
> loops ;)

Swapping is an important case.  But 9 times out of 10 you are managing
memory in caches, and throwing unused pages into swap.  You aren't busily
paging the data back an forth.  But if I have to make a choice in
what kind of situation I want to take a performance hit, paging
approaching thrashing or a system whose working set size is well
within RAM.  I'd rather take the hit in the system that is paging.

Further fast IPC + fork()/exec()/exit() that programmers can count on
leads to more robust programs.  Because different pieces of the program
can live in different processes.  One of the reasons for the stability
of unix is that it has always had a firewall between it's processes so
one bad pointer will not bring down the entire system.

Besides I also like to run a lot of shell scripts, which again stress
the fork()/exec()/exit() path.

So no I don't think keeping those paths fast is silly.

I also think that being able to get good memory usage information is
important.  I know that reverse maps make that job easier.  But just
because the make an important case easier to get write I don't think
reverse maps are a shoe in.

Eric




--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: broken VM in 2.4.10-pre9
  2001-09-20 12:06             ` Rik van Riel
@ 2001-09-21  8:13               ` Daniel Phillips
  2001-09-21 12:10                 ` Rik van Riel
  2001-09-21 15:27                 ` Jan Harkes
  0 siblings, 2 replies; 38+ messages in thread
From: Daniel Phillips @ 2001-09-21  8:13 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Alan Cox, Eric W. Biederman, Rob Fuller, linux-kernel, linux-mm

> That still doesn't mean we can't _approximate_ aging in
> another way. With linear page aging (3 up, 1 down) the
> page ages of pages referenced only in the page tables
> will still go up, albeit a tad slower than expected.
>
> It's exponential aging which makes the page age go into
> the other direction, with linear aging things seem to
> work again.
>
> I've done some experiments recently and found that (with
> reverse mappings) exponential aging is faster when we have
> a small inactive list and linear aging is faster when we
> have a large inactive list.

Have you tried making the down increment larger and the up increment smaller
when the active list is larger?  This has a natural interpretation: when the
active list is large the scanning period is longer.  During this longer scan
period an active page *should* be more likely to have its ref bit set, so it
gets a smaller boost if it is.  If not we should penalize it more heavily.

There are three points here:

  - small inactive list really means large active list (and vice versa)
  - aging increments need to depend on the size of the active list
  - "exponential" aging may be completely bogus

> This means we need linear page aging with a large inactive
> list in order to let the page ages move into the right
> direction when we run a system without reverse mapping,
> the patch for that was sent to Alan yesterday.

So, the question is, does my suggestion produce essentially the same
beneficial effect?  And by the way, what are your test cases?  I'd like to
see if I can your results here.

--
Daniel

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: broken VM in 2.4.10-pre9
  2001-09-20 12:57             ` Alan Cox
@ 2001-09-20 13:40               ` Daniel Phillips
  0 siblings, 0 replies; 38+ messages in thread
From: Daniel Phillips @ 2001-09-20 13:40 UTC (permalink / raw)
  To: Alan Cox; +Cc: Eric W. Biederman, Rob Fuller, linux-kernel, linux-mm

On September 20, 2001 02:57 pm, Alan Cox wrote:
> > On September 20, 2001 12:04 am, Alan Cox wrote:
> > > Reverse mappings make linear aging easier to do but are not critical (we
> > > can walk all physical pages via the page map array).
> >
> > But you can't pick up the referenced bit that way, so no up aging, only
> > down.
>
> #1 If you really wanted to you could update a referenced bit in the page
> struct in the fault handling path.

Right, we probably should do that.  But consider that any time this happens a
reverse map would have eliminated the fault because we wouldn't need to unmap
the page until we're actually going to free it.

> #2 If a page is referenced multiple times by different processes is the
> behaviour of multiple upward aging actually wrong.

With rmap it's easy to do it either way: either treat the ref bits as if
they're all or'd together or, perhaps more sensibly, age up by an amount that
depends on the number of ref bits set, but not as much as UP_AGE * refs.

--
Daniel

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: broken VM in 2.4.10-pre9
  2001-09-19 21:55         ` David S. Miller
@ 2001-09-20 13:02           ` Rik van Riel
  0 siblings, 0 replies; 38+ messages in thread
From: Rik van Riel @ 2001-09-20 13:02 UTC (permalink / raw)
  To: David S. Miller; +Cc: ebiederm, alan, phillips, rfuller, linux-kernel, linux-mm

On Wed, 19 Sep 2001, David S. Miller wrote:

> My own personal feeling, after having tried to implement a much
> lighter weight scheme involving "anon areas", is that reverse maps or
> something similar should be looked at as a latch ditch effort.
>
> We are tons faster than anyone else in fork/exec/exit precisely
> because we keep track of so little state for anonymous pages.

Thinking about this some more, it would seem that the
"perfect fork()" would be one where you DON'T copy the
page tables, but only set the parent's page tables to
read-only and point the VMAs of the child at some kind
of memory objects.

For example, for file-backed VMAs we might already skip
the page table copying right now.

regards,

Rik
-- 
IA64: a worthy successor to i860.

http://www.surriel.com/		http://distro.conectiva.com/

Send all your spam to aardvark@nl.linux.org (spam digging piggy)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: broken VM in 2.4.10-pre9
  2001-09-20 11:28           ` Daniel Phillips
  2001-09-20 12:06             ` Rik van Riel
@ 2001-09-20 12:57             ` Alan Cox
  2001-09-20 13:40               ` Daniel Phillips
  1 sibling, 1 reply; 38+ messages in thread
From: Alan Cox @ 2001-09-20 12:57 UTC (permalink / raw)
  To: Daniel Phillips
  Cc: Alan Cox, Eric W. Biederman, Rob Fuller, linux-kernel, linux-mm

> On September 20, 2001 12:04 am, Alan Cox wrote:
> > Reverse mappings make linear aging easier to do but are not critical (we
> > can walk all physical pages via the page map array).
>
> But you can't pick up the referenced bit that way, so no up aging, only
> down.

#1 If you really wanted to you could update a referenced bit in the page
struct in the fault handling path.

#2 If a page is referenced multiple times by different processes is the
behaviour of multiple upward aging actually wrong.

Alan

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: broken VM in 2.4.10-pre9
  2001-09-20 11:28           ` Daniel Phillips
@ 2001-09-20 12:06             ` Rik van Riel
  2001-09-21  8:13               ` Daniel Phillips
  2001-09-20 12:57             ` Alan Cox
  1 sibling, 1 reply; 38+ messages in thread
From: Rik van Riel @ 2001-09-20 12:06 UTC (permalink / raw)
  To: Daniel Phillips
  Cc: Alan Cox, Eric W. Biederman, Rob Fuller, linux-kernel, linux-mm

On Thu, 20 Sep 2001, Daniel Phillips wrote:
> On September 20, 2001 12:04 am, Alan Cox wrote:
> > Reverse mappings make linear aging easier to do but are not critical (we
> > can walk all physical pages via the page map array).
>
> But you can't pick up the referenced bit that way, so no up aging,
> only down.

That still doesn't mean we can't _approximate_ aging in
another way. With linear page aging (3 up, 1 down) the
page ages of pages referenced only in the page tables
will still go up, albeit a tad slower than expected.

It's exponential aging which makes the page age go into
the other direction, with linear aging things seem to
work again.

I've done some experiments recently and found that (with
reverse mappings) exponential aging is faster when we have
a small inactive list and linear aging is faster when we
have a large inactive list.

This means we need linear page aging with a large inactive
list in order to let the page ages move into the right
direction when we run a system without reverse mapping,
the patch for that was sent to Alan yesterday.

regards,

Rik
-- 
IA64: a worthy successor to i860.

http://www.surriel.com/		http://distro.conectiva.com/

Send all your spam to aardvark@nl.linux.org (spam digging piggy)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: broken VM in 2.4.10-pre9
  2001-09-19 22:04         ` Alan Cox
  2001-09-19 22:26           ` Eric W. Biederman
  2001-09-19 23:05           ` Rik van Riel
@ 2001-09-20 11:28           ` Daniel Phillips
  2001-09-20 12:06             ` Rik van Riel
  2001-09-20 12:57             ` Alan Cox
  2001-09-24 22:50           ` Pavel Machek
  3 siblings, 2 replies; 38+ messages in thread
From: Daniel Phillips @ 2001-09-20 11:28 UTC (permalink / raw)
  To: Alan Cox, Eric W. Biederman; +Cc: Rob Fuller, linux-kernel, linux-mm

On September 20, 2001 12:04 am, Alan Cox wrote:
> Reverse mappings make linear aging easier to do but are not critical (we
> can walk all physical pages via the page map array).

But you can't pick up the referenced bit that way, so no up aging, only
down.

--
Daniel

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: broken VM in 2.4.10-pre9
  2001-09-19 22:04         ` Alan Cox
  2001-09-19 22:26           ` Eric W. Biederman
@ 2001-09-19 23:05           ` Rik van Riel
  2001-09-20 11:28           ` Daniel Phillips
  2001-09-24 22:50           ` Pavel Machek
  3 siblings, 0 replies; 38+ messages in thread
From: Rik van Riel @ 2001-09-19 23:05 UTC (permalink / raw)
  To: Alan Cox
  Cc: Eric W. Biederman, Daniel Phillips, Rob Fuller, linux-kernel, linux-mm

On Wed, 19 Sep 2001, Alan Cox wrote:

> "Linux VM works wonderfully when nobody is using it"

"This OS is optimised for lmbench"


cheers,

Rik
-- 
IA64: a worthy successor to i860.

http://www.surriel.com/		http://distro.conectiva.com/

Send all your spam to aardvark@nl.linux.org (spam digging piggy)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: broken VM in 2.4.10-pre9
  2001-09-19 21:03       ` Eric W. Biederman
  2001-09-19 22:04         ` Alan Cox
@ 2001-09-19 23:00         ` Rik van Riel
  2001-09-21  8:23           ` Eric W. Biederman
  1 sibling, 1 reply; 38+ messages in thread
From: Rik van Riel @ 2001-09-19 23:00 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Alan Cox, Daniel Phillips, Rob Fuller, linux-kernel, linux-mm

On 19 Sep 2001, Eric W. Biederman wrote:

> That added to the fact that last time someone ran the numbers linux
> was considerably faster than the BSD for mm type operations when not
> swapping.  And this is the common case.

Optimising the VM for not swapping sounds kind of like
optimising your system for doing empty fork()/exec()/exit()
loops ;)

cheers,

Rik
-- 
IA64: a worthy successor to i860.

http://www.surriel.com/		http://distro.conectiva.com/

Send all your spam to aardvark@nl.linux.org (spam digging piggy)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: broken VM in 2.4.10-pre9
  2001-09-19 22:04         ` Alan Cox
@ 2001-09-19 22:26           ` Eric W. Biederman
  2001-09-19 23:05           ` Rik van Riel
                             ` (2 subsequent siblings)
  3 siblings, 0 replies; 38+ messages in thread
From: Eric W. Biederman @ 2001-09-19 22:26 UTC (permalink / raw)
  To: Alan Cox; +Cc: Daniel Phillips, Rob Fuller, linux-kernel, linux-mm

Alan Cox <alan@lxorguk.ukuu.org.uk> writes:

> Much of this goes away if you get rid of both the swap and anonymous page
> special cases. Back anonymous pages with the "whoops everything I write here
> vanishes mysteriously" file system and swap with a swapfs

Essentially.  Though that is just the strategy it doesn't cut to the heart of the
problems that need to be addressed.  The trickiest part is to allocate persistent
id's to the pages that don't require us to fragment the VMA's.

> Reverse mappings make linear aging easier to do but are not critical (we
> can walk all physical pages via the page map array).

Agreed.

What I find interesting about the 2.4.x VM is that most of the large
problems people have seen were not stupid designs mistakes in the VM
but small interaction glitches, between various pieces of code.

Eric

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: broken VM in 2.4.10-pre9
  2001-09-19 21:03       ` Eric W. Biederman
@ 2001-09-19 22:04         ` Alan Cox
  2001-09-19 22:26           ` Eric W. Biederman
                             ` (3 more replies)
  2001-09-19 23:00         ` Rik van Riel
  1 sibling, 4 replies; 38+ messages in thread
From: Alan Cox @ 2001-09-19 22:04 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Alan Cox, Daniel Phillips, Rob Fuller, linux-kernel, linux-mm

> That added to the fact that last time someone ran the numbers linux
> was considerably faster than the BSD for mm type operations when not
> swapping.  And this is the common case.

"Linux VM works wonderfully when nobody is using it"

Which is rather like the scheduler works well for one task then by three is
making bad decisions.

> But I have not seen the argument that not having reverse maps make it
> undoable.  In fact previous versions of linux seem to put the proof
> that you can get at least reasonable swapping under load without
> reverse page tables.

The last decent Linx VM behaviour was about 2.1.100 or so - which was
without reverse maps. It's been downhill since then. So yes you may be
right.

> So my suggestion was to look at getting anonymous pages backed by what
> amounts to a shared memory segment.  In that vein.  By using an extent
> based data structure we can get the cost down under the current 8 bits
> per page that we have for the swap counts, and make allocating swap
> pages faster.  And we want to cluster related swap pages anyway so
> an extent based system is a natural fit.

Much of this goes away if you get rid of both the swap and anonymous page
special cases. Back anonymous pages with the "whoops everything I write here
vanishes mysteriously" file system and swap with a swapfs

Reverse mappings make linear aging easier to do but are not critical (we
can walk all physical pages via the page map array).

Alan

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: broken VM in 2.4.10-pre9
  2001-09-19 21:37       ` Eric W. Biederman
@ 2001-09-19 21:55         ` David S. Miller
  2001-09-20 13:02           ` Rik van Riel
  0 siblings, 1 reply; 38+ messages in thread
From: David S. Miller @ 2001-09-19 21:55 UTC (permalink / raw)
  To: ebiederm; +Cc: alan, phillips, rfuller, linux-kernel, linux-mm

   That I think is a significant cost.

My own personal feeling, after having tried to implement a much
lighter weight scheme involving "anon areas", is that reverse maps or
something similar should be looked at as a latch ditch effort.

We are tons faster than anyone else in fork/exec/exit precisely
because we keep track of so little state for anonymous pages.

Later,
David S. Miller
davem@redhat.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: broken VM in 2.4.10-pre9
  2001-09-19 19:45     ` Alan Cox
  2001-09-19 21:03       ` Eric W. Biederman
@ 2001-09-19 21:37       ` Eric W. Biederman
  2001-09-19 21:55         ` David S. Miller
  1 sibling, 1 reply; 38+ messages in thread
From: Eric W. Biederman @ 2001-09-19 21:37 UTC (permalink / raw)
  To: Alan Cox; +Cc: Daniel Phillips, Rob Fuller, linux-kernel, linux-mm

Alan Cox <alan@lxorguk.ukuu.org.uk> writes:

> > On September 17, 2001 06:03 pm, Eric W. Biederman wrote:
> > > In linux we have avoided reverse maps (unlike the BSD's) which tends
> > > to make the common case fast at the expense of making it more
> > > difficult to handle times when the VM system is under extreme load and
> > > we are swapping etc.
> >
> > What do you suppose is the cost of the reverse map?  I get the impression you
>
> > think it's more expensive than it is.
>
> We can keep the typical page table cost lower than now (including reverse
> maps) just by doing some common sense small cleanups to get the page struct
> down to 48 bytes on x86

While there is a size cost I suspect you will notice reverse maps
a lot more in operations like fork where having them tripples the amount
of memory that you need to copy.  So you should see a double or more
in the time it takes to do a fork.

That I think is a significant cost.

Eric

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: broken VM in 2.4.10-pre9
  2001-09-19 19:45     ` Alan Cox
@ 2001-09-19 21:03       ` Eric W. Biederman
  2001-09-19 22:04         ` Alan Cox
  2001-09-19 23:00         ` Rik van Riel
  2001-09-19 21:37       ` Eric W. Biederman
  1 sibling, 2 replies; 38+ messages in thread
From: Eric W. Biederman @ 2001-09-19 21:03 UTC (permalink / raw)
  To: Alan Cox; +Cc: Daniel Phillips, Rob Fuller, linux-kernel, linux-mm

Alan Cox <alan@lxorguk.ukuu.org.uk> writes:

> > On September 17, 2001 06:03 pm, Eric W. Biederman wrote:
> > > In linux we have avoided reverse maps (unlike the BSD's) which tends
> > > to make the common case fast at the expense of making it more
> > > difficult to handle times when the VM system is under extreme load and
> > > we are swapping etc.
> >
> > What do you suppose is the cost of the reverse map?  I get the impression you
>
> > think it's more expensive than it is.
>
> We can keep the typical page table cost lower than now (including reverse
> maps) just by doing some common sense small cleanups to get the page struct
> down to 48 bytes on x86

I have to admit the first time I looked at reverse maps our struct page
was much lighter weight, then now (64 bytes x86 UP).  And our cost per
page was noticeably fewer bytes than the BSDs. average_mem_per_page =
sizeof(struct page) + sizeof(pte_t) + sizeof(reverse_pte_t)*average_user_per_page.
But struct page has grown pretty significantly since then, and could
use a cleanup.

So I figure it is worth going through and computing the costs of
reverse page tables and not, dismissing them out of hand.  But the
fact that the linux VM could get good performance in most
circumstances without reverse page tables has always enchanted me.

That added to the fact that last time someone ran the numbers linux
was considerably faster than the BSD for mm type operations when not
swapping.  And this is the common case.

I admit reverse page tables make it easier under a high load to get
good paging performance, as the algorithms are more straigh forward.
But I have not seen the argument that not having reverse maps make it
undoable.  In fact previous versions of linux seem to put the proof
that you can get at least reasonable swapping under load without
reverse page tables.

There is also the cache thrashing case.  While scaning page table
entries it is probably impossible to prevent cache thrashing, but
reverse page tables look like they make it worse.

With respect to the current VM the primary complaint I have heard is
that anonymous pages are not in the page cache so cannot be aged.  At
least that was the complaint that started this thread.  For adding
pages to the page cache we currently have conflicting tensions.  Do we
want it in the page cache to age better or do we not want to allocate
the swap space yet?

So my suggestion was to look at getting anonymous pages backed by what
amounts to a shared memory segment.  In that vein.  By using an extent
based data structure we can get the cost down under the current 8 bits
per page that we have for the swap counts, and make allocating swap
pages faster.  And we want to cluster related swap pages anyway so
an extent based system is a natural fit.

If we loose the requirement that swapped out pages need to be in the
page tables.  It becomes a trivial issue to drop page tables with all of
their pages swapped out.  Plus there are a million other special cases
we can remove from the current VM.

So right now I can see a bigger benefit from anonymouse pages with a
``backing store'' then I can from reverse maps.

Eric



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: broken VM in 2.4.10-pre9
  2001-09-19  9:45   ` Daniel Phillips
@ 2001-09-19 19:45     ` Alan Cox
  2001-09-19 21:03       ` Eric W. Biederman
  2001-09-19 21:37       ` Eric W. Biederman
  0 siblings, 2 replies; 38+ messages in thread
From: Alan Cox @ 2001-09-19 19:45 UTC (permalink / raw)
  To: Daniel Phillips; +Cc: Eric W. Biederman, Rob Fuller, linux-kernel, linux-mm

> On September 17, 2001 06:03 pm, Eric W. Biederman wrote:
> > In linux we have avoided reverse maps (unlike the BSD's) which tends
> > to make the common case fast at the expense of making it more
> > difficult to handle times when the VM system is under extreme load and
> > we are swapping etc.
>
> What do you suppose is the cost of the reverse map?  I get the impression you
> think it's more expensive than it is.

We can keep the typical page table cost lower than now (including reverse
maps) just by doing some common sense small cleanups to get the page struct
down to 48 bytes on x86

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: broken VM in 2.4.10-pre9
  2001-09-17 16:03 ` Eric W. Biederman
@ 2001-09-19  9:45   ` Daniel Phillips
  2001-09-19 19:45     ` Alan Cox
  0 siblings, 1 reply; 38+ messages in thread
From: Daniel Phillips @ 2001-09-19  9:45 UTC (permalink / raw)
  To: Eric W. Biederman, Rob Fuller; +Cc: linux-kernel, linux-mm

On September 17, 2001 06:03 pm, Eric W. Biederman wrote:
> In linux we have avoided reverse maps (unlike the BSD's) which tends
> to make the common case fast at the expense of making it more
> difficult to handle times when the VM system is under extreme load and
> we are swapping etc.

What do you suppose is the cost of the reverse map?  I get the impression you 
think it's more expensive than it is.

--
Daniel
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: broken VM in 2.4.10-pre9
  2001-09-17 15:40 Rob Fuller
@ 2001-09-17 16:03 ` Eric W. Biederman
  2001-09-19  9:45   ` Daniel Phillips
  0 siblings, 1 reply; 38+ messages in thread
From: Eric W. Biederman @ 2001-09-17 16:03 UTC (permalink / raw)
  To: Rob Fuller; +Cc: linux-kernel, linux-mm

"Rob Fuller" <rfuller@nsisoftware.com> writes:

> One argument for reverse mappings is distributed shared memory or
> distributed file systems and their interaction with memory mapped
> files.  For example, a distributed file system may need to invalidate a specific
> page of a file that may be mapped multiple times on a node.

To reduce the time for an invalidate is indeed a good argument for
reverse maps.  However this is generally the uncommon case, and it is
fine to leave this kinds of things on the slow path.  From struct page 
we currently go to struct address_space to lists of struct vm_area
which works but is just a little slower (but generally cheaper) than
having a reverse map.

Since Rik was not seeing the invalidate or the unmap case as the
bottleneck this reverse mappings are not needed simply something
with a similiar effect on the VM.  

In linux we have avoided reverse maps (unlike the BSD's) which tends
to make the common case fast at the expense of making it more
difficult to handle times when the VM system is under extreme load and
we are swapping etc.

Eric

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: broken VM in 2.4.10-pre9
  2001-09-17 12:12   ` Rik van Riel
@ 2001-09-17 15:45     ` Eric W. Biederman
  0 siblings, 0 replies; 38+ messages in thread
From: Eric W. Biederman @ 2001-09-17 15:45 UTC (permalink / raw)
  To: Rik van Riel; +Cc: linux-kernel, linux-mm

Rik van Riel <riel@conectiva.com.br> writes:

> On 17 Sep 2001, Eric W. Biederman wrote:
> 
> > There is an alternative approach to have better aging information.
> 
> [snip incomplete description of data structure]
> 
> What you didn't explain is how your idea is related to
> aging.

Sorry I thought you had been staring at the problem long enough to
see.  In any case the problem with the current code is that you can't
put all pages in the swap cache immediately because you don't want to
allocate the swap space just yet.  And without being in the swap cache
aging isn't especially effective.

By using something like a shared memory segment behind every anonymous
page, you can put the page in the swap cache before you allocate swap
for it (because it has a persistent identity).   Further since you no
longer need counts for every swap page.  You can deallocate swap space
from pages simply by walking through the ``indirect pages'' and
removing the reference to swap space.

> > > For 2.5 I'm making a VM subsystem with reverse mappings, the
> > > first iterations are giving very sweet performance so I will
> > > continue with this project regardless of what other kernel
> > > hackers might say ;)
> >
> > Do you have any arguments for the reverse mappings or just for some of
> > the other side effects that go along with them?
> 
> Mainly for the side effects, but until somebody comes
> up with another idea to achieve all the side effects I'm
> not giving up on reverse mappings. If you can achieve
> all the good stuff in another way, show it.

I think I can I haven't had time to implement it.  Given the way Alan
and some of the others were talking I though my idea has long ago been
thought of and put on the plate for 2.5.  If it really is a new idea
under the sun I'll look at implementing it as soon as I have a hole
in my schedule.

Eric
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 38+ messages in thread

* RE: broken VM in 2.4.10-pre9
@ 2001-09-17 15:40 Rob Fuller
  2001-09-17 16:03 ` Eric W. Biederman
  0 siblings, 1 reply; 38+ messages in thread
From: Rob Fuller @ 2001-09-17 15:40 UTC (permalink / raw)
  To: Rik van Riel, Eric W. Biederman; +Cc: linux-kernel, linux-mm

One argument for reverse mappings is distributed shared memory or
distributed file systems and their interaction with memory mapped files.
For example, a distributed file system may need to invalidate a specific
page of a file that may be mapped multiple times on a node.

This may be a naive argument given my limited knowledge of Linux memory
management internals.  If so, I will refrain from posting this sort of
thing in the future.  Let me know.

> -----Original Message-----
> From: Rik van Riel [mailto:riel@conectiva.com.br]
> Sent: Monday, September 17, 2001 7:13 AM
> To: Eric W. Biederman
> Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org
> Subject: Re: broken VM in 2.4.10-pre9
> 
> 
> On 17 Sep 2001, Eric W. Biederman wrote:

<snip>

> > Do you have any arguments for the reverse mappings or just 
> for some of
> > the other side effects that go along with them?
> 
> Mainly for the side effects, but until somebody comes
> up with another idea to achieve all the side effects I'm
> not giving up on reverse mappings. If you can achieve
> all the good stuff in another way, show it.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: broken VM in 2.4.10-pre9
  2001-09-17  8:06 ` Eric W. Biederman
@ 2001-09-17 12:12   ` Rik van Riel
  2001-09-17 15:45     ` Eric W. Biederman
  0 siblings, 1 reply; 38+ messages in thread
From: Rik van Riel @ 2001-09-17 12:12 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: linux-kernel, linux-mm

On 17 Sep 2001, Eric W. Biederman wrote:

> There is an alternative approach to have better aging information.

[snip incomplete description of data structure]

What you didn't explain is how your idea is related to
aging.

> > For 2.5 I'm making a VM subsystem with reverse mappings, the
> > first iterations are giving very sweet performance so I will
> > continue with this project regardless of what other kernel
> > hackers might say ;)
>
> Do you have any arguments for the reverse mappings or just for some of
> the other side effects that go along with them?

Mainly for the side effects, but until somebody comes
up with another idea to achieve all the side effects I'm
not giving up on reverse mappings. If you can achieve
all the good stuff in another way, show it.

regards,

Rik
-- 
IA64: a worthy successor to i860.

http://www.surriel.com/		http://distro.conectiva.com/

Send all your spam to aardvark@nl.linux.org (spam digging piggy)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: broken VM in 2.4.10-pre9
       [not found] <Pine.LNX.4.33L.0109161330000.9536-100000@imladris.rielhome.conectiva>
@ 2001-09-17  8:06 ` Eric W. Biederman
  2001-09-17 12:12   ` Rik van Riel
  0 siblings, 1 reply; 38+ messages in thread
From: Eric W. Biederman @ 2001-09-17  8:06 UTC (permalink / raw)
  To: Rik van Riel; +Cc: linux-kernel, linux-mm

Rik van Riel <riel@conectiva.com.br> writes:

> On 16 Sep 2001, Michael Rothwell wrote:
> 
> > Is there a way to tell the VM to prune its cache? Or a way to limit
> > the amount of cache it uses?
> 
> Not yet, I'll make a quick hack for this when I get back next
> week. It's pretty obvious now that the 2.4 kernel cannot get
> enough information to select the right pages to evict from
> memory.

Hmm.  Perhaps or perhaps it is using the information poorly.
There is an alternative approach to have better aging information.

An address_space can be allocated per mm_struct.    And all of the
anonymous pages can be allocated to that address_space.  The
address_space can then have an array or better a tree of extents that
list which indexes correspond to which swap pages.  With some
pages not being backed.

Getting the allocation of indices correct so that merging will work
is a little trickier then now, as is the case of a private writeable
mapping of a file.  But in a lot of other ways the logic becomes
simpler.
 
> For 2.5 I'm making a VM subsystem with reverse mappings, the
> first iterations are giving very sweet performance so I will
> continue with this project regardless of what other kernel
> hackers might say ;)

Do you have any arguments for the reverse mappings or just for some of
the other side effects that go along with them?

Eric
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 38+ messages in thread

end of thread, other threads:[~2001-10-01 11:37 UTC | newest]

Thread overview: 38+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2001-09-19 22:15 broken VM in 2.4.10-pre9 Rob Fuller
2001-09-19 22:21 ` David S. Miller
2001-09-19 22:30 ` Alan Cox
2001-09-19 22:48 ` Eric W. Biederman
2001-09-19 22:51 ` Bryan O'Sullivan
  -- strict thread matches above, loose matches on Subject: below --
2001-09-17 15:40 Rob Fuller
2001-09-17 16:03 ` Eric W. Biederman
2001-09-19  9:45   ` Daniel Phillips
2001-09-19 19:45     ` Alan Cox
2001-09-19 21:03       ` Eric W. Biederman
2001-09-19 22:04         ` Alan Cox
2001-09-19 22:26           ` Eric W. Biederman
2001-09-19 23:05           ` Rik van Riel
2001-09-20 11:28           ` Daniel Phillips
2001-09-20 12:06             ` Rik van Riel
2001-09-21  8:13               ` Daniel Phillips
2001-09-21 12:10                 ` Rik van Riel
2001-09-21 15:27                 ` Jan Harkes
2001-09-22  7:09                   ` Daniel Phillips
2001-09-25 11:04                     ` Mike Fedyk
2001-09-20 12:57             ` Alan Cox
2001-09-20 13:40               ` Daniel Phillips
2001-09-24 22:50           ` Pavel Machek
2001-09-26 18:22             ` Marcelo Tosatti
2001-09-26 23:44               ` Pavel Machek
2001-09-27 13:52                 ` Eric W. Biederman
2001-10-01 11:37                 ` Marcelo Tosatti
2001-09-19 23:00         ` Rik van Riel
2001-09-21  8:23           ` Eric W. Biederman
2001-09-21 12:01             ` Rik van Riel
2001-09-22  2:14             ` Alexander Viro
2001-09-22  3:09               ` Rik van Riel
2001-09-19 21:37       ` Eric W. Biederman
2001-09-19 21:55         ` David S. Miller
2001-09-20 13:02           ` Rik van Riel
     [not found] <Pine.LNX.4.33L.0109161330000.9536-100000@imladris.rielhome.conectiva>
2001-09-17  8:06 ` Eric W. Biederman
2001-09-17 12:12   ` Rik van Riel
2001-09-17 15:45     ` Eric W. Biederman

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox