Re: MADV_SPACEAVAIL and MADV

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* Re: MADV_SPACEAVAIL and MADV_FREE in pre2-3
       [not found] <20000320135939.A3390@pcep-jamie.cern.ch>
@ 2000-03-20 19:09 ` Chuck Lever
  2000-03-21  1:20   ` madvise (MADV_FREE) Jamie Lokier
                     ` (3 more replies)
  0 siblings, 4 replies; 55+ messages in thread
From: Chuck Lever @ 2000-03-20 19:09 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: linux-mm

jamie-

i've moved this discussion to linux-mm where we were just discussing the
madvise() implementation.

On Mon, 20 Mar 2000, Jamie Lokier wrote:
> Chuck Lever wrote:
> > > Besides, MADV_FREE would be quite useful.  MADV_DONTNEED doesn't do the
> > > right thing for free(3) and similar things.

ok, i don't understand why you think this.  and besides, free(3) doesn't
shrink the heap currently, i believe.  this would work if free(3) used
sbrk() to shrink the heap in an intelligent fashion, freeing kernel VM
resources along the way.  if you want something to help free(3), i would
favor this design instead.

> No idea.  Didn't you see my message about the collected meanings of
> different MADV_ flags on different systems?

yes, i saw it, but perhaps didn't understand it completely.

> In particular, using the name MADV_DONTNEED is a really bad idea.  It
> means completely different things on different OSes.  For example your
> meaning of MADV_DONTNEED is different to BSD's: a program that assumes
> the BSD behaviour may well crash with your implementation and will
> almost certainly give invalid results if it doesn't crash.

i'm more concerned about portability from operating systems like Solaris,
because there are many more server applications there than on *BSD that
have been designed to use these interfaces.  i'm not saying the *BSD way
is wrong, but i think it would be a more useful compromise to make *BSD
functionality available via some other interface (like MADV_ZERO).

> [Aside: is there the possibility to have mincore return the "!accessed"
> and "!dirty" bits of each page, perhaps as bits 1 and 2 of the returned
> bytes?  I can imagine a bunch of garbage collection algorithms that
> could make good use of those bits.  Currently some GC systems mprotect()
> regions and unprotect them on SEGV -- simply reading the !dirty status
> would obviously be much simpler and faster.]

you could add that; the question is how to do it while not breaking
applications that do this:

if (!byte) {
   page not present
}

rather than checking the LSB specifically.  i think using "dirty" instead
of "!dirty" would help.  the "accessed" bit is only used by the
shrink_mmap logic to "time out" a page as memory gets short; i'm not sure
that's a semantic that is useful to a user-level garbarge collector?  and
it probably isn't very portable.

[ jamie's earlier summary included below for context, with commentary ]

> 1. A hint to the VM system: I've finished using this data.  If it's
>    modified, you can write it back right away.  If not, you can discard
>    it.  FreeBSD's MADV_DONTNEED does this, but DU's doesn't.
> 
> FreeBSD:
> >  MADV_DONTNEED    Allows the VM system to decrease the in-memory priority
> >                   of pages in the specified range.  Additionally future
> >                   references to this address range will incur a page
> >                   fault.
> 
>    To avoid ambiguity, perhaps we could call this one MADV_DONE?
> 
>    In BSD compatibility mode, Glibc would define MADV_DONTNEED to be
>    MADV_DONE.  In standard mode it would not define MADV_DONTNEED at all.

my preference is for the DU semantic of tossing dirty data instead of
flushing onto backing store, simply because that's what so many
applications expect DONTNEED to do.

as far as i can tell, linux's msync(MS_INVALIDATE) behaves like freeBSD's
MADV_DONTNEED.

> 2. Zeroing a range in a private map.  DU's MADV_DONTNEED does this --
>    that's my reading of the man page.
> 
> Digital Unix: (?yes)
> >   MADV_DONTNEED   Do not need these pages
> >                   The system will free any whole pages in the specified
> >                   region.  All modifications will be lost and any swapped
> >                   out pages will be discarded.  Subsequent access to the
> >                   region will result in a zero-fill-on-demand fault as
> >                   though it is being accessed for the first time.
> >                   Reserved swap space is not affected by this call.
> 
>    For Linux, simply read /dev/zero into the selected range.  The kernel
>    already optimises this case for anonymous mappings.
> 
>    If doing it in general turns out to be too hard to implement, I
>    propose MADV_ZERO should have this effect: exactly like reading
>    /dev/zero into the range, but always efficient.

linux's MADV_DONTNEED currently doesn't clear the MADV_DONTNEED area.  but
it would be easy to add, perhaps as a separate MADV_ZERO as you describe
below.

> 3. Zeroing a range in a shared map.
> 
>    I have no idea if DU's MADV_DONTNEED has this effect, or whether it
>    only has this effect on shared anonymous mappings.
> 
>    In any case, reading /dev/zero into the range will always have the
>    desired effect, and Stephen's work will eventually make this
>    efficient on Linux.
> 
>    Again, if the kiobuf work doesn't have the desired effect, I propose
>    MADV_ZERO should be exactly like reading /dev/zero into the range,
>    and efficiently if the underlying mapped object can do so
>    efficiently.

MADV_ZERO makes sense to me as an efficient way to zero a range of
addresses in a mapping.  but i think it's useful as a *separate* function,
not as combined with, say, MADV_DONTNEED.

> 4. Deferred freeing of pages.  FreeBSD's MADV_FREE does this, according
>    to the posted manual snippet.  I like this very much -- it is perfect
>    for a wide variety of memory allocators.
> 
> FreeBSD:
> >  MADV_FREE        Gives the VM system the freedom to free pages, and tells
> >                   the system that information in the specified page range
> >                   is no longer important.  This is an efficient way of al-
> >                   lowing malloc(3) to free pages anywhere in the address
> >                   space, while keeping the address space valid.  The next
> >                   time that the page is referenced, the page might be de-
> >                   mand zeroed, or might contain the data that was there
> >                   before the MADV_FREE call.  References made to that ad-
> >                   dress space range will not make the VM system page the
> >                   information back in from backing store until the page is
> >                   modified again.
> 
>    I like this so much I started coding it a long time ago, as an
>    mdiscard syscall.  But then I got onto something else.
> 
>    The principle here is very simple: MADV_FREE marks all the pages in
>    the region as "discardable", and clears the accessed and dirty bits
>    of those pages.
> 
>    Later when the kernel needs to free some memory, it is permitted to
>    free "discardable" pages immediately provided they are still not
>    accessed or dirty.  When vmscan is clearing the accessed and dirty
>    bits on pages, if they were set it must clear the " discardable" bit.
> 
>    This allows malloc() and other user space allocators to free pages
>    back to the system.  Unlike DU's MADV_DONTNEED, or mmapping
>    /dev/zero, if the system does not need the page there is no
>    inefficient zero-copy.  If there was, malloc() would be better off
>    not bothering to return the pages.

unless i've completely misunderstood what you are proposing, this is what
MADV_DONTNEED does today, except it doesn't schedule the "freed" pages for
disposal ahead of other pages in the system.  but that should be easy
enough to add once the semantics are nailed down and the bugs have been
eliminated.

	- Chuck Lever
--
corporate:	<chuckl@netscape.com>
personal:	<chucklever@netscape.net> or <cel@monkey.org>

The Linux Scalability project:
	http://www.citi.umich.edu/projects/linux-scalability/

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 55+ messages in thread

* madvise (MADV_FREE)
  2000-03-20 19:09 ` MADV_SPACEAVAIL and MADV_FREE in pre2-3 Chuck Lever
@ 2000-03-21  1:20   ` Jamie Lokier
  2000-03-21  2:24     ` William J. Earl
  2000-03-22 16:24     ` Chuck Lever
  2000-03-21  1:29   ` MADV_DONTNEED Jamie Lokier
                     ` (2 subsequent siblings)
  3 siblings, 2 replies; 55+ messages in thread
From: Jamie Lokier @ 2000-03-21  1:20 UTC (permalink / raw)
  To: Chuck Lever; +Cc: linux-mm

Hi Chuck

About MADV_FREE
---------------

> >    The principle here is very simple: MADV_FREE marks all the pages in
> >    the region as "discardable", and clears the accessed and dirty bits
> >    of those pages.
> > 
> >    Later when the kernel needs to free some memory, it is permitted to
> >    free "discardable" pages immediately provided they are still not
> >    accessed or dirty.  When vmscan is clearing the accessed and dirty
> >    bits on pages, if they were set it must clear the " discardable" bit.
> > 
> >    This allows malloc() and other user space allocators to free pages
> >    back to the system.  Unlike DU's MADV_DONTNEED, or mmapping
> >    /dev/zero, if the system does not need the page there is no
> >    inefficient zero-copy.  If there was, malloc() would be better off
> >    not bothering to return the pages.
> 
> unless i've completely misunderstood what you are proposing, this is what
> MADV_DONTNEED does today,

No, your MADV_DONTNEED _always_ discards the data in those pages.  That
makes it too inefficient for application memory allocators, because they
will often want to reuse some of the pages soon after.  You don't want
redundant page zeroing, and you don't want to give up memory which is
still nice and warm in the CPU's cache.  Unless the kernel has a better
use for it than you.

MADV_FREE on the other hand simply permits the kernel to reclaim those
pages, if it is under memory pressure.

If there is no pressure, the pages are reused by the application
unchanged.  In this way different subsystems competing for memory get to
share it out -- essentially the fairness mechanisms in the kernel are
extending to application page management.  And the application hardly
knows a think about it.

Here's why MADV_FREE works, and the other things don't:

A typical memory allocator creates holes in its heap, which the kernel
has to swap out if it needs memory.  I guess about 1/4 of all data in
swap is this kind of junk (but it's just a guess).

But it's quite inefficient for an allocator to unconditionally give
pages back to the kernel.  The cost-benefit is "cost of giving page to
kernel" vs. "cost of maybe paging out".  The cost of giving up
pages is significant: each one implies a COW fault, clear_page
when you reuse the page, and loss of cache-warm memory.

You assume a page is not likely to swap, because there's a reasonable
chance the application will reallocate it before that happens.  So on
balance, giving pages unconditionally to the kernel is a loss.

--> No sane free(3) would call MADV_DONTNEED or msync(MS_INVALIDATE).

A better application allocator would base decisions about when to return
pages to the kernel on the likelihood of swapping and measured cost of
swapping vs. retaining pages.  Of course that's very difficult and
system specific.  And really only the kernel has access to all the
information on memory pressure.

So the best arrangment is to let the kernel make page reclamation
decisions.  And if a page is not reclaimed before it is reused, let the
application reuse the page unchanged and cache-warm.

MADV_FREE is the mechanism for doing that.  And it's a very nice, simple
one to use.  Paging decisions stay in the kernel where they belong.
Applications run fast if they have enough memory.  Everything is happy.

> ... except it doesn't schedule the "freed" pages for
> disposal ahead of other pages in the system.  but that should be easy
> enough to add once the semantics are nailed down and the bugs have been
> eliminated.

It's not clear you'd want to do that.  There is a cost for every "freed"
page disposed of, so you don't want to dispose of them ahead of other
pages.

> ok, i don't understand why you think this.  and besides, free(3) doesn't
> shrink the heap currently, i believe.  this would work if free(3) used
> sbrk() to shrink the heap in an intelligent fashion, freeing kernel VM
> resources along the way.  if you want something to help free(3), i would
> favor this design instead.

free(3) already uses sbrk() to shrink the heap at the end.  It's not
usable for the typical 1/3 of memory which becomes holes in the heap.

Yes the idea is to modify free(3) to permit the kernel to reclaim memory
that is free in the application.  However, none of sbrk() _or_
MADV_DONTNEED _or_ MADV_ZERO _or_ mmap(/dev/zero) have the desired
effect.

It has to be a win for the application to call this function -- and it
it's a loss to zero pages as soon as you free them.  But it's relatively
cheap to just mark the pages as "reclaimable" without losing them.

enjoy,
-- Jamie
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 55+ messages in thread

* MADV_DONTNEED
  2000-03-20 19:09 ` MADV_SPACEAVAIL and MADV_FREE in pre2-3 Chuck Lever
  2000-03-21  1:20   ` madvise (MADV_FREE) Jamie Lokier
@ 2000-03-21  1:29   ` Jamie Lokier
  2000-03-22 17:04     ` MADV_DONTNEED Chuck Lever
  2000-03-21  1:47   ` Extensions to mincore Jamie Lokier
  2000-03-21  1:50   ` MADV flags as mmap options Jamie Lokier
  3 siblings, 1 reply; 55+ messages in thread
From: Jamie Lokier @ 2000-03-21  1:29 UTC (permalink / raw)
  To: Chuck Lever; +Cc: linux-mm

Hi Chuck

About MADV_DONTNEED
-------------------

> > In particular, using the name MADV_DONTNEED is a really bad idea.  It
> > means completely different things on different OSes.  For example your
> > meaning of MADV_DONTNEED is different to BSD's: a program that assumes
> > the BSD behaviour may well crash with your implementation and will
> > almost certainly give invalid results if it doesn't crash.
> 
> i'm more concerned about portability from operating systems like Solaris,
> because there are many more server applications there than on *BSD that
> have been designed to use these interfaces.
...
> my preference is for the DU semantic of tossing dirty data instead of
> flushing onto backing store, simply because that's what so many
> applications expect DONTNEED to do.

That's interesting.  When I saw MADV_DONTNEED, I immediately assumed it
was the natural counterpoint to MADV_WILLNEED.  Useful even for
sequential accesses, to say "my streaming window has moved beyond this
point".  Do you agree that a counterpoint to MADV_WILLNEED is useful?

The names are so similar, I consider using MADV_DONTNEED to mean "trash
this memory" quite misleading.  (If there was no MADV_WILLNEED I
wouldn't mind).

> i'm not saying the *BSD way is wrong, but i think it would be a more
> useful compromise to make *BSD functionality available via some other
> interface (like MADV_ZERO).

You got it the wrong way around.  MADV_ZERO is more like what your
implementation of MADV_DONTNEED does.  The BSD behaviour is nothing like
MADV_ZERO.  BSD simply means "increment the paging priority" -- the
page contents are unchanged.

BSD's behaviour is the obvious counterpoint to MADV_WILLNEED afaict.

> as far as i can tell, linux's msync(MS_INVALIDATE) behaves like freeBSD's
> MADV_DONTNEED.

Doesn't look like that.

1. MS_INVALIDATE only works on file mappings -- BSD's MADV_DONTNEED is
   defined (if you believe the documentation) for any mapping.

2. The msync() manual page doesn't agree with you, but I'm not sure
   about the implementation.  The manual says:

       MS_INVALIDATE asks to invalidate  other  mappings  of  the
       same file (so that they can be updated with the fresh values
       just written).

   The implementation seems to invalidate _this_ mapping.
   Either way, they are different from BSD's MADV_DONTNEED.

3. Your MADV_DONTNEED does different things to msync(MS_INVALIDATE)

Actually I like what MADV_DONTNEED does, but I would like it to have a
different name to avoid potentially dangerous ambiguity with BSD's
meaning.  If Linux MADV_DONTNEED were just a hint it would be fine, but
it actively trashes memory.

By the way, Linux MADV_DONTNEED does some of the things
msync(MS_INVALIDATE) does but not others (in the implementation --
ignore the man page).

Can you explain how the two things differ?  I.e., why does MS_INVALIDATE
fiddle with swap cache pages.  Does this indicate a bug in your
MADV_DONTNEED implementation?

> MADV_ZERO makes sense to me as an efficient way to zero a range of
> addresses in a mapping.  but i think it's useful as a *separate* function,
> not as combined with, say, MADV_DONTNEED.

Agreed.  I mention DONTNEED only because some OS's documentation of
DONTNEED appears to be equivalent to MADV_ZERO.  And of course, on a
mapping of /dev/zero they are equivalent.

To be honest, the MADV_DONTNEED behaviour on private mappings is
probably much more useful than zeroing a range anyway.  You've always
got read(/dev/zero) for the latter.

enjoy,
-- Jamie
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Extensions to mincore
  2000-03-20 19:09 ` MADV_SPACEAVAIL and MADV_FREE in pre2-3 Chuck Lever
  2000-03-21  1:20   ` madvise (MADV_FREE) Jamie Lokier
  2000-03-21  1:29   ` MADV_DONTNEED Jamie Lokier
@ 2000-03-21  1:47   ` Jamie Lokier
  2000-03-21  9:11     ` Eric W. Biederman
  2000-03-21  1:50   ` MADV flags as mmap options Jamie Lokier
  3 siblings, 1 reply; 55+ messages in thread
From: Jamie Lokier @ 2000-03-21  1:47 UTC (permalink / raw)
  To: Chuck Lever; +Cc: linux-mm

> > [Aside: is there the possibility to have mincore return the "!accessed"
> > and "!dirty" bits of each page, perhaps as bits 1 and 2 of the returned
> > bytes?  I can imagine a bunch of garbage collection algorithms that
> > could make good use of those bits.  Currently some GC systems mprotect()
> > regions and unprotect them on SEGV -- simply reading the !dirty status
> > would obviously be much simpler and faster.]
> 
> you could add that; the question is how to do it while not breaking
> applications that do this:
> 
> if (!byte) {
>    page not present
> }
> 
> rather than checking the LSB specifically.

The comment says:

    The status is returned in a vector of bytes.  The least significant
    bit of each byte is 1 if the referenced page is in memory, otherwise
    it is zero.

Solaris (SunOS 5.6) extends this with:

     The settings of other bits in each character are undefined and may
     contain other information in future implementations.

So I think you're quite safe extending the information.

> i think using "dirty" instead of "!dirty" would help.

In a GC system you're looking to skip pages which are "definitely
clean".  "Definitely dirty" isn't very interesting, however "maybe
dirty" is.

Given that the default value from mincore is 0 (say for an older
kernel), it should mean "maybe dirty".  Hence !dirty.

> the "accessed" bit is only used by the shrink_mmap logic to "time out"
> a page as memory gets short; i'm not sure that's a semantic that is
> useful to a user-level garbarge collector?  and it probably isn't very
> portable.

For a garbage collector that can move objects, it has uses in suggesting
how to efficiently repack objects, to reduce the resident set size of
the process.

There are also a number of user-space paging systems (e.g. one was once
proposed for the special relocated .exe mappings in Wine), which would
benefit from this information the same was as the kernel does.

You could indicate that these values are "exact" by another bit which is
always set if you are able to provide dirty and accessed bits.  Then
the polarity doesn't really matter.

-- Jamie
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 55+ messages in thread

* MADV flags as mmap options
  2000-03-20 19:09 ` MADV_SPACEAVAIL and MADV_FREE in pre2-3 Chuck Lever
                     ` (2 preceding siblings ...)
  2000-03-21  1:47   ` Extensions to mincore Jamie Lokier
@ 2000-03-21  1:50   ` Jamie Lokier
  3 siblings, 0 replies; 55+ messages in thread
From: Jamie Lokier @ 2000-03-21  1:50 UTC (permalink / raw)
  To: Chuck Lever; +Cc: linux-mm

While we're here :-)

It seems to me that a lot of the time, madvise() will be called
immediately after mmap() on the same region.

How about making the MADV_ flags distinct from the MAP_ flags, and
arranging that you may pass MADV_ flags to mmap().  If it sees any, it
does the mapping and follows it by the corresponding madvise_vma call.

(Only really useful for MADV_RANDOM and MADV_SEQUENTIAL).

-- Jamie
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: madvise (MADV_FREE)
  2000-03-21  1:20   ` madvise (MADV_FREE) Jamie Lokier
@ 2000-03-21  2:24     ` William J. Earl
  2000-03-21 14:08       ` Jamie Lokier
  2000-03-22 16:24     ` Chuck Lever
  1 sibling, 1 reply; 55+ messages in thread
From: William J. Earl @ 2000-03-21  2:24 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: Chuck Lever, linux-mm

Jamie Lokier writes:
...
 > You assume a page is not likely to swap, because there's a reasonable
 > chance the application will reallocate it before that happens.  So on
 > balance, giving pages unconditionally to the kernel is a loss.
 > 
 > --> No sane free(3) would call MADV_DONTNEED or msync(MS_INVALIDATE).
 > 
 > A better application allocator would base decisions about when to return
 > pages to the kernel on the likelihood of swapping and measured cost of
 > swapping vs. retaining pages.  Of course that's very difficult and
 > system specific.  And really only the kernel has access to all the
 > information on memory pressure.
...

     I have been asked by some application people to have free() use
MADV_DONTNEED or the equivalent in selected cases, specifically when
the memory allocated is large, in order to free up the physical and
virtual (swap space) memory for other uses.  If the application uses
very large chunks of memory, giving it back entirely is a win.  The
application could be recoded to do its own mmap() of /dev/zero and
munmap(), but would prefer that this behavior be automatic.  Of course,
MADV_DONTNEED does not apply in the case of mmap()/munmap() of /dev/zero,
but it is not implausible to give up virtual memory.  Note that
I am not claiming one should do anything of the sort for small
allocations.

     If you have, say, 256 MB of memory and 256 MB of swap, and you
use 384 MB of memory in your application, you cannot even fork()
without giving up some of it.  Many serious applications at least
reserve large amounts of memory (even if they do not touch all of
it on every run).  
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Extensions to mincore
  2000-03-21  1:47   ` Extensions to mincore Jamie Lokier
@ 2000-03-21  9:11     ` Eric W. Biederman
  2000-03-21  9:40       ` lars brinkhoff
  2000-03-21 11:34       ` Stephen C. Tweedie
  0 siblings, 2 replies; 55+ messages in thread
From: Eric W. Biederman @ 2000-03-21  9:11 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: Chuck Lever, linux-mm

Jamie Lokier <jamie.lokier@cern.ch> writes:

> > > [Aside: is there the possibility to have mincore return the "!accessed"
> > > and "!dirty" bits of each page, perhaps as bits 1 and 2 of the returned
> > > bytes?  I can imagine a bunch of garbage collection algorithms that
> > > could make good use of those bits.  Currently some GC systems mprotect()
> > > regions and unprotect them on SEGV -- simply reading the !dirty status
> > > would obviously be much simpler and faster.]

No it wouldn't.  

Dirty kernel wise means the page needs to be swapped out. Clean kernel
wise mean the page is in the swap cache, and hasn't been written
since it was swapped in.

Dirty GC wise the page has changes since the last GC pass over it.

It is very easy to conceive of a case where a dirty GC'd page swapped
out, and then swapped in before someone got to looking at it.  So
kernel Clean/Dirty has no connection with GC Clean/Dirty.

Please, please don't mess with this for a 2.4 timeframe.

Eric
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Extensions to mincore
  2000-03-21  9:11     ` Eric W. Biederman
@ 2000-03-21  9:40       ` lars brinkhoff
  2000-03-21 11:34       ` Stephen C. Tweedie
  1 sibling, 0 replies; 55+ messages in thread
From: lars brinkhoff @ 2000-03-21  9:40 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: Jamie Lokier, Chuck Lever, linux-mm

"Eric W. Biederman" wrote:
> Jamie Lokier <jamie.lokier@cern.ch> writes:
> > > > [Aside: is there the possibility to have mincore return the "!accessed"
> > > > and "!dirty" bits of each page, perhaps as bits 1 and 2 of the returned
> > > > bytes?  I can imagine a bunch of garbage collection algorithms that
> > > > could make good use of those bits.  Currently some GC systems mprotect()
> > > > regions and unprotect them on SEGV -- simply reading the !dirty status
> > > > would obviously be much simpler and faster.]
> 
> Dirty kernel wise means the page needs to be swapped out. Clean kernel
> wise mean the page is in the swap cache, and hasn't been written
> since it was swapped in.
> 
> Dirty GC wise the page has changes since the last GC pass over it.
> 
> It is very easy to conceive of a case where a dirty GC'd page swapped
> out, and then swapped in before someone got to looking at it.  So
> kernel Clean/Dirty has no connection with GC Clean/Dirty.
> 
> Please, please don't mess with this for a 2.4 timeframe.

For user-space paging, it would be great to know the kernel sense of the
clean/dirty
status of pages.  Perhaps something to be considered for 2.5.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Extensions to mincore
  2000-03-21  9:11     ` Eric W. Biederman
  2000-03-21  9:40       ` lars brinkhoff
@ 2000-03-21 11:34       ` Stephen C. Tweedie
  2000-03-21 15:15         ` Jamie Lokier
  1 sibling, 1 reply; 55+ messages in thread
From: Stephen C. Tweedie @ 2000-03-21 11:34 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: Jamie Lokier, Chuck Lever, linux-mm

Hi,

On Tue, Mar 21, 2000 at 03:11:16AM -0600, Eric W. Biederman wrote:
> Jamie Lokier <jamie.lokier@cern.ch> writes:
> 
> > > > [Aside: is there the possibility to have mincore return the "!accessed"
> > > > and "!dirty" bits of each page, perhaps as bits 1 and 2 of the returned
> > > > bytes?  I can imagine a bunch of garbage collection algorithms that
> > > > could make good use of those bits.  Currently some GC systems mprotect()
> > > > regions and unprotect them on SEGV -- simply reading the !dirty status
> > > > would obviously be much simpler and faster.]
> 
> Dirty kernel wise means the page needs to be swapped out. Clean kernel
> wise mean the page is in the swap cache, and hasn't been written
> since it was swapped in.

Worse than that, returning dirty status bits in mincore() just wouldn't 
work for threads.  mincore() is a valid optimisation when you just treat
it as a hint: if a page gets swapped out between calling mincore() and 
using the page, nothing breaks, you just get an extra page fault.  

The same is not true for the sort of garbage collection or distributed
memory mechanisms which use mprotect().  If you find that a page is clean
via mincore() and discard the data based on that, there is nothing to 
stop another thread from dirtying the data after the mincore() and losing
its modification.  mprotect() has the advantage of holding page table
locks so it can do an atomic read-modify-write on the page table entries.
Without that locking, you just can't reliably use dirty/accessed
information.

--Stephen
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: madvise (MADV_FREE)
  2000-03-21  2:24     ` William J. Earl
@ 2000-03-21 14:08       ` Jamie Lokier
  0 siblings, 0 replies; 55+ messages in thread
From: Jamie Lokier @ 2000-03-21 14:08 UTC (permalink / raw)
  To: William J. Earl; +Cc: Chuck Lever, linux-mm

William J. Earl wrote:
>      I have been asked by some application people to have free() use
> MADV_DONTNEED or the equivalent in selected cases, specifically when
> the memory allocated is large, in order to free up the physical and
> virtual (swap space) memory for other uses.  If the application uses
> very large chunks of memory, giving it back entirely is a win.  The
> application could be recoded to do its own mmap() of /dev/zero and
> munmap(), but would prefer that this behavior be automatic.  Of course,
> MADV_DONTNEED does not apply in the case of mmap()/munmap() of /dev/zero,
> but it is not implausible to give up virtual memory.  Note that
> I am not claiming one should do anything of the sort for small
> allocations.

Take a look at Glibc's malloc/free, which is the only one we care about
for Linux.  Glibc's malloc uses mmap() of /dev/zero for large
allocations automatically.  You can change the threshold if you like.

However, assuming this was not the case, even your application would
benefit more from MADV_FREE than MADV_DONTNEED.  MADV_DONTNEED forces a
non-trivial minimum recycling cost, whereas MADV_FREE allows the cost to
be balanced between the kernel and the application, according to the
current paging situation.

-- Jamie
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Extensions to mincore
  2000-03-21 11:34       ` Stephen C. Tweedie
@ 2000-03-21 15:15         ` Jamie Lokier
  2000-03-21 15:41           ` Stephen C. Tweedie
  0 siblings, 1 reply; 55+ messages in thread
From: Jamie Lokier @ 2000-03-21 15:15 UTC (permalink / raw)
  To: Stephen C. Tweedie; +Cc: Eric W. Biederman, Chuck Lever, linux-mm

Eric W. Biederman wrote:
> > > > [Aside: is there the possibility to have mincore return the
> > > > "!accessed" and "!dirty" bits of each page, perhaps as bits 1
> > > > and 2 of the returned bytes?  I can imagine a bunch of garbage
> > > > collection algorithms that could make good use of those bits.
> > > > Currently some GC systems mprotect() regions and unprotect them
> > > > on SEGV -- simply reading the !dirty status would obviously be
> > > > much simpler and faster.]
> 
> No it wouldn't.  

Yes it would.

> Dirty kernel wise means the page needs to be swapped out. Clean kernel
> wise mean the page is in the swap cache, and hasn't been written
> since it was swapped in.
> 
> Dirty GC wise the page has changes since the last GC pass over it.

Of course, I thought that was obvious :-)

You're right, that for GC the "!dirty" bit has to mean "since the last
time we called mincore".

To get the correct behaviour without maintaining extra state in the
kernel (apart from a bit or two per struct page), you'd say that mincore
returns "!dirty since the last time _anyone_ called mincore on this
page", and you'd disallow it for shared mappings.

It works for threads too.

All threads sharing a page have to synchronise their mincore calls for
that page, but that situation is no different to the SEGV method: all
threads have to synchronise with the information collected from that,
too.

Stephen C. Tweedie wrote:
> Worse than that, returning dirty status bits in mincore() just wouldn't 
> work for threads.  mincore() is a valid optimisation when you just treat
> it as a hint: if a page gets swapped out between calling mincore() and 
> using the page, nothing breaks, you just get an extra page fault.  

[Aside: I regard this as a bug.  mincore() should have an option to set
the accessed bit on each page that is in core, to avoid the "just
missed" condition.  If it sets the accessed bit, then under most
circumstances the just missed condition will never happen.  If it does
not (it doesn't now), the just missed condition will always happen
sometimes under the slightest non-zero paging load.  The difference for
an application that does "call mincore; if not in core, spawn thread to
pull in page" under low system load will be between no stalls and
occasional stalls.  Thus mincore() is missing a flag parameter IMO]

> The same is not true for the sort of garbage collection or distributed
> memory mechanisms which use mprotect().  If you find that a page is clean
> via mincore() and discard the data based on that, there is nothing to 
> stop another thread from dirtying the data after the mincore() and losing
> its modification.

In general, you have to be very careful about what you allow other
threads to modify during GC.  For a full collection, some kind of
synchronisation point with everyone is usually required.

(Disclaimer: I am not a GC expert so if you know of GC mechanisms that
use mprotect and don't require threads to be synchronised, please speak up!)

1. Stop all the other threads, copy the state of their roots
   (i.e. processor registers, individual stack roots), call mprotect(),
   restart the threads, and let SEGVs mprotect() pages back to writable
   status while putting them on a list.  Watch out for concurrent SEGVs
   on the same page!

   Disadvantage: lots of SEGV handling, SEGV code is processor specific
   (until siginfo is reliable), lots of individual page mprotect calls,
   lots of vmas, page fault slowdown even for non-GC-using threads due
   to all the tiny vmas.

1a. Using mincore(): call mincore() instead of mprotect() in method 1.
    Threads are stopped so it just works :-)

    Advantage: everything runs faster and the code is more portable
    (among Linux systems).

2. Method 1 has a large mprotect() call.  Quite apart from the slowness
   of all that mprotect/SEGV processing, the single large mprotect may
   take a while during which all threads are blocked, and it also
   prevents any threads not involved in GC from faulting.  (As you say,
   it grabs the page table lock).

   You can call mprotect() first to protect the GC arena, with threads
   still running.  At this point, you're _not_ using it to collect dirty
   page information.  When mprotect() returns, you synchronise all
   threads to gather local GC roots, and then start collecting dirty
   page info via SEGVs.  If a thread gets a SEGV before the
   synchronisation point, it is blocked until the synchronisation
   point.  In this way, threads not writing to the arena don't get
   stopped for long even if mprotect() itself takes a long time.

2a. Method 2 using mincore().  Now you do do mprotect() at the beginning
    -- remember it is not for collecting dirty page info here, but for
    blocking threads writing to the arena while permitting others to
    continue.

    After synchronisation, call mincore() and then mprotect() to make
    the entire arena writable.  Then restart all blocked threads.  Any
    SEGVs from the start of the first mprotect() to the end of the
    second one block the faulting thread prior to synchronisation; any
    that block are restarted afterwards.

Obviously there are plenty of other ways to arrange this, with multiple
arenas etc.  But I hope you can see that mincore() can be used reliably
without requiring the overhead of individual-page mprotect and SEGVs.

> mprotect() has the advantage of holding page table locks so it can do
> an atomic read-modify-write on the page table entries.  Without that
> locking, you just can't reliably use dirty/accessed information.

mprotect() has the major disadvantage of creating a million tiny vmas
when you are using it to track dirty pages.  And as far as I can see,
mprotect/SEGV gives no advantage over the dirty bit method: in both
cases, you always need synchronisations points between threads to share the
dirty page information.

mprotect has another disadvantage: it holds the page table lock.  Great
for atomic operations; terrible when you do a large mprotect and you
_don't_ want to stop concurrent threads (that are not using the GC
arena) from page faulting their stuff.

Interestingly, neither GC synchronisation method I described depends on
mprotect() being atomic w.r.t. the whole protection change, and method 2
would actually benefit from concurrent page faults being allowed during
the mprotect().

The atomicity you mention is important.  Consider this implementation:

  1. Only private mappings allowed.
  2. A page is considered dirty "since the last mincore call" if the pte
     dirty bit is set, or if a struct page flag PageMincoreDirty is set.

To read this, you must atomically read and clear the pte's dirty bit.
(Not difficult on x86 or any UP system; I'm not sure about other SMP systems).

mincore() calls are assumed to be protected w.r.t. each other.

-- Jamie
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Extensions to mincore
  2000-03-21 15:15         ` Jamie Lokier
@ 2000-03-21 15:41           ` Stephen C. Tweedie
  2000-03-21 15:55             ` Jamie Lokier
  0 siblings, 1 reply; 55+ messages in thread
From: Stephen C. Tweedie @ 2000-03-21 15:41 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: Stephen C. Tweedie, Eric W. Biederman, Chuck Lever, linux-mm

On Tue, Mar 21, 2000 at 04:15:07PM +0100, Jamie Lokier wrote:
> > Dirty GC wise the page has changes since the last GC pass over it.
> 
> Of course, I thought that was obvious :-)
> 
> You're right, that for GC the "!dirty" bit has to mean "since the last
> time we called mincore".

And that information is not maintained anywhere.  In fact, it basically
_can't_ be maintained, since the hardware only maintains one bit and
we already use that dirty bit.  The only way round this is to use
mprotect-style munging.

> All threads sharing a page have to synchronise their mincore calls for
> that page, but that situation is no different to the SEGV method: all
> threads have to synchronise with the information collected from that,
> too.

It's not about synchronising between mincore calls, it's about 
synchronising mincore calls on one CPU with direct memory references
modifying page tables on another CPU.

--Stephen
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Extensions to mincore
  2000-03-21 15:41           ` Stephen C. Tweedie
@ 2000-03-21 15:55             ` Jamie Lokier
  2000-03-21 16:08               ` Stephen C. Tweedie
  0 siblings, 1 reply; 55+ messages in thread
From: Jamie Lokier @ 2000-03-21 15:55 UTC (permalink / raw)
  To: Stephen C. Tweedie
  Cc: Stephen C. Tweedie, Eric W. Biederman, Chuck Lever, linux-mm

Stephen C. Tweedie wrote:
> > You're right, that for GC the "!dirty" bit has to mean "since the last
> > time we called mincore".
> 
> And that information is not maintained anywhere.  In fact, it basically
> _can't_ be maintained, since the hardware only maintains one bit and
> we already use that dirty bit.  The only way round this is to use
> mprotect-style munging.

Didn't you read a few paragraphs down, where I explain how to implement
this?  You've got struct page.  It is enough for private mappings, and
we don't need this feature for shared mappings.

> > All threads sharing a page have to synchronise their mincore calls for
> > that page, but that situation is no different to the SEGV method: all
> > threads have to synchronise with the information collected from that,
> > too.
> 
> It's not about synchronising between mincore calls, it's about 
> synchronising mincore calls on one CPU with direct memory references
> modifying page tables on another CPU.

Note, for both GC synchronisation methods I described, the mincore()
call does not happen concurrently with other processors updating the
page flags.  In the first case all threads accessing the GC arena are
blocked, and in the second the entire area is write-protected during the
mincore() call.

So the synchronisation you say isn't possible isn't a required feature.
(I know it's quite easy on x86, but probably not some other CPUs).

It would be enough the say "the mincore accessed/dirty bits are not
guaranteed to be accurate if pages are accessed by concurrent threads
during the mincore call".

-- Jamie
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Extensions to mincore
  2000-03-21 15:55             ` Jamie Lokier
@ 2000-03-21 16:08               ` Stephen C. Tweedie
  2000-03-21 16:48                 ` Jamie Lokier
  0 siblings, 1 reply; 55+ messages in thread
From: Stephen C. Tweedie @ 2000-03-21 16:08 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: Eric W. Biederman, Chuck Lever, linux-mm

Hi,

On Tue, Mar 21, 2000 at 04:55:32PM +0100, Jamie Lokier wrote:
> 
> Didn't you read a few paragraphs down, where I explain how to implement
> this?  You've got struct page.  It is enough for private mappings, and
> we don't need this feature for shared mappings.

Umm, yes, but just saying "we'll solve synchronisation problems by 
stopping all the other threads" hardly seems like a "solution" to me:
more of a workaround of the problem!  mprotect() does work correctly
without stopping other threads.

> It would be enough the say "the mincore accessed/dirty bits are not
> guaranteed to be accurate if pages are accessed by concurrent threads
> during the mincore call".

Exactly why you need mprotect, which _does_ make the necessary 
guarantees.

Oh, and suggesting that we can obtain the dirty bit by assuming all
mappings are private doesn't work either.  Private mappings *need* a 
per-pte (NOT per-page, but per-pte) dirty bit to distinguish between 
pages shared with the underlying mapped object, and pages which have
been modified by the local process.

--Stephen

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Extensions to mincore
  2000-03-21 16:08               ` Stephen C. Tweedie
@ 2000-03-21 16:48                 ` Jamie Lokier
  2000-03-22  7:36                   ` Eric W. Biederman
  0 siblings, 1 reply; 55+ messages in thread
From: Jamie Lokier @ 2000-03-21 16:48 UTC (permalink / raw)
  To: Stephen C. Tweedie; +Cc: Eric W. Biederman, Chuck Lever, linux-mm

Stephen C. Tweedie wrote:
> > Didn't you read a few paragraphs down, where I explain how to implement
> > this?  You've got struct page.  It is enough for private mappings, and
> > we don't need this feature for shared mappings.
> 
> Umm, yes, but just saying "we'll solve synchronisation problems by 
> stopping all the other threads" hardly seems like a "solution" to me:
> more of a workaround of the problem!  mprotect() does work correctly
> without stopping other threads.

It is a limitation on mincore (at present).

But I haven't though of a GC implementation that will work without
synchronising the threads anyway.  So the limitation may not be a
problem for GC, and only GC would use this feature.

That said, the synchronisation issue is really separate from the dirty
page issue.  They're orthogonal.  There's no reason why mincore should
not have an option to synchronise with other processors, in just the
same way that mprotect does.

User space SEGV processing is horrible, per-page mprotect()
write-enabling is slow and a resource hog, and the mprotect works on
vmas instead of pages unfortunately so you get zillions of vmas.
zillions of vmas isn't good.  Try cat /proc/self/maps when you have
25000 entries :-)

Oops, I also forgot to mention that each per-page mprotect to
write-enable the page on SEGV causes horrendous SMP behaviour too.

> > It would be enough the say "the mincore accessed/dirty bits are not
> > guaranteed to be accurate if pages are accessed by concurrent threads
> > during the mincore call".
> 
> Exactly why you need mprotect, which _does_ make the necessary 
> guarantees.

It does so with utterly sucking performance too.  And not because of the
synchronisation -- but because you need 2500 separate mprotect calls and
to handle 2500 SEGV signals to detect that 10MB of pages have been
dirtied between GC runs.

mincore() can gather that info in one relatively fast system call.

It does have synchronisation issues -- on _some_ architectures.  But
they can be either documented (where they may not be a problem for GC),
or explicit synchronisation can be added for architectures that need it.

> Oh, and suggesting that we can obtain the dirty bit by assuming all
> mappings are private doesn't work either.  Private mappings *need* a 
> per-pte (NOT per-page, but per-pte) dirty bit to distinguish between 
> pages shared with the underlying mapped object, and pages which have
> been modified by the local process.

For private mappings, any page pointing to the underlying mapped object
is by definition clean.  That's easy enough to check.

Any other page has either a struct page or a swap entry that's local to
its pte.  So the mincore-dirty flag can be stored in the struct page or
the swap entry.

-- Jamie
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Extensions to mincore
  2000-03-21 16:48                 ` Jamie Lokier
@ 2000-03-22  7:36                   ` Eric W. Biederman
  0 siblings, 0 replies; 55+ messages in thread
From: Eric W. Biederman @ 2000-03-22  7:36 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: Stephen C. Tweedie, Chuck Lever, linux-mm

Jamie Lokier <jamie.lokier@cern.ch> writes:

> Stephen C. Tweedie wrote:
> > > Didn't you read a few paragraphs down, where I explain how to implement
> > > this?  You've got struct page.  It is enough for private mappings, and
> > > we don't need this feature for shared mappings.
> > 
> > Umm, yes, but just saying "we'll solve synchronisation problems by 
> > stopping all the other threads" hardly seems like a "solution" to me:
> > more of a workaround of the problem!  mprotect() does work correctly
> > without stopping other threads.
> 
> It is a limitation on mincore (at present).
> 
> But I haven't though of a GC implementation that will work without
> synchronising the threads anyway.  So the limitation may not be a
> problem for GC, and only GC would use this feature.

Nope.  In dosemu we do the mprotect style of munging with mappings
as well.  This allows us to detect which parts of a virtual frame
buffer have been changed pretty cheaply.  I think it is actually
implemented with mmap & munamp though.  Same story....

Doing mprotect tricks in a GC algorithm is actually a pretty
stupid way to go.  Upon occasion it might be the only solution
where you can't get in and modify the code the GC algorithm
is cooperating with.  But it still won't work great.

And only the slower GC algorithms, that need backwards compatiblity
with languages like C.

Anyway as you have mentioned to make this work you have to add
additional state from what is already kept, and it isn't
clear exactly what would make efficient use of this state.

I won't argue that in the long run this a bad idea.  But in
the short run of the upcomming 2.4.  I see no clear win.

For a GC that works with a SMP threaded heap you should never 
need to do that crap anyway.  You have the cost of the write lock
per object or group of objects anyway.  And it shouldn't be hard
to instrument the lock aquiring paths to mark the object dirty as
well.

> User space SEGV processing is horrible, per-page mprotect()
> write-enabling is slow and a resource hog, and the mprotect works on
> vmas instead of pages unfortunately so you get zillions of vmas.
> zillions of vmas isn't good.  Try cat /proc/self/maps when you have
> 25000 entries :-)

That's atleast 97 meg of RAM being managed, and given that
we combing adjacent vmas with the same permissions probably a lot 
more.  While not unthinkable I suspect that is a pretty unlikely case.

> Oops, I also forgot to mention that each per-page mprotect to
> write-enable the page on SEGV causes horrendous SMP behaviour too.

> > > It would be enough the say "the mincore accessed/dirty bits are not
> > > guaranteed to be accurate if pages are accessed by concurrent threads
> > > during the mincore call".
> > 
> > Exactly why you need mprotect, which _does_ make the necessary 
> > guarantees.
> 
> It does so with utterly sucking performance too.  And not because of the
> synchronisation -- but because you need 2500 separate mprotect calls and
> to handle 2500 SEGV signals to detect that 10MB of pages have been
> dirtied between GC runs.
> 
> mincore() can gather that info in one relatively fast system call.

mincore has to use exactly the same implementation except it
might be able to get lucky, and not need to juggle vmas.

In which case it probably makes more sense to figure out how
to store the page writeable flag in the page table of a swapped
out page so mprotect does not need to break vmas....

All GC's that use mprotect & co will have sucky performance period.
They are definentily compromise solutions.

> It does have synchronisation issues -- on _some_ architectures.  But
> they can be either documented (where they may not be a problem for GC),
> or explicit synchronisation can be added for architectures that need it.
> 
> > Oh, and suggesting that we can obtain the dirty bit by assuming all
> > mappings are private doesn't work either.  Private mappings *need* a 
> > per-pte (NOT per-page, but per-pte) dirty bit to distinguish between 
> > pages shared with the underlying mapped object, and pages which have
> > been modified by the local process.
> 
> For private mappings, any page pointing to the underlying mapped object
> is by definition clean.  That's easy enough to check.
> 
> Any other page has either a struct page or a swap entry that's local to
> its pte.  So the mincore-dirty flag can be stored in the struct page or
> the swap entry.

Again if you must please look at optimising mprotect.  If we can find
3 bits in a pte of a swapped out page we don't need to split the
vma's.   Nor do we need to change existing applications.

Plus the shared case is handled as well.  At the cost of a slightly
higher miss penalty for a page.  That sound like a much more
reasonable thing to do then what you are proposing now.

Please feel free to tell me I'm an idiot but I think I just stumbled
upon a pretty decent idea.

Eric
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: madvise (MADV_FREE)
  2000-03-21  1:20   ` madvise (MADV_FREE) Jamie Lokier
  2000-03-21  2:24     ` William J. Earl
@ 2000-03-22 16:24     ` Chuck Lever
  2000-03-22 18:05       ` Jamie Lokier
  2000-03-22 18:15       ` madvise (MADV_FREE) Christoph Rohland
  1 sibling, 2 replies; 55+ messages in thread
From: Chuck Lever @ 2000-03-22 16:24 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: linux-mm

hi jamie-

ok, i think i'm getting a more clear picture of what you are thinking.

On Tue, 21 Mar 2000, Jamie Lokier wrote:
> > >    The principle here is very simple: MADV_FREE marks all the pages in
> > >    the region as "discardable", and clears the accessed and dirty bits
> > >    of those pages.
> > > 
> > >    Later when the kernel needs to free some memory, it is permitted to
> > >    free "discardable" pages immediately provided they are still not
> > >    accessed or dirty.  When vmscan is clearing the accessed and dirty
> > >    bits on pages, if they were set it must clear the " discardable" bit.
> > > 
> > >    This allows malloc() and other user space allocators to free pages
> > >    back to the system.  Unlike DU's MADV_DONTNEED, or mmapping
> > >    /dev/zero, if the system does not need the page there is no
> > >    inefficient zero-copy.  If there was, malloc() would be better off
> > >    not bothering to return the pages.
> > 
> > unless i've completely misunderstood what you are proposing, this is what
> > MADV_DONTNEED does today,
> 
> No, your MADV_DONTNEED _always_ discards the data in those pages.  That
> makes it too inefficient for application memory allocators, because they
> will often want to reuse some of the pages soon after.  You don't want
> redundant page zeroing, and you don't want to give up memory which is
> still nice and warm in the CPU's cache.  Unless the kernel has a better
> use for it than you.
> 
> MADV_FREE on the other hand simply permits the kernel to reclaim those
> pages, if it is under memory pressure.
> 
> If there is no pressure, the pages are reused by the application
> unchanged.  In this way different subsystems competing for memory get to
> share it out -- essentially the fairness mechanisms in the kernel are
> extending to application page management.  And the application hardly
> knows a think about it.

ok, so you're asking for a lite(TM) version of DONTNEED that provides the
following hint to the kernel: "i may be finished with this page, but i may
also want to reuse it immediately."

memory allocation studies i've read show that dynamically allocated memory
objects are often re-used immediately after they are freed.  even if the
memory is being freed just before a process exits, it will be recycled
immediately by the kernel, so why use MADV_FREE if you are about to
munmap() it anyway?  finally, as you point out, the heap is generally too
fragmented to return page-sized chunks of it to the kernel, especially if
you consider that glibc uses *multiple* subheaps to reduce lock contention
in multithreaded applications.  it seems to me that normal page aging will
adequately identify these pages and flush them out.

if the application needs to recycle areas of a virtual address space
immediately, why should the kernel be involved at all?  i think even doing
an MADV_FREE during arbitrary free() operations would be more overhead
then you really want. in other words, i don't think free() as it exists
today harms performance in the ways you describe.

thus, either the application keeps the memory, or it is really completely
finished with it -- MADV_DONTNEED.

	- Chuck Lever
--
corporate:	<chuckl@netscape.com>
personal:	<chucklever@netscape.net> or <cel@monkey.org>

The Linux Scalability project:
	http://www.citi.umich.edu/projects/linux-scalability/

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: MADV_DONTNEED
  2000-03-21  1:29   ` MADV_DONTNEED Jamie Lokier
@ 2000-03-22 17:04     ` Chuck Lever
  2000-03-22 17:10       ` MADV_DONTNEED Stephen C. Tweedie
  2000-03-22 17:43       ` MADV_DONTNEED Jamie Lokier
  0 siblings, 2 replies; 55+ messages in thread
From: Chuck Lever @ 2000-03-22 17:04 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: linux-mm

hi jamie-

On Tue, 21 Mar 2000, Jamie Lokier wrote:
> > > In particular, using the name MADV_DONTNEED is a really bad idea.  It
> > > means completely different things on different OSes.  For example your
> > > meaning of MADV_DONTNEED is different to BSD's: a program that assumes
> > > the BSD behaviour may well crash with your implementation and will
> > > almost certainly give invalid results if it doesn't crash.
> > 
> > i'm more concerned about portability from operating systems like Solaris,
> > because there are many more server applications there than on *BSD that
> > have been designed to use these interfaces.
> ...
> > my preference is for the DU semantic of tossing dirty data instead of
> > flushing onto backing store, simply because that's what so many
> > applications expect DONTNEED to do.
> 
> That's interesting.  When I saw MADV_DONTNEED, I immediately assumed it
> was the natural counterpoint to MADV_WILLNEED.

yes, i did too.  but i realized later that "will" is *not* the opposite of
"dont".

> Useful even for
> sequential accesses, to say "my streaming window has moved beyond this
> point".  Do you agree that a counterpoint to MADV_WILLNEED is useful?

if you look at the implementation of nopage_sequential_readahead, you'll
see that it doesn't use MADV_DONTNEED, but the internal implementation of
msync(MS_INVALIDATE).  i'm not completely confident in this
implementation, but my intent was to release behind, not discard data.
so, yes, a counterpoint to WILLNEED is a good idea.  perhaps that *was*
the original intent of MADV_DONTNEED, but i don't see any documentation
that ties WILLNEED and DONTNEED together, semantically.

> > i'm not saying the *BSD way is wrong, but i think it would be a more
> > useful compromise to make *BSD functionality available via some other
> > interface (like MADV_ZERO).
> 
> You got it the wrong way around.  MADV_ZERO is more like what your
> implementation of MADV_DONTNEED does.  The BSD behaviour is nothing like
> MADV_ZERO.  BSD simply means "increment the paging priority" -- the
> page contents are unchanged.
> 
> BSD's behaviour is the obvious counterpoint to MADV_WILLNEED afaict.

it is, but it's not the behavior that most applications expect.  i'd like
to have something like this, but it should probably be named MADV_FREE, or
how about MADV_WONTNEED ? :)

so we agree that both behaviors might be useful to expose to an
application.  the only question is what to name them.

function 1 (could be MADV_DISCARD; currently MADV_DONTNEED):
  discard pages.  if they are referenced again, the process causes page
  faults to read original data (zero page for anonymous maps).

function 2 (could be MADV_FREE; currently msync(MS_INVALIDATE)):
  release pages, syncing dirty data.  if they are referenced again, the
  process causes page faults to read in latest data.

function 3 (could be MADV_ZERO):
  discard pages.  if they are referenced again, the process sees C-O-W 
  zeroed pages.

function 4 (for comparison; currently munmap):
  release pages, syncing dirty data.  if they are referenced again, the
  process causes invalid memory access faults.

i'm interested to hear what big database folks have to say about this.

> By the way, Linux MADV_DONTNEED does some of the things
> msync(MS_INVALIDATE) does but not others (in the implementation --
> ignore the man page).
> 
> Can you explain how the two things differ?  I.e., why does MS_INVALIDATE
> fiddle with swap cache pages.  Does this indicate a bug in your
> MADV_DONTNEED implementation?

for MADV_DONTNEED, i re-used code.  i'm not convinced that it's correct,
though, as i stated when i submitted the patch.  it may abandon swap cache
pages, and there may be some undefined interaction between file truncation
and MADV_DONTNEED.

	- Chuck Lever
--
corporate:	<chuckl@netscape.com>
personal:	<chucklever@netscape.net> or <cel@monkey.org>

The Linux Scalability project:
	http://www.citi.umich.edu/projects/linux-scalability/

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: MADV_DONTNEED
  2000-03-22 17:04     ` MADV_DONTNEED Chuck Lever
@ 2000-03-22 17:10       ` Stephen C. Tweedie
  2000-03-22 17:32         ` MADV_DONTNEED Jamie Lokier
  2000-03-22 17:33         ` MADV_DONTNEED Jamie Lokier
  2000-03-22 17:43       ` MADV_DONTNEED Jamie Lokier
  1 sibling, 2 replies; 55+ messages in thread
From: Stephen C. Tweedie @ 2000-03-22 17:10 UTC (permalink / raw)
  To: Chuck Lever; +Cc: Jamie Lokier, linux-mm, Stephen C. Tweedie

Hi,

On Wed, Mar 22, 2000 at 12:04:58PM -0500, Chuck Lever wrote:
> 
> so we agree that both behaviors might be useful to expose to an
> application.  the only question is what to name them.
> 
> function 1 (could be MADV_DISCARD; currently MADV_DONTNEED):
>   discard pages.  if they are referenced again, the process causes page
>   faults to read original data (zero page for anonymous maps).
> 
> function 2 (could be MADV_FREE; currently msync(MS_INVALIDATE)):
>   release pages, syncing dirty data.  if they are referenced again, the
>   process causes page faults to read in latest data.
> 
> function 3 (could be MADV_ZERO):
>   discard pages.  if they are referenced again, the process sees C-O-W 
>   zeroed pages.
> 
> function 4 (for comparison; currently munmap):
>   release pages, syncing dirty data.  if they are referenced again, the
>   process causes invalid memory access faults.
> 
> i'm interested to hear what big database folks have to say about this.

The requests I've seen from database vendors are specifically for
function 1 above.  I'd expect that they could live with function 3 
too, though --- perhaps the main reason they asked for 1 is that 
this is what they are used to working with on some other systems 
(I don't know offhand of anybody who implements 3: it seems an odd
thing to want to do for shared pages, and is equivalent to 1 for 
private mappings.)

--Stephen
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: MADV_DONTNEED
  2000-03-22 17:10       ` MADV_DONTNEED Stephen C. Tweedie
@ 2000-03-22 17:32         ` Jamie Lokier
  2000-03-22 17:33         ` MADV_DONTNEED Jamie Lokier
  1 sibling, 0 replies; 55+ messages in thread
From: Jamie Lokier @ 2000-03-22 17:32 UTC (permalink / raw)
  To: Stephen C. Tweedie; +Cc: Chuck Lever, linux-mm

Stephen C. Tweedie wrote:
> The requests I've seen from database vendors are specifically for
> function 1 above.  I'd expect that they could live with function 3 
> too, though --- perhaps the main reason they asked for 1 is that 
> this is what they are used to working with on some other systems 
> (I don't know offhand of anybody who implements 3: it seems an odd
> thing to want to do for shared pages, and is equivalent to 1 for 
> private mappings.)

For private file mappings, 1 and 3 are different.  1 reverts pages to
the underlying object.  3 as equivalent to writing zeros over the page.

It's only for /dev/zero mappings that they are the same.

Probably nobody implements 3, but some documentation suggests
otherwise.  Digital Unix:

   MADV_DONTNEED   Do not need these pages
                   The system will free any whole pages in the specified
                   region.  All modifications will be lost and any swapped
                   out pages will be discarded.  Subsequent access to the
                   region will result in a zero-fill-on-demand fault
                                           ~~~~~~~~~~~~~~~~~~~
                   as though it is being accessed for the first time.
                   ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
                   Reserved swap space is not affected by this call.

Clearly for non-anonymous mappings, the two underlined phrases
contradict one another.  Does MADV_DONTNEED on DU zero pages in private
file mappings, or does it revert to the original file pages?

-- Jamie
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: MADV_DONTNEED
  2000-03-22 17:10       ` MADV_DONTNEED Stephen C. Tweedie
  2000-03-22 17:32         ` MADV_DONTNEED Jamie Lokier
@ 2000-03-22 17:33         ` Jamie Lokier
  2000-03-22 17:37           ` MADV_DONTNEED Stephen C. Tweedie
  1 sibling, 1 reply; 55+ messages in thread
From: Jamie Lokier @ 2000-03-22 17:33 UTC (permalink / raw)
  To: Stephen C. Tweedie; +Cc: Chuck Lever, linux-mm

Stephen C. Tweedie wrote:
> > function 3 (could be MADV_ZERO):
> >   discard pages.  if they are referenced again, the process sees C-O-W 
> >   zeroed pages.

Fwiw, I don't think MADV_ZERO is particularly useful.
You can just read /dev/zero over that memory range.

-- Jamie
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: MADV_DONTNEED
  2000-03-22 17:33         ` MADV_DONTNEED Jamie Lokier
@ 2000-03-22 17:37           ` Stephen C. Tweedie
  0 siblings, 0 replies; 55+ messages in thread
From: Stephen C. Tweedie @ 2000-03-22 17:37 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: Chuck Lever, linux-mm

Hi,

On Wed, Mar 22, 2000 at 06:33:07PM +0100, Jamie Lokier wrote:
> 
> Fwiw, I don't think MADV_ZERO is particularly useful.
> You can just read /dev/zero over that memory range.

Exactly.

--Stephen
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: MADV_DONTNEED
  2000-03-22 17:04     ` MADV_DONTNEED Chuck Lever
  2000-03-22 17:10       ` MADV_DONTNEED Stephen C. Tweedie
@ 2000-03-22 17:43       ` Jamie Lokier
  2000-03-22 21:54         ` MADV_DONTNEED Chuck Lever
  1 sibling, 1 reply; 55+ messages in thread
From: Jamie Lokier @ 2000-03-22 17:43 UTC (permalink / raw)
  To: Chuck Lever; +Cc: linux-mm

Chuck Lever wrote:
> > That's interesting.  When I saw MADV_DONTNEED, I immediately assumed it
> > was the natural counterpoint to MADV_WILLNEED.
> 
> yes, i did too.  but i realized later that "will" is *not* the opposite of
> "dont".

Agreed.

> if you look at the implementation of nopage_sequential_readahead, you'll
> see that it doesn't use MADV_DONTNEED, but the internal implementation of
> msync(MS_INVALIDATE).  i'm not completely confident in this
> implementation, but my intent was to release behind, not discard data.

If I knew what msync(MS_INVALIDATE) did I could think about this! :-)
But the msync documentation is unhelpful and possibly misleading.

> it is, but it's not the behavior that most applications expect.  i'd like
> to have something like this, but it should probably be named MADV_FREE, or
> how about MADV_WONTNEED ? :)

I like the name MADV_WONTNEED.  Thanks for thinking of it :-)

With that, even keeping the name MADV_DONTNEED is ok because there is a
distinction.  (But I'd prefer to rename MADV_DONTNEED to MADV_DISCARD,
to catch potential misuses).

> function 1 (could be MADV_DISCARD; currently MADV_DONTNEED):
>   discard pages.  if they are referenced again, the process causes page
>   faults to read original data (zero page for anonymous maps).

I like the name MADV_DISCARD too. :-)

> function 2 (could be MADV_FREE; currently msync(MS_INVALIDATE)):
>   release pages, syncing dirty data.  if they are referenced again, the
>   process causes page faults to read in latest data.

Oh, I see, this is what msync(MS_INVALIDATE) does :-)

> function 4 (for comparison; currently munmap):
>   release pages, syncing dirty data.  if they are referenced again, the
>   process causes invalid memory access faults.

> for MADV_DONTNEED, i re-used code.

>From where?

> i'm not convinced that it's correct, though, as i stated when i
> submitted the patch.  it may abandon swap cache pages, and there may
> be some undefined interaction between file truncation and
> MADV_DONTNEED.

Oh dear -- because it's in pre2.4 already :-)
Better work out what it's supposed to do and fix it :-)

-- Jamie
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: madvise (MADV_FREE)
  2000-03-22 16:24     ` Chuck Lever
@ 2000-03-22 18:05       ` Jamie Lokier
  2000-03-22 21:39         ` Chuck Lever
  2000-03-22 18:15       ` madvise (MADV_FREE) Christoph Rohland
  1 sibling, 1 reply; 55+ messages in thread
From: Jamie Lokier @ 2000-03-22 18:05 UTC (permalink / raw)
  To: Chuck Lever; +Cc: linux-mm

Hi Chuck,

Think of this scenario:

   Allocate 20 x 20k blocks for images.
   Process images.
   Free 20 x 20k blocks (-> 100 page sized holes)
   Wait for user input.
   ...
   Allocate 20 x 20k blocks for images.
   Process images.
   Free 20 x 20k blocks.

Now, if the rest of your system (not just this app) is busy paging, the
best thing the app can do at "wait" is call MADV_DONTNEED.  But if the
rest of your system is not paging at all, the best thing the app can do
is _not_ call MADV_DONTNEED.

You see?  It doesn't matter whether you're going to reuse the pages soon.

The decision to use MADV_DONTNEED or not depends on overall system
behaviour, which the application doesn't know about.

Chuck Lever wrote:
> ok, so you're asking for a lite(TM) version of DONTNEED that provides the
> following hint to the kernel: "i may be finished with this page, but i may
> also want to reuse it immediately."

It does *not* mean "i may have finished with this page".
For free() it looks that way, but that is a special case.

It means "if you decide to swap this page out, you can skip the I/O".

The page age remains the same.  (You have MADV_WONTNEED if you want to
change the page age as well).

We let applications decide for themselves when it's best used.  It's for
long-lived holes after memory allocation, and cached objects such as
Netscapes in-memory image and document cache.

> memory allocation studies i've read show that dynamically allocated memory
> objects are often re-used immediately after they are freed.

True for programs which are continuously allocating and freeing memory.
Not true for interactive programs waiting for the user (for example).
See the scenario I wrote at the start of this message.

> even if the memory is being freed just before a process exits, it will
> be recycled immediately by the kernel, so why use MADV_FREE if you are
> about to munmap() it anyway?

You wouldn't use it in that situation.

I am thinking of long lived processes that aren't actively allocating
and have holes in their heap.  For example Emacs, Netscape etc.

My motivation for MADV_FREE is the observation that the optimal
behaviour for programs like Emacs and Netscape is to allocate and use
lots of memory (without changing it much) if there is no swapping, but
to release memory aggressively if there is swapping.

> finally, as you point out, the heap is generally too fragmented to
> return page-sized chunks of it to the kernel, especially if you
> consider that glibc uses *multiple* subheaps to reduce lock contention
> in multithreaded applications.

Multiple subheaps helps to produce page sized holes.  Larger allocations
(but not large enough to use mmap), when freed, leave page sized holes.
The holes aren't blocked because tiny allocations go on different
subheaps.

> it seems to me that normal page aging will adequately identify these
> pages and flush them out.

Exactly!  In fact page ageing is required for MADV_FREE to have any
effect.

The only effect of MADV_FREE is to eliminate the write to swap, after
page ageing has decided to flush a page.  It doesn't change the page
reclamation policy.

> if the application needs to recycle areas of a virtual address space
> immediately, why should the kernel be involved at all?

It is for long lived applications that have holes in their heap, who
aren't actively recycling.  Some memory allocators don't know if they
are about to be recycled, but some do.  It depends on the application.

> i think even doing an MADV_FREE during arbitrary free() operations
> would be more overhead then you really want. in other words, i don't
> think free() as it exists today harms performance in the ways you
> describe.

You're right, you wouldn't call MADV_FREE on every free().  Just when
you have a set of pages to free, every so often.  There are lots of
systems which can do that -- even a timer signal will do with a generic
malloc.

See for example GCC's ggc-page allocator -- every so often it decides to
free a set of pages.  And any GC system.  And any system which caches
objects in memory, for example Netscape.

> thus, either the application keeps the memory, or it is really completely
> finished with it -- MADV_DONTNEED.

MADV_FREE is, speaking generally, not for either of those situations.

It's for when the application has memory that it's _willing_ to give up,
at some cost to application performance.  For example cached objects
that can be recalculated or reread over the network.

Memory allocators are a special case of this.  Not just malloc/free, but
also garbage collecting systems.

At the moment, the kernel has a number of subsystems, and when memory is
required, it asks each subsystem to release some memory.  MADV_FREE is a
way for the kernel to include applications in memory balancing
decisions.

-- Jamie
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: madvise (MADV_FREE)
  2000-03-22 16:24     ` Chuck Lever
  2000-03-22 18:05       ` Jamie Lokier
@ 2000-03-22 18:15       ` Christoph Rohland
  2000-03-22 18:30         ` Jamie Lokier
  1 sibling, 1 reply; 55+ messages in thread
From: Christoph Rohland @ 2000-03-22 18:15 UTC (permalink / raw)
  To: Chuck Lever; +Cc: Jamie Lokier, linux-mm

Hi Chuck

Chuck Lever <cel@monkey.org> writes:

> ok, so you're asking for a lite(TM) version of DONTNEED that
> provides the following hint to the kernel: "i may be finished with
> this page, but i may also want to reuse it immediately."

I would say "... reuse this address space immediately and you can give
me _any_ data the next time". "Any data" means probably either the old
or a zero page.

That's the optimal strategy for the memory management modules of SAP
R/3.

> function 1 (could be MADV_DISCARD; currently MADV_DONTNEED):
>   discard pages.  if they are referenced again, the process causes page
>   faults to read original data (zero page for anonymous maps).

That would be also good.

> i'm interested to hear what big database folks have to say about this.

R/3 is not a database but probably the biggest database client. Often
much bigger than the database itself.

Greetings
		Christoph
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: madvise (MADV_FREE)
  2000-03-22 18:15       ` madvise (MADV_FREE) Christoph Rohland
@ 2000-03-22 18:30         ` Jamie Lokier
  2000-03-23 16:56           ` Christoph Rohland
  0 siblings, 1 reply; 55+ messages in thread
From: Jamie Lokier @ 2000-03-22 18:30 UTC (permalink / raw)
  To: Christoph Rohland; +Cc: Chuck Lever, linux-mm

Christoph Rohland wrote:
> > ok, so you're asking for a lite(TM) version of DONTNEED that
> > provides the following hint to the kernel: "i may be finished with
> > this page, but i may also want to reuse it immediately."
> 
> I would say "... reuse this address space immediately and you can give
> me _any_ data the next time". "Any data" means probably either the old
> or a zero page.

For maximum performance that's right.  But Linux normally has to provide
some minimal security, so an application should only see its own data or
zeros, not an arbitrary page.

Zeroing has another advantage: you can efficiently detect it.  So you
can use it for cached memory objects too in a number of cases, not just
free memory.  (A bit from mincore would also allow detection, but not
nearly as efficiently).

> That's the optimal strategy for the memory management modules of SAP R/3.

Excellent!  A hard core recommendation :-)

-- Jamie
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: madvise (MADV_FREE)
  2000-03-22 18:05       ` Jamie Lokier
@ 2000-03-22 21:39         ` Chuck Lever
  2000-03-22 22:31           ` Jamie Lokier
  2000-03-22 22:33           ` Stephen C. Tweedie
  0 siblings, 2 replies; 55+ messages in thread
From: Chuck Lever @ 2000-03-22 21:39 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: linux-mm

On Wed, 22 Mar 2000, Jamie Lokier wrote:
> Think of this scenario:
> 
>    Allocate 20 x 20k blocks for images.
>    Process images.
>    Free 20 x 20k blocks (-> 100 page sized holes)
>    Wait for user input.
>    ...
>    Allocate 20 x 20k blocks for images.
>    Process images.
>    Free 20 x 20k blocks.
> 
> Now, if the rest of your system (not just this app) is busy paging, the
> best thing the app can do at "wait" is call MADV_DONTNEED.  But if the
> rest of your system is not paging at all, the best thing the app can do
> is _not_ call MADV_DONTNEED.
> 
> You see?  It doesn't matter whether you're going to reuse the pages soon.
> 
> The decision to use MADV_DONTNEED or not depends on overall system
> behaviour, which the application doesn't know about.
> 
> > ok, so you're asking for a lite(TM) version of DONTNEED that provides the
> > following hint to the kernel: "i may be finished with this page, but i may
> > also want to reuse it immediately."
> 
> It does *not* mean "i may have finished with this page".
> For free() it looks that way, but that is a special case.
> 
> It means "if you decide to swap this page out, you can skip the I/O".
> 
> The page age remains the same.  (You have MADV_WONTNEED if you want to
> change the page age as well).
> 
> We let applications decide for themselves when it's best used.  It's for
> long-lived holes after memory allocation, and cached objects such as
> Netscapes in-memory image and document cache.

we have several generic applications we are interested in optimizing:

1.  memory allocators can indicate pages that are not in use

2.  applications that need to cache large files or big pieces of data that
can be regenerated relatively cheaply

3.  applications that need to buffer data to control precisely its
movement to and from permanent storage.

now, for 1:

several studies i've read indicate that the average size of a dynamically
allocated object is in the range of 40 bytes.  if an application is
screwing with much bigger objects, it should probably manage the objects
differently (use mmap explicitly, tweak malloc, or something like that).

in fact, i'd say it is safe in general to lower DEFAULT_MMAP_THRESHOLD to
the system page size.  that way you'd get closer to the behavior you're
after, and you'd also win a much bigger effective heap size when
allocating large objects, because you can only allocate up to 960M of a
process's address space with sbrk().

on Linux with glibc, you can use mallopt to do this. something like:

	mallopt(M_MMAP_THRESHOLD, getpagesize());

for 2:

note carefully that my implementation of MADV_DONTNEED doesn't evict data
from memory.  it simply tears down page mappings.  this will result in a
minor fault if the application immediately reaccesses the address, or a
major fault if the application accesses the address after the page
contents have finally been evicted from physical memory.

to say this another way, the page mapping binds a virtual address to a
page in the page cache. MADV_DONTNEED simply removes that binding.  
normal page aging will discover the unbound pages in the page cache and
remove them.  so really, MADV_DONTNEED is actually disconnected from the
mechanism of swapping or discarding the page's data.

there are probably nicer ways to do this, but there it is.

i think this is exactly what you want for cached files.  the application
can say "DONTNEED" this data, and the system is free to reclaim it as
necessary.  if the application accesses it again later, it will get the
old data back.  just be sure that if you change data in the file, you
explicitly sync it back to disk.

for 3:

this area of memory is probably going to be mapped from /dev/zero, and
pinned.  it's a nice way to get a clear page if you just re-read /dev/zero
into that page.

> > it seems to me that normal page aging will adequately identify these
> > pages and flush them out.
> 
> Exactly!  In fact page ageing is required for MADV_FREE to have any
> effect.
> 
> The only effect of MADV_FREE is to eliminate the write to swap, after
> page ageing has decided to flush a page.  It doesn't change the page
> reclamation policy.

ok, here is where i'm confused.  i don't think MADV_DONTNEED and MADV_FREE
are different -- they both work this way.

> > i think even doing an MADV_FREE during arbitrary free() operations
> > would be more overhead then you really want. in other words, i don't
> > think free() as it exists today harms performance in the ways you
> > describe.
> 
> You're right, you wouldn't call MADV_FREE on every free().  Just when
> you have a set of pages to free, every so often.  There are lots of
> systems which can do that -- even a timer signal will do with a generic
> malloc.

nah, i still say a better way to handle this case is to lower malloc's
"use an anon map instead of the heap" threshold to 4K or 8K.  right now
it's 32K by default.  

> At the moment, the kernel has a number of subsystems, and when memory is
> required, it asks each subsystem to release some memory.  MADV_FREE is a
> way for the kernel to include applications in memory balancing
> decisions.

like adding another separate call in do_try_to_free_pages that trolls
applications for free-able pages; expect with MADV_FREE and MADV_DONTNEED,
you're causing shrink_mmap to do this for you automatically.

	- Chuck Lever
--
corporate:	<chuckl@netscape.com>
personal:	<chucklever@netscape.net> or <cel@monkey.org>

The Linux Scalability project:
	http://www.citi.umich.edu/projects/linux-scalability/

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: MADV_DONTNEED
  2000-03-22 17:43       ` MADV_DONTNEED Jamie Lokier
@ 2000-03-22 21:54         ` Chuck Lever
  2000-03-22 22:41           ` MADV_DONTNEED Jamie Lokier
  0 siblings, 1 reply; 55+ messages in thread
From: Chuck Lever @ 2000-03-22 21:54 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: linux-mm

On Wed, 22 Mar 2000, Jamie Lokier wrote:
> > if you look at the implementation of nopage_sequential_readahead, you'll
> > see that it doesn't use MADV_DONTNEED, but the internal implementation of
> > msync(MS_INVALIDATE).  i'm not completely confident in this
> > implementation, but my intent was to release behind, not discard data.
> 
> If I knew what msync(MS_INVALIDATE) did I could think about this! :-)
> But the msync documentation is unhelpful and possibly misleading.

well, the doc's accurate, as far as i can tell.  but my use of it is a
side-effect of the behavior described in the man page.

> > function 2 (could be MADV_FREE; currently msync(MS_INVALIDATE)):
> >   release pages, syncing dirty data.  if they are referenced again, the
> >   process causes page faults to read in latest data.
> 
> Oh, I see, this is what msync(MS_INVALIDATE) does :-)

more or less.  it removes the mappings, but also schedules writes for any
dirty pages it finds.

> > function 4 (for comparison; currently munmap):
> >   release pages, syncing dirty data.  if they are referenced again, the
> >   process causes invalid memory access faults.
> 
> > for MADV_DONTNEED, i re-used code.
> 
> From where?

you can find logic that invokes zap_page_range throughout the mm code, but
especially in do_munmap.  if my implementation is broken in this regard,
then i'd bet do_munmap is broken too.

> > i'm not convinced that it's correct, though, as i stated when i
> > submitted the patch.  it may abandon swap cache pages, and there may
> > be some undefined interaction between file truncation and
> > MADV_DONTNEED.
> 
> Oh dear -- because it's in pre2.4 already :-)
> Better work out what it's supposed to do and fix it :-)

it's not too serious, i hope, since madvise is not used by any existing
Linux apps.  this area of the kernel has been changing so much in the past
6-9 months that it's been difficult to know what is the blessed way to get
my implementation to work.

it now works in the simple cases.  i'm waiting to hear about real world
usage.

	- Chuck Lever
--
corporate:	<chuckl@netscape.com>
personal:	<chucklever@netscape.net> or <cel@monkey.org>

The Linux Scalability project:
	http://www.citi.umich.edu/projects/linux-scalability/

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: madvise (MADV_FREE)
  2000-03-22 21:39         ` Chuck Lever
@ 2000-03-22 22:31           ` Jamie Lokier
  2000-03-22 22:44             ` Stephen C. Tweedie
  2000-03-23 18:53             ` Chuck Lever
  2000-03-22 22:33           ` Stephen C. Tweedie
  1 sibling, 2 replies; 55+ messages in thread
From: Jamie Lokier @ 2000-03-22 22:31 UTC (permalink / raw)
  To: Chuck Lever; +Cc: linux-mm

> > > it seems to me that normal page aging will adequately identify these
> > > pages and flush them out.
> > 
> > Exactly!  In fact page ageing is required for MADV_FREE to have any
> > effect.
> > 
> > The only effect of MADV_FREE is to eliminate the write to swap, after
> > page ageing has decided to flush a page.  It doesn't change the page
> > reclamation policy.
> 
> ok, here is where i'm confused.  i don't think MADV_DONTNEED and MADV_FREE
> are different -- they both work this way.

No they don't.  MADV_DONTNEED always discards private modifications.
(BTW I think it should be flushing the swap cache while it's at it).

MADV_FREE only discards private modifications when there is paging
pressure to do so.  The decisions to do so are deferred, for
architectures that support this.  (Includes x86).

Chuck Lever wrote:
> 1.  memory allocators can indicate pages that are not in use
> 
> now, for 1:
> 
> several studies i've read indicate that the average size of a dynamically
> allocated object is in the range of 40 bytes.  if an application is
> screwing with much bigger objects, it should probably manage the objects
> differently (use mmap explicitly, tweak malloc, or something like that).

The average object size is skewed towards small numbers because there
are usually many more small objects, allocated at a higher rate.  It
only takes a few larger objects to lead to holes, but they don't count
in the "average size" statistic because the time spent in the memory
allocator for larger objects isn't significant.

MADV_FREE isn't to optimise the time spent in a memory allocator.  It's
to optimise overall system performance.

And that is for a subset of applications.  Yes, by all means tweak
malloc.  Tweak it to call MADV_FREE :-)

> in fact, i'd say it is safe in general to lower DEFAULT_MMAP_THRESHOLD to
> the system page size.  that way you'd get closer to the behavior you're
> after, and you'd also win a much bigger effective heap size when
> allocating large objects, because you can only allocate up to 960M of a
> process's address space with sbrk().

A fine way to make performance suck.

Application heap fragmentation now appears as vma fragmentation -> that
means expect to see hundreds or more vmas.  Lost memory due to rounding
to a page size is also now also unusable.

Even if you manage to save memory, performance sucks.  A system call for
every medium size allocation and deallocation?  You gotta be kidding.
And now even normal page faults take longer because of the extra vmas.

You've just optimised for the minimum RAM, maximum paging case.

> > You're right, you wouldn't call MADV_FREE on every free().  Just when
> > you have a set of pages to free, every so often.  There are lots of
> > systems which can do that -- even a timer signal will do with a generic
> > malloc.
> 
> nah, i still say a better way to handle this case is to lower malloc's
> "use an anon map instead of the heap" threshold to 4K or 8K.  right now
> it's 32K by default.  

Try it.  I expect the malloc author chose a high threshold after
extensive measurements -- that malloc implementation is the result of a
series of implementations and studies.  Do you know that Glibc's malloc
also limits the total number of mmaps?  I believe that's because
performance plummets when you have too many vmas.

And even if we didn't use vmas or system calls, even if mmap were a
straightforward function call to ultra-fast code, explicitly returning
the memory to the kernel implies a significant overhead -- you're
forcing unnecessary clear_page() calls.

> 2.  applications that need to cache large files or big pieces of data that
> can be regenerated relatively cheaply
> 
> note carefully that my implementation of MADV_DONTNEED doesn't evict data
> from memory.  it simply tears down page mappings.  this will result in a
> minor fault if the application immediately reaccesses the address, or a
> major fault if the application accesses the address after the page
> contents have finally been evicted from physical memory.
> 
> to say this another way, the page mapping binds a virtual address to a
> page in the page cache. MADV_DONTNEED simply removes that binding.  
> normal page aging will discover the unbound pages in the page cache and
> remove them.  so really, MADV_DONTNEED is actually disconnected from the
> mechanism of swapping or discarding the page's data.

Let's see... zap_page_range.  That looks like the private modification
is discarded.

That's not what MADV_FREE does.  MADV_FREE does _not_ discard private
modifications unless they're reclaimed due to memory pressure.  And that
decision is magically deferred.

And that's what you want for caching calculated structures in an
application.  They are private mappings which will be zeroed _if_ (and
only if) the kernel decides there is pressure to use the memory
elsewhere.

> i think this is exactly what you want for cached files.

For reading a file, yes.  For a locally generated structure, such as a
parsed file, no.  BTW, I am sure that Netscape's "memory cache" is the
latter -- because they have "disk cache" for the former.

> the application can say "DONTNEED" this data, and the system is free
> to reclaim it as necessary.  if the application accesses it again
> later, it will get the old data back.  just be sure that if you change
> data in the file, you explicitly sync it back to disk.

You say "the system is free to reclaim it".  MADV_DONTNEED _forces_ the
system to reclaim the data, if it is not in swap cache at the time.

For a locally calculated structure in an anonymous mapping, you don't
get the data back.  (Yes, this means "cached files".  Sorry if I made it
sound like mapped files).

> 3.  applications that need to buffer data to control precisely its
> movement to and from permanent storage.
>
> for 3:
> 
> this area of memory is probably going to be mapped from /dev/zero, and
> pinned.  it's a nice way to get a clear page if you just re-read /dev/zero
> into that page.

Um.  I don't see how that response has anything to do with 3 :-)

> > At the moment, the kernel has a number of subsystems, and when memory is
> > required, it asks each subsystem to release some memory.  MADV_FREE is a
> > way for the kernel to include applications in memory balancing
> > decisions.
> 
> like adding another separate call in do_try_to_free_pages that trolls
> applications for free-able pages; expect with MADV_FREE and MADV_DONTNEED,
> you're causing shrink_mmap to do this for you automatically.

It should be added to vmscan and/or shrink_mmap.  The rough outline is:
MADV_FREE clears the pte accessed bit and marks the page as freeable.
Later, on finding one of these pages during the normal scans, just dump
the page if it is still not accessed.  If it has been accessed, it's no
longer freeable.

There are some interactions with the swap cache and vmscan algorithm I
have glossed over...

-- Jamie
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: madvise (MADV_FREE)
  2000-03-22 21:39         ` Chuck Lever
  2000-03-22 22:31           ` Jamie Lokier
@ 2000-03-22 22:33           ` Stephen C. Tweedie
  2000-03-22 22:45             ` Jamie Lokier
  1 sibling, 1 reply; 55+ messages in thread
From: Stephen C. Tweedie @ 2000-03-22 22:33 UTC (permalink / raw)
  To: Chuck Lever; +Cc: Jamie Lokier, linux-mm, Stephen C. Tweedie

Hi,

On Wed, Mar 22, 2000 at 04:39:12PM -0500, Chuck Lever wrote:
> 
> in fact, i'd say it is safe in general to lower DEFAULT_MMAP_THRESHOLD to
> the system page size.  that way you'd get closer to the behavior you're
> after, and you'd also win a much bigger effective heap size when
> allocating large objects, because you can only allocate up to 960M of a
> process's address space with sbrk().

You can use MADV_DONTNEED to reclaim demand-zero pages below sbrk()
even without using memory map in the first place, and I understand that
recent versions of glibc will resort to extending the heap with mmap()
automatically once sbrk() reaches its limit.  So, I don't think that
decreasing DEFAULT_MMAP_THRESHOLD really gains that much.
> 
> to say this another way, the page mapping binds a virtual address to a
> page in the page cache. MADV_DONTNEED simply removes that binding.  
> normal page aging will discover the unbound pages in the page cache and
> remove them.  so really, MADV_DONTNEED is actually disconnected from the
> mechanism of swapping or discarding the page's data.

Not for anonymous pages, where the pte reference is the _only_ reference
to the page (except for swap-cached pages).  In this case, MADV_DONTNEED
will genuinely free the page.

> nah, i still say a better way to handle this case is to lower malloc's
> "use an anon map instead of the heap" threshold to 4K or 8K.  right now
> it's 32K by default.  

No, it's much cheaper to do a MADV_DONTNEED when freeing an anonymous
page: that way the pageout and subsequent demand-zero pagein all happen
entirely within the page tables, without having to perform lots of
operations on the vma tree of the process.

--Stephen

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: MADV_DONTNEED
  2000-03-22 21:54         ` MADV_DONTNEED Chuck Lever
@ 2000-03-22 22:41           ` Jamie Lokier
  2000-03-23 19:13             ` MADV_DONTNEED James Antill
  0 siblings, 1 reply; 55+ messages in thread
From: Jamie Lokier @ 2000-03-22 22:41 UTC (permalink / raw)
  To: Chuck Lever; +Cc: linux-mm

Chuck Lever wrote:
> > If I knew what msync(MS_INVALIDATE) did I could think about this! :-)
> > But the msync documentation is unhelpful and possibly misleading.
> 
> well, the doc's accurate, as far as i can tell.  but my use of it is a
> side-effect of the behavior described in the man page.

	"MS_INVALIDATE asks to invalidate  other  mappings  of  the
       same file (so that they can be updated with the fresh val-
       ues just written)."

Oh I see.  It means the locally modified but in principle shared mapping
is copied back to the underlying object.  For a page aligned mapping
that shouldn't need to do anything.

Since the MS_INVALIDATE code doesn't modify other ptes, we must assume
the other mappings are all page aligned or they wouldn't see the
update.

So why does MS_INVALIDATE have any code? :-)

> > > function 2 (could be MADV_FREE; currently msync(MS_INVALIDATE)):
> > >   release pages, syncing dirty data.  if they are referenced again, the
> > >   process causes page faults to read in latest data.
> > 
> > Oh, I see, this is what msync(MS_INVALIDATE) does :-)
> 
> more or less.  it removes the mappings, but also schedules writes for any
> dirty pages it finds.

I think "schedules writes" is what MS_ASYNC and MS_SYNC do,
independently of MS_INVALIDATE.

> > > function 4 (for comparison; currently munmap):
> > >   release pages, syncing dirty data.  if they are referenced again, the
> > >   process causes invalid memory access faults.
> > 
> > > for MADV_DONTNEED, i re-used code.
> > 
> > From where?
> 
> you can find logic that invokes zap_page_range throughout the mm code, but
> especially in do_munmap.  if my implementation is broken in this regard,
> then i'd bet do_munmap is broken too.

do_munmap also calls vm_ops->unmap before the zap_page_range, which has
a potentially important side effects for files...  Like actually writing
the data :-)

That's not what, say, MADV_DISCARD would do, but it's what "release
pages, syncing dirty data" should do.

> > > i'm not convinced that it's correct, though, as i stated when i
> > > submitted the patch.  it may abandon swap cache pages, and there may
> > > be some undefined interaction between file truncation and
> > > MADV_DONTNEED.
> > 
> > Oh dear -- because it's in pre2.4 already :-)
> > Better work out what it's supposed to do and fix it :-)
> 
> it's not too serious, i hope, since madvise is not used by any existing
> Linux apps.  this area of the kernel has been changing so much in the past
> 6-9 months that it's been difficult to know what is the blessed way to get
> my implementation to work.

Quite.  I'm not so concerned about the implementation at this stage as
getting agreement on the right semantics!

-- Jamie
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: madvise (MADV_FREE)
  2000-03-22 22:31           ` Jamie Lokier
@ 2000-03-22 22:44             ` Stephen C. Tweedie
  2000-03-23 18:53             ` Chuck Lever
  1 sibling, 0 replies; 55+ messages in thread
From: Stephen C. Tweedie @ 2000-03-22 22:44 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: Chuck Lever, linux-mm, Stephen C. Tweedie

Hi,

On Wed, Mar 22, 2000 at 11:31:47PM +0100, Jamie Lokier wrote:
> 
> No they don't.  MADV_DONTNEED always discards private modifications.
> (BTW I think it should be flushing the swap cache while it's at it).

If it is the last user of the page --- ie. if PG_SwapCache is set and
the refcount of the page is one --- then it will do so anyway, because
when I added that swap cache code I made sure that zap_page_range()
does a free_page_and_swap_cache() when freeing pages.

--Stephen

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: madvise (MADV_FREE)
  2000-03-22 22:33           ` Stephen C. Tweedie
@ 2000-03-22 22:45             ` Jamie Lokier
  2000-03-22 22:48               ` Stephen C. Tweedie
  0 siblings, 1 reply; 55+ messages in thread
From: Jamie Lokier @ 2000-03-22 22:45 UTC (permalink / raw)
  To: Stephen C. Tweedie; +Cc: Chuck Lever, linux-mm

Stephen C. Tweedie wrote:
> > to say this another way, the page mapping binds a virtual address to a
> > page in the page cache. MADV_DONTNEED simply removes that binding.  
> > normal page aging will discover the unbound pages in the page cache and
> > remove them.  so really, MADV_DONTNEED is actually disconnected from the
> > mechanism of swapping or discarding the page's data.
> 
> Not for anonymous pages, where the pte reference is the _only_ reference
> to the page (except for swap-cached pages).  In this case, MADV_DONTNEED
> will genuinely free the page.

Doesn't this also result in a swap-cache leak, or are orphan swap-cache
pages reclaimed eventually?

> > nah, i still say a better way to handle this case is to lower malloc's
> > "use an anon map instead of the heap" threshold to 4K or 8K.  right now
> > it's 32K by default.  
> 
> No, it's much cheaper to do a MADV_DONTNEED when freeing an anonymous
> page: that way the pageout and subsequent demand-zero pagein all happen
> entirely within the page tables, without having to perform lots of
> operations on the vma tree of the process.

And it's even cheaper to do MADV_FREE so you skip demand-zeroing if
memory pressure doesn't require that.

-- Jamie
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: madvise (MADV_FREE)
  2000-03-22 22:45             ` Jamie Lokier
@ 2000-03-22 22:48               ` Stephen C. Tweedie
  2000-03-22 22:55                 ` Q. about swap-cache orphans Jamie Lokier
  0 siblings, 1 reply; 55+ messages in thread
From: Stephen C. Tweedie @ 2000-03-22 22:48 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: Chuck Lever, linux-mm, Stephen C. Tweedie

Hi,

On Wed, Mar 22, 2000 at 11:45:31PM +0100, Jamie Lokier wrote:
> 
> Doesn't this also result in a swap-cache leak, or are orphan swap-cache
> pages reclaimed eventually?

The shrink_mmap() page cache reclaimer is able to pick up any orphaned 
swap cache pages.

> And it's even cheaper to do MADV_FREE so you skip demand-zeroing if
> memory pressure doesn't require that.

Right.

--Stephen
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Q. about swap-cache orphans
  2000-03-22 22:48               ` Stephen C. Tweedie
@ 2000-03-22 22:55                 ` Jamie Lokier
  2000-03-22 22:58                   ` Stephen C. Tweedie
  0 siblings, 1 reply; 55+ messages in thread
From: Jamie Lokier @ 2000-03-22 22:55 UTC (permalink / raw)
  To: Stephen C. Tweedie; +Cc: Chuck Lever, linux-mm

[This is just a question to help my understanding, not relevant to madvise]

Stephen C. Tweedie wrote:
> If it is the last user of the page --- ie. if PG_SwapCache is set and
> the refcount of the page is one --- then it will do so anyway, because
> when I added that swap cache code I made sure that zap_page_range()
> does a free_page_and_swap_cache() when freeing pages.

I.e., zap_page_range makes sure that MADV_DONTNEED won't leave orphan
swap-cache pages.

> > Doesn't this also result in a swap-cache leak, or are orphan swap-cache
> > pages reclaimed eventually?
> 
> The shrink_mmap() page cache reclaimer is able to pick up any orphaned 
> swap cache pages.

But there won't be any orphans, will there?
Or do they appear due to async. swapping situations?

thanks,
-- Jamie
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Q. about swap-cache orphans
  2000-03-22 22:55                 ` Q. about swap-cache orphans Jamie Lokier
@ 2000-03-22 22:58                   ` Stephen C. Tweedie
  0 siblings, 0 replies; 55+ messages in thread
From: Stephen C. Tweedie @ 2000-03-22 22:58 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: Chuck Lever, linux-mm, Stephen C. Tweedie

Hi,

On Wed, Mar 22, 2000 at 11:55:45PM +0100, Jamie Lokier wrote:
> [This is just a question to help my understanding, not relevant to madvise]
> 
> Stephen C. Tweedie wrote:
> > If it is the last user of the page --- ie. if PG_SwapCache is set and
> > the refcount of the page is one --- then it will do so anyway, because
> > when I added that swap cache code I made sure that zap_page_range()
> > does a free_page_and_swap_cache() when freeing pages.
> 
> I.e., zap_page_range makes sure that MADV_DONTNEED won't leave orphan
> swap-cache pages.

Not quite, but very nearly.  There are a few minor places where the 
refcount on a page is bumped up temporarily, so zap_page_range is
theoretically able to be confused into thinking that there are extra
references, and that the swap cache should remain.  However, that is
still correct behaviour, because the shrink_mmap() code will seek and
destroy the remaining swap cache references if that happens.

> > The shrink_mmap() page cache reclaimer is able to pick up any orphaned 
> > swap cache pages.
> 
> But there won't be any orphans, will there?
> Or do they appear due to async. swapping situations?

Yes, but it's harmless.

--Stephen

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: madvise (MADV_FREE)
  2000-03-22 18:30         ` Jamie Lokier
@ 2000-03-23 16:56           ` Christoph Rohland
  0 siblings, 0 replies; 55+ messages in thread
From: Christoph Rohland @ 2000-03-23 16:56 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: Christoph Rohland, Chuck Lever, linux-mm

Jamie Lokier <jamie.lokier@cern.ch> writes:
> Christoph Rohland wrote:
> > > ok, so you're asking for a lite(TM) version of DONTNEED that
> > > provides the following hint to the kernel: "i may be finished
> > > with this page, but i may also want to reuse it immediately."
> > 
> > I would say "... reuse this address space immediately and you can
> > give me _any_ data the next time". "Any data" means probably
> > either the old or a zero page.
> 
> For maximum performance that's right.  But Linux normally has to
> provide some minimal security, so an application should only see its
> own data or zeros, not an arbitrary page.

That was the reason for "...probably either the old or a zero page"

> Zeroing has another advantage: you can efficiently detect it.  So
> you can use it for cached memory objects too in a number of cases,
> not just free memory.  (A bit from mincore would also allow
> detection, but not nearly as efficiently).
> 
> > That's the optimal strategy for the memory management modules of
> > SAP R/3.
> 
> Excellent!  A hard core recommendation :-)

:-)

Greetings
		Christoph
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: madvise (MADV_FREE)
  2000-03-22 22:31           ` Jamie Lokier
  2000-03-22 22:44             ` Stephen C. Tweedie
@ 2000-03-23 18:53             ` Chuck Lever
  2000-03-24  0:00               ` /dev/recycle Jamie Lokier
  2000-03-24  0:21               ` madvise (MADV_FREE) Jamie Lokier
  1 sibling, 2 replies; 55+ messages in thread
From: Chuck Lever @ 2000-03-23 18:53 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: linux-mm

On Wed, 22 Mar 2000, Jamie Lokier wrote:
> > > The only effect of MADV_FREE is to eliminate the write to swap, after
> > > page ageing has decided to flush a page.  It doesn't change the page
> > > reclamation policy.
> > 
> > ok, here is where i'm confused.  i don't think MADV_DONTNEED and MADV_FREE
> > are different -- they both work this way.
> 
> No they don't.  MADV_DONTNEED always discards private modifications.
> (BTW I think it should be flushing the swap cache while it's at it).
> 
> MADV_FREE only discards private modifications when there is paging
> pressure to do so.  The decisions to do so are deferred, for
> architectures that support this.  (Includes x86).

i still don't see a big difference.  the private modifications, in both
cases, won't be written to swap.  in both cases, the application cannot
rely on the contents of these pages after the madvise call.

for private mappings, pages are freed immediately by DONTNEED; FREE will
cause the pages to be freed later if the system is low on memory.  that's
six of one, half dozen of the other.  freeing later may mean the
application saves a little time now, but freeing immediately could mean
postponing a low memory scenario, and would allow the system to reuse a
page that is still in hardware caches.

> > nah, i still say a better way to handle this case is to lower malloc's
> > "use an anon map instead of the heap" threshold to 4K or 8K.  right now
> > it's 32K by default.  
> 
> Try it.  I expect the malloc author chose a high threshold after
> extensive measurements -- that malloc implementation is the result of a
> series of implementations and studies.  Do you know that Glibc's malloc
> also limits the total number of mmaps?  I believe that's because
> performance plummets when you have too many vmas.

the AVL tree structure helps this.  there is still a linear search in the
number of vmas to find unused areas in a virtual address space.  this
makes mmap significantly slower when there are a large number of vmas.
i'll bet some clever person on this list could create a data structure
that fixes this problem.

but you said before that the number of small dynamically allocated objects
dwarfs the number of large objects.  so either there is a problem here, or
there isn't! :)  can this be any worse than mprotect?

	- Chuck Lever
--
corporate:	<chuckl@netscape.com>
personal:	<chucklever@netscape.net> or <cel@monkey.org>

The Linux Scalability project:
	http://www.citi.umich.edu/projects/linux-scalability/

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: MADV_DONTNEED
  2000-03-22 22:41           ` MADV_DONTNEED Jamie Lokier
@ 2000-03-23 19:13             ` James Antill
  0 siblings, 0 replies; 55+ messages in thread
From: James Antill @ 2000-03-23 19:13 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: Chuck Lever, linux-mm

> Chuck Lever wrote:
> > > If I knew what msync(MS_INVALIDATE) did I could think about this! :-)
> > > But the msync documentation is unhelpful and possibly misleading.
> > 
> > well, the doc's accurate, as far as i can tell.  but my use of it is a
> > side-effect of the behavior described in the man page.
> 
> 	"MS_INVALIDATE asks to invalidate  other  mappings  of  the
>        same file (so that they can be updated with the fresh val-
>        ues just written)."
> 
> Oh I see.  It means the locally modified but in principle shared mapping
> is copied back to the underlying object.  For a page aligned mapping
> that shouldn't need to do anything.
> 
> Since the MS_INVALIDATE code doesn't modify other ptes, we must assume
> the other mappings are all page aligned or they wouldn't see the
> update.
> 
> So why does MS_INVALIDATE have any code? :-)

 I've used this in Solaris when mmap()'ing over NFS.

 Ie. You'd msync(MS_SYNC) on the NFS writer, and msync(MS_INVALIDATE)
on the readers.

 The Linux documentation I have is the same as Jamie's and says
_other_ mappings, but maybe that's just a typo (I'm pretty sure
INVALIDATE on solaris guaranteed that your mapping was invalidas
well).

-- 
James Antill -- james@and.org
"If we can't keep this sort of thing out of the kernel, we might as well
pack it up and go run Solaris." -- Larry McVoy.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 55+ messages in thread

* /dev/recycle
  2000-03-23 18:53             ` Chuck Lever
@ 2000-03-24  0:00               ` Jamie Lokier
  2000-03-24  9:14                 ` /dev/recycle Christoph Rohland
  2000-03-28  0:48                 ` /dev/recycle Chuck Lever
  2000-03-24  0:21               ` madvise (MADV_FREE) Jamie Lokier
  1 sibling, 2 replies; 55+ messages in thread
From: Jamie Lokier @ 2000-03-24  0:00 UTC (permalink / raw)
  To: Chuck Lever; +Cc: linux-mm

This discussion needs to split into two: one about memory allocators
responding to overall system memory pressure, and another about
applications cacheing recomputable objects, which also want to respond
to system memory pressure.pa

The issues are different and the requirements are different.
Perhaps trying to use the name MADV_FREE to cover them both is just
confusing.

For the record, I'm going to talk about memory allocators, and the
subject has changed to reflect that.

So hi Chuck!  I've thought of something maybe better than MADV_FREE for
memory allocators.  It's neat, it's simple, it's cute...  But first I'll
explain MADV_FREE a bit more.

Chuck Lever wrote:
> > MADV_FREE only discards private modifications when there is paging
> > pressure to do so.  The decisions to do so are deferred, for
> > architectures that support this.  (Includes x86).
> 
> i still don't see a big difference.  the private modifications, in both
> cases, won't be written to swap.  in both cases, the application cannot
> rely on the contents of these pages after the madvise call.

Correct.  The difference is that with MADV_FREE, clear_page() operations
are skipped when there's no memory pressure from the kernel.

> for private mappings, pages are freed immediately by DONTNEED; FREE will
> cause the pages to be freed later if the system is low on memory.  that's
> six of one, half dozen of the other.  freeing later may mean the
> application saves a little time now,

It may save the time overall -- if the page is next reused by the
application before the kernel recycles it.  Note that nobody, neither
the application nor the kernel, knows in advance if this will be the
case.

> but freeing immediately could mean postponing a low memory scenario,
> and would allow the system to reuse a page that is still in hardware
> caches.

The system is free to reuse MADV_FREE pages immediately if it wishes --
the system doesn't lose here.  In fact if you're already low on memory
at the time madvise() is called, the kernel would reclaim as many pages
as it needs immediately, just as if you'd called MADV_DONTNEED for those
pages.  The remainder get marked reclaimable.

Look at it from the point of view of an application writer.  Why would I
ever call MADV_DONTNEED for anything but large memory areas?  It
penalises my application on systems that aren't swapping..  (Though
MADV_FREE is also a penalty, but a smaller one).

> but you said before that the number of small dynamically allocated objects
> dwarfs the number of large objects.  so either there is a problem here, or
> there isn't! :)

We're talking about free areas, not objects :-) Think of the kernel,
specifically only the memory managed by kmalloc/slab.  It handles lots
of small allocations, but nevertheless produces free pages which the
kernel can use when there's memory pressure.

But anyway...

Better than MADV_FREE: /dev/recycle
--------------------------------------------------

What about this whacky idea?

MAP_RECYCLE|MAP_ANON initially allocates pages like MAP_ANON.  Mapping
/dev/recycle is similar (but subtly different).

MADV_DONTNEED or munmap discard private modifications, but record this
process as the page owner.  If the process later accesses the page, a
page is allocated again but the MAP_RECYCLE means it may return a page
already marked as belonging to this process without clearing it.

That's better for app allocators than MADV_FREE: they're giving the
kernel more freedom with not much loss in performance.  And the kernel
likes this too -- no need for vmscan to release references, as the pages
are free already.

-- Jamie
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: madvise (MADV_FREE)
  2000-03-23 18:53             ` Chuck Lever
  2000-03-24  0:00               ` /dev/recycle Jamie Lokier
@ 2000-03-24  0:21               ` Jamie Lokier
  2000-03-24  7:21                 ` lars brinkhoff
  1 sibling, 1 reply; 55+ messages in thread
From: Jamie Lokier @ 2000-03-24  0:21 UTC (permalink / raw)
  To: Chuck Lever; +Cc: linux-mm

On the dirty bit
................

And then Chuck moved onto a different topic, mincore...
> can this be any worse than mprotect?

Do you really imagine an application having to handle 1000 SEGV signals,
and call mprotect() for one page per SEGV, and the kernel locking the mm
thereby causing soft fault contention for other threads, is fast? :-)

<ahem>, but enough sensationalism from me.  I went and looked at some
papers -- and found a rather annoying problem with mprotect, for general
purpose GCs[1]:

	"The resulting write faults were caught as UNIX signals and
	recorded.  Various Portable Common Runtime interfaces to SunOS
	system calls were modified so as to preclude unrecoverable
	faults in system calls."

Ouch!  You can't use the mprotect() method with read().  mincore would
be just fine.  So you can't make a conservative collector that works
with a third-party library unless you're willing to write wrappers for
all the system calls that touch user memory.  ioctl() for a hairy
example.

On the matter of timing, when that paper was written (1991), continued
from above:

	"The primary cost of this is that the first time a page in the
	heap is written after a garbage collection, a signal must be
	caught and a system call must be executed to unprotect the
	page.  The cost of this is variable, but in our environment
	appears to be somewhat less then half a millisecond per page
	written."

As a counterpoint, Boehm has this to say on the subject of getting dirty
bits from the OS[2].  See 3:

	"We keep track of modified pages using one of three distinct
	mechanisms:

	1. Through explicit mutator cooperation. Currently this requires the
	   use of GC_malloc_stubborn. 
	2. By write-protecting physical pages and catching write faults. This
           is implemented for many Unix-like systems and for win32. It is not
           possible in a few environments. 
	3. By retrieving dirty bit information from /proc. (Currently only
           Sun's Solaris supports this. Though this is considerably cleaner,
           performance may actually be better with mprotect and signals.) 

Well, I guess we will never know until it has been tried, but it looks
like it should be experimented with by someone writing a garbage
collector before it becomes a standard kernel feature.  I really don't
like the way mprotect breaks syscalls though, even if it performs well.

On the accessed bit
...................

In [3], Boehm says:

	"Paging locality 

	A common concern about garbage collection, or any form of
	dynamic memory allocation, is its interaction with a virtual
	memory system.  Accesses to virtual memory should be such that
	the traffic between disk and memory is small, i.e. most access
	should be to pages that were already recently accessed.  On
	modern computers, where disks are so much slower than CPUs, many
	programs page very little, and most of their heaps reside in the
	working set. But even for those programs in which significant
	parts of the heap do not reside in the working set, there are a
	number of techniques which dramatically increase the locality of
	reference of a mark-and-sweep collector.  The fundamental
	problem is that all memory that may possibly contain pointers
	has to be examined during every full collection."

Boehm then goes on to summarise methods used to avoid this problem.  In
particular, generational collection.

This is something that mincore could perhaps help with.  Pages that
haven't been accessed since certain GC checkpoints can gather in a set
of pages that don't need to be scanned, or at least not scanned
particularly often.

Again, somebody working on a real GC implementation would be the right
person to experiment with extensions to mincore.

My summary from this is: no point adding mincore extensions until we
know what would be useful.  But do reserve the space in those bits 1-7.

enjoy,
-- Jamie

[1] http://reality.sgi.com/boehm/papers/pldi91.ps.Z
[2] http://reality.sgi.com/boehm/gcdescr.html
[3] http://reality.sgi.com/boehm/issues.html
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: madvise (MADV_FREE)
  2000-03-24  0:21               ` madvise (MADV_FREE) Jamie Lokier
@ 2000-03-24  7:21                 ` lars brinkhoff
  2000-03-24 17:42                   ` Jeff Dike
  0 siblings, 1 reply; 55+ messages in thread
From: lars brinkhoff @ 2000-03-24  7:21 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: Chuck Lever, linux-mm, jdike

Jamie Lokier wrote:
> Well, I guess we will never know until it has been tried, but it looks
> like it should be experimented with by someone writing a garbage
> collector before it becomes a standard kernel feature.  I really don't
> like the way mprotect breaks syscalls though, even if it performs well.

And please remember that not only garbage collectors can benefit from dirty
and accessed bits.  There are a number of applications doing paging in user
space.  For example, the Brown Simulator
(http://www.cs.brown.edu/software/brownsim/)
and a386 (http://a386.nocrew.org/) both provide virtual CPUs with MMUs which
can run operating system kernels.  Per-page accessed and dirty information
from the hosting kernel would ease the implementation of a simulated MMU.

Perhaps also the user-mode Linux kernel would benefit, but I'm not sure.
Jeff?
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: /dev/recycle
  2000-03-24  0:00               ` /dev/recycle Jamie Lokier
@ 2000-03-24  9:14                 ` Christoph Rohland
  2000-03-24 13:10                   ` /dev/recycle Jamie Lokier
  2000-03-28  0:48                 ` /dev/recycle Chuck Lever
  1 sibling, 1 reply; 55+ messages in thread
From: Christoph Rohland @ 2000-03-24  9:14 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: Chuck Lever, linux-mm

Jamie Lokier <lk@tantalophile.demon.co.uk> writes:

> Better than MADV_FREE: /dev/recycle
> --------------------------------------------------
> 
> What about this whacky idea?
> 
> MAP_RECYCLE|MAP_ANON initially allocates pages like MAP_ANON.  Mapping
> /dev/recycle is similar (but subtly different).
> 
> MADV_DONTNEED or munmap discard private modifications, but record this
> process as the page owner.  If the process later accesses the page, a
> page is allocated again but the MAP_RECYCLE means it may return a page
> already marked as belonging to this process without clearing it.
> 
> That's better for app allocators than MADV_FREE: they're giving the
> kernel more freedom with not much loss in performance.  And the kernel
> likes this too -- no need for vmscan to release references, as the pages
> are free already.

This would only work for /dev/zero like mappings. I need it for shm
mappings.

Greetings
		Christoph
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: /dev/recycle
  2000-03-24  9:14                 ` /dev/recycle Christoph Rohland
@ 2000-03-24 13:10                   ` Jamie Lokier
  2000-03-24 13:54                     ` /dev/recycle Christoph Rohland
  0 siblings, 1 reply; 55+ messages in thread
From: Jamie Lokier @ 2000-03-24 13:10 UTC (permalink / raw)
  To: Christoph Rohland; +Cc: Chuck Lever, linux-mm

Christoph Rohland wrote:
> > MAP_RECYCLE|MAP_ANON initially allocates pages like MAP_ANON.  Mapping
> > /dev/recycle is similar (but subtly different).
> 
> This would only work for /dev/zero like mappings. I need it for shm
> mappings.

Open /dev/recycle several times and map it shared -- it's the same as
anonymous shared mappings.  The owner of pages is considered to be the
filehandle itself in that case.

-- Jamie
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: /dev/recycle
  2000-03-24 13:10                   ` /dev/recycle Jamie Lokier
@ 2000-03-24 13:54                     ` Christoph Rohland
  2000-03-24 14:17                       ` /dev/recycle Jamie Lokier
  0 siblings, 1 reply; 55+ messages in thread
From: Christoph Rohland @ 2000-03-24 13:54 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: Christoph Rohland, Chuck Lever, linux-mm

Jamie Lokier <lk@tantalophile.demon.co.uk> writes:

> Christoph Rohland wrote:
> > This would only work for /dev/zero like mappings. I need it for shm
> > mappings.
> 
> Open /dev/recycle several times and map it shared -- it's the same as
> anonymous shared mappings.  The owner of pages is considered to be the
> filehandle itself in that case.

It's not the same as posix shared mem.

Greetings
		Christoph
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: /dev/recycle
  2000-03-24 13:54                     ` /dev/recycle Christoph Rohland
@ 2000-03-24 14:17                       ` Jamie Lokier
  2000-03-24 17:40                         ` /dev/recycle Christoph Rohland
  0 siblings, 1 reply; 55+ messages in thread
From: Jamie Lokier @ 2000-03-24 14:17 UTC (permalink / raw)
  To: Christoph Rohland; +Cc: Chuck Lever, linux-mm

Christoph Rohland wrote:
> > Open /dev/recycle several times and map it shared -- it's the same as
> > anonymous shared mappings.  The owner of pages is considered to be the
> > filehandle itself in that case.
> 
> It's not the same as posix shared mem.

What's the difference?

-- Jamie
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: madvise (MADV_FREE)
  2000-03-24 17:42                   ` Jeff Dike
@ 2000-03-24 16:49                     ` Jamie Lokier
  2000-03-24 17:08                     ` Stephen C. Tweedie
  1 sibling, 0 replies; 55+ messages in thread
From: Jamie Lokier @ 2000-03-24 16:49 UTC (permalink / raw)
  To: Jeff Dike; +Cc: lars brinkhoff, cel, linux-mm

Jeff Dike wrote:
> Maybe on arches where the hardware provides those bits and the kernel uses 
> them, but the i386 kernel doesn't.

The i386 not-user-mode kernel certainly uses the accessed and dirty bits.
What do you think pte_young does?

-- Jamie
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: madvise (MADV_FREE)
  2000-03-24 17:42                   ` Jeff Dike
  2000-03-24 16:49                     ` Jamie Lokier
@ 2000-03-24 17:08                     ` Stephen C. Tweedie
  2000-03-24 19:58                       ` Jeff Dike
  1 sibling, 1 reply; 55+ messages in thread
From: Stephen C. Tweedie @ 2000-03-24 17:08 UTC (permalink / raw)
  To: Jeff Dike; +Cc: lars brinkhoff, lk, cel, linux-mm, Stephen Tweedie

Hi,

On Fri, Mar 24, 2000 at 12:42:18PM -0500, Jeff Dike wrote:
> 
> Maybe on arches where the hardware provides those bits and the kernel uses 
> them, but the i386 kernel doesn't.

Sure it does.  It relies utterly on them.  It uses the accessed bit to
perform page aging, and it uses the dirty bit to distinguish between
private and shared pages on writable private vmas, or to mark dirty shared
pages on shared vmas.

--Stephen
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: /dev/recycle
  2000-03-24 14:17                       ` /dev/recycle Jamie Lokier
@ 2000-03-24 17:40                         ` Christoph Rohland
  2000-03-24 18:13                           ` /dev/recycle Jamie Lokier
  0 siblings, 1 reply; 55+ messages in thread
From: Christoph Rohland @ 2000-03-24 17:40 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: Christoph Rohland, Chuck Lever, linux-mm

Jamie Lokier <lk@tantalophile.demon.co.uk> writes:

> Christoph Rohland wrote:
> > > Open /dev/recycle several times and map it shared -- it's the same as
> > > anonymous shared mappings.  The owner of pages is considered to be the
> > > filehandle itself in that case.
> > 
> > It's not the same as posix shared mem.
> 
> What's the difference?

1) /dev/{zero,recycle} shared mappings do only work between childs of
   the same parent and the parent. Also they do not survive an exec.
2) You cannot unmap and remap the same area.

Greetings
		Christoph
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: madvise (MADV_FREE)
  2000-03-24  7:21                 ` lars brinkhoff
@ 2000-03-24 17:42                   ` Jeff Dike
  2000-03-24 16:49                     ` Jamie Lokier
  2000-03-24 17:08                     ` Stephen C. Tweedie
  0 siblings, 2 replies; 55+ messages in thread
From: Jeff Dike @ 2000-03-24 17:42 UTC (permalink / raw)
  To: lars brinkhoff; +Cc: lk, cel, linux-mm

> Per-page accessed and dirty information from the hosting kernel would
> ease the implementation of a simulated MMU.

> Perhaps also the user-mode Linux kernel would benefit, but I'm not
> sure. Jeff?

The user-mode kernel doesn't expect to get any mm bits from the hosting kernel 
and I don't see any use for them.  It lives in its own happy world keeping 
track of its own bits.

Maybe on arches where the hardware provides those bits and the kernel uses 
them, but the i386 kernel doesn't.

				Jeff


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: /dev/recycle
  2000-03-24 17:40                         ` /dev/recycle Christoph Rohland
@ 2000-03-24 18:13                           ` Jamie Lokier
  2000-03-25  8:35                             ` /dev/recycle Christoph Rohland
  0 siblings, 1 reply; 55+ messages in thread
From: Jamie Lokier @ 2000-03-24 18:13 UTC (permalink / raw)
  To: Christoph Rohland; +Cc: Chuck Lever, linux-mm

Christoph Rohland wrote:
> 1) /dev/{zero,recycle} shared mappings do only work between childs of
>    the same parent and the parent. Also they do not survive an exec.

Use file handle passing -- another process can then share the mapping.
This is what shared anonymous mapping means, and it was added to the
kernel recently just after posix shm (because posix shm made it easy to
implement).

> 2) You cannot unmap and remap the same area.

You can if someone else holds it open.

Anyway, you can use MAP_RECYCLE when you're mapping posix shm.
That could be made to work :-)

-- Jamie
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: madvise (MADV_FREE)
  2000-03-24 17:08                     ` Stephen C. Tweedie
@ 2000-03-24 19:58                       ` Jeff Dike
  2000-03-25  0:30                         ` Stephen C. Tweedie
  0 siblings, 1 reply; 55+ messages in thread
From: Jeff Dike @ 2000-03-24 19:58 UTC (permalink / raw)
  To: Stephen C. Tweedie, lk; +Cc: linux-mm

> The i386 not-user-mode kernel

I usually call that the native kernel :-)

lk@tantalophile.demon.co.uk said:
> certainly uses the accessed and dirty bits. What do you think
> pte_young does?

sct@redhat.com said:
> It uses the accessed bit to perform page aging, and it uses the dirty
> bit to distinguish between private and shared pages on writable
> private vmas, or to mark dirty shared pages on shared vmas.

I should have thought a little before making that post.  When I did the 
user-mode port, I didn't have to provide any special support for maintaining 
the non-protection bits (should I be?).  I essentially stole the i386 
pgtable.h and pgalloc.h to get the bits and macros, and that's about it.

Everything appears to work fine, so my conclusion (without delving into the 
i386 code too deeply) was that the upper kernel maintained them itself without 
any particular help from the hardware.

Is this correct?  Should I be dealing with the non-protection bits in the arch 
layer?

				Jeff

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: madvise (MADV_FREE)
  2000-03-24 19:58                       ` Jeff Dike
@ 2000-03-25  0:30                         ` Stephen C. Tweedie
  0 siblings, 0 replies; 55+ messages in thread
From: Stephen C. Tweedie @ 2000-03-25  0:30 UTC (permalink / raw)
  To: Jeff Dike; +Cc: Stephen C. Tweedie, lk, linux-mm

Hi,

On Fri, Mar 24, 2000 at 02:58:10PM -0500, Jeff Dike wrote:
> 
> Everything appears to work fine, so my conclusion (without delving into the 
> i386 code too deeply) was that the upper kernel maintained them itself without 
> any particular help from the hardware.
> 
> Is this correct?  Should I be dealing with the non-protection bits in the arch 
> layer?

You probably should.  It is impossible to do MAP_SHARED, PROT_WRITE 
regions correctly without dirty bit support, and you don't get 
efficient paging without accessed bit support.

--Stephen
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: /dev/recycle
  2000-03-24 18:13                           ` /dev/recycle Jamie Lokier
@ 2000-03-25  8:35                             ` Christoph Rohland
  0 siblings, 0 replies; 55+ messages in thread
From: Christoph Rohland @ 2000-03-25  8:35 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: Christoph Rohland, Chuck Lever, linux-mm

Jamie Lokier <lk@tantalophile.demon.co.uk> writes:

> Christoph Rohland wrote:
> > 1) /dev/{zero,recycle} shared mappings do only work between childs of
> >    the same parent and the parent. Also they do not survive an exec.
> 
> Use file handle passing -- another process can then share the mapping.
> This is what shared anonymous mapping means, and it was added to the
> kernel recently just after posix shm (because posix shm made it easy to
> implement).

That's not how /dev/zero works. Check the implementation. AFAIK it
also does not work this way on other platforms.
 
> > 2) You cannot unmap and remap the same area.
> 
> You can if someone else holds it open.

See above.

Greetings
		Christoph
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: /dev/recycle
  2000-03-24  0:00               ` /dev/recycle Jamie Lokier
  2000-03-24  9:14                 ` /dev/recycle Christoph Rohland
@ 2000-03-28  0:48                 ` Chuck Lever
  1 sibling, 0 replies; 55+ messages in thread
From: Chuck Lever @ 2000-03-28  0:48 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: linux-mm

On Fri, 24 Mar 2000, Jamie Lokier wrote:
> Chuck Lever wrote:
> > > MADV_FREE only discards private modifications when there is paging
> > > pressure to do so.  The decisions to do so are deferred, for
> > > architectures that support this.  (Includes x86).
> > 
> > i still don't see a big difference.  the private modifications, in both
> > cases, won't be written to swap.  in both cases, the application cannot
> > rely on the contents of these pages after the madvise call.
> 
> Correct.  The difference is that with MADV_FREE, clear_page() operations
> are skipped when there's no memory pressure from the kernel.
> 
> > for private mappings, pages are freed immediately by DONTNEED; FREE will
> > cause the pages to be freed later if the system is low on memory.  that's
> > six of one, half dozen of the other.  freeing later may mean the
> > application saves a little time now,
> 
> It may save the time overall -- if the page is next reused by the
> application before the kernel recycles it.  Note that nobody, neither
> the application nor the kernel, knows in advance if this will be the
> case.
> 
> > but freeing immediately could mean postponing a low memory scenario,
> > and would allow the system to reuse a page that is still in hardware
> > caches.
> 
> The system is free to reuse MADV_FREE pages immediately if it wishes --
> the system doesn't lose here.  In fact if you're already low on memory
> at the time madvise() is called, the kernel would reclaim as many pages
> as it needs immediately, just as if you'd called MADV_DONTNEED for those
> pages.  The remainder get marked reclaimable.

ok, i just want to make sure we really are talking about the same thing,
at least from the point of view of the semantics that the application will
depend on.  the only difference is how/when the kernel disposes of the
pages.

reducing the number of clear_page() operations and reducing the amount of
page table jiggling on SMP are both good goals.  is it your view that
MADV_FREE is a better implementation of MADV_DONTNEED?  should we replace
the current implementation of MADV_DONTNEED with one that behaves more
like MADV_FREE?  is there a reason to have both behaviors available to
applications?

	- Chuck Lever
--
corporate:	<chuckl@netscape.com>
personal:	<chucklever@netscape.net> or <cel@monkey.org>

The Linux Scalability project:
	http://www.citi.umich.edu/projects/linux-scalability/

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 55+ messages in thread

end of thread, other threads:[~2000-03-28  0:48 UTC | newest]

Thread overview: 55+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <20000320135939.A3390@pcep-jamie.cern.ch>
2000-03-20 19:09 ` MADV_SPACEAVAIL and MADV_FREE in pre2-3 Chuck Lever
2000-03-21  1:20   ` madvise (MADV_FREE) Jamie Lokier
2000-03-21  2:24     ` William J. Earl
2000-03-21 14:08       ` Jamie Lokier
2000-03-22 16:24     ` Chuck Lever
2000-03-22 18:05       ` Jamie Lokier
2000-03-22 21:39         ` Chuck Lever
2000-03-22 22:31           ` Jamie Lokier
2000-03-22 22:44             ` Stephen C. Tweedie
2000-03-23 18:53             ` Chuck Lever
2000-03-24  0:00               ` /dev/recycle Jamie Lokier
2000-03-24  9:14                 ` /dev/recycle Christoph Rohland
2000-03-24 13:10                   ` /dev/recycle Jamie Lokier
2000-03-24 13:54                     ` /dev/recycle Christoph Rohland
2000-03-24 14:17                       ` /dev/recycle Jamie Lokier
2000-03-24 17:40                         ` /dev/recycle Christoph Rohland
2000-03-24 18:13                           ` /dev/recycle Jamie Lokier
2000-03-25  8:35                             ` /dev/recycle Christoph Rohland
2000-03-28  0:48                 ` /dev/recycle Chuck Lever
2000-03-24  0:21               ` madvise (MADV_FREE) Jamie Lokier
2000-03-24  7:21                 ` lars brinkhoff
2000-03-24 17:42                   ` Jeff Dike
2000-03-24 16:49                     ` Jamie Lokier
2000-03-24 17:08                     ` Stephen C. Tweedie
2000-03-24 19:58                       ` Jeff Dike
2000-03-25  0:30                         ` Stephen C. Tweedie
2000-03-22 22:33           ` Stephen C. Tweedie
2000-03-22 22:45             ` Jamie Lokier
2000-03-22 22:48               ` Stephen C. Tweedie
2000-03-22 22:55                 ` Q. about swap-cache orphans Jamie Lokier
2000-03-22 22:58                   ` Stephen C. Tweedie
2000-03-22 18:15       ` madvise (MADV_FREE) Christoph Rohland
2000-03-22 18:30         ` Jamie Lokier
2000-03-23 16:56           ` Christoph Rohland
2000-03-21  1:29   ` MADV_DONTNEED Jamie Lokier
2000-03-22 17:04     ` MADV_DONTNEED Chuck Lever
2000-03-22 17:10       ` MADV_DONTNEED Stephen C. Tweedie
2000-03-22 17:32         ` MADV_DONTNEED Jamie Lokier
2000-03-22 17:33         ` MADV_DONTNEED Jamie Lokier
2000-03-22 17:37           ` MADV_DONTNEED Stephen C. Tweedie
2000-03-22 17:43       ` MADV_DONTNEED Jamie Lokier
2000-03-22 21:54         ` MADV_DONTNEED Chuck Lever
2000-03-22 22:41           ` MADV_DONTNEED Jamie Lokier
2000-03-23 19:13             ` MADV_DONTNEED James Antill
2000-03-21  1:47   ` Extensions to mincore Jamie Lokier
2000-03-21  9:11     ` Eric W. Biederman
2000-03-21  9:40       ` lars brinkhoff
2000-03-21 11:34       ` Stephen C. Tweedie
2000-03-21 15:15         ` Jamie Lokier
2000-03-21 15:41           ` Stephen C. Tweedie
2000-03-21 15:55             ` Jamie Lokier
2000-03-21 16:08               ` Stephen C. Tweedie
2000-03-21 16:48                 ` Jamie Lokier
2000-03-22  7:36                   ` Eric W. Biederman
2000-03-21  1:50   ` MADV flags as mmap options Jamie Lokier

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox