More observations...

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* More observations...
@ 2000-05-16  2:44 Mike Simons
  2000-05-16 10:20 ` Stephen C. Tweedie
  0 siblings, 1 reply; 5+ messages in thread
From: Mike Simons @ 2000-05-16  2:44 UTC (permalink / raw)
  To: Linux Memory Management List

mmap002 application cause crashes, ways to avoid the kernel deadlocks,
and how to speed them up ;).

Questions:
- In vmstat output, if memory doesn't show up in "free", "buff", or in
  "cache" ... where is it?
- I don't understand why vmstat "block out" doesn't start happening shortly
  after the mmap002 application starts running?
- Why can't the kernel be flushing buffers to disk while the dirty
  buffers are being created by the application?  (especially when the number
  of "dirty" buffers on the system is > X %)

  The application making the dirty pages is only doing about 6M per second
of new dirty pages using about 6% of the processor.  Which my hard drive would
more that be able to keep up with flushing out.
  Based on my previous post it appears the kernel waits until *just*
before killing mmap002 to do any writes.  As far as I can tell, had the
kernel been flushing the buffers as they were dirtied there would never
have been a problem.   However, the kernel currently backs itself into 
a corner by encouraging applications to _fill_ all available memory with
dirty buffers so when someone asks for more memory there is no where to 
run ... except lots of disk I/O and blocking which the current kernel
(for whatever reason) doesn't wait long enough to happen before it starts
blowing processes to bits.  =)
  Sure if the kernel flushed started forcing flushed buffers to disk after
75% dirty the application could redirty ones already flushed and there 
would be some wasted I/O but that might just prevent the system from
completely running out of "available" pages to use, since it could
reuse one of it just put out to disk...

    
  Well... Since I don't know anything about how the kernel VM system works
I started tinkering with mmap002.c to see if there were different kernel
behaviors under different types of load:

  - Added fprintf's to tell which when mmap003 finishes each loop.

    I found that the first loop was never finishing before it was killed.
    After mmap003 was killed most of the memory was left in "cache".
    
    
  - Changed the for loops to skip 1024 bytes into the map for 
    each dirty... ("i++" --> "i+=1024")
    
    This causes the kernel to kill almost every application (a list of
    about 8 (including init 6 times)), the first time mmap003 is
    run after a fresh reboot.  This on the -pre8+riel patch 2 kernel
    I mentioned earlier today, the major change is it normally took
    at least three runs kill the system before, and only occasionally
    killed more than mmap and init... now it is slaying several 
    applications every time.


  - Still skipping 1024 bytes, I added a msync to flush the entire
    mmap'ed buffer after dirtying every 1Meg of the file.

    This version takes 11 seconds to run the first loop. Never ever,
    gets killed running the first loop...

    When it starts the second for loop (which uses a non-file-mmapped buffer)
    suddenly memory disappears from "cache" and does not reappear in any
    other category.  I killed the application after vmstat only showed
    24Megs left in my system (4 free, 0 buff, 20 cached).  All these missing
    buffers appeared back in free instantly.
      I then let this run this a few times, each time it kills the mmap003
    application while in the second for loop... this loop never completes.
    When the application is killed all the memory that was missing appears
    in the free memory area instantly.  (Sometimes this second loop will
    kill init and lock the system so be careful.)


  - Changed the msync to request ASYNC flushing of the whole buffer, 
    once every 32 Megs... the first loop completes in 8 seconds of real
    time, 0 seconds user, 1 seconds system (256 Meg file).  The system
    locks up for about 4 seconds just before finishing the for loop, but
    is responsive before and after...


  - I noticed the difference between a file-mapped and a non-file-mapped
    application kill and the effects on free memory... so I've tried killing
    the application manually.  Which has the same effect (memory doesn't
    move to free).

    I tracked this down to the buffers remain in the "cache" state until the
    file which is file-mapped is unlinked.  At which point all of the
    buffers instantly free up.

    TTFN,
      Mike Simons

vmstat runs during some tests and some version of the mmap003.c code is
available from:
  http://moria.simons-clan.com/~msimons/

note the original mmap code is part of a suite by Juan Jose Quintela 
and is available:
  http://carpanta.dc.fi.udc.es/~quintela/memtest/
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: More observations...
  2000-05-16  2:44 More observations Mike Simons
@ 2000-05-16 10:20 ` Stephen C. Tweedie
  2000-05-16 15:41   ` Rik van Riel
  0 siblings, 1 reply; 5+ messages in thread
From: Stephen C. Tweedie @ 2000-05-16 10:20 UTC (permalink / raw)
  To: Mike Simons; +Cc: Linux Memory Management List

Hi,

On Mon, May 15, 2000 at 10:44:03PM -0400, Mike Simons wrote:

>   Sure if the kernel flushed started forcing flushed buffers to disk after
> 75% dirty the application could redirty ones already flushed and there 
> would be some wasted I/O but that might just prevent the system from
> completely running out of "available" pages to use, since it could
> reuse one of it just put out to disk...

With mmap(), it is nothing to do with dirty buffers.  There are, in fact,
_no_ dirty buffers when you have the mmap() case --- the buffer_heads
backing the files will remain clean.  It is the pages themselves which 
are dirty, and the only record of their dirtiness is in the ptes.

For buffer_heads, we can (and do) throttle write activity when
the dirty list grows too long.  However, we don't do anything like
that for mmaped pages.  Think what happens if you have an application
(say, a simulation, in which a lot of the data is constantly being 
modified) which fits in memory, but only just --- if you put a limit
on the %age of dirty memory, you'd be constantly thrashing to disk
despite having enough memory for the workload.

We _could_ keep track of the number of dirty pages quite easily, by
making all clean ptes readonly.  It's not at all clear that it helps,
though. 

I think that the real solution here is still dynamic RSS limits for
mms.  We can allow the RSS limits to grow as the RSS grows as long as
there are sufficient free pages in the GFP_USER class.  As soon as we
start to swap, however, imposing RSS limits is an ideal way (right 
now, it's pretty much the only way) to limit the impact of heavy
threaded memory write activity by a process.

The concept is quite simple: if you can limit a process's RSS, you 
can limit the amount of memory which is pinned in process page tables,
and thus subject to expensive swapping.  Note that you don't have to
get rid of the pages --- you can leave them in the page cache/swap
cache, where they can be re-faulted rapidly if needed, but if the
memory is needed for something else then shrink_mmap can reclaim the
pages rapidly.  

Rick's old memory hog flag is essentially a simple case of an RSS
limit (the task RSS is limited to what it is currently set at).  In 
general, if you can identify severe memory pressure being caused by
a specific process, then you can start doing early RSS limiting on 
the mm in question and substantially reduce the impact on the rest
of the system.

--Stephen
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: More observations...
  2000-05-16 10:20 ` Stephen C. Tweedie
@ 2000-05-16 15:41   ` Rik van Riel
  2000-05-16 16:07     ` Stephen C. Tweedie
  0 siblings, 1 reply; 5+ messages in thread
From: Rik van Riel @ 2000-05-16 15:41 UTC (permalink / raw)
  To: Stephen C. Tweedie; +Cc: Mike Simons, Linux Memory Management List

On Tue, 16 May 2000, Stephen C. Tweedie wrote:

> The concept is quite simple: if you can limit a process's RSS,
> you can limit the amount of memory which is pinned in process
> page tables, and thus subject to expensive swapping.  Note that
> you don't have to get rid of the pages --- you can leave them in
> the page cache/swap cache, where they can be re-faulted rapidly
> if needed, but if the memory is needed for something else then
> shrink_mmap can reclaim the pages rapidly.

There's one problem with this idea. The current implementation
of shrink_mmap() skips over dirty pages, leading to a failing
shrink_mmap(), calls to swap_out() and replacement of the wrong
pages...

> Rick's old memory hog flag is essentially a simple case of an
> RSS limit (the task RSS is limited to what it is currently set
> at).

Not really. The anti-hog code did a number of things:
- swap_out() scans tasks more and more agressively the
  bigger their RSS gets bigger, meaning we "push back
  harder" if a process is very big
- slow down the allocation rate of very big processes
  by having them call try_to_free_pages() if they want
  to allocate something. It doesn't have to steal a page
  from itself, but can steal the page from anywhere.

The effect should be comperable to RSS limits, only simpler ;)

(After all, all RSS limits do is make sure that the VM subsystem
"pushes back harder" against the VM pressure of big processes)

regards,

Rik
--
The Internet is not a network of computers. It is a network
of people. That is its real strength.

Wanna talk about the kernel?  irc.openprojects.net / #kernelnewbies
http://www.conectiva.com/		http://www.surriel.com/

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: More observations...
  2000-05-16 15:41   ` Rik van Riel
@ 2000-05-16 16:07     ` Stephen C. Tweedie
  2000-05-16 17:23       ` Rik van Riel
  0 siblings, 1 reply; 5+ messages in thread
From: Stephen C. Tweedie @ 2000-05-16 16:07 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Stephen C. Tweedie, Mike Simons, Linux Memory Management List

Hi,

On Tue, May 16, 2000 at 12:41:05PM -0300, Rik van Riel wrote:
> 
> > The concept is quite simple: if you can limit a process's RSS,
> > you can limit the amount of memory which is pinned in process
> > page tables, and thus subject to expensive swapping.  Note that
> > you don't have to get rid of the pages --- you can leave them in
> > the page cache/swap cache, where they can be re-faulted rapidly
> > if needed, but if the memory is needed for something else then
> > shrink_mmap can reclaim the pages rapidly.
> 
> There's one problem with this idea. The current implementation
> of shrink_mmap() skips over dirty pages, leading to a failing
> shrink_mmap(), calls to swap_out() and replacement of the wrong
> pages...

No, because if you have evicted the pages from the RSS, they are 
guaranteed to be clean.  The shrink_mmap reclaim will never have 
to block.  We always flush mmaped or anon pageson swapout, not on 
shrink_mmap().  

For writable shared file mappings, the flush only goes to the buffer
cache, not to disk, so we still rely on bdflush writeback, but 
currently filemap_swapout triggers the bdflush thread automatically
anyway.  Subsequent shrink_mmap reclaims will just find a locked
page and block, which is the desired behaviour.

--Stephen
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: More observations...
  2000-05-16 16:07     ` Stephen C. Tweedie
@ 2000-05-16 17:23       ` Rik van Riel
  0 siblings, 0 replies; 5+ messages in thread
From: Rik van Riel @ 2000-05-16 17:23 UTC (permalink / raw)
  To: Stephen C. Tweedie
  Cc: Mike Simons, Linus Torvalds, Linux Memory Management List

On Tue, 16 May 2000, Stephen C. Tweedie wrote:

> For writable shared file mappings, the flush only goes to the
> buffer cache, not to disk, so we still rely on bdflush
> writeback, but currently filemap_swapout triggers the bdflush
> thread automatically anyway.  Subsequent shrink_mmap reclaims
> will just find a locked page and block, which is the desired
> behaviour.

I can agree on this. Shrink_mmap() should wait if it finds
(a number of) locked buffers.  [It doesn't seem to do that
right now]

Linus??

regards,

Rik
--
The Internet is not a network of computers. It is a network
of people. That is its real strength.

Wanna talk about the kernel?  irc.openprojects.net / #kernelnewbies
http://www.conectiva.com/		http://www.surriel.com/

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2000-05-16 17:23 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2000-05-16  2:44 More observations Mike Simons
2000-05-16 10:20 ` Stephen C. Tweedie
2000-05-16 15:41   ` Rik van Riel
2000-05-16 16:07     ` Stephen C. Tweedie
2000-05-16 17:23       ` Rik van Riel

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox