Estrange behaviour of pre9-1

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* Estrange behaviour of pre9-1
@ 2000-05-15 22:56 Juan J. Quintela
  2000-05-15 23:59 ` Linus Torvalds
  0 siblings, 1 reply; 12+ messages in thread
From: Juan J. Quintela @ 2000-05-15 22:56 UTC (permalink / raw)
  To: linux-mm, Linus Torvalds

Hi,
        this is a report about the actual situation of the VM system
in linux pre9-1.  I have been playing with the two patches from Rik,
with the patch from Linus (sync_buffer_pages), and with mine that
posted yesterday to the list.  I have been playing using the mmap002
code, I know that this is not a good performance test, but it is a
good test for pressure in page allocation.  I am exposing here my
findings expecting that somebody has some idea on how to solve them.

The tests have been done in K6-300Mmz, 98MB ram

       Vanilla pre9-1 dies in all the tries to run the test until the
machine becomes completely frozen.  No answer to ping, no answer to
keyboard, nothing in console, nothing in logs.  The last message
printed is that init has been killed by OOM.

       The one that more helped is Linus patch, it helps a lot in
performance, we go almost as fast as 2.2, the problem with this patch
is that sometimes we get out of memory errors (much less than with
vanilla kernel, but in 5/6 tries we get the error in init, and after
that the system freezes the same that previous paragraph.  Another
problem of the patch is that we spend *a lot of time* in the kernel,
system time has increased a lot.

real    6m12.635s
user    0m16.290s
sys     1m40.420s

In 2.2 the time of the whole process takes less than 2m, and the
system time is around 10 seconds.  At the end it finish killing
processes.

Next try is using the patch from Rik, augmenting also the priority in
try_to_free_pages to 16 (other magic values as 10 tested also) and
playing also with FREE_COUNT and SWAP_COUNT (8, 16 and 32).  The
system normally never kills any process, but it is slow, very slow:

real    11m48.746s
user    0m16.060s
sys     5m22.010s

It takes the double of wall clock time than the Linus version, and
more than thrice the system time.  Noted also that in this version I
see stalls in 'vmstat 1' output of more than 20 seconds.
More about the stalls later.

I tried several combinations of Rik & Linus patches, but I neither
achieve the good points of the two together.  I normally get the
instability of Linus patch (indeed more unstable), and the *slowness*
of Rik patch.  When combined, the system always get killed.

I try to use also the patch that I posted yesterday to minimise the
system time, my patch basically does is maintain the LRU list fixed and
quit/put only the elements that change position.  This helped in some
tests and don't help at all in other tests.  I have reached the
conclusion that the moment in which the system freezes is related when
it kills init, but that moment is *really* related with luck, then you
made a change in _any_ place and things go better, worst or equal and
you can affirm _nothing_ about the last change, the new result can
be very related with luck (or luck of that).

Talking about the stalls, I have noticed that seeing the vmstat output
they happened when the system is using all the page cache for holding
pages that are dirty (mmap002 only generates dirty pages, only reuse
pages after a lot of time).  And it happens when the page cache has
one size of 90MB in one machine with 98MB, in that moment stalls of 20
seconds (or more) happen.

See for instance output from vmstat 1 before one freeze:

 2  0  0   3148   2680     92  80496  36   0    10     7  156    28  62  38   0
 2  0  0   3312   2352     92  89724  40 164    11   707  148    51  67  33   0
 2  0  1   3364   1440    100  90796  52  68    89  4280  487   428  57  29  14
 1  2  0   3368   1784    392  84344 428  68  1475 94051 4852  6388  84  16   1
                                                   ^^^^^
 1  1  0   3372   2296    120  84196  36  40  4558    10  228   217  78  22   0
   procs                      memory    swap          io     system         cpu
 r  b  w   swpd   free   buff  cache  si  so    bi    bo   in    cs  us  sy  id

The system spend in the marked line something similar to 26 minutes,
during this time, no answer to ping, I noticed that the machine was
not dead when I went to copy the results and see that they was not in
the display, have moved.  After that slowdown:

Killed

real    0m23.441s
user    0m3.250s
sys     0m1.900s
Killed

real    26m23.029s
user    0m2.440s
sys     0m6.230s

real    3m6.289s
user    0m15.770s
sys     0m8.170s

real    2m14.012s
user    0m16.080s
sys     0m11.140s

After the two first killed mmap002 (notice the 26min, is not a bug of
the copy paste), the machine goes fast, after that point, no more
killed processes and the system goes fast, 2m14 seconds is quite fast
for this test, and 11 seconds for system time is not high at all.

Surprised for this last result, I send you my findings to the moment,
if you have some comments, suggestions, or you need more detailed
results, let me know.

Thanks for your time, Juan.

-- 
In theory, practice and theory are the same, but in practice they 
are different -- Larry McVoy
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Estrange behaviour of pre9-1
  2000-05-15 22:56 Estrange behaviour of pre9-1 Juan J. Quintela
@ 2000-05-15 23:59 ` Linus Torvalds
  2000-05-16  0:20   ` Juan J. Quintela
  2000-05-16  1:03   ` Rik van Riel
  0 siblings, 2 replies; 12+ messages in thread
From: Linus Torvalds @ 2000-05-15 23:59 UTC (permalink / raw)
  To: Juan J. Quintela; +Cc: linux-mm

On 16 May 2000, Juan J. Quintela wrote:
> 
>        The one that more helped is Linus patch, it helps a lot in
> performance, we go almost as fast as 2.2, the problem with this patch
> is that sometimes we get out of memory errors (much less than with
> vanilla kernel, but in 5/6 tries we get the error in init, and after
> that the system freezes the same that previous paragraph.  Another
> problem of the patch is that we spend *a lot of time* in the kernel,
> system time has increased a lot.

Ok.

The system time increase I wouldn't worry about that much, as long as it's
still clearly smaller than the real time. If we did a better job of
writing stuff out so that the real time goes down, it's almost certainly
the right thing to do.

The fact that Rik's patch performs so badly is interesting in itself, and
I thus removed it from my tree.

I think I have a reasonable alternative to Rik's patch, which is to give
"negative brownie-points" to allocators after the fact. It should be
fairer to the person who frees up memory than the current one, by simply
re-ordering the requirement for freeing memory. The theory goes as
follows:

_Most_ of the time when "try_to_free_pages()" is called, the memory
actually exists, and we call try_to_free_pages() mainly because we want to
make sure that we don't get into a bad situation.

So, how about doing something like:

 - if memory is low, allocate the page anyway if you can, but increment a
   "bad user" count in current->user->mmuse;
 - when entering __alloc_pages(), if "current->user->mmuse > 0", do a
   "try_to_free_pages()" if there are any zones that need any help
   (otherwise just clear this field).

Think of it as "this user can allocate a few pages, but it's on credit.
They have to be paid back with the appropriate 'try_to_free_pages()'".

Couple this with raising the low-water-mark a bit, and it should work out
fine: the guy who does the "try_to_free_pages()" is always the one that
gets to be credited with it by actually allocating a page. And if kswapd
runs quickly enough that it's not needed, all the better.

Rik? I think this would solve the fairness concerns without the need to
tell the rest of the world about a process trying to free up memory and
causing bad performance..

		Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Estrange behaviour of pre9-1
  2000-05-15 23:59 ` Linus Torvalds
@ 2000-05-16  0:20   ` Juan J. Quintela
  2000-05-16  0:34     ` Linus Torvalds
  2000-05-16  0:55     ` Rik van Riel
  2000-05-16  1:03   ` Rik van Riel
  1 sibling, 2 replies; 12+ messages in thread
From: Juan J. Quintela @ 2000-05-16  0:20 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-mm

>>>>> "linus" == Linus Torvalds <torvalds@transmeta.com> writes:

Hi

linus> So, how about doing something like:

linus>  - if memory is low, allocate the page anyway if you can, but increment a
linus>    "bad user" count in current->user->mmuse;
linus>  - when entering __alloc_pages(), if "current->user->mmuse > 0", do a
linus>    "try_to_free_pages()" if there are any zones that need any help
linus>    (otherwise just clear this field).

linus> Think of it as "this user can allocate a few pages, but it's on credit.
linus> They have to be paid back with the appropriate 'try_to_free_pages()'".

I was discussing with Rik an scheme similar to that. I have found that
appears that we are trying very hard to get pages without doing any
writing. I think that we need to _wait_ for the pages if we are really
low on memory.  Just now, for pathological examples like mmap002 that
dirty a lot of memory very fast, I am observing that we made the page
cache grow until it occupies all the RAM. That is OK when the RAM is
empty.  But in that moment, if all the pages are dirty, we call
shrink_mmap, and it will start the async write of all the pages (in
this case, all our memory).  In this case shrink_mmap returns fail and
we will end calling shrink_mmap again, until we start the writing of
all the pages to the disk.  I think that shrink_mmap should return an
special value just in the case that it has started a *lot* of writes
of dirty pages, then we need to wait for some IO to complete before
continuing asking for memory.  We don't want to write synchronously
pages to the disk, because we want the requests to be coalescing.  But
in the other hand, we don't want to start the witting of 100MB of
dirty pages in one only call to try_to_free_pages.  And I suspect that
this is the case with the actual code.

Comments?

Later, Juan.

-- 
In theory, practice and theory are the same, but in practice they 
are different -- Larry McVoy
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Estrange behaviour of pre9-1
  2000-05-16  0:20   ` Juan J. Quintela
@ 2000-05-16  0:34     ` Linus Torvalds
  2000-05-16  0:54       ` Juan J. Quintela
  2000-05-16  0:55     ` Rik van Riel
  1 sibling, 1 reply; 12+ messages in thread
From: Linus Torvalds @ 2000-05-16  0:34 UTC (permalink / raw)
  To: Juan J. Quintela; +Cc: linux-mm

On 16 May 2000, Juan J. Quintela wrote:
> 
> I was discussing with Rik an scheme similar to that. I have found that
> appears that we are trying very hard to get pages without doing any
> writing. I think that we need to _wait_ for the pages if we are really
> low on memory.

That is indeed what my shink_mmap() suggested change does (ie make
"sync_page_buffers()" wait for old locked buffers). 

That, together with each "try_to_free_page()" only trying to free a fairly
small number pf pages, should make it behave fine. I think one of the
reasons Rik's patch had bad performance was that when it started swapping
out, the "free_before_allocate" trap caused it to swap out _a lot_ by
trapping everybody else into freeing stuff too. Even when it might not
have been strictly necessary.

Btw, if you're testing the "wait for locked buffers" case, you should also
remove the "run_task(&tq_disk)" from do_try_to_free_pages(). That
artificially throttles disk performance regardless of whether it is needed
or not. The "wait for locked buffers" version of the code will
automatically cause the tq_disk queue to be emptied when it actually turns
out that yes, we really need to start the thing going. Which is exactly
what we want.

?		  Just now, for pathological examples like mmap002 that
> dirty a lot of memory very fast, I am observing that we made the page
> cache grow until it occupies all the RAM. That is OK when the RAM is
> empty.  But in that moment, if all the pages are dirty, we call
> shrink_mmap, and it will start the async write of all the pages (in
> this case, all our memory).

Yes. This is what kflushd is there for, and this is what "balance_dirty()"
is supposed to avoid. It may not work (and memory mappings are the worst
case, because the system doesn't _know_ that they are dirty until at the
point where it starts looking at the page tables - which is when it's too
late).

In order to truly make this behave more smoothly, we should trap the thing
when it creates a dirty page, which is quite hard the way things are set
up now. Certainly not 2.4.x code.

[ Incidentally, the thing that mmap002 tests is quite rare, so I don't
  think we have to have perfect behaviour, we just shouldn't kill
   processes the way we do now ]

		Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Estrange behaviour of pre9-1
  2000-05-16  0:34     ` Linus Torvalds
@ 2000-05-16  0:54       ` Juan J. Quintela
  2000-05-16  1:15         ` Rik van Riel
  2000-05-16 13:53         ` Linus Torvalds
  0 siblings, 2 replies; 12+ messages in thread
From: Juan J. Quintela @ 2000-05-16  0:54 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-mm

>>>>> "linus" == Linus Torvalds <torvalds@transmeta.com> writes:

Hi

linus> That is indeed what my shink_mmap() suggested change does (ie make
linus> "sync_page_buffers()" wait for old locked buffers). 

But your change wait for *all* locked buffers, I want to start several
writes asynchronously and then wait for one of them. This makes the
system sure that we don't try to write *all* the memory in one simple
call to try_to_free_pages.  Just now for the vmstat traces that I have
shown, I read it as I have 90MB of pages in cache in one machine with
98MB ram.  Almost all the pages are dirty pages, then we end calling
shrink_mmap a lot of times and starting a lot of IO, we don't wait for
the IO to complete, and then we pass through the pages priority times.

I think this is the reason that Rik patch worked making the priority
higher, he gets more passes through the data, then he spent more time
in shrink_mmap, more possibilities for the IO to finish.  Otherwise it
has no sense that augmenting the priority achieves more possibilities
to allocate one page.  At least make no sense to me.

I will test the rest of your suggestions and we report my findings.

linus> Yes. This is what kflushd is there for, and this is what "balance_dirty()"
linus> is supposed to avoid. It may not work (and memory mappings are the worst
linus> case, because the system doesn't _know_ that they are dirty until at the
linus> point where it starts looking at the page tables - which is when it's too
linus> late).

I think that there is no daemon that would be able to stop a memory
hog like mmap002, it need to wait *itself* when it allocates memory,
otherwise it will empty all the memory very fast.  We don't want all
processes waiting for allocation, but we want memory hogs to wait for
memory and to be the prime candidates for swapout pages.

linus> In order to truly make this behave more smoothly, we should trap the thing
linus> when it creates a dirty page, which is quite hard the way things are set
linus> up now. Certainly not 2.4.x code.

Yes, I am thinking more in the lines of trap the allocations, and if
one application begins to do an insane number of allocations (like
mmap002), we make *that* application to wait for the pages to be
written to swap/disk.  My idea is doing that shrink_mmap returns some
value to tell the allocator that needs to wait for that process.  If I
find some simple solution that works I will sent it.

linus> [ Incidentally, the thing that mmap002 tests is quite rare, so I don't
linus>   think we have to have perfect behaviour, we just shouldn't kill
linus>    processes the way we do now ]

Yes, I agree here, but this program is based in one application that
gets Ooops in pre6 and previous kernels.  I made the test to know that
the code works, I know that the thing that does mmap002 is very rare,
but not one reason to begin doing Oops/killing innocent processes.
That is all the point of that test, not optimise performance for it.

Comments?

Later, Juan.

-- 
In theory, practice and theory are the same, but in practice they 
are different -- Larry McVoy
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Estrange behaviour of pre9-1
  2000-05-16  0:54       ` Juan J. Quintela
@ 2000-05-16  1:15         ` Rik van Riel
  2000-05-16 13:53         ` Linus Torvalds
  1 sibling, 0 replies; 12+ messages in thread
From: Rik van Riel @ 2000-05-16  1:15 UTC (permalink / raw)
  To: Juan J. Quintela; +Cc: Linus Torvalds, linux-mm

On 16 May 2000, Juan J. Quintela wrote:

> I think this is the reason that Rik patch worked making the priority
> higher, he gets more passes through the data, then he spent more time
> in shrink_mmap, more possibilities for the IO to finish.  Otherwise it
> has no sense that augmenting the priority achieves more possibilities
> to allocate one page.  At least make no sense to me.

The reason that starting with a higher priority works is that
shrink_mmap() will fail easier on the first run, causing
swap_out() to refill the lru queue and keeping freeable pages
around.

Of course this is no more than a demonstration that:
- we need to have some freeable pages around
- the current fail-through behaviour is not right

> I think that there is no daemon that would be able to stop a
> memory hog like mmap002, it need to wait *itself* when it
> allocates memory, otherwise it will empty all the memory very
> fast.  We don't want all processes waiting for allocation, but
> we want memory hogs to wait for memory and to be the prime
> candidates for swapout pages.

Agreed. How could we achieve this?

> linus> In order to truly make this behave more smoothly, we should trap the thing
> linus> when it creates a dirty page, which is quite hard the way things are set
> linus> up now. Certainly not 2.4.x code.
> 
> Yes, I am thinking more in the lines of trap the allocations,
> and if one application begins to do an insane number of
> allocations (like mmap002), we make *that* application to wait
> for the pages to be written to swap/disk.

"some insane number" is probably not good enough (how would
we detect this?  how do we know that it isn't a process that
slept for the last minute and needs to be swapped in because
the user switched desktops while running mmap002?)

The problem seems to be that we leave dirty pages lying around
instead of waiting on them. Waiting on dirty buffers has a number
of effects:
- try_to_free_pages() will take a bit longer, so we have to make
  sure we don't wait too often (only once per shrink_mmap run?)
- we'll have a better change of replacing the right page, instead
  of a clean page from an innocent process
- that in turn should make sure the innocent process has less
  page faults than it has now, making it run faster and let the
  memory hog proceed faster because of less disk seek time

> linus> [ Incidentally, the thing that mmap002 tests is quite rare, so I don't
> linus>   think we have to have perfect behaviour, we just shouldn't kill
> linus>    processes the way we do now ]
> 
> Yes, I agree here, but this program is based in one application
> that gets Ooops in pre6 and previous kernels.

Wasn't mmap002 based on the behaviour of a real-world program
you were working on?

Also, wouldn't video streaming and data acquisition give similar
results in some cases?

I really think we should support this kind of workload. It is
within our reach and some people are actually running this kind
of application...

regards,

Rik
--
The Internet is not a network of computers. It is a network
of people. That is its real strength.

Wanna talk about the kernel?  irc.openprojects.net / #kernelnewbies
http://www.conectiva.com/		http://www.surriel.com/

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Estrange behaviour of pre9-1
  2000-05-16  0:54       ` Juan J. Quintela
  2000-05-16  1:15         ` Rik van Riel
@ 2000-05-16 13:53         ` Linus Torvalds
  2000-05-16 14:20           ` Juan J. Quintela
  2000-05-16 16:37           ` Roger Larsson
  1 sibling, 2 replies; 12+ messages in thread
From: Linus Torvalds @ 2000-05-16 13:53 UTC (permalink / raw)
  To: Juan J. Quintela; +Cc: linux-mm

On 16 May 2000, Juan J. Quintela wrote:
> Hi
> 
> linus> That is indeed what my shink_mmap() suggested change does (ie make
> linus> "sync_page_buffers()" wait for old locked buffers). 
> 
> But your change wait for *all* locked buffers, I want to start several
> writes asynchronously and then wait for one of them.

This is pretty much exactly what my change does - no need to be
excessively clever.

Remember, we walk the LRU list from the "old" end, and whenever we hita
dirty buffer we will write it out asynchronously. AND WE WILL MOVE IT TO
THE TOP OF THE LRU QUEUE!

Which means that we will only see actual locked buffers if we have gotten
through the whole LRU queue without giving our write-outs time to
complete: which is exactly the situation where we do want to wait for
them.

So normally, we will write out a ton of buffers, and then wait for the
oldest one. You're obviously right that we may end up waiting for more
than one buffer, but when that happens it will be the right thing to do:
whenever we're waiting for the oldest buffer to flush, the others are also
likely to have flushed (remember - we started them pretty much at the same
time because we've tried hard to delay the 'run_task(&tq_disk)' too).

>						 This makes the
> system sure that we don't try to write *all* the memory in one simple
> call to try_to_free_pages.

Well, right now we cannot avoid doing that. The reason is simply that the
current MM layerdoes not know how many pages are dirty. THAT is a problem,
but it's not a problem we're going to solve for 2.4.x. 

If you actually were to use "write()", or if the load on the machine was
more balanced than just one large "mmap002", you wouldn't see the
everything-at-once behaviour, but ..

> Yes, I agree here, but this program is based in one application that
> gets Ooops in pre6 and previous kernels.  I made the test to know that
> the code works, I know that the thing that does mmap002 is very rare,
> but not one reason to begin doing Oops/killing innocent processes.
> That is all the point of that test, not optimise performance for it.

Absolutely. We apparently do not get the oopses any more, but the
out-of-memory behaviour should be fixed. And never fear, we'llget it
fixed.

		Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Estrange behaviour of pre9-1
  2000-05-16 13:53         ` Linus Torvalds
@ 2000-05-16 14:20           ` Juan J. Quintela
  2000-05-16 16:37           ` Roger Larsson
  1 sibling, 0 replies; 12+ messages in thread
From: Juan J. Quintela @ 2000-05-16 14:20 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-mm

>>>>> "linus" == Linus Torvalds <torvalds@transmeta.com> writes:

linus> Remember, we walk the LRU list from the "old" end, and whenever we hita
linus> dirty buffer we will write it out asynchronously. AND WE WILL MOVE IT TO
linus> THE TOP OF THE LRU QUEUE!

linus> Which means that we will only see actual locked buffers if we have gotten
linus> through the whole LRU queue without giving our write-outs time to
linus> complete: which is exactly the situation where we do want to wait for
linus> them.

Linus, I think that we want to wait for a buffer if we have start the
write of a lot of them, i.e., If the last pass of shrink_mmap has
found a lot of dirty buffers (put here a magic number, 100, 1000, one
percentage of the count parameter...), then we want to wait for some
of that buffers to be write, before continue starting more writings.
I have done that change here and it appears to be working at least as
good as the actual vanilla kernel, and with a bit of tuning I think
that I can make it work better,  I will try to send a patch later
today.  I have achieved now to be able to reduce the cache, when we
are low on memory, we begin to shrink the cache at the same time that
we begin to swap.  With my patch I shrink it *too* aggressively, I need
to tune that a bit.

linus> So normally, we will write out a ton of buffers, and then wait for the
linus> oldest one. You're obviously right that we may end up waiting for more
linus> than one buffer, but when that happens it will be the right thing to do:
linus> whenever we're waiting for the oldest buffer to flush, the others are also
linus> likely to have flushed (remember - we started them pretty much at the same
linus> time because we've tried hard to delay the 'run_task(&tq_disk)' too).

That thing is write if we have *some* dirty pages, but if *almost* all
the pages are *dirty* we don't want that, because in that round when
we start the writes of the page cache, we will cause all the clean
pages of the innocent programs (bash, less, daemons, ....) to be
freed, and we do that too fast to let the programs to soft fault that
pages.

linus> Well, right now we cannot avoid doing that. The reason is simply that the
linus> current MM layerdoes not know how many pages are dirty. THAT is a problem,
linus> but it's not a problem we're going to solve for 2.4.x. 

If my patch work it will solve some of that problems.  If I have some
success I will let you know.

Later, Juan.

-- 
In theory, practice and theory are the same, but in practice they 
are different -- Larry McVoy
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Estrange behaviour of pre9-1
  2000-05-16 13:53         ` Linus Torvalds
  2000-05-16 14:20           ` Juan J. Quintela
@ 2000-05-16 16:37           ` Roger Larsson
  1 sibling, 0 replies; 12+ messages in thread
From: Roger Larsson @ 2000-05-16 16:37 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Juan J. Quintela, linux-mm

Linus Torvalds wrote:
> 
> On 16 May 2000, Juan J. Quintela wrote:
> > Hi
> >
> > linus> That is indeed what my shink_mmap() suggested change does (ie make
> > linus> "sync_page_buffers()" wait for old locked buffers).
> >
> > But your change wait for *all* locked buffers, I want to start several
> > writes asynchronously and then wait for one of them.
> 
> This is pretty much exactly what my change does - no need to be
> excessively clever.
> 
> Remember, we walk the LRU list from the "old" end, and whenever we hita
> dirty buffer we will write it out asynchronously. AND WE WILL MOVE IT TO
> THE TOP OF THE LRU QUEUE!
> 

Not in my recently released patch [Improved LRU shrink_mmap...].
It keeps the dirty (to be cleaned) pages in the old end.
I do not scan for dirty pages but it can easily be added - tonight.

/RogerL

--
Home page:
  http://www.norran.net/nra02596/
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Estrange behaviour of pre9-1
  2000-05-16  0:20   ` Juan J. Quintela
  2000-05-16  0:34     ` Linus Torvalds
@ 2000-05-16  0:55     ` Rik van Riel
  2000-05-16 14:03       ` Linus Torvalds
  1 sibling, 1 reply; 12+ messages in thread
From: Rik van Riel @ 2000-05-16  0:55 UTC (permalink / raw)
  To: Juan J. Quintela; +Cc: Linus Torvalds, linux-mm

On 16 May 2000, Juan J. Quintela wrote:

> linus> So, how about doing something like:
> 
> linus>  - if memory is low, allocate the page anyway if you can, but increment a
> linus>    "bad user" count in current->user->mmuse;
> linus>  - when entering __alloc_pages(), if "current->user->mmuse > 0", do a
> linus>    "try_to_free_pages()" if there are any zones that need any help
> linus>    (otherwise just clear this field).
> 
> linus> Think of it as "this user can allocate a few pages, but it's on credit.
> linus> They have to be paid back with the appropriate 'try_to_free_pages()'".

I don't think this will help. Imagine a user firing up 'ls', that
will need more than one page. Besides, the difference isn't that
we have to free pages, but that we have to deal with a *LOT* of
dirty pages at once, unexpectedly.

> I was discussing with Rik an scheme similar to that. I have
> found that appears that we are trying very hard to get pages
> without doing any writing. I think that we need to _wait_ for
> the pages if we are really low on memory.

Indeed. I've seen vmstat reports where the most common action
just before OOM is _pagein_. This indicates that shrink_mmap()
was very busy skipping over the dirty pages and dropping clean
pages which were needed again a few milliseconds later...

The right solution is to make sure the dirty pages are flushed
out.

> We don't want to write synchronously pages to the disk, because
> we want the requests to be coalescing.  But in the other hand,
> we don't want to start the witting of 100MB of dirty pages in
> one only call to try_to_free_pages.  And I suspect that this is
> the case with the actual code.

I think we may be able to use the 'priority' argument to
shrink_mmap() to determine the maximum amount of pages to
sync out at once (more or less, since this is pretty
arbitrary anyway we don't need to be that precise).

What we may want to do is wait on one page per shrink_mmap(),
and only "start waiting" after count has been decremented to
less than 1/2 of its original value.

This way we'll:
- make sure we won't flush too many things out at once
- allow for some IO clustering to happen
- keep latency decent
- try hard to flush out the _right_ page

regards,

Rik
--
The Internet is not a network of computers. It is a network
of people. That is its real strength.

Wanna talk about the kernel?  irc.openprojects.net / #kernelnewbies
http://www.conectiva.com/		http://www.surriel.com/

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Estrange behaviour of pre9-1
  2000-05-16  0:55     ` Rik van Riel
@ 2000-05-16 14:03       ` Linus Torvalds
  0 siblings, 0 replies; 12+ messages in thread
From: Linus Torvalds @ 2000-05-16 14:03 UTC (permalink / raw)
  To: Rik van Riel; +Cc: Juan J. Quintela, linux-mm

On Mon, 15 May 2000, Rik van Riel wrote:
> > 
> > linus> Think of it as "this user can allocate a few pages, but it's on credit.
> > linus> They have to be paid back with the appropriate 'try_to_free_pages()'".
> 
> I don't think this will help. Imagine a user firing up 'ls', that
> will need more than one page. Besides, the difference isn't that
> we have to free pages, but that we have to deal with a *LOT* of
> dirty pages at once, unexpectedly.

The reason it will help is that we can usethe "credit" to balance out the
spikes.

Think of how people use credit cards. They get paid once or twice a month,
andthen they have lots of money. But sometimes there's a big item like a
cruise in the caribbean, and that rum ended up being more expensive than
you thought.. Not to mention all those trinkets.

So what do you do? Do you pay it off immediately? Maybe you cannot afford
to, right then. You'll have to pay it off partially each month, but you
don't have to pay it all at once. And you have to pay interest.

This is a similar situation. We will have to pay interest (== free more
pages than we actually allocated), and we'll have to do it each month (==
call try_to_free_pages() on every allocation that happens while we have an
outstanding balance). But we don't have to pay it all back immediately (==
a single negative return from try_to_free_pages() does not kill us).

Right now "try_to_free_pages()" tries to always "pay back" something like
8 or 16 pages for each page we "borrowed". That's good. But let's face it,
we might be asked to pay back during a market slump when all the pages are
dirty, and while we have the "money", it's locked up right now. It would
be ok to pay off just one or two pages (== miniumum monthly payment), as
long as we pay back the rest later.

See?

		Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Estrange behaviour of pre9-1
  2000-05-15 23:59 ` Linus Torvalds
  2000-05-16  0:20   ` Juan J. Quintela
@ 2000-05-16  1:03   ` Rik van Riel
  1 sibling, 0 replies; 12+ messages in thread
From: Rik van Riel @ 2000-05-16  1:03 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Juan J. Quintela, linux-mm

On Mon, 15 May 2000, Linus Torvalds wrote:

> The fact that Rik's patch performs so badly is interesting in
> itself, and I thus removed it from my tree.

This may be because the different try_to_free_pages end
up waiting for each other (to complete IO?).

If VM was constructed right, we would never have the
situation where a half-dozen apps are all waiting in
try_to_free_pages() simultaneously.

The bug is, IMHO, the fact that shrink_mmap() frees the
wrong pages by skipping over buffer pages. This causes
"innocent" apps to have pagefaults they didn't deserve
==> slowdown.

> _Most_ of the time when "try_to_free_pages()" is called, the
> memory actually exists, and we call try_to_free_pages() mainly
> because we want to make sure that we don't get into a bad
> situation.

True.

> So, how about doing something like:
> 
>  - if memory is low, allocate the page anyway if you can, but increment a
>    "bad user" count in current->user->mmuse;
>  - when entering __alloc_pages(), if "current->user->mmuse > 0", do a
>    "try_to_free_pages()" if there are any zones that need any help
>    (otherwise just clear this field).
> 
> Think of it as "this user can allocate a few pages, but it's on credit.
> They have to be paid back with the appropriate 'try_to_free_pages()'".

I don't think this will work if we keep stealing the wrong pages
from innocent, small processes with lots of clean pages (ie. bash,
vi, emacs, ...).

> Rik? I think this would solve the fairness concerns without the
> need to tell the rest of the world about a process trying to
> free up memory and causing bad performance..

The main problem now seems to be bad page replacement and a
practically unbounded wait time inside try_to_free_pages().

If we fix those, I think it should be possible to move back
to a slightly more conservative (and safe) model (like what
I had in my patch ... you might argue it is too conservative,
slow or whatever, but it should be the more robust one).

regards,

Rik
--
The Internet is not a network of computers. It is a network
of people. That is its real strength.

Wanna talk about the kernel?  irc.openprojects.net / #kernelnewbies
http://www.conectiva.com/		http://www.surriel.com/

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2000-05-16 16:37 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2000-05-15 22:56 Estrange behaviour of pre9-1 Juan J. Quintela
2000-05-15 23:59 ` Linus Torvalds
2000-05-16  0:20   ` Juan J. Quintela
2000-05-16  0:34     ` Linus Torvalds
2000-05-16  0:54       ` Juan J. Quintela
2000-05-16  1:15         ` Rik van Riel
2000-05-16 13:53         ` Linus Torvalds
2000-05-16 14:20           ` Juan J. Quintela
2000-05-16 16:37           ` Roger Larsson
2000-05-16  0:55     ` Rik van Riel
2000-05-16 14:03       ` Linus Torvalds
2000-05-16  1:03   ` Rik van Riel

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox