Re: [patch] improve streaming I/O [bug in shrink

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* Re: [patch] improve streaming I/O [bug in shrink_mmap()]
@ 2000-06-13  8:10 Roger Larsson
  0 siblings, 0 replies; 28+ messages in thread
From: Roger Larsson @ 2000-06-13  8:10 UTC (permalink / raw)
  To: linux-mm

> On Mon, 12 Jun 2000, Stephen C. Tweedie wrote:
> > On Mon, Jun 12, 2000 at 11:46:09PM +0200, Zlatko Calusic wrote:
> > > 
> > > This simple one-liner solves a long standing problem in Linux VM.
> > > While searching for a discardable page in shrink_mmap() Linux was too
> > > easily failing and subsequently falling back to swapping. The problem
> > > was that shrink_mmap() counted pages from the wrong zone, and in case
> > > of balancing a relatively smaller zone (e.g. DMA zone on a 128MB
> > > computer) "count" would be mistakenly spent dealing with pages from
> > > the wrong zone. The net effect of all this was spurious swapping that
> > > hurt performance greatly.
> > 
> > Nice --- it might also explain some of the excessive kswap CPU 
> > utilisation we've seen reported now and again.
> 
> Indeed. And to be honest, the patch can be made even simpler.
> 
> We can simply move the test up to above the count--, so we won't
> start IO for the "wrong" zones either.
> 
> There's only one serious bug left with the current shrink_mmap,
> a bug which appears to be easy to trigger with this patch, but
> still there without it.
> 
> Consider the case where only one zone has free_pages < pages_high,
> but all the pages in the LRU queue are from the other zone or not
> freeable (ie. with pagetable mapping)...
> 
> In those cases shrink_mmap() can loop forever. We probably want to
> add a "maxscan" variable, initialised to nr_lru_pages, which is
> decremented on every iteration through the loop to prevent us from
> triggering this bug.


An I have already released such a patch.
See "reduce swap due to shrink_mmap failures".

But it is probable that we should clean pages (= start I/O) even on
zones with no pressure - like Rajagopal reported.

/RogerL
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 28+ messages in thread

[parent not found: <8i3qe8$lltbv$1@fido.engr.sgi.com>]

* Re: [patch] improve streaming I/O [bug in shrink_mmap()]
       [not found] <8i3qe8$lltbv$1@fido.engr.sgi.com>
@ 2000-06-14  6:17 ` Rajagopal Ananthanarayanan
  0 siblings, 0 replies; 28+ messages in thread
From: Rajagopal Ananthanarayanan @ 2000-06-14  6:17 UTC (permalink / raw)
  To: Rik van Riel, linux-mm

Rik van Riel wrote:
> 
	[ ... ]
> 
> Indeed. And to be honest, the patch can be made even simpler.
> 
> We can simply move the test up to above the count--, so we won't
> start IO for the "wrong" zones either.

No, I think that leads to other problems. Almost a month ago,
when pre6-8 was having serious issues here, I also happened
to chance on the same set of problems. And here's the summary
of the discussions with Linus: (1) shrink_mmap should
not give up having tried one of the pages from a balanced zone
(2) regardless of zone being balanced or not, memory pressure
should trigger I/O. Otherwise the buffer-heads attached to the
pages in the balanced zones can never be recovered in time.
Here's a quote from Linus' message:

------------ Begin Quote ------------------------------------------
Linus Torvalds wrote:
> 
        [ ... ]
> 
> The "don't page out pages from zones that don't need it" test is a good
> test, but it turns out that it triggers a rather serious problem: the way
> the buffer cache dirty page handling is done is by having shrink_mmap() do
> a "try_to_free_buffers()" on the pages it encounters that have
> "page->buffer" set.
> 
> And doing that is quite important, because without that logic the buffers
> don't get written to disk in a timely manner, nor do already-written
> buffers get refiled to their proper lists. So you end up being "out of
> memory" - not because the machine is really out of memory, but because
> those buffers have a tendency to stick around if they aren't constantly
> looked after by "try_to_free_buffers()".
> 
> So the real fix ended up being to re-order the tests in shrink_mmap() a
> bit, so that try_to_free_buffers() is called even for pages that are on
> a good zone that doesn't need any real balancing..
------------------------- End Quote ------------------------

.... Back to Rik's message ....
> 
> There's only one serious bug left with the current shrink_mmap,
> a bug which appears to be easy to trigger with this patch, but
> still there without it.
> 
> Consider the case where only one zone has free_pages < pages_high,
> but all the pages in the LRU queue are from the other zone or not
> freeable (ie. with pagetable mapping)...
> 
> In those cases shrink_mmap() can loop forever. We probably want to
> add a "maxscan" variable, initialised to nr_lru_pages, which is
> decremented on every iteration through the loop to prevent us from
> triggering this bug.


This, I agree. And something I gave up trying to bring up earlier as well:
There should be some mechanism to check that enough pages have been examined
in the presence of pages from balanced zones.



--------------------------------------------------------------------------
Rajagopal Ananthanarayanan ("ananth")
Member Technical Staff, SGI.
--------------------------------------------------------------------------
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [patch] improve streaming I/O [bug in shrink_mmap()]
@ 2000-06-12 21:46 Zlatko Calusic
  2000-06-12 22:29 ` Stephen C. Tweedie
  0 siblings, 1 reply; 28+ messages in thread
From: Zlatko Calusic @ 2000-06-12 21:46 UTC (permalink / raw)
  To: alan; +Cc: Linux MM List, Linux Kernel List

Hi!

This simple one-liner solves a long standing problem in Linux VM.
While searching for a discardable page in shrink_mmap() Linux was too
easily failing and subsequently falling back to swapping. The problem
was that shrink_mmap() counted pages from the wrong zone, and in case
of balancing a relatively smaller zone (e.g. DMA zone on a 128MB
computer) "count" would be mistakenly spent dealing with pages from
the wrong zone. The net effect of all this was spurious swapping that
hurt performance greatly.

I tested this patch very thoroughly here and it doesn't reveal any bad
behavior. I think that applying the patch is the first and most
important step towards more fast and balanced kernel. Stay tuned for
more improvements.

Benchmarking reveals a nice improvement for the streaming I/O
applications:

    -------Sequential Output-------- ---Sequential Input-- --Random--
    -Per Char- --Block--- -Rewrite-- -Per Char- --Block--- --Seeks---
 MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU  /sec %CPU

*** ac-16:

400 17380 74.4 13887 14.9  6203  6.8 14452 46.4 15743 12.3 129.9  1.0
400 15134 65.3 15085 15.9  5872  6.5 13281 40.8 18943 14.4 124.4  1.0

*** ac-16 with patch applied:

400 17426 75.8 17919 18.6  6518  7.7 16294 50.0 21038 16.8 132.0  0.8
400 16915 73.3 17502 17.9  6515  7.2 16499 51.4 21148 15.7 131.0  1.4
               ^^^^^       ^^^^                 ^^^^^

Index: 24001.23/mm/filemap.c
--- 24001.23/mm/filemap.c Mon, 12 Jun 2000 21:03:48 +0200 zcalusic (linux/F/b/16_filemap.c 1.6.1.3.2.4.1.1.2.2.2.1.1.21.1.1.3.2.3.1.3.1.2.1 644)
+++ 24001.24/mm/filemap.c Mon, 12 Jun 2000 21:51:53 +0200 zcalusic (linux/F/b/16_filemap.c 1.6.1.3.2.4.1.1.2.2.2.1.1.21.1.1.3.2.3.1.3.1.2.2 644)
@@ -365,8 +365,11 @@
 		 * Page is from a zone we don't care about.
 		 * Don't drop page cache entries in vain.
 		 */
-		if (page->zone->free_pages > page->zone->pages_high)
+		if (page->zone->free_pages > page->zone->pages_high) {
+			/* the page from the wrong zone doesn't count */
+			count++;
 			goto unlock_continue;
+		}
 
 		/* Take the pagecache_lock spinlock held to avoid
 		   other tasks to notice the page while we are looking at its

Regards,
-- 
Zlatko
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [patch] improve streaming I/O [bug in shrink_mmap()]
  2000-06-12 21:46 Zlatko Calusic
@ 2000-06-12 22:29 ` Stephen C. Tweedie
  2000-06-12 23:04   ` Rik van Riel
  2000-06-13 15:08   ` Andrea Arcangeli
  0 siblings, 2 replies; 28+ messages in thread
From: Stephen C. Tweedie @ 2000-06-12 22:29 UTC (permalink / raw)
  To: Zlatko Calusic; +Cc: alan, Linux MM List, Linux Kernel List

Hi,

On Mon, Jun 12, 2000 at 11:46:09PM +0200, Zlatko Calusic wrote:
> 
> This simple one-liner solves a long standing problem in Linux VM.
> While searching for a discardable page in shrink_mmap() Linux was too
> easily failing and subsequently falling back to swapping. The problem
> was that shrink_mmap() counted pages from the wrong zone, and in case
> of balancing a relatively smaller zone (e.g. DMA zone on a 128MB
> computer) "count" would be mistakenly spent dealing with pages from
> the wrong zone. The net effect of all this was spurious swapping that
> hurt performance greatly.

Nice --- it might also explain some of the excessive kswap CPU 
utilisation we've seen reported now and again.

--Stephen
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [patch] improve streaming I/O [bug in shrink_mmap()]
  2000-06-12 22:29 ` Stephen C. Tweedie
@ 2000-06-12 23:04   ` Rik van Riel
  2000-06-13 15:08   ` Andrea Arcangeli
  1 sibling, 0 replies; 28+ messages in thread
From: Rik van Riel @ 2000-06-12 23:04 UTC (permalink / raw)
  To: Stephen C. Tweedie; +Cc: Zlatko Calusic, alan, Linux MM List, Linux Kernel List

On Mon, 12 Jun 2000, Stephen C. Tweedie wrote:
> On Mon, Jun 12, 2000 at 11:46:09PM +0200, Zlatko Calusic wrote:
> > 
> > This simple one-liner solves a long standing problem in Linux VM.
> > While searching for a discardable page in shrink_mmap() Linux was too
> > easily failing and subsequently falling back to swapping. The problem
> > was that shrink_mmap() counted pages from the wrong zone, and in case
> > of balancing a relatively smaller zone (e.g. DMA zone on a 128MB
> > computer) "count" would be mistakenly spent dealing with pages from
> > the wrong zone. The net effect of all this was spurious swapping that
> > hurt performance greatly.
> 
> Nice --- it might also explain some of the excessive kswap CPU 
> utilisation we've seen reported now and again.

Indeed. And to be honest, the patch can be made even simpler.

We can simply move the test up to above the count--, so we won't
start IO for the "wrong" zones either.

There's only one serious bug left with the current shrink_mmap,
a bug which appears to be easy to trigger with this patch, but
still there without it.

Consider the case where only one zone has free_pages < pages_high,
but all the pages in the LRU queue are from the other zone or not
freeable (ie. with pagetable mapping)...

In those cases shrink_mmap() can loop forever. We probably want to
add a "maxscan" variable, initialised to nr_lru_pages, which is
decremented on every iteration through the loop to prevent us from
triggering this bug.

regards,

Rik
--
The Internet is not a network of computers. It is a network
of people. That is its real strength.

Wanna talk about the kernel?  irc.openprojects.net / #kernelnewbies
http://www.conectiva.com/		http://www.surriel.com/

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [patch] improve streaming I/O [bug in shrink_mmap()]
  2000-06-12 22:29 ` Stephen C. Tweedie
  2000-06-12 23:04   ` Rik van Riel
@ 2000-06-13 15:08   ` Andrea Arcangeli
  2000-06-13 17:08     ` Juan J. Quintela
  2000-06-13 19:20     ` Rik van Riel
  1 sibling, 2 replies; 28+ messages in thread
From: Andrea Arcangeli @ 2000-06-13 15:08 UTC (permalink / raw)
  To: Stephen C. Tweedie; +Cc: Zlatko Calusic, alan, Linux MM List, Linux Kernel List

On Mon, 12 Jun 2000, Stephen C. Tweedie wrote:

>Nice --- it might also explain some of the excessive kswap CPU 
>utilisation we've seen reported now and again.

You have more kswapd load for sure due the strict zone approch. It maybe
not noticeable but it's real. You boot, you allocate all the normal zone
in cache doing some fs load, then you start netscape and you allocate the
lower 16mbyte of RAM into it, then doing some other thing you trigger
kswapd to run because also the lower 16mbyte are been allocated now. Then
netscape exists and release all the lower 16m but kswapd keeps shrinking
the normal zone (this shouldn't happen and it wouldn't happen with
classzone design).

I think Linus's argument about the above scenario is simply that the above
isn't going to happen very often, but how can I ignore this broken
behaviour? I hate code that works in the common case but that have
drawbacks in the corner case. It would be better if I wouldn't know what
the current code is doing, then I could accept it more easily.

Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [patch] improve streaming I/O [bug in shrink_mmap()]
  2000-06-13 15:08   ` Andrea Arcangeli
@ 2000-06-13 17:08     ` Juan J. Quintela
  2000-06-13 19:09       ` Andrea Arcangeli
  2000-06-13 19:20     ` Rik van Riel
  1 sibling, 1 reply; 28+ messages in thread
From: Juan J. Quintela @ 2000-06-13 17:08 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Stephen C. Tweedie, Zlatko Calusic, alan, Linux MM List,
	Linux Kernel List

>>>>> "andrea" == Andrea Arcangeli <andrea@suse.de> writes:

andrea> You have more kswapd load for sure due the strict zone approch. It maybe
andrea> not noticeable but it's real. You boot, you allocate all the normal zone
andrea> in cache doing some fs load, then you start netscape and you allocate the
andrea> lower 16mbyte of RAM into it, then doing some other thing you trigger
andrea> kswapd to run because also the lower 16mbyte are been allocated now. Then
andrea> netscape exists and release all the lower 16m but kswapd keeps shrinking
andrea> the normal zone (this shouldn't happen and it wouldn't happen with
andrea> classzone design).

Linus argument is that you should never get _all_ the normal zone
allocated and nothing of the DMA zone.  You need to balance the
allocations module the .free_pages, .low_pages etc of each zone....

The problem with the actual algorithm is when we have allocated all
the pages in one zone and all the pages in the LRU list are from a
different zone.  We need to do some swaping and not write to disk
_all_ the pages of the rest of the zones (that happend to be in the
LRU list).  See the comments from riel and Roger Larson in this list.

Later, Juan.

-- 
In theory, practice and theory are the same, but in practice they 
are different -- Larry McVoy
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [patch] improve streaming I/O [bug in shrink_mmap()]
  2000-06-13 17:08     ` Juan J. Quintela
@ 2000-06-13 19:09       ` Andrea Arcangeli
  2000-06-13 19:32         ` Rik van Riel
  0 siblings, 1 reply; 28+ messages in thread
From: Andrea Arcangeli @ 2000-06-13 19:09 UTC (permalink / raw)
  To: Juan J. Quintela
  Cc: Stephen C. Tweedie, Zlatko Calusic, alan, Linux MM List,
	Linux Kernel List

On 13 Jun 2000, Juan J. Quintela wrote:

>>>>>> "andrea" == Andrea Arcangeli <andrea@suse.de> writes:
>
>andrea> You have more kswapd load for sure due the strict zone approch. It maybe
>andrea> not noticeable but it's real. You boot, you allocate all the normal zone
>andrea> in cache doing some fs load, then you start netscape and you allocate the
>andrea> lower 16mbyte of RAM into it, then doing some other thing you trigger
>andrea> kswapd to run because also the lower 16mbyte are been allocated now. Then
>andrea> netscape exists and release all the lower 16m but kswapd keeps shrinking
>andrea> the normal zone (this shouldn't happen and it wouldn't happen with
>andrea> classzone design).
>
>Linus argument is that you should never get _all_ the normal zone
>allocated and nothing of the DMA zone.  You need to balance the
>allocations module the .free_pages, .low_pages etc of each zone....

Of course I was just assuming the Linus's point you raised (I was just
running in my mind plain 2.4.0-test1-ac vm) even if it's irrelevant for
this example (and that's why I didn't focused on the fact there was still
some memory free in the normal zone before going to allocate from the dma
zone). In what I described above I just assumed that _not_ all the normal
zone is been allocated, but that we stopped eating from there as soon as
we triggered the watermark (high/low/min whatever you want, I don't mind).

So far so good. Then you also allocated most of the DMA zone because you
started netscape. As you prefer to point out at this point there was still
"pages_min" pages free in the normal zone.

Then you do some more I/O and allocate some cache, then kswapd triggers to
try to free some memory because all zones are under the watermark. OK?

Then netscape exits and release 10mbyte from the DMA zone _but_ kswapd
continues to shrink the normal zone, why??? -> because the MM doesn't have
enough information in order to do the right thing, that's all.

It's broken, period and you can't fix that behaviour without changing
design and going classzone based. You can say nobody will ever notice it
with only mere three zones, I don't have numbers to say otherwise at this
moment, but it's sure my own kernel will react right to that corner case
too (it may even run slower in the common case but I really don't mind
about performance, I mind about correctness first).

I only mentioned another one of the buggy behaviour that I see. For the
other fact you don't empty the zone_normal before falling back into
zone_dma you all agree it's a feature (while IMHO it's a misefature but
not severe, but ok for now I will also assume that one was a feature to
avoid further flames).

Now I'd like to hear if you consider what I described in these two emails
a feature too. If you consider it a feature I'll just tell you the next
bad behaviour that happens in the LRU aging (and that's not exactly the
problem you are describing below but it's only a little bit more subtle).

>The problem with the actual algorithm is when we have allocated all
>the pages in one zone and all the pages in the LRU list are from a
>different zone.  We need to do some swaping and not write to disk

You don't need to do any swapping! Please read carefully the below stuff:

Assume the DMA zone is filled by cache. Assume the normal zone is allocate
in anonymous and kernel memory (so not in lru).

Then when you release some memory from the DMA zone you _have_ to
understand that you did progress also for the normal zone because you
_did_ progress!!! Now the next alloc_pages(GFP_KERNEL) will succeed
because you have memory free in the DMA zone. Do you agree that you did
some progress or not?

Classzone understands you did progress in the DMA zone and it doesn't
remains stuck trying to free cache from the normal zone. That case is
handled _perfectly_ from ages by the classzone patch and it instead breaks
with the current kernel (both 2.4.0-test1 and latest ac one).

Fixing it by swapping out stuff from the normal zone is even worse. It
just means that you'll start swapping out stuff when you still have around
16mbyte of cache freeable and potentially very old!! See? Only way to fix
this is to change the design of the memory balancing... as I just did with
the classzone patch when I noticed what was going on last month.

At this moment I won't buy the current design and I'll stick with
classzone until somebody will offer me a design solution that handles all
the cases right as classzone does (and I think there's no other way than
what I am just doing however I can't exclude there's a smarter solution so
think about it!). I believe if people would understand what's the current
allocator is doing they wouldn't agree with it either.

I'd love if Rik would do his patch where he splits each zone in NR_CPUS
zones so that the drawbacks that are now in the darkness (because there
are only 2 or 3 zones) would see more light.

Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [patch] improve streaming I/O [bug in shrink_mmap()]
  2000-06-13 19:09       ` Andrea Arcangeli
@ 2000-06-13 19:32         ` Rik van Riel
  2000-06-13 23:07           ` Andrea Arcangeli
  0 siblings, 1 reply; 28+ messages in thread
From: Rik van Riel @ 2000-06-13 19:32 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Juan J. Quintela, Stephen C. Tweedie, Zlatko Calusic, alan,
	Linux MM List, Linux Kernel List

On Tue, 13 Jun 2000, Andrea Arcangeli wrote:

> Then you do some more I/O and allocate some cache, then kswapd
> triggers to try to free some memory because all zones are under
> the watermark. OK?

Ahhh, but kswapd will *only* trigger the number of pages we
need to reach zone->pages_low (in the latest -ac patches).

> Then netscape exits and release 10mbyte from the DMA zone _but_
> kswapd continues to shrink the normal zone, why??? -> because
> the MM doesn't have enough information in order to do the right
> thing, that's all.

In this case kswapd will only shrink the normal zone *once*.
After the normal zone has reached zone->pages_low, we will:
1) stop freeing pages in zone_normal
2) allocate all new allocations from zone_dma, until that
   zone hits the low watermark as well

I think you're overlooking the fact that kswapd's freeing of
pages is something that occurs only *once*...

> (it may even run slower in the common case but I really don't
> mind about performance, I mind about correctness first).

Ermm, wasn't your motivation for the classzone idea
_performance_??  (at least, that's what I read from
the rest of your email)

> Assume the DMA zone is filled by cache. Assume the normal zone
> is allocate in anonymous and kernel memory (so not in lru).
> 
> Then when you release some memory from the DMA zone you _have_
> to understand that you did progress also for the normal zone
> because you _did_ progress!!!

That's why the new balancing code leaves the area between
zone->pages_low and zone->pages_high as "slack", used to
balance between zones. And when all zones go _just_ below
zone->pages_low, we'll free something from the zones.

If one zone contains more easily freeable memory, we'll free
more pages from that zone before we get the other zone(s)
above zone->pages_low ... and we have the balancing between
zones.

> At this moment I won't buy the current design and I'll stick
> with classzone until somebody will offer me a design solution
> that handles all the cases right as classzone does

I think you may want to read the discussion between Matt Dillon
and me about FreeBSD VM. The main point is that we keep some
pages around which are clean, unmapped and unused. We can reclaim
them at any time.

Since "producing" such pages doesn't mean we have to "throw away"
useful data, we'll have the ability to have one really inactive
zone grow megabytes of these pages without too much overhead, so
we can achieve faster balancing between zones with the benefits of
both classzone and the normal zoned system.

Also, since all inactive pages are equally old and equal candidates
for being evicted from memory, we can chose to delay IO on dirty
pages or spread out IO in a better way. There are all sorts of big
and small optimisations we can make here...

(eg. don't grow the number of scavenge pages in a zone if we don't
need to and it would require IO to do so)

regards,

Rik
--
The Internet is not a network of computers. It is a network
of people. That is its real strength.

Wanna talk about the kernel?  irc.openprojects.net / #kernelnewbies
http://www.conectiva.com/		http://www.surriel.com/

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [patch] improve streaming I/O [bug in shrink_mmap()]
  2000-06-13 19:32         ` Rik van Riel
@ 2000-06-13 23:07           ` Andrea Arcangeli
  2000-06-13 23:34             ` Rik van Riel
  2000-06-13 23:41             ` Juan J. Quintela
  0 siblings, 2 replies; 28+ messages in thread
From: Andrea Arcangeli @ 2000-06-13 23:07 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Juan J. Quintela, Stephen C. Tweedie, Zlatko Calusic, alan,
	Linux MM List, Linux Kernel List, Linus Torvalds

On Tue, 13 Jun 2000, Rik van Riel wrote:

>On Tue, 13 Jun 2000, Andrea Arcangeli wrote:
>
>> Then you do some more I/O and allocate some cache, then kswapd
>> triggers to try to free some memory because all zones are under
>> the watermark. OK?
>
>Ahhh, but kswapd will *only* trigger the number of pages we
>need to reach zone->pages_low (in the latest -ac patches).

Who said otherwise? It will trigger for freeing pages_low-pages_min. Just
the gap between the two watermarks you prefer.

>> Then netscape exits and release 10mbyte from the DMA zone _but_
>> kswapd continues to shrink the normal zone, why??? -> because
>> the MM doesn't have enough information in order to do the right
>> thing, that's all.
>
>In this case kswapd will only shrink the normal zone *once*.

How can you be sure of that? So I'll make you an obvious case where
it will shrink not twice, not three times but _forever_.

Assume the pages_min of the normal zone watermark triggers when the normal
zone is allocated at 95% and assume that all such 95% of the normal zone
is been allocated all in mlocked memory and kernel mem_map_t array. Can't
somebody (for example an oracle database) allocate 95% of the normal zone
in mlocked shm memory? Do you agree? Or you are telling me it can't or
that if it does so it should then expect the linux kernel to explode
(actually it would cause kswapd to loop forever trying to free the normal
zone even if there's still 15mbyte of ZONE_DMA memory free).

So let's make the whole picture from the start starting with all the
memory free: assume oracle allocates all the normal zone in shm mlocked
memory. You still have 15mbyte free for the cache in the ZONE_DMA, OK?
Then you allocate the 95% of such 15mbyte in the cache and then kswapd
triggers and it will never stop because it will try to free the
zone_normal forever, even if it just recycled enough memory from the
ZONE_DMA (so even if __alloc_pages wouldn't start memory balancing
anymore!). See????

The classzone patch will fix the above bad behaviour completly because
kswapd in classzone will notice that there's enough memory for allocation
from both ZONE_DMA and ZONE_NORMAL because the cache in the ZONE_DMA is
been recycled successfully.

Without classzone you'll always get the above case wrong and I don't mind
if it's a corner case or not, we have to handle it right! I will hate a
kernel that works fine only as far as you only compile kernels on it.

>After the normal zone has reached zone->pages_low, we will:

The normal zone will never reach pages_low because all that is
allocated in the normal zone is mlocked userspace shm memory.

>I think you're overlooking the fact that kswapd's freeing of
>pages is something that occurs only *once*...

Since the normal zone will never return over pages_low it will run more
than once.

>> (it may even run slower in the common case but I really don't
>> mind about performance, I mind about correctness first).
>
>Ermm, wasn't your motivation for the classzone idea
>_performance_??  (at least, that's what I read from
>the rest of your email)

My argument of the classzone design is to get correctness in the corner
case: to fix the drawbacks.

Then I also included into such patch some performance stuff and that's why
it also improve performances siginficantly but I'm not interested about
such part for now. Since such part is stable as well you can get both
correctness and improvement at the same time but I can drop the
performance part if there will be an interest only on the other part.

I believe the very classzone part (the design change in the page_alloc.c)
isn't going to make visible performance changes in the common case but it
simply allow to get the corner case right.

I don't mind about the other part of the email at this moment, I only mind
about the global design of the allocator at this moment.

Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [patch] improve streaming I/O [bug in shrink_mmap()]
  2000-06-13 23:07           ` Andrea Arcangeli
@ 2000-06-13 23:34             ` Rik van Riel
  2000-06-14  0:12               ` Andrea Arcangeli
  2000-06-13 23:41             ` Juan J. Quintela
  1 sibling, 1 reply; 28+ messages in thread
From: Rik van Riel @ 2000-06-13 23:34 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Juan J. Quintela, Stephen C. Tweedie, Zlatko Calusic, alan,
	Linux MM List, Linux Kernel List, Linus Torvalds

On Wed, 14 Jun 2000, Andrea Arcangeli wrote:
> On Tue, 13 Jun 2000, Rik van Riel wrote:
> 
> >> Then netscape exits and release 10mbyte from the DMA zone _but_
> >> kswapd continues to shrink the normal zone, why??? -> because
> >> the MM doesn't have enough information in order to do the right
> >> thing, that's all.
> >
> >In this case kswapd will only shrink the normal zone *once*.
> 
> How can you be sure of that? So I'll make you an obvious case where
> it will shrink not twice, not three times but _forever_.

The infinite loop case is orthagonal to classzone. Please don't
try to confuse the issues.

> Assume the pages_min of the normal zone watermark triggers when the normal
> zone is allocated at 95% and assume that all such 95% of the normal zone
> is been allocated all in mlocked memory and kernel mem_map_t array. Can't
> somebody (for example an oracle database) allocate 95% of the normal zone
> in mlocked shm memory? Do you agree? Or you are telling me it can't or
> that if it does so it should then expect the linux kernel to explode
> (actually it would cause kswapd to loop forever trying to free the normal
> zone even if there's still 15mbyte of ZONE_DMA memory free).

No. Kswapd will never get woken up until *all* zones get below
zone->pages_low. I fixed this buglet in the -ac patches.

> memory. You still have 15mbyte free for the cache in the ZONE_DMA, OK?
> Then you allocate the 95% of such 15mbyte in the cache and then kswapd
> triggers and it will never stop because it will try to free the
> zone_normal forever, even if it just recycled enough memory from the
> ZONE_DMA (so even if __alloc_pages wouldn't start memory balancing
> anymore!). See????

No I don't see this. Kswapd will only be woken up when all zones get
below pages_low. I agree that this corner case can happen and that
we should fix it in kswapd, but I don't see how this has anything to
do with classzone vs. the zoned approach.

> >I think you're overlooking the fact that kswapd's freeing of
> >pages is something that occurs only *once*...
> 
> Since the normal zone will never return over pages_low it will
> run more than once.

You're right that the current kswapd loop won't terminate and
that this is a bug, but it doesn't have anything at all to do
with the classzone idea.

> I believe the very classzone part (the design change in the
> page_alloc.c) isn't going to make visible performance changes in
> the common case but it simply allow to get the corner case
> right.

Except when the zones are not inclusive. You may want to check
out the docs on the new POWER4 beasts from IBM. They have 4
dual-cpu dies in one package, with fast interconnects between
the dies, but one memory but and one IO bus directly attached
to each die.

This way you'll end up with something halfway between NUMA and
SMP (nUMA?), and the zone lists are still complementory, but no
longer inclusive ...

Classzone may be a nice abstraction for the current generation
of PCs, but it's simply not general enough to cover all corner
cases. Also, the gain of classzone over a correctly implemented
zoned VM should be absolutely negligable (if it exists at all).

> I don't mind about the other part of the email at this moment, I
> only mind about the global design of the allocator at this
> moment.

Then please look at the allocator code in -ac18 and not at the
one in Linus his kernel...

regards,

Rik
--
The Internet is not a network of computers. It is a network
of people. That is its real strength.

Wanna talk about the kernel?  irc.openprojects.net / #kernelnewbies
http://www.conectiva.com/		http://www.surriel.com/

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [patch] improve streaming I/O [bug in shrink_mmap()]
  2000-06-13 23:34             ` Rik van Riel
@ 2000-06-14  0:12               ` Andrea Arcangeli
  2000-06-14  0:58                 ` Rik van Riel
  0 siblings, 1 reply; 28+ messages in thread
From: Andrea Arcangeli @ 2000-06-14  0:12 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Juan J. Quintela, Stephen C. Tweedie, Zlatko Calusic, alan,
	Linux MM List, Linux Kernel List, Linus Torvalds

On Tue, 13 Jun 2000, Rik van Riel wrote:

>The infinite loop case is orthagonal to classzone. Please don't
>try to confuse the issues.

It isn't! classzone will loop forever only if you are really out of
memory, in the described scenario instead it won't waste any further time
in kswapd because kswapd succeed to shrink some bit from ZONE_DMA.

>> Assume the pages_min of the normal zone watermark triggers when the normal
>> zone is allocated at 95% and assume that all such 95% of the normal zone
>> is been allocated all in mlocked memory and kernel mem_map_t array. Can't
>> somebody (for example an oracle database) allocate 95% of the normal zone
>> in mlocked shm memory? Do you agree? Or you are telling me it can't or
>> that if it does so it should then expect the linux kernel to explode
>> (actually it would cause kswapd to loop forever trying to free the normal
>> zone even if there's still 15mbyte of ZONE_DMA memory free).
>
>No. Kswapd will never get woken up until *all* zones get below
>zone->pages_low. I fixed this buglet in the -ac patches.

All zones gone under pages_low. The zone normal gone under the watermark
due oracle mlocked shm, the other other zone (dma) gone down the watermark
due the cache that is been allocated during I/O.

>> memory. You still have 15mbyte free for the cache in the ZONE_DMA, OK?
>> Then you allocate the 95% of such 15mbyte in the cache and then kswapd
>> triggers and it will never stop because it will try to free the
>> zone_normal forever, even if it just recycled enough memory from the
>> ZONE_DMA (so even if __alloc_pages wouldn't start memory balancing
>> anymore!). See????
>
>No I don't see this. Kswapd will only be woken up when all zones get
>below pages_low. I agree that this corner case can happen and that

all zones gone under pages_low.

>we should fix it in kswapd, but I don't see how this has anything to
>do with classzone vs. the zoned approach.

You can't fix this from kswapd with yet another hack.

>You're right that the current kswapd loop won't terminate and
>that this is a bug, but it doesn't have anything at all to do
>with the classzone idea.

It have to do with the classzone idea, because you shouldn't even try to
repeat the loop because you should notice that the ZONE_NORMAL _classzone_
is not under the watermark because you succeeded freeing the cache from
the ZONE_DMA.

>Except when the zones are not inclusive. You may want to check
>out the docs on the new POWER4 beasts from IBM. They have 4

If it's a power4beast I also hope it won't need any zone in first place
just like on a alpha, we have more then one zone only because some
hardware is been designed in the very past.

>Classzone may be a nice abstraction for the current generation
>of PCs, but it's simply not general enough to cover all corner
>cases. [..]

Sorry but your argument is silly. You say that you are not covering the
corner cases at runtime in a PC used by 90% of userbase because you want
to support another very alien architecture without having to change one
bit of code in page_alloc.c?

All the architecture that I know fits in the classzone design, but don't
worry about that, if there is some future that won't fit I will extend the
classzone design to support also non inclusive zones. I simply avoid to
overdesign at this time.

Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [patch] improve streaming I/O [bug in shrink_mmap()]
  2000-06-14  0:12               ` Andrea Arcangeli
@ 2000-06-14  0:58                 ` Rik van Riel
  2000-06-14  1:18                   ` Andrea Arcangeli
  0 siblings, 1 reply; 28+ messages in thread
From: Rik van Riel @ 2000-06-14  0:58 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Juan J. Quintela, Stephen C. Tweedie, Zlatko Calusic, alan,
	Linux MM List, Linux Kernel List, Linus Torvalds

On Wed, 14 Jun 2000, Andrea Arcangeli wrote:
> On Tue, 13 Jun 2000, Rik van Riel wrote:
> 
> >The infinite loop case is orthagonal to classzone. Please don't
> >try to confuse the issues.
> 
> It isn't! classzone will loop forever only if you are really out
> of memory,

Which is a bug, just the same as this can happen to the zoned design
when we run out of memory in one zone. As I said, orthagonal.

> >No. Kswapd will never get woken up until *all* zones get below
> >zone->pages_low. I fixed this buglet in the -ac patches.
> 
> All zones gone under pages_low. The zone normal gone under the
> watermark due oracle mlocked shm, the other other zone (dma)
> gone down the watermark due the cache that is been allocated
> during I/O.

The zone approach doesn't really use the watermarks in the 2.2
sense. If all zones dive below pages_low, kswapd will free some
memory until all zones get just above pages_low.

We achieve something like the watermarks because we'll free the
pages that are at the end of the LRU list, so if one zone has a
lot of unused pages, we'll have freed up to pages_high memory in
that zone before we get the other zone above pages_low...

> >we should fix it in kswapd, but I don't see how this has anything to
> >do with classzone vs. the zoned approach.
> 
> You can't fix this from kswapd with yet another hack.

Above you wrote that classzone has the exact same problem. If one
(class)zone gets out of memory and contains no freeable memory,
kswapd will enter an infinite loop. In this case there's no
difference between freeing memory from a classzone or a normal zone.

In fact, the bugfix would be the exact *same* for both classzone and
the normal zoned VM.

> >You're right that the current kswapd loop won't terminate and
> >that this is a bug, but it doesn't have anything at all to do
> >with the classzone idea.
> 
> It have to do with the classzone idea, because you shouldn't
> even try to repeat the loop because you should notice that the
> ZONE_NORMAL _classzone_ is not under the watermark because you
> succeeded freeing the cache from the ZONE_DMA.

You're playing with words here. If the cache was allocated before
the mlock()ed memory, classzone would loop forever on trying to
free memory from the DMA zone. There is no fundamental difference
in the manifestation of the bug on either classzone or the normal
VM.

> >Classzone may be a nice abstraction for the current generation
> >of PCs, but it's simply not general enough to cover all corner
> >cases. [..]
> 
> Sorry but your argument is silly. You say that you are not
> covering the corner cases at runtime in a PC used by 90% of
> userbase because you want to support another very alien
> architecture without having to change one bit of code in
> page_alloc.c?

No. I'm saying that the classzone abstraction is not general enough
and we can support all corner cases of usage well without it. In
fact, as I demonstrated above, even your own contorted example will
hang classzone if I only switch the order in which the allocations
happen...

> All the architecture that I know fits in the classzone design,
> but don't worry about that, if there is some future that won't
> fit I will extend the classzone design to support also non
> inclusive zones. I simply avoid to overdesign at this time.

I don't think I can add anything to this. Adding features to
an already complex design to avoid overengineering??

cheers,

Rik
--
The Internet is not a network of computers. It is a network
of people. That is its real strength.

Wanna talk about the kernel?  irc.openprojects.net / #kernelnewbies
http://www.conectiva.com/		http://www.surriel.com/

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [patch] improve streaming I/O [bug in shrink_mmap()]
  2000-06-14  0:58                 ` Rik van Riel
@ 2000-06-14  1:18                   ` Andrea Arcangeli
  2000-06-14  1:33                     ` Rik van Riel
  0 siblings, 1 reply; 28+ messages in thread
From: Andrea Arcangeli @ 2000-06-14  1:18 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Juan J. Quintela, Stephen C. Tweedie, Zlatko Calusic, alan,
	Linux MM List, Linux Kernel List, Linus Torvalds

On Tue, 13 Jun 2000, Rik van Riel wrote:

>> It have to do with the classzone idea, because you shouldn't
>> even try to repeat the loop because you should notice that the
>> ZONE_NORMAL _classzone_ is not under the watermark because you
>> succeeded freeing the cache from the ZONE_DMA.
>
>You're playing with words here. If the cache was allocated before
>the mlock()ed memory, classzone would loop forever on trying to
>free memory from the DMA zone. There is no fundamental difference
>in the manifestation of the bug on either classzone or the normal
>VM.

There's a _lot_ of difference: in your scenario the dma zone is in empty
and the next allocation GFP_DMA will fail. So it's right and necessary to
loop in kswapd because we are really low on memory on such zone (the dma
zone)!

In the scenario that I raised previously where the stock kernel loops
forever (so first mlocked in normal zone and then cache in dma zone)  the
recycling will _succeed_ and there's no reason to keep looping in kswapd
trying to free the normal zone because the next allocation will succeed 
without any problem. Do you see the difference?

>and we can support all corner cases of usage well without it. In
>fact, as I demonstrated above, even your own contorted example will
>hang classzone if I only switch the order in which the allocations
>happen...

It won't hang, but kswapd will eat CPU and that's right in your case. The
difference that you can't see is that in the second scenario where the
classzone would spend CPU in kswapd the CPU is spent for a purpose that
have a sense. In the first scenario where classzone wouldn't any spend
CPU, the CPU in kswapd would infact be _wasted_.

Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [patch] improve streaming I/O [bug in shrink_mmap()]
  2000-06-14  1:18                   ` Andrea Arcangeli
@ 2000-06-14  1:33                     ` Rik van Riel
  2000-06-14  2:10                       ` Andrea Arcangeli
  0 siblings, 1 reply; 28+ messages in thread
From: Rik van Riel @ 2000-06-14  1:33 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Juan J. Quintela, Stephen C. Tweedie, Zlatko Calusic, alan,
	Linux MM List, Linux Kernel List, Linus Torvalds

On Wed, 14 Jun 2000, Andrea Arcangeli wrote:

> >and we can support all corner cases of usage well without it. In
> >fact, as I demonstrated above, even your own contorted example will
> >hang classzone if I only switch the order in which the allocations
> >happen...
> 
> It won't hang, but kswapd will eat CPU and that's right in your case. The
> difference that you can't see is that in the second scenario where the
> classzone would spend CPU in kswapd the CPU is spent for a purpose that
> have a sense. In the first scenario where classzone wouldn't any spend
> CPU, the CPU in kswapd would infact be _wasted_.

Now explain to me *why* this happens. I'm pretty sure this happens
because of the 'dispose = &old' in shrink_mmap and not because of
anything even remotely classzone related...

I'm trying to improve the Linux kernel here, I'd appreciate it if
you were honest with me.

regards,

Rik
--
The Internet is not a network of computers. It is a network
of people. That is its real strength.

Wanna talk about the kernel?  irc.openprojects.net / #kernelnewbies
http://www.conectiva.com/		http://www.surriel.com/

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [patch] improve streaming I/O [bug in shrink_mmap()]
  2000-06-14  1:33                     ` Rik van Riel
@ 2000-06-14  2:10                       ` Andrea Arcangeli
  2000-06-14  2:46                         ` Rik van Riel
  0 siblings, 1 reply; 28+ messages in thread
From: Andrea Arcangeli @ 2000-06-14  2:10 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Juan J. Quintela, Stephen C. Tweedie, Zlatko Calusic, alan,
	Linux MM List, Linux Kernel List, Linus Torvalds

On Tue, 13 Jun 2000, Rik van Riel wrote:

>On Wed, 14 Jun 2000, Andrea Arcangeli wrote:
>
>> >and we can support all corner cases of usage well without it. In
>> >fact, as I demonstrated above, even your own contorted example will
>> >hang classzone if I only switch the order in which the allocations
>> >happen...
>> 
>> It won't hang, but kswapd will eat CPU and that's right in your case. The
>> difference that you can't see is that in the second scenario where the
>> classzone would spend CPU in kswapd the CPU is spent for a purpose that
>> have a sense. In the first scenario where classzone wouldn't any spend
>> CPU, the CPU in kswapd would infact be _wasted_.
>
>Now explain to me *why* this happens. I'm pretty sure this happens
>because of the 'dispose = &old' in shrink_mmap and not because of
>anything even remotely classzone related...

You waste CPU in kswapd in the first scenario simply because you are not
looking backwards at the ZONE_DMA state at the time you have to choose if
you did some progress on the ZONE_NORMAL zone.

You did progress in the ZONE_DMA because it was all cache so then kswapd
should understand even if nothing is been freed from the ZONE_NORMAL, we
just have enough marging for the next GFP_KERNEL allocation too (not only
for the GFP_DMA allocations), thus it should stop looping. There's just
enough free memory for both zones.

The problem isn't related to shrink_mmap, but only to the zone design
(proper classzone part).

>I'm trying to improve the Linux kernel here, I'd appreciate it if
>you were honest with me.

Are you saying I'm not been honest with you? JFYI: I don't enjoy to get
insulted by you (and it's not the first time). I will ignore also your
above comment but please don't insult me anymore in the future! Thanks.

Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [patch] improve streaming I/O [bug in shrink_mmap()]
  2000-06-14  2:10                       ` Andrea Arcangeli
@ 2000-06-14  2:46                         ` Rik van Riel
  2000-06-14 13:01                           ` Andrea Arcangeli
  0 siblings, 1 reply; 28+ messages in thread
From: Rik van Riel @ 2000-06-14  2:46 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Juan J. Quintela, Stephen C. Tweedie, Zlatko Calusic, alan,
	Linux MM List, Linux Kernel List, Linus Torvalds

On Wed, 14 Jun 2000, Andrea Arcangeli wrote:
> On Tue, 13 Jun 2000, Rik van Riel wrote:
> >On Wed, 14 Jun 2000, Andrea Arcangeli wrote:
> >
> >> >and we can support all corner cases of usage well without it. In
> >> >fact, as I demonstrated above, even your own contorted example will
> >> >hang classzone if I only switch the order in which the allocations
> >> >happen...
> >> 
> >> It won't hang, but kswapd will eat CPU and that's right in your case. The
> >> difference that you can't see is that in the second scenario where the
> >> classzone would spend CPU in kswapd the CPU is spent for a purpose that
> >> have a sense. In the first scenario where classzone wouldn't any spend
> >> CPU, the CPU in kswapd would infact be _wasted_.
> >
> >Now explain to me *why* this happens. I'm pretty sure this happens
> >because of the 'dispose = &old' in shrink_mmap and not because of
> >anything even remotely classzone related...
> 
> You waste CPU in kswapd in the first scenario simply because you
> are not looking backwards at the ZONE_DMA state at the time you
> have to choose if you did some progress on the ZONE_NORMAL zone.
>
> The problem isn't related to shrink_mmap, but only to the zone design
> (proper classzone part).

But when you switch around the order of allocation in your
hypothetical example, allocating the cache first, from the
ZONE_NORMAL and then proceeding to mlock the rest of the
normal zone and the dma zone, then classzone will still
break.

> >I'm trying to improve the Linux kernel here, I'd appreciate it if
> >you were honest with me.
> 
> Are you saying I'm not been honest with you? JFYI: I don't enjoy to get
> insulted by you (and it's not the first time). I will ignore also your
> above comment but please don't insult me anymore in the future! Thanks.

Conveniently snipping out the part of my post where I proved
your example wrong is not what I'd call constructive dialog.
Maybe the "honesty" thing was a bit much. I should get some
sleep and try again tomorrow using less inflammatory words.

regards,

Rik
--
The Internet is not a network of computers. It is a network
of people. That is its real strength.

Wanna talk about the kernel?  irc.openprojects.net / #kernelnewbies
http://www.conectiva.com/		http://www.surriel.com/

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [patch] improve streaming I/O [bug in shrink_mmap()]
  2000-06-14  2:46                         ` Rik van Riel
@ 2000-06-14 13:01                           ` Andrea Arcangeli
  2000-06-14 13:44                             ` Rik van Riel
  0 siblings, 1 reply; 28+ messages in thread
From: Andrea Arcangeli @ 2000-06-14 13:01 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Juan J. Quintela, Stephen C. Tweedie, Zlatko Calusic, alan,
	Linux MM List, Linux Kernel List, Linus Torvalds

On Tue, 13 Jun 2000, Rik van Riel wrote:

>But when you switch around the order of allocation in your
>hypothetical example, allocating the cache first, from the
>ZONE_NORMAL and then proceeding to mlock the rest of the
>normal zone and the dma zone, then classzone will still
>break.

It doesn't break anything. You'll simply will not able to allocate memory
with GFP_DMA anymore (that was happening seldom also in 2.2.x). If all the
DMA zone is mlocked not being able to return GFP_DMA memory is normal.

If all the ZONE_NORMAL is mlocked but the ZONE_DMA is filled by cache
having kswapd that loops forever wasting CPU in the ZONE_NORMAL is
a broken behaviour IMHO.

>Conveniently snipping out the part of my post where I proved
>your example wrong is not what I'd call constructive dialog.

You repeated the same thing many times and so I left only the part
underlined below in the reply.

On Wed, 14 Jun 2000, Andrea Arcangeli wrote:
>
>Date: Wed, 14 Jun 2000 04:10:08 +0200 (CEST)
>From: Andrea Arcangeli <andrea@suse.de>
>To: Rik van Riel <riel@conectiva.com.br>
>Cc: Juan J. Quintela <quintela@fi.udc.es>, Stephen C. Tweedie <sct@redhat.com>,
>     Zlatko Calusic <zlatko@iskon.hr>, alan@redhat.com,
>     Linux MM List <linux-mm@kvack.org>,
>     Linux Kernel List <linux-kernel@vger.rutgers.edu>,
>     Linus Torvalds <torvalds@transmeta.com>
>Subject: Re: [patch] improve streaming I/O [bug in shrink_mmap()]
>
>On Tue, 13 Jun 2000, Rik van Riel wrote:
>
>>On Wed, 14 Jun 2000, Andrea Arcangeli wrote:
>>
>>> >and we can support all corner cases of usage well without it. In
								   ^^
>>> >fact, as I demonstrated above, even your own contorted example will
     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>>> >hang classzone if I only switch the order in which the allocations
     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>>> >happen...
     ^^^^^^
>>> 
>>> It won't hang, but kswapd will eat CPU and that's right in your case. The
>>> difference that you can't see is that in the second scenario where the
>>> classzone would spend CPU in kswapd the CPU is spent for a purpose that
>>> have a sense. In the first scenario where classzone wouldn't any spend
>>> CPU, the CPU in kswapd would infact be _wasted_.
>>
>>Now explain to me *why* this happens. I'm pretty sure this happens
>>because of the 'dispose = &old' in shrink_mmap and not because of
>>anything even remotely classzone related...
>
>You waste CPU in kswapd in the first scenario simply because you are not
>looking backwards at the ZONE_DMA state at the time you have to choose if
>you did some progress on the ZONE_NORMAL zone.
>
>You did progress in the ZONE_DMA because it was all cache so then kswapd
>should understand even if nothing is been freed from the ZONE_NORMAL, we
>just have enough marging for the next GFP_KERNEL allocation too (not only
>for the GFP_DMA allocations), thus it should stop looping. There's just
>enough free memory for both zones.
>
>The problem isn't related to shrink_mmap, but only to the zone design
>(proper classzone part).
>
>>I'm trying to improve the Linux kernel here, I'd appreciate it if
>>you were honest with me.
>
>Are you saying I'm not been honest with you? JFYI: I don't enjoy to get
>insulted by you (and it's not the first time). I will ignore also your
>above comment but please don't insult me anymore in the future! Thanks.
>
>Andrea
>
>

Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [patch] improve streaming I/O [bug in shrink_mmap()]
  2000-06-14 13:01                           ` Andrea Arcangeli
@ 2000-06-14 13:44                             ` Rik van Riel
  2000-06-14 13:57                               ` Andrea Arcangeli
  0 siblings, 1 reply; 28+ messages in thread
From: Rik van Riel @ 2000-06-14 13:44 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Juan J. Quintela, Stephen C. Tweedie, Zlatko Calusic, alan,
	Linux MM List, Linux Kernel List, Linus Torvalds

On Wed, 14 Jun 2000, Andrea Arcangeli wrote:
> On Tue, 13 Jun 2000, Rik van Riel wrote:
> 
> >But when you switch around the order of allocation in your
> >hypothetical example, allocating the cache first, from the
> >ZONE_NORMAL and then proceeding to mlock the rest of the
> >normal zone and the dma zone, then classzone will still
> >break.
> 
> It doesn't break anything. You'll simply will not able to allocate memory
> with GFP_DMA anymore (that was happening seldom also in 2.2.x). If all the
> DMA zone is mlocked not being able to return GFP_DMA memory is normal.

So if the ZONE_DMA is filled by mlock()ed memory, classzone
will *not* try to balance it? Will classzone *only* try to
balance the big classzone containing zone_dma, and not the
dma zone itself?  (since the dma zone doesn't contain any
other zone, doesn't it need to be balanced?)

> If all the ZONE_NORMAL is mlocked but the ZONE_DMA is filled by cache
> having kswapd that loops forever wasting CPU in the ZONE_NORMAL is
> a broken behaviour IMHO.

A few mails back you wrote that the classzone patch would
do just about the same if a _classzone_ fills up. (except
that the different shrink_mmap() causes it to go to sleep
before being woken up again at the next allocation)

regards,

Rik
--
The Internet is not a network of computers. It is a network
of people. That is its real strength.

Wanna talk about the kernel?  irc.openprojects.net / #kernelnewbies
http://www.conectiva.com/		http://www.surriel.com/

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [patch] improve streaming I/O [bug in shrink_mmap()]
  2000-06-14 13:44                             ` Rik van Riel
@ 2000-06-14 13:57                               ` Andrea Arcangeli
  2000-06-14 16:48                                 ` Rik van Riel
  0 siblings, 1 reply; 28+ messages in thread
From: Andrea Arcangeli @ 2000-06-14 13:57 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Juan J. Quintela, Stephen C. Tweedie, Zlatko Calusic, alan,
	Linux MM List, Linux Kernel List, Linus Torvalds

On Wed, 14 Jun 2000, Rik van Riel wrote:

>So if the ZONE_DMA is filled by mlock()ed memory, classzone
>will *not* try to balance it? Will classzone *only* try to

It will try but it won't succeed.

>balance the big classzone containing zone_dma, and not the
>dma zone itself?  (since the dma zone doesn't contain any

No, I definitely try to balance the DMA zone itself. But in such case (all
DMA zone mlocked) kswapd will just spend CPU trying to balance the zone
but it _can't_ succeed because mlocked just means we can't even attempt to
move such memory elsewhere in the physical space or we'll break userspace 
critical latency needs.

>other zone, doesn't it need to be balanced?)

Yes, I of course agree, it needs to be balanced and classzone will try to
balance it.

>A few mails back you wrote that the classzone patch would
>do just about the same if a _classzone_ fills up. (except

What you mean with "just about the same"? You mean spending CPU in kswapd
trying to release some memory? If so yes. When a classzone fills up kswapd
will spend cpu trying to free some memory so that the next
GFP_DMA/GFP_KERNEL/GFP_HIGHUSER allocation (depending on the classzone
that is low on memory) will succeed.

>that the different shrink_mmap() causes it to go to sleep
>before being woken up again at the next allocation)

In classzone shrink_mmap doesn't control in any way how kswapd will react
to low memory conditions. Only the level of memory of the classzones are
controlling kswapd. If classzone is low on memory kswapd will keep to try
to shrink it.

Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [patch] improve streaming I/O [bug in shrink_mmap()]
  2000-06-14 13:57                               ` Andrea Arcangeli
@ 2000-06-14 16:48                                 ` Rik van Riel
  2000-06-14 17:14                                   ` Andrea Arcangeli
  0 siblings, 1 reply; 28+ messages in thread
From: Rik van Riel @ 2000-06-14 16:48 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Juan J. Quintela, Stephen C. Tweedie, Zlatko Calusic, alan,
	Linux MM List, Linux Kernel List, Linus Torvalds

On Wed, 14 Jun 2000, Andrea Arcangeli wrote:
> On Wed, 14 Jun 2000, Rik van Riel wrote:
> 
> >So if the ZONE_DMA is filled by mlock()ed memory, classzone
> >will *not* try to balance it? Will classzone *only* try to
> 
> It will try but it won't succeed.
> 
> >balance the big classzone containing zone_dma, and not the
> >dma zone itself?  (since the dma zone doesn't contain any
> 
> No, I definitely try to balance the DMA zone itself. But in such
> case (all DMA zone mlocked) kswapd will just spend CPU trying to
> balance the zone but it _can't_ succeed because mlocked just
> means we can't even attempt to move such memory elsewhere in the
> physical space or we'll break userspace critical latency needs.

I fully agree with this, this is the obviously right thing to
do. Would you be surprised to know that the code in the last
2.4.0-ac kernels does exactly this?

(with the exception of the two implementation bugs which can
cause kswapd and shrink_mmap to loop)

> >A few mails back you wrote that the classzone patch would
> >do just about the same if a _classzone_ fills up. (except
> 
> What you mean with "just about the same"? You mean spending CPU
> in kswapd trying to release some memory? If so yes. When a
> classzone fills up kswapd will spend cpu trying to free some
> memory so that the next GFP_DMA/GFP_KERNEL/GFP_HIGHUSER
> allocation (depending on the classzone that is low on memory)
> will succeed.

So classzone and the normal zoned VM behave in the same way here
except that classzone doesn't show the bad effects when the
allocations happen in a certain lucky order.

I think the differences between classzone and the zoned vm are
pretty small at this moment, with most of classzone's benefits
being theoretical ones that rely on memory zones being inclusive
rather than numa-like...

> >that the different shrink_mmap() causes it to go to sleep
> >before being woken up again at the next allocation)
> 
> In classzone shrink_mmap doesn't control in any way how kswapd
> will react to low memory conditions. Only the level of memory of
> the classzones are controlling kswapd. If classzone is low on
> memory kswapd will keep to try to shrink it.

Owww, so classzone kswapd will get into an infinite loop with
the disaster scenario too?

regards,

Rik
--
The Internet is not a network of computers. It is a network
of people. That is its real strength.

Wanna talk about the kernel?  irc.openprojects.net / #kernelnewbies
http://www.conectiva.com/		http://www.surriel.com/


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [patch] improve streaming I/O [bug in shrink_mmap()]
  2000-06-14 16:48                                 ` Rik van Riel
@ 2000-06-14 17:14                                   ` Andrea Arcangeli
  2000-06-14 17:33                                     ` Rik van Riel
  0 siblings, 1 reply; 28+ messages in thread
From: Andrea Arcangeli @ 2000-06-14 17:14 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Juan J. Quintela, Stephen C. Tweedie, Zlatko Calusic, alan,
	Linux MM List, Linux Kernel List, Linus Torvalds

On Wed, 14 Jun 2000, Rik van Riel wrote:

>On Wed, 14 Jun 2000, Andrea Arcangeli wrote:
>> On Wed, 14 Jun 2000, Rik van Riel wrote:
>> 
>> >So if the ZONE_DMA is filled by mlock()ed memory, classzone
>> >will *not* try to balance it? Will classzone *only* try to
>> 
>> It will try but it won't succeed.
>> 
>> >balance the big classzone containing zone_dma, and not the
>> >dma zone itself?  (since the dma zone doesn't contain any
>> 
>> No, I definitely try to balance the DMA zone itself. But in such
>> case (all DMA zone mlocked) kswapd will just spend CPU trying to
>> balance the zone but it _can't_ succeed because mlocked just
>> means we can't even attempt to move such memory elsewhere in the
>> physical space or we'll break userspace critical latency needs.
>
>I fully agree with this, this is the obviously right thing to

Ok. [1]

>do. Would you be surprised to know that the code in the last
>2.4.0-ac kernels does exactly this?

I'm not surprised. I know what the current code does and infact I didn't
took that case as the testcase. That was _your_ testcase that you invented
changing the text of the problem in something that is handled correctly by
the current code and I'm not interested about it (as far as the kernel
continues to handle it correctly as now ;).

_My_ testcase (first mlocked and then cache) is instead handled wrong by
the latest kernels and that's the only thing I'm interested about at this
moment.

>(with the exception of the two implementation bugs which can
>cause kswapd and shrink_mmap to loop)

Indeed, I don't mind about that issue at the moment.

>> >A few mails back you wrote that the classzone patch would
>> >do just about the same if a _classzone_ fills up. (except
>> 
>> What you mean with "just about the same"? You mean spending CPU
>> in kswapd trying to release some memory? If so yes. When a
>> classzone fills up kswapd will spend cpu trying to free some
>> memory so that the next GFP_DMA/GFP_KERNEL/GFP_HIGHUSER
>> allocation (depending on the classzone that is low on memory)
>> will succeed.
>
>So classzone and the normal zoned VM behave in the same way here
>except that classzone doesn't show the bad effects when the
>allocations happen in a certain lucky order.
>
>I think the differences between classzone and the zoned vm are
>pretty small at this moment, with most of classzone's benefits
>being theoretical ones that rely on memory zones being inclusive
>rather than numa-like...

You got it. Exactly.

However don't mix numa with the internal of a node. We have the pgdat and
each one is a node in a NUMA system. All the zones internal to a pgdat
have to belong to the some node or it will become impossible to shrink
cache only from one zone and to do smart decisions in NUMA systems.

>> >that the different shrink_mmap() causes it to go to sleep
>> >before being woken up again at the next allocation)
>> 
>> In classzone shrink_mmap doesn't control in any way how kswapd
>> will react to low memory conditions. Only the level of memory of
>> the classzones are controlling kswapd. If classzone is low on
>> memory kswapd will keep to try to shrink it.
>
>Owww, so classzone kswapd will get into an infinite loop with
>the disaster scenario too?

Yes. If I understood well from the first line of your email you agree
that's the right behaviour (see [1]). Since in the disaster scenario the
ZONE_DMA classzone is low on memory kswapd will continue to spend CPU to
try to free some page there.

Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [patch] improve streaming I/O [bug in shrink_mmap()]
  2000-06-14 17:14                                   ` Andrea Arcangeli
@ 2000-06-14 17:33                                     ` Rik van Riel
  2000-06-14 18:37                                       ` Andrea Arcangeli
  0 siblings, 1 reply; 28+ messages in thread
From: Rik van Riel @ 2000-06-14 17:33 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Juan J. Quintela, Stephen C. Tweedie, Zlatko Calusic, alan,
	Linux MM List, Linux Kernel List, Linus Torvalds

On Wed, 14 Jun 2000, Andrea Arcangeli wrote:
> On Wed, 14 Jun 2000, Rik van Riel wrote:
> >On Wed, 14 Jun 2000, Andrea Arcangeli wrote:
> >> On Wed, 14 Jun 2000, Rik van Riel wrote:
> >> 
> >> >So if the ZONE_DMA is filled by mlock()ed memory, classzone
> >> >will *not* try to balance it? Will classzone *only* try to
> >> 
> >> It will try but it won't succeed.
> >> 
> >> >balance the big classzone containing zone_dma, and not the
> >> >dma zone itself?  (since the dma zone doesn't contain any
> >> 
> >> No, I definitely try to balance the DMA zone itself. But in such
> >> case (all DMA zone mlocked) kswapd will just spend CPU trying to
> >> balance the zone but it _can't_ succeed because mlocked just
> >> means we can't even attempt to move such memory elsewhere in the
> >> physical space or we'll break userspace critical latency needs.
> >
> >I fully agree with this, this is the obviously right thing to
> 
> Ok. [1]

Ermmm, I mean that trying to _balance_ the zone is the right
thing to do. Consuming infinite CPU time when we can't succeed
is a clear bug we want to fix.

> >do. Would you be surprised to know that the code in the last
> >2.4.0-ac kernels does exactly this?
> 
> I'm not surprised. I know what the current code does and infact
> I didn't took that case as the testcase. That was _your_
> testcase that you invented changing the text of the problem in
> something that is handled correctly by the current code and I'm
> not interested about it (as far as the kernel continues to
> handle it correctly as now ;).
> 
> _My_ testcase (first mlocked and then cache) is instead handled
> wrong by the latest kernels and that's the only thing I'm
> interested about at this moment.

The only difference between your test case and my test case
is that the allocations happen in another order.

In most cases _both_ classzone and the zoned VM will break
down horribly, only when the allocations happen in the lucky
order of your example will classzone deal with the situation
better than the normal kernel.

> >So classzone and the normal zoned VM behave in the same way here
> >except that classzone doesn't show the bad effects when the
> >allocations happen in a certain lucky order.
> >
> >I think the differences between classzone and the zoned vm are
> >pretty small at this moment, with most of classzone's benefits
> >being theoretical ones that rely on memory zones being inclusive
> >rather than numa-like...
> 
> You got it. Exactly.
> 
> However don't mix numa with the internal of a node. We have the
> pgdat and each one is a node in a NUMA system. All the zones
> internal to a pgdat have to belong to the some node or it will
> become impossible to shrink cache only from one zone and to do
> smart decisions in NUMA systems.

I'll send you the POWER4 document I have here so you can
see what I meant. This machine will be somewhere halfway
between NUMA and SMP ... having non-inclusive zones in the
same node seems to make quite a lot of sense in that
architecture.

And since the behaviour of classzone and normal zoned vm
is just about the same, I'd really like it if we chose the
more generic abstraction here.

> >Owww, so classzone kswapd will get into an infinite loop with
> >the disaster scenario too?
> 
> Yes. If I understood well from the first line of your email you
> agree that's the right behaviour (see [1]). Since in the
> disaster scenario the ZONE_DMA classzone is low on memory kswapd
> will continue to spend CPU to try to free some page there.

Nono. As I corrected above, I think it is good that we try to
balance the zone, but we shouldn't do so in an infinite loop ;)

I'll prepare a bugfix for both potential infinite loops in
2.4.0-ac18 right now...

regards,

Rik
--
The Internet is not a network of computers. It is a network
of people. That is its real strength.

Wanna talk about the kernel?  irc.openprojects.net / #kernelnewbies
http://www.conectiva.com/		http://www.surriel.com/

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [patch] improve streaming I/O [bug in shrink_mmap()]
  2000-06-14 17:33                                     ` Rik van Riel
@ 2000-06-14 18:37                                       ` Andrea Arcangeli
  0 siblings, 0 replies; 28+ messages in thread
From: Andrea Arcangeli @ 2000-06-14 18:37 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Juan J. Quintela, Stephen C. Tweedie, Zlatko Calusic, alan,
	Linux MM List, Linux Kernel List, Linus Torvalds

On Wed, 14 Jun 2000, Rik van Riel wrote:

>Ermmm, I mean that trying to _balance_ the zone is the right
>thing to do. Consuming infinite CPU time when we can't succeed
>is a clear bug we want to fix.

Actually consuming CPU is the right thing to do. The other option is to
understand the zone is all mlocked and that it doesn't worth to waste CPU
there. If you're going to just break the kswapd loop after some time then
you're inserting a bug and you're making the VM even less robust.

What I was trying to explain is not how the VM reacts to too big mlocked
regions, but just how much the current design doesn't see the whole
picture about the property of the memory and how it ends doing something
very stupid in my testcase (the one first mlocked and then cache). The
fact it does something stupid is _only_ the sympthom. Whatever you do with
mlocked accounting can only fix the sympthom.

As soon as time permits I'll try to do another example of the current lack
of knowledge of the VM with respect to the property of the VM (and how
this ends doing yet other silly things). These emails are very expensive
in terms of time and I need to do some more real coding now ;).

Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [patch] improve streaming I/O [bug in shrink_mmap()]
  2000-06-13 23:07           ` Andrea Arcangeli
  2000-06-13 23:34             ` Rik van Riel
@ 2000-06-13 23:41             ` Juan J. Quintela
  2000-06-14  0:21               ` Andrea Arcangeli
  1 sibling, 1 reply; 28+ messages in thread
From: Juan J. Quintela @ 2000-06-13 23:41 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Rik van Riel, Stephen C. Tweedie, Zlatko Calusic, alan,
	Linux MM List, Linux Kernel List, Linus Torvalds

>>>>> "andrea" == Andrea Arcangeli <andrea@suse.de> writes:

Hi

andrea> How can you be sure of that? So I'll make you an obvious case where
andrea> it will shrink not twice, not three times but _forever_.

andrea> Assume the pages_min of the normal zone watermark triggers when the normal
andrea> zone is allocated at 95% and assume that all such 95% of the normal zone
andrea> is been allocated all in mlocked memory and kernel mem_map_t array. Can't
andrea> somebody (for example an oracle database) allocate 95% of the normal zone
andrea> in mlocked shm memory? Do you agree? Or you are telling me it can't or
andrea> that if it does so it should then expect the linux kernel to explode
andrea> (actually it would cause kswapd to loop forever trying to free the normal
andrea> zone even if there's still 15mbyte of ZONE_DMA memory free).

andrea> So let's make the whole picture from the start starting with all the
andrea> memory free: assume oracle allocates all the normal zone in shm mlocked
andrea> memory. You still have 15mbyte free for the cache in the ZONE_DMA, OK?
andrea> Then you allocate the 95% of such 15mbyte in the cache and then kswapd
andrea> triggers and it will never stop because it will try to free the
andrea> zone_normal forever, even if it just recycled enough memory from the
andrea> ZONE_DMA (so even if __alloc_pages wouldn't start memory balancing
andrea> anymore!). See????

andrea> The classzone patch will fix the above bad behaviour completly because
andrea> kswapd in classzone will notice that there's enough memory for allocation
andrea> from both ZONE_DMA and ZONE_NORMAL because the cache in the ZONE_DMA is
andrea> been recycled successfully.

andrea> Without classzone you'll always get the above case wrong and I don't mind
andrea> if it's a corner case or not, we have to handle it right! I will hate a
andrea> kernel that works fine only as far as you only compile kernels on it.

I think that if you have a program that mlocked 95% of your normal
memory you have two options:
       - tweak the values of freepages.{min,low,high}
       - buy more memory

What is the difference with the case where we mlocked *all* memory.
If we allocate all memory we don't expect the system to work.  Pass
one limit, there is no way to solve the problem.  The limit just now
in freepages.high.  If you don't like that limit, change it.

Notice also that the actual allocator will give the shm segment pages
in the DMA zone and in the normal zone, that the case that it
allocates all the NORMAL zone but nothing of the DMA zone is not the
normal case, nor should happend.  It should get their pages from the
DMA zone and the NORMAL zone.  If we have the 95% of the DMA zone and
the 95% of the NORMAL zone mlocked, we are really in problems....

>> I think you're overlooking the fact that kswapd's freeing of
>> pages is something that occurs only *once*...

andrea> Since the normal zone will never return over pages_low it will run more
andrea> than once.

as I told before, if you want to have 95% of your memory mlocked, you
should tweak the values of freepages.*

andrea> My argument of the classzone design is to get correctness in the corner
andrea> case: to fix the drawbacks.

andrea> Then I also included into such patch some performance stuff and that's why
andrea> it also improve performances siginficantly but I'm not interested about
andrea> such part for now. Since such part is stable as well you can get both
andrea> correctness and improvement at the same time but I can drop the
andrea> performance part if there will be an interest only on the other part.

At least _I_ am interested in *only* the performance part.  I would
like to test the actual aproach with your performance improvements and
then compare the design.  I conceptually preffer the zones desing, but
I can be proved wrong.

andrea> I don't mind about the other part of the email at this moment, I only mind
andrea> about the global design of the allocator at this moment.

Later, Juan.

-- 
In theory, practice and theory are the same, but in practice they 
are different -- Larry McVoy
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [patch] improve streaming I/O [bug in shrink_mmap()]
  2000-06-13 23:41             ` Juan J. Quintela
@ 2000-06-14  0:21               ` Andrea Arcangeli
  0 siblings, 0 replies; 28+ messages in thread
From: Andrea Arcangeli @ 2000-06-14  0:21 UTC (permalink / raw)
  To: Juan J. Quintela
  Cc: Rik van Riel, Stephen C. Tweedie, Zlatko Calusic, alan,
	Linux MM List, Linux Kernel List, Linus Torvalds

On 14 Jun 2000, Juan J. Quintela wrote:

>I think that if you have a program that mlocked 95% of your normal
>memory you have two options:
>       - tweak the values of freepages.{min,low,high}
>       - buy more memory

You don't need to buy more memory: you have the memory and it's just been
recycled in the previous pass of the loop of kswapd. Why the heck should I
buy more memory because kswapd doesn't notice it should stop looping
and that enough memory is _just_ been released? :)

>What is the difference with the case where we mlocked *all* memory.

The only difference is that in such case there's not memory, but in the
scenario I described there is free memory and the VM is stpuid and it's
not using it.

>DMA zone and the NORMAL zone.  If we have the 95% of the DMA zone and
>the 95% of the NORMAL zone mlocked, we are really in problems....

Only 95% of the normal zone is mlocked and a sane VM must continue to work
like a charm because it can shrink all cache it wants from the first
15mbyte that are all freeable and in cache.

>as I told before, if you want to have 95% of your memory mlocked, you
>should tweak the values of freepages.*

Ok so set freepages.{min,low,high} to zero, then oracle exits and then all
the normal zone is allocated in cache because you're reading emails, and
then such cache is not shrunk anymore because you set freepages.high to
zero. Setting one of the others watermark to zero will lead to similar
side effects.

>then compare the design.  I conceptually preffer the zones desing, but
>I can be proved wrong.

I proved it to be wrong.

Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [patch] improve streaming I/O [bug in shrink_mmap()]
  2000-06-13 15:08   ` Andrea Arcangeli
  2000-06-13 17:08     ` Juan J. Quintela
@ 2000-06-13 19:20     ` Rik van Riel
  2000-06-13 21:49       ` Andrea Arcangeli
  1 sibling, 1 reply; 28+ messages in thread
From: Rik van Riel @ 2000-06-13 19:20 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Stephen C. Tweedie, Zlatko Calusic, alan, Linux MM List,
	Linux Kernel List

On Tue, 13 Jun 2000, Andrea Arcangeli wrote:
> On Mon, 12 Jun 2000, Stephen C. Tweedie wrote:
> 
> >Nice --- it might also explain some of the excessive kswap CPU 
> >utilisation we've seen reported now and again.
> 
> You have more kswapd load for sure due the strict zone approch.
> It maybe not noticeable but it's real.

Theoretically it's real, but having a certain number of free pages
around in the normal zone so we can do eg. process struct allocations
and slab allocations from there is well worth it. You may want to
closely re-read Linus' response to your classzone proposal some
weeks ago.

> I think Linus's argument about the above scenario is simply that
> the above isn't going to happen very often, but how can I ignore
> this broken behaviour? I hate code that works in the common case
> but that have drawbacks in the corner case.

Let me summarise the drawbacks of classzone and the strict zone
approach:

Strict zone approach:
- use slightly more memory, on the order of maybe 1 or 2%
- use slightly more kswapd cpu time since the free page goals
  are stricter

Classzone:
- can easily run out of 2- and 4-page contiguous areas of
  free memory in the normal zone, leading to the need to
  do allocation of task_structs and most slab caches from
  the dma zone
- this in turn will lead to the dma zone being less reliable
  when we need to allocate dma pages, or to a fork() failing
  with out of memory once we have a lot of processes on very
  big systems

Here you'll see that both systems have their advantages and
disadvantages. The zoned approach has a few (minimal) performance
disadvantages while classzone has a few stability disadvantages.
Personally I'd chose stability over performance any day, but that's
just me.

The big gains in classzone are most likely from the _other_ changes
that are somewhere inside the classzone patch. If we focus on
merging some of those (and maybe even improving some of the others
before merging), we can have a 2.4 which performs as good as or
better than the current classzone code but without the drawbacks.

Oh, btw, the classzone patch is vulnerable to the infinite-loop
in shrink_mmap too. Imagine a page shortage in the dma zone but
having only zone_normal pages on the lru queue ...

(and since the zone_normal classzone already has enough free pages,
shrink_mmap will find itself looping forever searching for freeable
zone_dma pages which aren't there)

regards,

Rik
--
The Internet is not a network of computers. It is a network
of people. That is its real strength.

Wanna talk about the kernel?  irc.openprojects.net / #kernelnewbies
http://www.conectiva.com/		http://www.surriel.com/

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [patch] improve streaming I/O [bug in shrink_mmap()]
  2000-06-13 19:20     ` Rik van Riel
@ 2000-06-13 21:49       ` Andrea Arcangeli
  0 siblings, 0 replies; 28+ messages in thread
From: Andrea Arcangeli @ 2000-06-13 21:49 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Stephen C. Tweedie, Zlatko Calusic, alan, Linux MM List,
	Linux Kernel List

On Tue, 13 Jun 2000, Rik van Riel wrote:

>On Tue, 13 Jun 2000, Andrea Arcangeli wrote:
>> On Mon, 12 Jun 2000, Stephen C. Tweedie wrote:
>> 
>> >Nice --- it might also explain some of the excessive kswap CPU 
>> >utilisation we've seen reported now and again.
>> 
>> You have more kswapd load for sure due the strict zone approch.
>> It maybe not noticeable but it's real.
>
>Theoretically it's real, but having a certain number of free pages
>around in the normal zone so we can do eg. process struct allocations
>and slab allocations from there is well worth it. You may want to
>closely re-read Linus' response to your classzone proposal some
>weeks ago.

I read all Linus's reply and I'm not missing anything.

>> I think Linus's argument about the above scenario is simply that
>> the above isn't going to happen very often, but how can I ignore
>> this broken behaviour? I hate code that works in the common case
>> but that have drawbacks in the corner case.
>
>Let me summarise the drawbacks of classzone and the strict zone
>approach:
>
>Strict zone approach:
>- use slightly more memory, on the order of maybe 1 or 2%
>- use slightly more kswapd cpu time since the free page goals
>  are stricter
>
>Classzone:
>- can easily run out of 2- and 4-page contiguous areas of
>  free memory in the normal zone, leading to the need to
>  do allocation of task_structs and most slab caches from
>  the dma zone

This is a very very red herring. Take 2.4.0-test1-aclatest and assume you
do some I/O and you fill all the normal zone (except the latest pages_min
of course) in the cache. Then you fork a task and you fallback in the DMA
zone that is completly free and the memory for the task_struct got
allocated from the DMA zone also with your design!

Also the memory you take free from the normal zone for this purpose is at
max pages_high-pages_min that is very low margin that you can trivially
throw away if you are doing some I/O and you happen to allocate it from
the cache. Then you won't have any margin anymore and you'll allocate all
the persistent stuff from the zone DMA.

>- this in turn will lead to the dma zone being less reliable
>  when we need to allocate dma pages, [..]

Previous point was wrong and that can happen also with current kernel. The
fact is that the kernel memory currently can't be relocated and thus you
can't do anything to solve this problem except to avoid to allocate there
anything that can't be relocated and then you could fail kernel
allocations even if you still have 16mbyte of free ram.

>  [..] or to a fork() failing
>  with out of memory once we have a lot of processes on very
>  big systems

What the fork have to do with this issue? classzone patch will take enough
memory free in the classzone so that it's likely there are two contigous
pages thus this point is completly irrelevant with regard to
zone/classzone design.

>Here you'll see that both systems have their advantages and
>disadvantages. The zoned approach has a few (minimal) performance

As far I can tell the only disavantage of classzone is that the spinlock
have to be per-node and you have to keep collected the information about
the classzone while allocating and freeing the pages.

>disadvantages while classzone has a few stability disadvantages.

IMHO it's the opposite. Classzone provides the correct behaviour but at a
potentially major fixed cost during allocations/deallocations and the lock
is not per-zone anymore. However this additional information that we
collect we'll avoid us to waste CPU and memory so it's not obvious that
classzone will decrease performance.

>Personally I'd chose stability over performance any day, but that's
>just me.

I fully agree and that's why I developed classzone in first place.

>The big gains in classzone are most likely from the _other_ changes
>that are somewhere inside the classzone patch. If we focus on

Indeed.

>merging some of those (and maybe even improving some of the others
>before merging), we can have a 2.4 which performs as good as or
>better than the current classzone code but without the drawbacks.

IMHO it's the current kernel that have the drawbacks. Classzone is the
_fixes_ for the drawbacks.

For the other improvments I agree they are completly orthogonal and I
agree to split them and to discuss separately. I have not mentioned in
these emails infact. I'm only concerned about the zone design at this
moment.

>Oh, btw, the classzone patch is vulnerable to the infinite-loop
>in shrink_mmap too. Imagine a page shortage in the dma zone but
>having only zone_normal pages on the lru queue ...
>(and since the zone_normal classzone already has enough free pages,
>shrink_mmap will find itself looping forever searching for freeable
>zone_dma pages which aren't there)

You're obviously wrong that can't happen with classzone! I intentionally
always put the pages outside the memclass into a dispose list so I simply
can't lockup there and the code there in classzone is rock solid. Here the
code from 2.4.0-test1-ac7-classzone-31:filemap.c:shrink_mmap():

	while (count > 0 && (page_lru = lru_head->prev) != lru_head) {
		page = list_entry(page_lru, struct page, lru);
		list_del(page_lru);

		dispose = &old;
		^^^^^^^^^^^^^^
		/* don't account passes over not DMA pages */
		if (!memclass(page->zone, zone))
			goto dispose_continue;

		count--;

It doesn't matter at all if I do count-- after going to the
dispose_continue and that's not a bug it's intentional and the count--
have to stay after the check for the memclass to provide shrink_mmap
enough power for shrinking the interesting classzones.

Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2000-06-14 18:37 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2000-06-13  8:10 [patch] improve streaming I/O [bug in shrink_mmap()] Roger Larsson
     [not found] <8i3qe8$lltbv$1@fido.engr.sgi.com>
2000-06-14  6:17 ` Rajagopal Ananthanarayanan
  -- strict thread matches above, loose matches on Subject: below --
2000-06-12 21:46 Zlatko Calusic
2000-06-12 22:29 ` Stephen C. Tweedie
2000-06-12 23:04   ` Rik van Riel
2000-06-13 15:08   ` Andrea Arcangeli
2000-06-13 17:08     ` Juan J. Quintela
2000-06-13 19:09       ` Andrea Arcangeli
2000-06-13 19:32         ` Rik van Riel
2000-06-13 23:07           ` Andrea Arcangeli
2000-06-13 23:34             ` Rik van Riel
2000-06-14  0:12               ` Andrea Arcangeli
2000-06-14  0:58                 ` Rik van Riel
2000-06-14  1:18                   ` Andrea Arcangeli
2000-06-14  1:33                     ` Rik van Riel
2000-06-14  2:10                       ` Andrea Arcangeli
2000-06-14  2:46                         ` Rik van Riel
2000-06-14 13:01                           ` Andrea Arcangeli
2000-06-14 13:44                             ` Rik van Riel
2000-06-14 13:57                               ` Andrea Arcangeli
2000-06-14 16:48                                 ` Rik van Riel
2000-06-14 17:14                                   ` Andrea Arcangeli
2000-06-14 17:33                                     ` Rik van Riel
2000-06-14 18:37                                       ` Andrea Arcangeli
2000-06-13 23:41             ` Juan J. Quintela
2000-06-14  0:21               ` Andrea Arcangeli
2000-06-13 19:20     ` Rik van Riel
2000-06-13 21:49       ` Andrea Arcangeli

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox