More VM balancing issues..

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* More VM balancing issues..
       [not found] <38D2A2E3.A2CEA602@av.com>
@ 2000-03-17 22:07 ` Linus Torvalds
  2000-03-17 22:23   ` Kanoj Sarcar
  0 siblings, 1 reply; 9+ messages in thread
From: Linus Torvalds @ 2000-03-17 22:07 UTC (permalink / raw)
  To: linux-mm; +Cc: Ben LaHaise, Christopher Zimmerman, Stephen Tweedie

[ background: Christopher Zimmerman has had a number of problems with the
  CONFIG_HIGHMEM stuff: there was one rather serious NFS client bug where
  it used free_page() on the virtual address of a high-mem page etc. That
  fixed it seems to be much more stable, but seems to have serious VM
  balancing issues. See my correspondence, and my theory. Comments, anyone?

  Christopher has a 2GB dual CPU machine - nice high-end box, nothing
  outrageous. Big fast disks and gigabit ethernet. ]

On Fri, 17 Mar 2000, Christopher Zimmerman wrote:
> Linus Torvalds wrote: 
> > On Fri, 17 Mar 2000, Christopher Zimmerman wrote:
> > >
> > > kswapd seems to be using %30 of the total CPU time whenever I push a box
> > > hard.  On some machines I get "VM: killing process webindexer" when I
> > > attempt to index some pages.  In some cases the machine just freezes
> > > without an oops.
> >
> > Can you check whether this still happens without CONFIG_HIGHMEM. I realize
> > that that will cause you to run with less effective memory, but I suspect
> > that the current VM balancing just gets the high page region totally
> > wrong.
> >
> >                 Linus
> 
> I take that back.  kswapd is now maxing a %30 cpu usage but averaging %15.

Ok. 15% may just be normal, considering that you're dirtying a LOT of
pages. I have a hard time judging what the load really is, but I assume
it's under fairly high load at that point and the disks are just spinning
all the time..

My personal suspicion is that it's the cumulative thing. I still don't
think that's the right thing to do, because it "penalizes" the higher
zones. So it tries to keep more free memory available in the higher zones
because it looks at the cumulative sizes of the zones up to that point to
determine how many free pages to aim for.  Which is wrong, because
especially the highmem zone is NOT a zone that we are all that interested
in keeping free pages in. If anything, we want to make sure that the
_lower_ zones have the free pages. 

Christopher, with CONFIG_HIGHMEM enabled, what happens if you apply this
patch?

		Linus

-----
--- v2.3.99-pre1/linux/mm/page_alloc.c	Tue Mar 14 19:10:40 2000
+++ linux/mm/page_alloc.c	Fri Mar 17 14:05:50 2000
@@ -277,7 +277,8 @@

 				if (z->low_on_memory)
 					goto balance;
-			}
+			} else
+				z->low_on_memory = 0;
 		}
 		/*
 		 * This is an optimization for the 'higher order zone
@@ -549,7 +550,7 @@

 		zone->offset = offset;
 		cumulative += size;
-		mask = (cumulative / zone_balance_ratio[j]);
+		mask = (size / zone_balance_ratio[j]);
 		if (mask < zone_balance_min[j])
 			mask = zone_balance_min[j];
 		else if (mask > zone_balance_max[j])

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: More VM balancing issues..
  2000-03-17 22:07 ` More VM balancing issues Linus Torvalds
@ 2000-03-17 22:23   ` Kanoj Sarcar
  2000-03-18  2:59     ` Linus Torvalds
  0 siblings, 1 reply; 9+ messages in thread
From: Kanoj Sarcar @ 2000-03-17 22:23 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: linux-mm, Ben LaHaise, Christopher Zimmerman, Stephen Tweedie,
	Kanoj Sarcar

And while you are at it, could you try this patch too. This "fixes"
the issues I pointed out earlier in the recent balancing patch. This
is against 2.3.99-pre1.

Thanks.

Kanoj

--- mm/page_alloc.c	Wed Mar 15 09:30:24 2000
+++ mm/page_alloc.c	Fri Mar 17 14:19:26 2000
@@ -148,8 +148,10 @@
 
 	spin_unlock_irqrestore(&zone->lock, flags);
 
-	if (classfree(zone) > zone->pages_high)
+	if (classfree(zone) > zone->pages_high) {
 		zone->zone_wake_kswapd = 0;
+		zone->low_on_memory = 0;
+	}
 }
 
 #define MARK_USED(index, order, area) \
@@ -269,8 +271,11 @@
 			{
 				extern wait_queue_head_t kswapd_wait;
 
-				z->zone_wake_kswapd = 1;
-				wake_up_interruptible(&kswapd_wait);
+				if (free <= z->pages_low) {
+					z->zone_wake_kswapd = 1;
+					wake_up_interruptible(&kswapd_wait);
+				} else
+					z->zone_wake_kswapd = 0;
 
 				if (free <= z->pages_min)
 					z->low_on_memory = 1;
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: More VM balancing issues..
  2000-03-17 22:23   ` Kanoj Sarcar
@ 2000-03-18  2:59     ` Linus Torvalds
  2000-03-20 20:29       ` Kanoj Sarcar
  0 siblings, 1 reply; 9+ messages in thread
From: Linus Torvalds @ 2000-03-18  2:59 UTC (permalink / raw)
  To: Kanoj Sarcar
  Cc: linux-mm, Ben LaHaise, Christopher Zimmerman, Stephen Tweedie

Kanoj,
 would you mind looking at the balancing idea in pre2-4? I put it out that
way, because it's the easiesy way for me to show what I was thinking
about, but it may be something I just punt on for a real pre-2.4 kernel..

Basically, I never liked the thing that was based on adding up the total
and free pages of different zones. It gave us the old 2.2 behaviour (or
close to it), but it's a global decision on something that really is a
local issue, I think. And it definitely doesn't make sense on a NUMA thing
at all.

So I have this feeling that balancing really should be purely a per-zone
thing, and purely based on the size and freeness of that particular zone.
That would allow us to make clear decisions like "we want to keep 2% of
the regular zones free, but for the DMA zone we want to keep that 10% free
because it more easily becomes a resource issue". 

So my approach would be:
 - each zone is completely independent
 - when allocating from a zone-list, the act of allocation is the only
   thing that should care about the "conglomerate" of zones.

So what I do in pre2-4 is basically:
 - __alloc_pages() walks all zones. If it finds one that has "enough"
   pages, it will just allocate from the first such zone it finds.
 - if none of the zones have "enough" pages, it does a zone-list balance. 
 - the zone-list balance will walk the list of zones again, and do the
   right thing for each of them. It will return successfully if it was
   able to free up some memory (or if it decides that it's not critical
   and we could just start kswapd without even trying to free anything
   ourselves)
 - if the zonelist balance succeeded, __alloc_pages() will walk the zones
   again and try to allocate memory, this time regardless of whether they
   have "enough" memory (because we freed some memory we can do that).

This avoids making any global decisions: it works naturally whatever the
zone-list looks like. It still tries to first allocate from the first
zone-lists, so it still has the advantage of leaving the DMA zone-list
pretty quiescent as it's the last zone on the lists - so the DMA zone list
will tend to have "enough" pages.

What my patch does NOT do is to change the zone_balance_ratio[] stuff etc,
but I think that with this approach it is now truly meaningful to do that,
and that we now really _can_ try to keep the DMA area at a certain
percentage etc..

		Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: More VM balancing issues..
  2000-03-18  2:59     ` Linus Torvalds
@ 2000-03-20 20:29       ` Kanoj Sarcar
  2000-03-20 21:27         ` Linus Torvalds
  0 siblings, 1 reply; 9+ messages in thread
From: Kanoj Sarcar @ 2000-03-20 20:29 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: linux-mm, Ben LaHaise, Christopher Zimmerman, Stephen Tweedie,
	Kanoj Sarcar

Okay, here it comes, you asked for it ... You know most of it anyway, 
but seeing it all together might help.

1. In a theoretical sense, there are _only_ memory classes. DMA class 
memory, direct mapped class memory and the rest. Code will ask for a 
dma, regular or other class memory (proactive balancing is needed for 
intr context allocations or otherwise when page stealing is impossible
or deadlock prone). Hence, theoretically, it makes sense to decide
how many pages in each memory _class_ we want to keep free for such
requests (based on application type, #cpu, memory, devices and fs
activity). Decisions on when pages need to be stolen should really be
_class_ based.

2. Linux uses zones to implement memory classes. The DMA zone represents
DMA class, the DMA+regular zone represents regular class, and the
DMA+regular+himem zone represents other class. Theoretically, that is
why decisions on page stealing need to be cumulative on the zones.
(This explains why I did most of the code that way).

3. Implementation can of course diverge from theory (like using NRU 
in place of LRU). In Documentation/vm/balance, I have tried laying
down the pros and cons of local vs cumulative balancing:

"In 2.3, zone balancing can be done in one of two ways: depending on the
zone size (and possibly of the size of lower class zones), we can decide
at init time how many free pages we should aim for while balancing any
zone. The good part is, while balancing, we do not need to look at sizes
of lower class zones, the bad part is, we might do too frequent balancing
due to ignoring possibly lower usage in the lower class zones. Also,
with a slight change in the allocation routine, it is possible to reduce
the memclass() macro to be a simple equality.

Another possible solution is that we balance only when the free memory
of a zone _and_ all its lower class zones falls below 1/64th of the
total memory in the zone and its lower class zones. This fixes the 2.2
balancing problem, and stays as close to 2.2 behavior as possible. Also,
the balancing algorithm works the same way on the various architectures,
which have different numbers and types of zones. If we wanted to get
fancy, we could assign different weights to free pages in different
zones in the future."

4. In 2.3.50 and pre1, zone_balance_ratio[] is the ratio of each _class_
of memory that you want free, which is intuitive.

5. For true NUMA machines, there will be memory nodes, and each node
will possibly have dma/regular/himem zones. For memory-hole architectures,
ie DISCONTIG machines, there will again be nodes, but there will be a
lot of nodes with only one class of memory (don't know yet, there are 
not too many people working on this).

Coming specifically to the 2.3.99-pre2 code, I see a couple of bugs:
1. __alloc_pages needs to return NULL instead of doing zone_balance_memory
for the PF_MEMALLOC case.

	if (!(current->flags & PF_MEMALLOC))
               	return(NULL);
        if (zone_balance_memory(zonelist)) {

2. The body of zone_balance_memory() should be replaced with the pre1
code, otherwise there are too many differences/problems to enumerate. 
Unless you are also proposing changes in this area.

I attach a patch against 2.3.99-pre2 to fix these.

The other issues are:
1. In the face of races, you probably want to do a loopback in __alloc_pages
after the zone_balance_memory() returns success. Something like
	if (zone_balance_memory(zonelist)) {
		if (retry)
			return(NULL);
		retry++;
		goto tryagain;
	}

2. Due to purely zone-local computation, the pre2 version will more easily
fall back to lower zones while allocating memory (when it is not neccessary). 
Specially interesting will be cases where the regular zone is much smaller 
than the dma zone, or the himem zone is tiny compared to the regular zone. 
So, gone will be the protection that dma and regular zones enjoyed in 
older versions. 

Kanoj

--- mm/page_alloc.c	Mon Mar 20 09:38:48 2000
+++ mm/page_alloc.c	Mon Mar 20 11:48:02 2000
@@ -152,10 +152,10 @@

 	spin_unlock_irqrestore(&zone->lock, flags);

-	if (zone->free_pages > zone->pages_high) {
+	if (zone->free_pages > zone->pages_low)
 		zone->zone_wake_kswapd = 0;
+	if (zone->free_pages > zone->pages_high)
 		zone->low_on_memory = 0;
-	}
 }

 #define MARK_USED(index, order, area) \
@@ -233,21 +233,22 @@
 	zone = zonelist->zones;
 	for (;;) {
 		zone_t *z = *(zone++);
+		unsigned long free;
 		if (!z)
 			break;
-		if (z->free_pages > z->pages_low)
-			continue;
-
-		z->zone_wake_kswapd = 1;
-		wake_up_interruptible(&kswapd_wait);
+		free = z->free_pages;
+		if (free <= z->pages_high) {
+			if (free <= z->pages_low) {
+				z->zone_wake_kswapd = 1;
+				wake_up_interruptible(&kswapd_wait);
+			}
+			if (free <= z->pages_min)
+				z->low_on_memory = 1;
+		}

 		/* Are we reaching the critical stage? */
-		if (!z->low_on_memory) {
-			/* Not yet critical, so let kswapd handle it.. */
-			if (z->free_pages > z->pages_min)
-				continue;
-			z->low_on_memory = 1;
-		}
+		if (!z->low_on_memory)
+			continue;
 		/*
 		 * In the atomic allocation case we only 'kick' the
 		 * state machine, but do not try to free pages
@@ -307,6 +308,8 @@
 				return page;
 		}
 	}
+	if (!(current->flags & PF_MEMALLOC))
+		return(NULL);
 	if (zone_balance_memory(zonelist)) {
 		zone = zonelist->zones;
 		for (;;) {
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: More VM balancing issues..
  2000-03-20 20:29       ` Kanoj Sarcar
@ 2000-03-20 21:27         ` Linus Torvalds
  2000-03-20 22:17           ` Kanoj Sarcar
  0 siblings, 1 reply; 9+ messages in thread
From: Linus Torvalds @ 2000-03-20 21:27 UTC (permalink / raw)
  To: Kanoj Sarcar
  Cc: linux-mm, Ben LaHaise, Christopher Zimmerman, Stephen Tweedie

On Mon, 20 Mar 2000, Kanoj Sarcar wrote:
> 
> 1. In a theoretical sense, there are _only_ memory classes. DMA class 
> memory, direct mapped class memory and the rest.

Definitely. I agree 100% that "classes" are really the fundamental factor
of allocation (and thus of balancing).  And I tried to preserve that,
exactly by having the zone_balance() thing become a "class" thing by
walking the list of zones that constitutes a class.

>					 Code will ask for a 
> dma, regular or other class memory (proactive balancing is needed for 
> intr context allocations or otherwise when page stealing is impossible
> or deadlock prone). Hence, theoretically, it makes sense to decide
> how many pages in each memory _class_ we want to keep free for such
> requests (based on application type, #cpu, memory, devices and fs
> activity). Decisions on when pages need to be stolen should really be
> _class_ based.

Yes. However, I think that in general is a quite difficult problem, given
that the zones that constitute classes are not at all necessarily
inclusive.

They happen to be inclusive on x86 (ie DMA <= direct-mapped <=
everything), but I think it is a mistake to consider that a design. It's
obviously not true on NUMA if you have per-CPU classes that fall back onto
other CPU's zones. I would imagine, for example, that on NUMA the best
arrangement would be something like

 - when making a NODE1 allocation, the "class" list is

	NODE1, NODE2, NODE3, NODE4, NULL

 - when making a NODE2 allocation it would be

	NODE2, NODE3, NODE4, NODE1, NULL

 etc...

(So each node would preferentially always allocate from its own zone, but
would fall back on other nodes memory if the local zone fills up).

With something like the above, there is no longer any true inclusion. Each
class covers an "equal" amount of zones, but has a different structure.

So I think the ordering is important (it implies preferences within a
class), but I don't the the total inclusion is.

> 2. Linux uses zones to implement memory classes. The DMA zone represents
> DMA class, the DMA+regular zone represents regular class, and the
> DMA+regular+himem zone represents other class. Theoretically, that is
> why decisions on page stealing need to be cumulative on the zones.
> (This explains why I did most of the code that way).

With strictly ordered classes, the cumulative aproach works. It does't
work for anything that is only partially ordered.

> 4. In 2.3.50 and pre1, zone_balance_ratio[] is the ratio of each _class_
> of memory that you want free, which is intuitive.

I disagree about the "intuitive" part. Yes, zone_balance_ratio is how each
class was to be balanced, but it's definitely not intuitive, with
different zones being in different classes, and the actual tests being
done per zone.

> Coming specifically to the 2.3.99-pre2 code, I see a couple of bugs:
> 1. __alloc_pages needs to return NULL instead of doing zone_balance_memory
> for the PF_MEMALLOC case.

Yup. I actually had this on my mental list, but it got dropped.

> 2. The body of zone_balance_memory() should be replaced with the pre1
> code, otherwise there are too many differences/problems to enumerate. 
> Unless you are also proposing changes in this area.

The pre1 code was broken, and never checked pages_low. The changes were
definitely pre-meditated - trying to think of the balancing as a "list of
zones" issue.

And I think it's fine that kswapd continues to run until we reach "high".
Your patch makes kswapd stop when it reaches "low", but that makes kswapd
go into this kind of "start/stop/start/stop" behaviour at around the "low"
watermark.

Maybe you meant to clear the flags the other way around: keep kswapd
running until it hits high, but remove the "low_on_mem" flag when we are
above "low" (but we've gotten away from "min"). That might work, but I
think clearing both flags at "high" is actually the right thing to do,
because that way we will not get into a state where kswapd runs all the
time because somebody is still allocating pages without helping to free
anything up.

The pre-3 behaviour is: if you ever hit "min", you set a flag that means
"ok, kswapd can't do this on its own, and needs some help from the people
that allocate memory all the time". If you think of it that way, I think
you'll agree that it shouldn't be cleared until after kswapd says
everything is ok again.

I don't know..

		Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: More VM balancing issues..
  2000-03-20 21:27         ` Linus Torvalds
@ 2000-03-20 22:17           ` Kanoj Sarcar
  0 siblings, 0 replies; 9+ messages in thread
From: Kanoj Sarcar @ 2000-03-20 22:17 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: linux-mm, Ben LaHaise, Christopher Zimmerman, Stephen Tweedie

> 
> They happen to be inclusive on x86 (ie DMA <= direct-mapped <=
> everything), but I think it is a mistake to consider that a design. It's
> obviously not true on NUMA if you have per-CPU classes that fall back onto
> other CPU's zones. I would imagine, for example, that on NUMA the best
> arrangement would be something like
> 
>  - when making a NODE1 allocation, the "class" list is
> 
> 	NODE1, NODE2, NODE3, NODE4, NULL
> 
>  - when making a NODE2 allocation it would be
> 
> 	NODE2, NODE3, NODE4, NODE1, NULL
> 
>  etc...
> 
> (So each node would preferentially always allocate from its own zone, but
> would fall back on other nodes memory if the local zone fills up).

Okay, I think the crux of this discussion lies in this statement. I do
not believe this is what the numa code will do, but note that we are
not 100% certain at this stage. The numa code will be layered on top
of the generic code, (the primary goal being generic code should be 
impacted by numa minimally), so for example, the numa version of
alloc_pages() will invoke __alloc_pages() on different nodes. The
other thing to note is, the sequence of nodes to allocate is not
static, but dynamic (depending on other data structures that numa
code will track). This gives the most flexibility to numa code to
do the best thing performance wise for a wide variety of apps
under different situations. So apriori, you can not claim the class
list for NODE1 allocation will be "NODE1, NODE2, NODE3, NODE4, NULL".
I am ccing Hubertus Franke from IBM, we have been working on numa
issues together.

> 
> With something like the above, there is no longer any true inclusion. Each
> class covers an "equal" amount of zones, but has a different structure.
>

The only example I can think of is a hole architecture, as I mentioned
before, but even that can be handled with a "true inclusion" assumption.

Unless you can point to a processor/architecture to the contrary, for
the 2.4 timeframe, I would think we can assume true inclusion. (And that
will be true even if we come up with a ZONE_PCI32 for 64bit machines).

> > 2. The body of zone_balance_memory() should be replaced with the pre1
> > code, otherwise there are too many differences/problems to enumerate. 
> > Unless you are also proposing changes in this area.
> 
> The pre1 code was broken, and never checked pages_low. The changes were
> definitely pre-meditated - trying to think of the balancing as a "list of
> zones" issue.

Agreed, I pointed out the breakage when the balancing patch was sent out. 
I patched the pre1 code to get back to 2.3.50 behavior, and Christopher 
Zimmerman zim@av.com tested it out.

> 
> And I think it's fine that kswapd continues to run until we reach "high".
> Your patch makes kswapd stop when it reaches "low", but that makes kswapd
> go into this kind of "start/stop/start/stop" behaviour at around the "low"
> watermark.
> 
> Maybe you meant to clear the flags the other way around: keep kswapd
> running until it hits high, but remove the "low_on_mem" flag when we are
> above "low" (but we've gotten away from "min"). That might work, but I
> think clearing both flags at "high" is actually the right thing to do,
> because that way we will not get into a state where kswapd runs all the
> time because somebody is still allocating pages without helping to free
> anything up.
> 

Okay, that is a change on top of 2.3.50 behavior, this can be easily
implemented. As I mention in Documentation/vm/balance, low_on_memory
is a hysteric flag, zone_wake_kswapd/kswapd poking is not, we can 
change that. Do you want me to create a new patch against 2.3.99-pre2?

Kanoj

> The pre-3 behaviour is: if you ever hit "min", you set a flag that means
> "ok, kswapd can't do this on its own, and needs some help from the people
> that allocate memory all the time". If you think of it that way, I think
> you'll agree that it shouldn't be cleared until after kswapd says
> everything is ok again.
> 
> I don't know..
> 
> 		Linus
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: More VM balancing issues..
  2000-03-17 23:31 ` Linus Torvalds
@ 2000-03-17 23:51   ` Kanoj Sarcar
  0 siblings, 0 replies; 9+ messages in thread
From: Kanoj Sarcar @ 2000-03-17 23:51 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Christopher Zimmerman, linux-mm

> 
> 
> Oh, I found another problem: when the VM balancing was rewritten, the
> "pages_low" thing was still calculated, but nothing actually USED it.

2.3.50 did.

This is one of the things I pointed out in the recent balancing patch. 
The patch I sent to Christopher tries to fix it and go back to 2.3.50
bahavior.

Again, Documentation/vm/balance has comments about this which were
true till 2.3.50.

AFAIR, Andrea put this stuff into 2.3, round about 2.3.40 or so timeframe.


> 
> So we had three water-marks: "enough for anything", "low on memory" and
> "critical".
> 
> And we somehow lost the "low on memory" and only used the "enough" and
> "critical" to do all comparisons.

This was also done properly in 2.3.50, and changed in the recent patch.

Maybe its time that tweaks to the balancing code get documented in
Documentation/vm/balance. Its easier all around to keep track of 
what's happening.

Kanoj
> 
> Which makes for a _very_ choppy balance, and is definitely wrong.
> 
> The behaviour should be something like:
>  - whenever we dip below "low", we wake up kswapd. kswapd remains awake
>    (for that zone) until we reach "enough".
>  - whenever we dip below "critical", we start doing synchronous memory
>    freeing ourselves. We continue to do that until we reach "low" again
>    (at which point kswapd will still continue in the background, but we
>    don't depend on the synchronous freeing any more).
> 
> but for some time we appear to have gotten this wrong, and lost the "low"
> mark, and used the "critical" and "high" marks only. 
> 
> Or maybe somebody did some testing and decided to disagree with the old
> three-level thing based on actual numbers? The only coding I've done has
> been based on "this is how I think it should work, and because I'm always
> right it's obviously the way it _should_ work". Which is not always the
> approach that gets the best results ;)
> 
> 		Linus
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux.eu.org/Linux-MM/
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: More VM balancing issues..
       [not found] <38D2BB5C.AC4A89C9@av.com>
  2000-03-17 23:15 ` Linus Torvalds
@ 2000-03-17 23:31 ` Linus Torvalds
  2000-03-17 23:51   ` Kanoj Sarcar
  1 sibling, 1 reply; 9+ messages in thread
From: Linus Torvalds @ 2000-03-17 23:31 UTC (permalink / raw)
  To: Christopher Zimmerman; +Cc: Kanoj Sarcar, linux-mm

Oh, I found another problem: when the VM balancing was rewritten, the
"pages_low" thing was still calculated, but nothing actually USED it.

So we had three water-marks: "enough for anything", "low on memory" and
"critical".

And we somehow lost the "low on memory" and only used the "enough" and
"critical" to do all comparisons.

Which makes for a _very_ choppy balance, and is definitely wrong.

The behaviour should be something like:
 - whenever we dip below "low", we wake up kswapd. kswapd remains awake
   (for that zone) until we reach "enough".
 - whenever we dip below "critical", we start doing synchronous memory
   freeing ourselves. We continue to do that until we reach "low" again
   (at which point kswapd will still continue in the background, but we
   don't depend on the synchronous freeing any more).

but for some time we appear to have gotten this wrong, and lost the "low"
mark, and used the "critical" and "high" marks only. 

Or maybe somebody did some testing and decided to disagree with the old
three-level thing based on actual numbers? The only coding I've done has
been based on "this is how I think it should work, and because I'm always
right it's obviously the way it _should_ work". Which is not always the
approach that gets the best results ;)

		Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: More VM balancing issues..
       [not found] <38D2BB5C.AC4A89C9@av.com>
@ 2000-03-17 23:15 ` Linus Torvalds
  2000-03-17 23:31 ` Linus Torvalds
  1 sibling, 0 replies; 9+ messages in thread
From: Linus Torvalds @ 2000-03-17 23:15 UTC (permalink / raw)
  To: Christopher Zimmerman; +Cc: Kanoj Sarcar, linux-mm

On Fri, 17 Mar 2000, Christopher Zimmerman wrote:
>
> No that didn't seem to help.  In fact the machine(1GB) just froze after a while.

The 1GB case is actually the most interesting of all, because in the 1GB
case you end up having a _really_ small "high memory" zone, I think (just 
a small zone that is comprised of the pages that can't be used in the
normal memory area due to needing some kernel VM space etc).

Which means that the balancing probably gets rather interesting for that
exact case.

Anyway, my patch was buggy - it made the per-zone "pages_high" depend on
only the zone size, but still leaves the actual comparisons towards the
"class" free pages count. Which just can't be right.

I'd like to try a "local decisions only" version of this, with no classes
etc. That's the simplest case, and it's the only case that I'm reasonable
confident cannot have any really strange behaviour due to pathologically
small zones, etc.

> I'm going to try out Konoj's patch next.  I also tried it out on the 2GB box and
> got and immediate highmem.c oops.  Maybe that oops hasn't been fully resolved.
> If it happens again I'll send you the info.

I'd sure like to see the oops. It may be that the strange balancing caused
by the thinko in my patch (see above) actually causes a hidden bug
somewhere to materialize.. I don't think the thinko itself should cause
any oopses, just strange balancing behaviour.

		Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2000-03-20 22:17 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <38D2A2E3.A2CEA602@av.com>
2000-03-17 22:07 ` More VM balancing issues Linus Torvalds
2000-03-17 22:23   ` Kanoj Sarcar
2000-03-18  2:59     ` Linus Torvalds
2000-03-20 20:29       ` Kanoj Sarcar
2000-03-20 21:27         ` Linus Torvalds
2000-03-20 22:17           ` Kanoj Sarcar
     [not found] <38D2BB5C.AC4A89C9@av.com>
2000-03-17 23:15 ` Linus Torvalds
2000-03-17 23:31 ` Linus Torvalds
2000-03-17 23:51   ` Kanoj Sarcar

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox