Re: [PATCH] Separate global/perzone inactive/free shortage

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* Re: [PATCH] Separate global/perzone inactive/free shortage
@ 2001-07-16 13:56 Bulent Abali
  2001-07-16 15:56 ` Stephen C. Tweedie
  2001-07-18  8:54 ` Mike Galbraith
  0 siblings, 2 replies; 21+ messages in thread
From: Bulent Abali @ 2001-07-16 13:56 UTC (permalink / raw)
  To: Stephen C. Tweedie
  Cc: Mike Galbraith, Marcelo Tosatti, Rik van Riel, Dirk Wetter, linux-mm



>> On Sat, 14 Jul 2001, Marcelo Tosatti wrote:
>
>> On highmem machines, wouldn't it save a LOT of time to prevent
allocation
>> of ZONE_DMA as VM pages?  Or, if we really need to, get those pages into
>> the swapcache instantly?  Crawling through nearly 4 gig of VM looking
for
>> 16 MB of ram has got to be very expensive.  Besides, those pages are
just
>> too precious to allow some user task to sit on them.
>
>Can't we balance that automatically?
>
>Why not just round-robin between the eligible zones when allocating,
>biasing each zone based on size?  On a 4GB box you'd basically end up
>doing 3 times as many allocations from the highmem zone as the normal
>zone and only very occasionally would you try to dig into the dma
>zone.
>Cheers,
> Stephen

If I understood page_alloc.c:build_zonelists() correctly
ZONE_HIGHMEM includes ZONE_NORMAL which includes ZONE_DMA.
Memory allocators (other than ZONE_DMA) will dip in to the dma zone
only when there are no highmem and/or normal zone pages available.
So, the current method is more conservative (better) than round-robin
it seems to me.

I think Marcello is proposing to make ZONE_DMA exclusive in large
memory machines, which might make it better for allocators
needing ZONE_DMA pages...
Bulent


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] Separate global/perzone inactive/free shortage
  2001-07-16 13:56 [PATCH] Separate global/perzone inactive/free shortage Bulent Abali
@ 2001-07-16 15:56 ` Stephen C. Tweedie
  2001-07-16 19:04   ` Rik van Riel
  2001-07-18  8:54 ` Mike Galbraith
  1 sibling, 1 reply; 21+ messages in thread
From: Stephen C. Tweedie @ 2001-07-16 15:56 UTC (permalink / raw)
  To: Bulent Abali
  Cc: Stephen C. Tweedie, Mike Galbraith, Marcelo Tosatti,
	Rik van Riel, Dirk Wetter, linux-mm

On Mon, Jul 16, 2001 at 09:56:58AM -0400, Bulent Abali wrote:
> >
> >Why not just round-robin between the eligible zones when allocating,
> >biasing each zone based on size?  On a 4GB box you'd basically end up
> >doing 3 times as many allocations from the highmem zone as the normal
> >zone and only very occasionally would you try to dig into the dma
> >zone.
> >Cheers,
> > Stephen
> 
> If I understood page_alloc.c:build_zonelists() correctly
> ZONE_HIGHMEM includes ZONE_NORMAL which includes ZONE_DMA.
> Memory allocators (other than ZONE_DMA) will dip in to the dma zone
> only when there are no highmem and/or normal zone pages available.
> So, the current method is more conservative (better) than round-robin
> it seems to me.

On a 20MB box with 16MB DMA zone and 4MB NORMAL zone, a low rate of
allocations will be continually satisfied from the NORMAL zone
resulting in constant aging and pageout within that zone, but with no
pressure at all on the larger 16MB DMA zone.  That's hardly fair.

Likewise for the small 100MB HIGHMEM zone you get at the top of memory
on a 1GB box.

Weighted round-robin has the advantage of not needing to be
special-cased for different sizes of machine.

Cheers,
 Stephen
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] Separate global/perzone inactive/free shortage
  2001-07-16 15:56 ` Stephen C. Tweedie
@ 2001-07-16 19:04   ` Rik van Riel
  0 siblings, 0 replies; 21+ messages in thread
From: Rik van Riel @ 2001-07-16 19:04 UTC (permalink / raw)
  To: Stephen C. Tweedie
  Cc: Bulent Abali, Mike Galbraith, Marcelo Tosatti, Dirk Wetter, linux-mm

On Mon, 16 Jul 2001, Stephen C. Tweedie wrote:

> On a 20MB box with 16MB DMA zone and 4MB NORMAL zone, a low rate of
> allocations will be continually satisfied from the NORMAL zone
> resulting in constant aging and pageout within that zone, but with no
> pressure at all on the larger 16MB DMA zone.  That's hardly fair.

It shouldn't. Pages in both zones get aged equally,
leading to both zones getting above the various
allocation watermarks in turn and getting pages
allocated from them in turn.

If what you are describing is happening, we have a
bug in the implementation of the current scheme.

regards,

Rik
--
Virtual memory is like a game you can't win;
However, without VM there's truly nothing to lose...

http://www.surriel.com/		http://distro.conectiva.com/

Send all your spam to aardvark@nl.linux.org (spam digging piggy)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] Separate global/perzone inactive/free shortage
  2001-07-16 13:56 [PATCH] Separate global/perzone inactive/free shortage Bulent Abali
  2001-07-16 15:56 ` Stephen C. Tweedie
@ 2001-07-18  8:54 ` Mike Galbraith
  2001-07-18 10:18   ` Stephen C. Tweedie
  2001-07-18 15:07   ` Dave McCracken
  1 sibling, 2 replies; 21+ messages in thread
From: Mike Galbraith @ 2001-07-18  8:54 UTC (permalink / raw)
  To: Bulent Abali
  Cc: Stephen C. Tweedie, Marcelo Tosatti, Rik van Riel, Dirk Wetter, linux-mm

On Mon, 16 Jul 2001, Bulent Abali wrote:

> >> On Sat, 14 Jul 2001, Marcelo Tosatti wrote:
> >
> >> On highmem machines, wouldn't it save a LOT of time to prevent
> allocation
> >> of ZONE_DMA as VM pages?  Or, if we really need to, get those pages into
> >> the swapcache instantly?  Crawling through nearly 4 gig of VM looking
> for
> >> 16 MB of ram has got to be very expensive.  Besides, those pages are
> just
> >> too precious to allow some user task to sit on them.
> >
> >Can't we balance that automatically?
> >
> >Why not just round-robin between the eligible zones when allocating,
> >biasing each zone based on size?  On a 4GB box you'd basically end up
> >doing 3 times as many allocations from the highmem zone as the normal
> >zone and only very occasionally would you try to dig into the dma
> >zone.
> >Cheers,
> > Stephen
>
> If I understood page_alloc.c:build_zonelists() correctly
> ZONE_HIGHMEM includes ZONE_NORMAL which includes ZONE_DMA.
> Memory allocators (other than ZONE_DMA) will dip in to the dma zone
> only when there are no highmem and/or normal zone pages available.
> So, the current method is more conservative (better) than round-robin
> it seems to me.

Not really.  As soon as ZONE_NORMAL is engaged such that free_pages
hits pages_low, we will pilpher ZONE_DMA.  That's guaranteed to happen
because that's exactly what we balance for.  Once ZONE_NORMAL reaches
pages_low, we will fall back to allocating ZONE_DMA exclusively.

(problem yes?.. if agree, skip to 'possible solution' below;)

Thinking about doing a find /usr on my box:  we commit ZONE_NORMAL,
then transition to exclusive use of ZONE_DMA instantly.  These pages
will go to kernel structures.  (except for metadata.  Metadata will
be aged/laundered, and become available for more kernel structures)
The tendancy is for ever increasing quantities to become invisible
to the balancing mechanisms.

Thinking about Dirk's logs of rsync leads me to believe that this must
be the case.  kreclaimd is eating cpu.  It can't possibly be any other
zone.  When rsync has had 30 minutes of cpu, kswapd has had 40 minutes.
kreclaimd has eaten 15 solid minutes.  It can't possibly accumulate
that much time unless ZONE_DMA is the problem.. the other zones are
just too easy to find/launder/reclaim.

Much worse is the case of Dirk's two 2gig simulations on a dual cpu
4gig box.  It will guaranteed allocate all of the dma zone to his two
tasks vm.  It will also guaranteed unbalance the zone.  Doesn't that
also guarantee that we will walk pagetables endlessly each and every
time a ZONE_DMA page is issued?  Try to write out a swapcache page,
and you might get a dma page, try to do _anything_ and you might get
a ZONE_DMA page.  With per zone balancing, you will turn these pages
over much much faster than before, and the aging will be unfair in
the extreme.. it absolutely has to be.  SCT's suggestion would make
the total pressure equal, but it would not (could not) correct the
problem of searching for this handful of pages, the very serious cost
of owning a ZONE_DMA page, nor the problem of a specific request for
GFP_DMA pages having a reasonable chance of succeeding.

IMHO, it is really really bad, that any fallback allocation can also
bring the dma zone into critical, and these allocations may end up in
kernel structures which are invisible to the balancing logic, making
a search a complete waste of time.  In any case, on a machine with
lots of ram, the search is going to be disproportionately expensive
due to the size of the search area.

Possible solution:

Effectively reserving the last ~meg (pick a number, scaled by ramsize
would be better) of ZONE_DMA for real GFP_DMA allocations would cure
Dirk's problem I bet, and also cure most of the others too, simply by
ensuring that the ONLY thing that could unbalance that zone would be
real GFP_DMA pressure.  That way, you'd only eat the incredible cost
of balancing that zone when it really really had to be done.

> I think Marcello is proposing to make ZONE_DMA exclusive in large
> memory machines, which might make it better for allocators
> needing ZONE_DMA pages...

That would have to save very much time on HIGHMEM boxes.  No box with
>= a gig of ram would even miss (a lousy;) 16MB.  The very small bit of
extra balancing of other zones would easily be paid for by the reduced
search time (supposition).  You'd certainly be doing large tasks a big
favor by refusing to give them ZONE_DMA pages on a fallback allocation.

I'd almost bet money (will bet a bogobeer:) that disabling fallback to
ZONE_DMA entirely on Dirk's box will make his troubles instantly gone.
Not that we don't need per zone balancing anyway mind you.. it's just
the tiny zone case that is an absolute guaranteed performance killer.

Comments?

	-Mike

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] Separate global/perzone inactive/free shortage
  2001-07-18  8:54 ` Mike Galbraith
@ 2001-07-18 10:18   ` Stephen C. Tweedie
  2001-07-18 14:51     ` Mike Galbraith
  2001-07-18 15:07   ` Dave McCracken
  1 sibling, 1 reply; 21+ messages in thread
From: Stephen C. Tweedie @ 2001-07-18 10:18 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Bulent Abali, Stephen C. Tweedie, Marcelo Tosatti, Rik van Riel,
	Dirk Wetter, linux-mm

Hi,

On Wed, Jul 18, 2001 at 10:54:52AM +0200, Mike Galbraith wrote:

> Much worse is the case of Dirk's two 2gig simulations on a dual cpu
> 4gig box.  It will guaranteed allocate all of the dma zone to his two
> tasks vm.  It will also guaranteed unbalance the zone.  Doesn't that
> also guarantee that we will walk pagetables endlessly each and every
> time a ZONE_DMA page is issued?  Try to write out a swapcache page,
> and you might get a dma page, try to do _anything_ and you might get
> a ZONE_DMA page.  With per zone balancing, you will turn these pages
> over much much faster than before, and the aging will be unfair in
> the extreme.. it absolutely has to be.  SCT's suggestion would make
> the total pressure equal, but it would not (could not) correct the
> problem of searching for this handful of pages, the very serious cost
> of owning a ZONE_DMA page, nor the problem of a specific request for
> GFP_DMA pages having a reasonable chance of succeeding.

The round-robin scheme would result in 16MB worth of allocations ever
2GB of requests coming from ZONE_DMA.  That's one in 128.

But the impact of this on memory pressure depends on how we distribute
PAGES_HIGH/PAGES_LOW allocation requests, and at what point we wake up
the reclaim code.  If we have 50 pages between PAGES_HIGH and
PAGES_LOW for DMA zone, and the reclaim target is PAGES_HIGH, then we
won't start the reclaim until (50*128) allocations have been done
since the last one --- that's 25MB of allocations.  Plus, the reclaim
that does kick off won't be scanning the VM for just one DMA page, it
will keep scanning for a bunch of them, so it's just one VM pass for
that whole 25MB of allocation.  At the very least, this helps spread
the load.

But on top of that, we need to make a distinction between directed
reclaim and opportunistic reclaim.  What I mean by that is this: we
need not force the memory reclaim logic to try to balance the DMA
zone unnecessarily; hence we should not force it to scavenge in DMA if
the (free+inactive) count is above pages_low.  However, we *should*
still age DMA pages if we happen to be triggering the aging loop
anyway.  If pressure on the NORMAL zone triggers aging, then sure, we
can top up the DMA zone.  So, for page aging, if the memory pressure
is balanced, the aging should not have to do a specific pass over the
DMA zone at all --- the aging done in response to pressure elsewhere
ought to have a proportionate impact on the DMA zone's inactive queue
too.

> IMHO, it is really really bad, that any fallback allocation can also
> bring the dma zone into critical, and these allocations may end up in
> kernel structures which are invisible to the balancing logic

That is the crux of the problem.  ext3's journaling, XFS with deferred
allocation, and other advanced filesystem activities will just make
this worse, because even normal file data may suddenly become
"pinned".  With transactions, you can't write out dirty data until any
pending syscalls which are still operating on that transaction
complete.  With deferred block allocation, you can't write out dirty
data without first doing extra filesystem operations to assign disk
blocks for them.

It's not really possible to account this stuff 100% right now ---
mlock() is just too hard to deal with in the absense of rmaps (because
when you munlock() a page, it's currently too expensive to check
whether any other tasks still have a lock on the page.)  A separate
lock_count on the page would help here --- that would allow
the filesystem, the VM and any other components of the system to
register temporary or permanent pins on a page for balancing purposes.

*If* you can account for pinned pages, much of the current trouble
disappears --- you can even do 4MB allocations effectively if you have
the ability to restrict permanently pinned pages to certain zones, and
to force temporary pins out of memory when you need to cleanse a
particular 4MB region for a large allocation.

But for now we have absolutely no such accounting, so if you combine
Reiserfs, ext3 and XFS all on one box, all of them doing their own
half-hearted attempts to avoid flooding memory with pinned pages but
none of them communicating with each other, we can most certainly
deadlock the system.

Cheers,
 Stephen
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] Separate global/perzone inactive/free shortage
  2001-07-18 10:18   ` Stephen C. Tweedie
@ 2001-07-18 14:51     ` Mike Galbraith
  0 siblings, 0 replies; 21+ messages in thread
From: Mike Galbraith @ 2001-07-18 14:51 UTC (permalink / raw)
  To: Stephen C. Tweedie
  Cc: Bulent Abali, Marcelo Tosatti, Rik van Riel, Dirk Wetter, linux-mm

On Wed, 18 Jul 2001, Stephen C. Tweedie wrote:

> Hi,

Greetings,

> On Wed, Jul 18, 2001 at 10:54:52AM +0200, Mike Galbraith wrote:
>
> > Much worse is the case of Dirk's two 2gig simulations on a dual cpu
> > 4gig box.  It will guaranteed allocate all of the dma zone to his two
> > tasks vm.  It will also guaranteed unbalance the zone.  Doesn't that
> > also guarantee that we will walk pagetables endlessly each and every
> > time a ZONE_DMA page is issued?  Try to write out a swapcache page,
> > and you might get a dma page, try to do _anything_ and you might get
> > a ZONE_DMA page.  With per zone balancing, you will turn these pages
> > over much much faster than before, and the aging will be unfair in
> > the extreme.. it absolutely has to be.  SCT's suggestion would make
> > the total pressure equal, but it would not (could not) correct the
> > problem of searching for this handful of pages, the very serious cost
> > of owning a ZONE_DMA page, nor the problem of a specific request for
> > GFP_DMA pages having a reasonable chance of succeeding.
>
> The round-robin scheme would result in 16MB worth of allocations ever
> 2GB of requests coming from ZONE_DMA.  That's one in 128.
>
> But the impact of this on memory pressure depends on how we distribute
> PAGES_HIGH/PAGES_LOW allocation requests, and at what point we wake up
> the reclaim code.  If we have 50 pages between PAGES_HIGH and
> PAGES_LOW for DMA zone, and the reclaim target is PAGES_HIGH, then we
> won't start the reclaim until (50*128) allocations have been done
> since the last one --- that's 25MB of allocations.  Plus, the reclaim
> that does kick off won't be scanning the VM for just one DMA page, it
> will keep scanning for a bunch of them, so it's just one VM pass for
> that whole 25MB of allocation.  At the very least, this helps spread
> the load.

Yes it would.  It would also distribute those pages much wider (+-?).

With 4gig of simulation running/swappimg though, 25MB of allocation
could come around very quickly.  I don't know exactly how expensive it
is to search that much space, but it's definitely cheaper to not have
to bother most of the time.  In the simulation case, that would be all
of the time.

If fallback allocations couldn't get at enough to trigger a zone specific
inactive/free shortage, normal global aging/laundering could handle it
exactly as before for free.  If some DMA pages get picked up (demand is
generally being serviced well enough or we'd see more gripes) cool, if
not, it's no big deal until a zone specific demand comes along.  Instead
of eating the cost repeatedly just to make the bean counter happy, it'd
be defered until genuine demand for these specific beans hit.

> But on top of that, we need to make a distinction between directed
> reclaim and opportunistic reclaim.  What I mean by that is this: we
> need not force the memory reclaim logic to try to balance the DMA
> zone unnecessarily; hence we should not force it to scavenge in DMA if
> the (free+inactive) count is above pages_low.  However, we *should*
> still age DMA pages if we happen to be triggering the aging loop
> anyway.  If pressure on the NORMAL zone triggers aging, then sure, we
> can top up the DMA zone.  So, for page aging, if the memory pressure
> is balanced, the aging should not have to do a specific pass over the
> DMA zone at all --- the aging done in response to pressure elsewhere
> ought to have a proportionate impact on the DMA zone's inactive queue
> too.

Ah yes.  That could reduce the cost a lot.  But not as much as ignoring
it until real demand happens.  In both cases, when that happens, a full
blown effort is needed.  I'm not hung up on my workaround suggestion by
any means though.. if there's a cleaner way to deal with the problems,
I'm all for it.

The really nasty problem that my suggestion helps with is DMA pages going
down an invisible rathole.  If that happens, the cost is high indeed because
the invested effort can fail to produce any result.. possibly forever.

> > IMHO, it is really really bad, that any fallback allocation can also
> > bring the dma zone into critical, and these allocations may end up in
> > kernel structures which are invisible to the balancing logic
>
> That is the crux of the problem.  ext3's journaling, XFS with deferred

Fully agree with that.  (don't have a clue what an rmap even is, so I'll
just shut up and listen now:)

	-Mike

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] Separate global/perzone inactive/free shortage
  2001-07-18  8:54 ` Mike Galbraith
  2001-07-18 10:18   ` Stephen C. Tweedie
@ 2001-07-18 15:07   ` Dave McCracken
  2001-07-18 16:09     ` Rik van Riel
  1 sibling, 1 reply; 21+ messages in thread
From: Dave McCracken @ 2001-07-18 15:07 UTC (permalink / raw)
  To: linux-mm

--On Wednesday, July 18, 2001 10:54:52 +0200 Mike Galbraith 
<mikeg@wen-online.de> wrote:

> Possible solution:
>
> Effectively reserving the last ~meg (pick a number, scaled by ramsize
> would be better) of ZONE_DMA for real GFP_DMA allocations would cure
> Dirk's problem I bet, and also cure most of the others too, simply by
> ensuring that the ONLY thing that could unbalance that zone would be
> real GFP_DMA pressure.  That way, you'd only eat the incredible cost
> of balancing that zone when it really really had to be done.

Couldn't something similar to this be accomplished by tweaking the 
pages_{min,low,high} values to ZONE_DMA based on the total memory in the 
machine?  It seems to me if you have a large memory machine it'd be simple 
enough to set at least pages_high (and perhaps pages_low?) to a larger 
value.  If we do this, won't it keep the DMA zone from triggering memory 
pressure as much?

Dave McCracken

======================================================================
Dave McCracken          IBM Linux Base Kernel Team      1-512-838-3059
dmc@austin.ibm.com                                      T/L   678-3059

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] Separate global/perzone inactive/free shortage
  2001-07-18 15:07   ` Dave McCracken
@ 2001-07-18 16:09     ` Rik van Riel
  0 siblings, 0 replies; 21+ messages in thread
From: Rik van Riel @ 2001-07-18 16:09 UTC (permalink / raw)
  To: Dave McCracken; +Cc: linux-mm, linux-kernel

On Wed, 18 Jul 2001, Dave McCracken wrote:
> --On Wednesday, July 18, 2001 10:54:52 +0200 Mike Galbraith
> <mikeg@wen-online.de> wrote:
>
> > Possible solution:
> >
> > Effectively reserving the last ~meg (pick a number, scaled by ramsize
> > would be better) of ZONE_DMA for real GFP_DMA allocations would cure
> > Dirk's problem I bet, and also cure most of the others too, simply by
>
> Couldn't something similar to this be accomplished by tweaking the
> pages_{min,low,high} values to ZONE_DMA based on the total memory in the
> machine?

I bet we can do this in a much simpler way with less
reliance on magic numbers. My theory goes as follows:

The problem with the current code is that the global
free target (freepages.high) is the same as the sum
of the per-zone free targets.

Because of this, we will always run into the local
free shortages and the VM has to eat free pages from
all zones and has no chance to properly balance usage
bettween the zones depending on VM activity in the
zone and desireability of allocating from this zone.

We could try increasing the _global_ free target to
something like 2 or 3 times the sum of the per-zone
free targets.

By doing that the system would have a much better
chance of leaving eg. the DMA zone alone for allocations
because kswapd doesn't just free the amount of pages
required to bring each zone to the edge, it would free
a whole bunch more pages, to whatever zone they happen
to be in.  That way the VM would do the bulk of the
allocations from the least loaded zone and leave the
DMA zone (at the end of the fallback chain) alone.

I'm not sure if this would work, but just increasing
the global free target to something significantly
higher than the sum of the per-zone free targets
is an easy to test change ;)

regards,

Rik
--
Virtual memory is like a game you can't win;
However, without VM there's truly nothing to lose...

http://www.surriel.com/		http://distro.conectiva.com/

Send all your spam to aardvark@nl.linux.org (spam digging piggy)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH] Separate global/perzone inactive/free shortage
@ 2001-07-14  5:19 Marcelo Tosatti
  2001-07-14  7:11 ` Marcelo Tosatti
                   ` (3 more replies)
  0 siblings, 4 replies; 21+ messages in thread
From: Marcelo Tosatti @ 2001-07-14  5:19 UTC (permalink / raw)
  To: lkml
  Cc: Rik van Riel, Dirk Wetter, Mike Galbraith, linux-mm, Stephen C. Tweedie

Hi,

As well known, the VM does not make a distiction between global and
per-zone shortages when trying to free memory. That means if only a given
memory zone is under shortage, the kernel will scan pages from all zones. 

The following patch (against 2.4.6-ac2), changes the kernel behaviour to
avoid freeing pages from zones which do not have an inactive and/or
free shortage.

Now I'm able to run memory hogs allocating 4GB of memory (on 4GB machine)
without getting real long hangs on my ssh session. (which used to happen
on stock -ac2 due to exhaustion of DMA pages for networking).

Comments ? 

Dirk, Can you please try the patch and tell us if it fixes your problem ? 


diff --exclude-from=/home/marcelo/exclude -Nur linux.orig/include/linux/swap.h linux/include/linux/swap.h
--- linux.orig/include/linux/swap.h	Sat Jul 14 02:47:14 2001
+++ linux/include/linux/swap.h	Sat Jul 14 03:27:13 2001
@@ -123,9 +123,14 @@
 extern wait_queue_head_t kreclaimd_wait;
 extern int page_launder(int, int);
 extern int free_shortage(void);
+extern int total_free_shortage(void);
 extern int inactive_shortage(void);
+extern int total_inactive_shortage(void);
 extern void wakeup_kswapd(void);
 extern int try_to_free_pages(unsigned int gfp_mask);
+
+extern unsigned int zone_free_shortage(zone_t *zone);
+extern unsigned int zone_inactive_shortage(zone_t *zone);
 
 /* linux/mm/page_io.c */
 extern void rw_swap_page(int, struct page *);
diff --exclude-from=/home/marcelo/exclude -Nur linux.orig/mm/page_alloc.c linux/mm/page_alloc.c
--- linux.orig/mm/page_alloc.c	Sat Jul 14 02:47:14 2001
+++ linux/mm/page_alloc.c	Sat Jul 14 02:50:50 2001
@@ -451,7 +451,7 @@
 		 * to give up than to deadlock the kernel looping here.
 		 */
 		if (gfp_mask & __GFP_WAIT) {
-			if (!order || free_shortage()) {
+			if (!order || total_free_shortage()) {
 				int progress = try_to_free_pages(gfp_mask);
 				if (progress || (gfp_mask & __GFP_FS))
 					goto try_again;
@@ -689,6 +689,39 @@
 	return pages;
 }
 #endif
+
+unsigned int zone_free_shortage(zone_t *zone)
+{
+	int sum = 0;
+
+	if (!zone->size)
+		goto ret;
+
+	if (zone->inactive_clean_pages + zone->free_pages
+			< zone->pages_min) {
+		sum += zone->pages_min;
+		sum -= zone->free_pages;
+		sum -= zone->inactive_clean_pages;
+	}
+ret:
+	return sum;
+}
+
+unsigned int zone_inactive_shortage(zone_t *zone) 
+{
+	int sum = 0;
+
+	if (!zone->size)
+		goto ret;
+
+	sum = zone->pages_high;
+	sum -= zone->inactive_dirty_pages;
+	sum -= zone->inactive_clean_pages;
+	sum -= zone->free_pages;
+
+ret:
+     return (sum > 0 ? sum : 0);
+}
 
 /*
  * Show free area list (used inside shift_scroll-lock stuff)
diff --exclude-from=/home/marcelo/exclude -Nur linux.orig/mm/vmscan.c linux/mm/vmscan.c
--- linux.orig/mm/vmscan.c	Sat Jul 14 02:47:14 2001
+++ linux/mm/vmscan.c	Sat Jul 14 03:22:19 2001
@@ -36,11 +36,19 @@
  */
 
 /* mm->page_table_lock is held. mmap_sem is not held */
-static void try_to_swap_out(struct mm_struct * mm, struct vm_area_struct* vma, unsigned long address, pte_t * page_table, struct page *page)
+static void try_to_swap_out(zone_t *zone, struct mm_struct * mm, struct vm_area_struct* vma, unsigned long address, pte_t * page_table, struct page *page)
 {
 	pte_t pte;
 	swp_entry_t entry;
 
+	/* 
+	 * If we are doing a zone-specific scan, do not
+	 * touch pages from zones which don't have a 
+	 * shortage.
+	 */
+	if (zone && !zone_inactive_shortage(page->zone))
+		return;
+
 	/* Don't look at this pte if it's been accessed recently. */
 	if (ptep_test_and_clear_young(page_table)) {
 		page->age += PAGE_AGE_ADV;
@@ -131,7 +139,7 @@
 }
 
 /* mm->page_table_lock is held. mmap_sem is not held */
-static int swap_out_pmd(struct mm_struct * mm, struct vm_area_struct * vma, pmd_t *dir, unsigned long address, unsigned long end, int count)
+static int swap_out_pmd(zone_t *zone, struct mm_struct * mm, struct vm_area_struct * vma, pmd_t *dir, unsigned long address, unsigned long end, int count)
 {
 	pte_t * pte;
 	unsigned long pmd_end;
@@ -155,7 +163,7 @@
 			struct page *page = pte_page(*pte);
 
 			if (VALID_PAGE(page) && !PageReserved(page)) {
-				try_to_swap_out(mm, vma, address, pte, page);
+				try_to_swap_out(zone, mm, vma, address, pte, page);
 				if (!--count)
 					break;
 			}
@@ -168,7 +176,7 @@
 }
 
 /* mm->page_table_lock is held. mmap_sem is not held */
-static inline int swap_out_pgd(struct mm_struct * mm, struct vm_area_struct * vma, pgd_t *dir, unsigned long address, unsigned long end, int count)
+static inline int swap_out_pgd(zone_t *zone, struct mm_struct * mm, struct vm_area_struct * vma, pgd_t *dir, unsigned long address, unsigned long end, int count)
 {
 	pmd_t * pmd;
 	unsigned long pgd_end;
@@ -188,7 +196,7 @@
 		end = pgd_end;
 	
 	do {
-		count = swap_out_pmd(mm, vma, pmd, address, end, count);
+		count = swap_out_pmd(zone, mm, vma, pmd, address, end, count);
 		if (!count)
 			break;
 		address = (address + PMD_SIZE) & PMD_MASK;
@@ -198,7 +206,7 @@
 }
 
 /* mm->page_table_lock is held. mmap_sem is not held */
-static int swap_out_vma(struct mm_struct * mm, struct vm_area_struct * vma, unsigned long address, int count)
+static int swap_out_vma(zone_t *zone, struct mm_struct * mm, struct vm_area_struct * vma, unsigned long address, int count)
 {
 	pgd_t *pgdir;
 	unsigned long end;
@@ -213,7 +221,7 @@
 	if (address >= end)
 		BUG();
 	do {
-		count = swap_out_pgd(mm, vma, pgdir, address, end, count);
+		count = swap_out_pgd(zone, mm, vma, pgdir, address, end, count);
 		if (!count)
 			break;
 		address = (address + PGDIR_SIZE) & PGDIR_MASK;
@@ -225,7 +233,7 @@
 /*
  * Returns non-zero if we scanned all `count' pages
  */
-static int swap_out_mm(struct mm_struct * mm, int count)
+static int swap_out_mm(zone_t *zone, struct mm_struct * mm, int count)
 {
 	unsigned long address;
 	struct vm_area_struct* vma;
@@ -248,7 +256,7 @@
 			address = vma->vm_start;
 
 		for (;;) {
-			count = swap_out_vma(mm, vma, address, count);
+			count = swap_out_vma(zone, mm, vma, address, count);
 			if (!count)
 				goto out_unlock;
 			vma = vma->vm_next;
@@ -280,7 +288,7 @@
 	return nr;
 }
 
-static void swap_out(unsigned int priority, int gfp_mask)
+static void swap_out(zone_t *zone, unsigned int priority, int gfp_mask)
 {
 	int counter;
 	int retval = 0;
@@ -288,7 +296,7 @@
 
 	/* Always start by trying to penalize the process that is allocating memory */
 	if (mm)
-		retval = swap_out_mm(mm, swap_amount(mm));
+		retval = swap_out_mm(zone, mm, swap_amount(mm));
 
 	/* Then, look at the other mm's */
 	counter = (mmlist_nr << SWAP_MM_SHIFT) >> priority;
@@ -310,7 +318,7 @@
 		spin_unlock(&mmlist_lock);
 
 		/* Walk about 6% of the address space each time */
-		retval |= swap_out_mm(mm, swap_amount(mm));
+		retval |= swap_out_mm(zone, mm, swap_amount(mm));
 		mmput(mm);
 	} while (--counter >= 0);
 	return;
@@ -426,7 +434,7 @@
 #define MAX_LAUNDER 		(4 * (1 << page_cluster))
 #define CAN_DO_FS		(gfp_mask & __GFP_FS)
 #define CAN_DO_IO		(gfp_mask & __GFP_IO)
-int page_launder(int gfp_mask, int sync)
+int do_page_launder(zone_t *zone, int gfp_mask, int sync)
 {
 	int launder_loop, maxscan, cleaned_pages, maxlaunder;
 	struct list_head * page_lru;
@@ -461,6 +469,17 @@
 			continue;
 		}
 
+		/* 
+		 * If we are doing zone-specific laundering, 
+		 * avoid touching pages from zones which do 
+		 * not have a free shortage.
+		 */
+		if (zone && !zone_free_shortage(page->zone)) {
+			list_del(page_lru);
+			list_add(page_lru, &inactive_dirty_list);
+			continue;
+		}
+
 		/*
 		 * The page is locked. IO in progress?
 		 * Move it to the back of the list.
@@ -574,8 +593,13 @@
 			 * If we're freeing buffer cache pages, stop when
 			 * we've got enough free memory.
 			 */
-			if (freed_page && !free_shortage())
-				break;
+			if (freed_page) {
+				if (zone) {
+					if (!zone_free_shortage(zone))
+						break;
+				} else if (free_shortage()) 
+					break;
+			}
 			continue;
 		} else if (page->mapping && !PageDirty(page)) {
 			/*
@@ -613,7 +637,8 @@
 	 * loads, flush out the dirty pages before we have to wait on
 	 * IO.
 	 */
-	if (CAN_DO_IO && !launder_loop && free_shortage()) {
+	if (CAN_DO_IO && !launder_loop && (free_shortage() 
+				|| (zone && zone_free_shortage(zone)))) {
 		launder_loop = 1;
 		/* If we cleaned pages, never do synchronous IO. */
 		if (cleaned_pages)
@@ -629,6 +654,34 @@
 	return cleaned_pages;
 }
 
+int page_launder(int gfp_mask, int sync)
+{
+	int type = 0;
+	int ret;
+	pg_data_t *pgdat = pgdat_list;
+	/*
+	 * First do a global scan if there is a 
+	 * global shortage.
+	 */
+	if (free_shortage())
+		ret += do_page_launder(NULL, gfp_mask, sync);
+
+	/*
+	 * Then check if there is any specific zone 
+	 * needs laundering.
+	 */
+	for (type = 0; type < MAX_NR_ZONES; type++) {
+		zone_t *zone = pgdat->node_zones + type;
+		
+		if (zone_free_shortage(zone)) 
+			ret += do_page_launder(zone, gfp_mask, sync);
+	} 
+
+	return ret;
+}
+
+
+
 /**
  * refill_inactive_scan - scan the active list and find pages to deactivate
  * @priority: the priority at which to scan
@@ -637,7 +690,7 @@
  * This function will scan a portion of the active list to find
  * unused pages, those pages will then be moved to the inactive list.
  */
-int refill_inactive_scan(unsigned int priority, int target)
+int refill_inactive_scan(zone_t *zone, unsigned int priority, int target)
 {
 	struct list_head * page_lru;
 	struct page * page;
@@ -665,6 +718,16 @@
 			continue;
 		}
 
+		/*
+		 * If we are doing zone-specific scanning, ignore
+		 * pages from zones without shortage.
+		 */
+
+		if (zone && !zone_inactive_shortage(page->zone)) {
+			page_active = 1;
+			goto skip_page;
+		}
+
 		/* Do aging on the pages. */
 		if (PageTestandClearReferenced(page)) {
 			age_page_up_nolock(page);
@@ -694,6 +757,7 @@
 		 * to the other end of the list. Otherwise we exit if
 		 * we have done enough work.
 		 */
+skip_page:
 		if (page_active || PageActive(page)) {
 			list_del(page_lru);
 			list_add(page_lru, &active_list);
@@ -709,12 +773,10 @@
 }
 
 /*
- * Check if there are zones with a severe shortage of free pages,
- * or if all zones have a minor shortage.
+ * Check if we have are low on free pages globally.
  */
 int free_shortage(void)
 {
-	pg_data_t *pgdat = pgdat_list;
 	int sum = 0;
 	int freeable = nr_free_pages() + nr_inactive_clean_pages();
 	int freetarget = freepages.high;
@@ -722,6 +784,22 @@
 	/* Are we low on free pages globally? */
 	if (freeable < freetarget)
 		return freetarget - freeable;
+	return 0;
+}
+
+/*
+ *
+ * Check if there are zones with a severe shortage of free pages,
+ * or if all zones have a minor shortage.
+ */
+int total_free_shortage(void)
+{
+	int sum = 0;
+	pg_data_t *pgdat = pgdat_list;
+
+	/* Do we have a global free shortage? */
+	if((sum = free_shortage()))
+		return sum;
 
 	/* If not, are we very low on any particular zone? */
 	do {
@@ -739,15 +817,15 @@
 	} while (pgdat);
 
 	return sum;
+
 }
 
 /*
- * How many inactive pages are we short?
+ * How many inactive pages are we short globally?
  */
 int inactive_shortage(void)
 {
 	int shortage = 0;
-	pg_data_t *pgdat = pgdat_list;
 
 	/* Is the inactive dirty list too small? */
 
@@ -759,10 +837,20 @@
 
 	if (shortage > 0)
 		return shortage;
+	return 0;
+}
+/*
+ * Are we low on inactive pages globally or in any zone?
+ */
+int total_inactive_shortage(void)
+{
+	int shortage = 0;
+	pg_data_t *pgdat = pgdat_list;
 
-	/* If not, do we have enough per-zone pages on the inactive list? */
+	if((shortage = inactive_shortage()))
+		return shortage;
 
-	shortage = 0;
+	shortage = 0;	
 
 	do {
 		int i;
@@ -802,7 +890,7 @@
  * when called from a user process.
  */
 #define DEF_PRIORITY (6)
-static int refill_inactive(unsigned int gfp_mask, int user)
+static int refill_inactive_global(unsigned int gfp_mask, int user)
 {
 	int count, start_count, maxtry;
 
@@ -824,9 +912,9 @@
 		}
 
 		/* Walk the VM space for a bit.. */
-		swap_out(DEF_PRIORITY, gfp_mask);
+		swap_out(NULL, DEF_PRIORITY, gfp_mask);
 
-		count -= refill_inactive_scan(DEF_PRIORITY, count);
+		count -= refill_inactive_scan(NULL, DEF_PRIORITY, count);
 		if (count <= 0)
 			goto done;
 
@@ -839,6 +927,60 @@
 	return (count < start_count);
 }
 
+static int refill_inactive_zone(zone_t *zone, unsigned int gfp_mask, int user) 
+{
+	int count, start_count, maxtry; 
+	
+	count = start_count = zone_inactive_shortage(zone);
+
+	maxtry = (1 << DEF_PRIORITY);
+
+	do {
+		swap_out(zone, DEF_PRIORITY, gfp_mask);
+
+		count -= refill_inactive_scan(zone, DEF_PRIORITY, count);
+
+		if (count <= 0)
+			goto done;
+
+		if (--maxtry <= 0)
+			return 0;
+
+	} while(zone_inactive_shortage(zone));
+done:
+	return (count < start_count);
+}
+
+
+static int refill_inactive(unsigned int gfp_mask, int user) 
+{
+	int type = 0;
+	int ret;
+	pg_data_t *pgdat = pgdat_list;
+	/*
+	 * First do a global scan if there is a 
+	 * global shortage.
+	 */
+	if (inactive_shortage())
+		ret += refill_inactive_global(gfp_mask, user);
+
+	/*
+	 * Then check if there is any specific zone 
+	 * with a shortage and try to refill it if
+	 * so.
+	 */
+	for (type = 0; type < MAX_NR_ZONES; type++) {
+		zone_t *zone = pgdat->node_zones + type;
+		
+		if (zone_inactive_shortage(zone)) 
+			ret += refill_inactive_zone(zone, gfp_mask, user);
+	} 
+
+	return ret;
+}
+
+#define DEF_PRIORITY (6)
+
 static int do_try_to_free_pages(unsigned int gfp_mask, int user)
 {
 	int ret = 0;
@@ -851,8 +993,10 @@
 	 * before we get around to moving them to the other
 	 * list, so this is a relatively cheap operation.
 	 */
-	if (free_shortage()) {
-		ret += page_launder(gfp_mask, user);
+
+	ret += page_launder(gfp_mask, user);
+
+	if (total_free_shortage()) {
 		shrink_dcache_memory(DEF_PRIORITY, gfp_mask);
 		shrink_icache_memory(DEF_PRIORITY, gfp_mask);
 	}
@@ -861,8 +1005,7 @@
 	 * If needed, we move pages from the active list
 	 * to the inactive list.
 	 */
-	if (inactive_shortage())
-		ret += refill_inactive(gfp_mask, user);
+	ret += refill_inactive(gfp_mask, user);
 
 	/* 	
 	 * Reclaim unused slab cache if memory is low.
@@ -917,7 +1060,7 @@
 		static long recalc = 0;
 
 		/* If needed, try to free some memory. */
-		if (inactive_shortage() || free_shortage()) 
+		if (total_inactive_shortage() || total_free_shortage()) 
 			do_try_to_free_pages(GFP_KSWAPD, 0);
 
 		/* Once a second ... */
@@ -928,7 +1071,7 @@
 			recalculate_vm_stats();
 
 			/* Do background page aging. */
-			refill_inactive_scan(DEF_PRIORITY, 0);
+			refill_inactive_scan(NULL, DEF_PRIORITY, 0);
 		}
 
 		run_task_queue(&tq_disk);
@@ -944,7 +1087,7 @@
 		 * We go to sleep for one second, but if it's needed
 		 * we'll be woken up earlier...
 		 */
-		if (!free_shortage() || !inactive_shortage()) {
+		if (!total_free_shortage() || !total_inactive_shortage()) {
 			interruptible_sleep_on_timeout(&kswapd_wait, HZ);
 		/*
 		 * If we couldn't free enough memory, we see if it was

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] Separate global/perzone inactive/free shortage
  2001-07-14  5:19 Marcelo Tosatti
@ 2001-07-14  7:11 ` Marcelo Tosatti
  2001-07-14 20:13 ` Dirk
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 21+ messages in thread
From: Marcelo Tosatti @ 2001-07-14  7:11 UTC (permalink / raw)
  To: lkml
  Cc: Rik van Riel, Dirk Wetter, Mike Galbraith, linux-mm, Stephen C. Tweedie


There is a silly typo on the patch. 


On Sat, 14 Jul 2001, Marcelo Tosatti wrote:

> 
> Hi,
> 
> As well known, the VM does not make a distiction between global and
> per-zone shortages when trying to free memory. That means if only a given
> memory zone is under shortage, the kernel will scan pages from all zones. 
> 
> The following patch (against 2.4.6-ac2), changes the kernel behaviour to
> avoid freeing pages from zones which do not have an inactive and/or
> free shortage.
> 
> Now I'm able to run memory hogs allocating 4GB of memory (on 4GB machine)
> without getting real long hangs on my ssh session. (which used to happen
> on stock -ac2 due to exhaustion of DMA pages for networking).
> 
> Comments ? 
> 
> Dirk, Can you please try the patch and tell us if it fixes your problem ? 
> 
> 

mm/vmscan.c diff 

> @@ -574,8 +593,13 @@
>  			 * If we're freeing buffer cache pages, stop when
>  			 * we've got enough free memory.
>  			 */
> -			if (freed_page && !free_shortage())
> -				break;
> +			if (freed_page) {
> +				if (zone) {
> +					if (!zone_free_shortage(zone))
> +						break;
> +				} else if (free_shortage()) 
					  ^^^^^^^^ ^^^^^^ 
Should be 
				} else if (!free_shortage())


> +					break;
> +			}
>  			continue;


Well, updated patch at
http://bazar.conectiva.com.br/~marcelo/patches/v2.4/2.4.6ac2/zoned.patch

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] Separate global/perzone inactive/free shortage
  2001-07-14  5:19 Marcelo Tosatti
  2001-07-14  7:11 ` Marcelo Tosatti
@ 2001-07-14 20:13 ` Dirk
       [not found] ` <Pine.LNX.4.33.0107141023440.283-100000@mikeg.weiden.de>
  2001-07-16 15:51 ` Kanoj Sarcar
  3 siblings, 0 replies; 21+ messages in thread
From: Dirk @ 2001-07-14 20:13 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: lkml, Rik van Riel, Mike Galbraith, linux-mm, Stephen C. Tweedie

Marcelo Tosatti wrote:

> Hi,
>
> As well known, the VM does not make a distiction between global and
> per-zone shortages when trying to free memory. That means if only a given
> memory zone is under shortage, the kernel will scan pages from all zones.
>
> The following patch (against 2.4.6-ac2), changes the kernel behaviour to
> avoid freeing pages from zones which do not have an inactive and/or
> free shortage.
>
> Now I'm able to run memory hogs allocating 4GB of memory (on 4GB machine)
> without getting real long hangs on my ssh session. (which used to happen
> on stock -ac2 due to exhaustion of DMA pages for networking).
>
> Comments ?
>
> Dirk, Can you please try the patch and tell us if it fixes your problem ?
>

great!! that is definitely better, the machine talks to me again. there are some
small "but"s. however. i write them up and let you know.

    ~dirkw


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 21+ messages in thread

[parent not found: <Pine.LNX.4.33.0107141023440.283-100000@mikeg.weiden.de>]

* Re: [PATCH] Separate global/perzone inactive/free shortage
       [not found] ` <Pine.LNX.4.33.0107141023440.283-100000@mikeg.weiden.de>
@ 2001-07-16 13:19   ` Stephen C. Tweedie
  2001-07-16 15:44     ` Mike Galbraith
  2001-07-16 18:42     ` Dirk Wetter
  0 siblings, 2 replies; 21+ messages in thread
From: Stephen C. Tweedie @ 2001-07-16 13:19 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Marcelo Tosatti, Rik van Riel, Dirk Wetter, Stephen C. Tweedie, linux-mm

Hi,

> On Sat, 14 Jul 2001, Marcelo Tosatti wrote:

> On highmem machines, wouldn't it save a LOT of time to prevent allocation
> of ZONE_DMA as VM pages?  Or, if we really need to, get those pages into
> the swapcache instantly?  Crawling through nearly 4 gig of VM looking for
> 16 MB of ram has got to be very expensive.  Besides, those pages are just
> too precious to allow some user task to sit on them.

Can't we balance that automatically?

Why not just round-robin between the eligible zones when allocating,
biasing each zone based on size?  On a 4GB box you'd basically end up
doing 3 times as many allocations from the highmem zone as the normal
zone and only very occasionally would you try to dig into the dma
zone.  But on a 32MB box you would automatically spread allocations
50/50 between normal and dma, and on a 20MB box you would be biased in
favour of allocating dma pages.

Cheers,
 Stephen
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] Separate global/perzone inactive/free shortage
  2001-07-16 13:19   ` Stephen C. Tweedie
@ 2001-07-16 15:44     ` Mike Galbraith
  2001-07-16 18:30       ` Stephen C. Tweedie
  2001-07-16 18:42     ` Dirk Wetter
  1 sibling, 1 reply; 21+ messages in thread
From: Mike Galbraith @ 2001-07-16 15:44 UTC (permalink / raw)
  To: Stephen C. Tweedie; +Cc: Marcelo Tosatti, Rik van Riel, Dirk Wetter, linux-mm

On Mon, 16 Jul 2001, Stephen C. Tweedie wrote:

> Hi,
>
> > On Sat, 14 Jul 2001, Marcelo Tosatti wrote:
>
> > On highmem machines, wouldn't it save a LOT of time to prevent allocation
> > of ZONE_DMA as VM pages?  Or, if we really need to, get those pages into
> > the swapcache instantly?  Crawling through nearly 4 gig of VM looking for
> > 16 MB of ram has got to be very expensive.  Besides, those pages are just
> > too precious to allow some user task to sit on them.
>
> Can't we balance that automatically?
>
> Why not just round-robin between the eligible zones when allocating,
> biasing each zone based on size?  On a 4GB box you'd basically end up
> doing 3 times as many allocations from the highmem zone as the normal
> zone and only very occasionally would you try to dig into the dma
> zone.  But on a 32MB box you would automatically spread allocations
> 50/50 between normal and dma, and on a 20MB box you would be biased in
> favour of allocating dma pages.

Parceling them out biased according to size would distribute pressure
equally.. except on task vm.. I think.

What prevents this from happening, and lets make ZONE_DINKY _really_
dinky just for the sake of argument.  ZONE_DINKY will have say 4 pages,
one for active, dirty, clean and free.  Balanced is 2 dirty and 2 free,
or 1 free, 1 clean and 1 dirty.  2 tasks are running, and both are giant
economy size, with very nearly 2gig of vm allocated each.

ZONE_DINKY, ZONE_BIG, and ZONE_MONDO are all fully engaged and under
pressure.  ZONE_DINKY gets aged/laundered such that it is in balance.
Task A is using 1 ZONE_DINKY page.  Task B requests a page to do pagein,
and reclaims a page from ZONE_DINKY because there's only 1 free page.
We are back to inactive shortage instantly, so we have to walk 4gig of
vm looking for one ZONE_DINKY page to activate/age/deactivate.  During
the aging process, any other in use page from that zone is fair game.

Merely posessing pages from a small zone implies a higher turnover rate,
and that has to be bad.  In this made up case, it would be horrible.
ZONE_DINKY pages in mondo task's vm would shred them.

To kill the search overhead, you could flag areas with possession info,
but that won't stop the turnover differential problem when your resources
are all engaged.

Is there anything wrong with this logic?  If not, it's just a matter
of scaling the problem to real life numbers.

(maybe I should stop thinking about vm.. makes me dizzy;)

	-Mike

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] Separate global/perzone inactive/free shortage
  2001-07-16 15:44     ` Mike Galbraith
@ 2001-07-16 18:30       ` Stephen C. Tweedie
  2001-07-17  2:55         ` Mike Galbraith
  0 siblings, 1 reply; 21+ messages in thread
From: Stephen C. Tweedie @ 2001-07-16 18:30 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Stephen C. Tweedie, Marcelo Tosatti, Rik van Riel, Dirk Wetter, linux-mm

Hi,

On Mon, Jul 16, 2001 at 05:44:17PM +0200, Mike Galbraith wrote:

> > Why not just round-robin between the eligible zones when allocating,
> > biasing each zone based on size?

> What prevents this from happening, and lets make ZONE_DINKY _really_
> dinky just for the sake of argument.  ZONE_DINKY will have say 4 pages,
> one for active, dirty, clean and free.  Balanced is 2 dirty and 2 free,
> or 1 free, 1 clean and 1 dirty.  2 tasks are running, and both are giant
> economy size, with very nearly 2gig of vm allocated each.
> 
> ZONE_DINKY, ZONE_BIG, and ZONE_MONDO are all fully engaged and under
> pressure.  ZONE_DINKY gets aged/laundered such that it is in balance.
> Task A is using 1 ZONE_DINKY page.  Task B requests a page to do pagein,
> and reclaims a page from ZONE_DINKY because there's only 1 free page.
> We are back to inactive shortage instantly, so we have to walk 4gig of
> vm looking for one ZONE_DINKY page to activate/age/deactivate.  During
> the aging process, any other in use page from that zone is fair game.

Agreed, but in that sort of case, if we have (say) close 1GB in
ZONE_NORMAL and 16MB in ZONE_DMA, then only one allocation in 64 will
even _try_ to allocate from the DMA zone.  Replace the DMA zone with a
hypothetical DINKY 4-page zone and it goes down to one allocation in
65536.  You don't reduce the cost of a DINKY allocation, but you
reduce the change that such an allocation will happen.

The balanced round-robin still seems like a helpful next step here
even if it doesn't cure all the balance problems immediately.

--Stephen
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] Separate global/perzone inactive/free shortage
  2001-07-16 18:30       ` Stephen C. Tweedie
@ 2001-07-17  2:55         ` Mike Galbraith
  0 siblings, 0 replies; 21+ messages in thread
From: Mike Galbraith @ 2001-07-17  2:55 UTC (permalink / raw)
  To: Stephen C. Tweedie; +Cc: Marcelo Tosatti, Rik van Riel, Dirk Wetter, linux-mm

On Mon, 16 Jul 2001, Stephen C. Tweedie wrote:

> Hi,
>
> On Mon, Jul 16, 2001 at 05:44:17PM +0200, Mike Galbraith wrote:
>
> > > Why not just round-robin between the eligible zones when allocating,
> > > biasing each zone based on size?
>
> > What prevents this from happening, and lets make ZONE_DINKY _really_
> > dinky just for the sake of argument.  ZONE_DINKY will have say 4 pages,
> > one for active, dirty, clean and free.  Balanced is 2 dirty and 2 free,
> > or 1 free, 1 clean and 1 dirty.  2 tasks are running, and both are giant
> > economy size, with very nearly 2gig of vm allocated each.
> >
> > ZONE_DINKY, ZONE_BIG, and ZONE_MONDO are all fully engaged and under
> > pressure.  ZONE_DINKY gets aged/laundered such that it is in balance.
> > Task A is using 1 ZONE_DINKY page.  Task B requests a page to do pagein,
> > and reclaims a page from ZONE_DINKY because there's only 1 free page.
> > We are back to inactive shortage instantly, so we have to walk 4gig of
> > vm looking for one ZONE_DINKY page to activate/age/deactivate.  During
> > the aging process, any other in use page from that zone is fair game.
>
> Agreed, but in that sort of case, if we have (say) close 1GB in
> ZONE_NORMAL and 16MB in ZONE_DMA, then only one allocation in 64 will
> even _try_ to allocate from the DMA zone.  Replace the DMA zone with a
> hypothetical DINKY 4-page zone and it goes down to one allocation in
> 65536.  You don't reduce the cost of a DINKY allocation, but you
> reduce the change that such an allocation will happen.
>
> The balanced round-robin still seems like a helpful next step here
> even if it doesn't cure all the balance problems immediately.

Yes, this should mitigate the effect.  I think something will still
end up having to be done about the search time though.  Dirk's case
seems to be the pathalogical one.

	-Mike

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] Separate global/perzone inactive/free shortage
  2001-07-16 13:19   ` Stephen C. Tweedie
  2001-07-16 15:44     ` Mike Galbraith
@ 2001-07-16 18:42     ` Dirk Wetter
  1 sibling, 0 replies; 21+ messages in thread
From: Dirk Wetter @ 2001-07-16 18:42 UTC (permalink / raw)
  To: Stephen C. Tweedie
  Cc: Mike Galbraith, Marcelo Tosatti, Rik van Riel, linux-mm

On Mon, 16 Jul 2001, Stephen C. Tweedie wrote:

> Hi,
>
> > On Sat, 14 Jul 2001, Marcelo Tosatti wrote:
>
> > On highmem machines, wouldn't it save a LOT of time to prevent allocation
> > of ZONE_DMA as VM pages?  Or, if we really need to, get those pages into
> > the swapcache instantly?  Crawling through nearly 4 gig of VM looking for
> > 16 MB of ram has got to be very expensive.  Besides, those pages are just
> > too precious to allow some user task to sit on them.
>
> Can't we balance that automatically?
>
> Why not just round-robin between the eligible zones when allocating,
> biasing each zone based on size?  On a 4GB box you'd basically end up
> doing 3 times as many allocations from the highmem zone as the normal
> zone and only very occasionally would you try to dig into the dma
> zone.  But on a 32MB box you would automatically spread allocations
> 50/50 between normal and dma, and on a 20MB box you would be biased in
> favour of allocating dma pages.

how good would be the one-size-fits-all approach?  certainly i would
like to have the best memory performance for my 4GB boxes, so does the
guy with the 20MB or 32MB box.  why not having yet another kernel config
option ;-) ?


cheers,
	~dirkw




--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] Separate global/perzone inactive/free shortage
  2001-07-14  5:19 Marcelo Tosatti
                   ` (2 preceding siblings ...)
       [not found] ` <Pine.LNX.4.33.0107141023440.283-100000@mikeg.weiden.de>
@ 2001-07-16 15:51 ` Kanoj Sarcar
  2001-07-16 19:00   ` Rik van Riel
  2001-07-17  0:01   ` Marcelo Tosatti
  3 siblings, 2 replies; 21+ messages in thread
From: Kanoj Sarcar @ 2001-07-16 15:51 UTC (permalink / raw)
  To: Marcelo Tosatti, lkml
  Cc: Rik van Riel, Dirk Wetter, Mike Galbraith, linux-mm, Stephen C. Tweedie

--- Marcelo Tosatti <marcelo@conectiva.com.br> wrote:
> Hi,
> 
> As well known, the VM does not make a distiction
> between global and
> per-zone shortages when trying to free memory. That
> means if only a given
> memory zone is under shortage, the kernel will scan
> pages from all zones. 
> 
> The following patch (against 2.4.6-ac2), changes the
> kernel behaviour to
> avoid freeing pages from zones which do not have an
> inactive and/or
> free shortage.
> 
> Now I'm able to run memory hogs allocating 4GB of
> memory (on 4GB machine)
> without getting real long hangs on my ssh session.
> (which used to happen
> on stock -ac2 due to exhaustion of DMA pages for
> networking).
> 
> Comments ? 
> 
> Dirk, Can you please try the patch and tell us if it
> fixes your problem ? 
> 
> 

Just a quick note. A per-zone page reclamation
method like this was what I had advocated and sent
patches to Linus for in the 2.3.43 time frame or so.
I think later performance work ripped out that work.
I guess the problem is that a lot of the different
page reclamation schemes first of all do not know
how to reclaim pages for a specific zone, and secondly
have to go thru a lot of work before they discover the
page they are trying to reclaim does not belong to the
shortage zone, hence wasting a lot of work/cputime.
try_to_swap_out is a good example, which can be solved
by rmaps. 

Kanoj

__________________________________________________
Do You Yahoo!?
Get personalized email addresses from Yahoo! Mail
http://personal.mail.yahoo.com/
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] Separate global/perzone inactive/free shortage
  2001-07-16 15:51 ` Kanoj Sarcar
@ 2001-07-16 19:00   ` Rik van Riel
  2001-07-17  0:27     ` Marcelo Tosatti
  2001-07-17  0:01   ` Marcelo Tosatti
  1 sibling, 1 reply; 21+ messages in thread
From: Rik van Riel @ 2001-07-16 19:00 UTC (permalink / raw)
  To: Kanoj Sarcar
  Cc: Marcelo Tosatti, lkml, Dirk Wetter, Mike Galbraith, linux-mm,
	Stephen C. Tweedie

On Mon, 16 Jul 2001, Kanoj Sarcar wrote:

> Just a quick note. A per-zone page reclamation
> method like this was what I had advocated and sent
> patches to Linus for in the 2.3.43 time frame or so.
> I think later performance work ripped out that work.

Yes, the system ended up swapping as soon as the first zone
was filled up and after that would fill up the other zones;
the way the system stabilised was cycling through the pages
of one zone and leaving the lower zones alone.

This reduced the amount of available VM of a 1GB system
to 128MB, which is somewhat suboptimal ;)

What we learned from that is that we need to have some
way to auto-balance the reclaiming, keeping the objective
of evicting the least used page from RAM in mind.

> I guess the problem is that a lot of the different
> page reclamation schemes first of all do not know
> how to reclaim pages for a specific zone,

> try_to_swap_out is a good example, which can be solved
> by rmaps.

Indeed. Most of the time things go right, but the current
system cannot cope at all when things go wrong. I think we
really want things like rmaps and more sturdy reclaiming
mechanisms to cope with these worst cases (and also to make
the common case easier to get right).

regards,

Rik
--
Virtual memory is like a game you can't win;
However, without VM there's truly nothing to lose...

http://www.surriel.com/		http://distro.conectiva.com/

Send all your spam to aardvark@nl.linux.org (spam digging piggy)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] Separate global/perzone inactive/free shortage
  2001-07-16 19:00   ` Rik van Riel
@ 2001-07-17  0:27     ` Marcelo Tosatti
  2001-07-17  2:07       ` Kanoj Sarcar
  0 siblings, 1 reply; 21+ messages in thread
From: Marcelo Tosatti @ 2001-07-17  0:27 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Kanoj Sarcar, lkml, Dirk Wetter, Mike Galbraith, linux-mm,
	Stephen C. Tweedie


On Mon, 16 Jul 2001, Rik van Riel wrote:

> On Mon, 16 Jul 2001, Kanoj Sarcar wrote:
> 
> > Just a quick note. A per-zone page reclamation
> > method like this was what I had advocated and sent
> > patches to Linus for in the 2.3.43 time frame or so.
> > I think later performance work ripped out that work.
> 
> Yes, the system ended up swapping as soon as the first zone
> was filled up and after that would fill up the other zones;
> the way the system stabilised was cycling through the pages
> of one zone and leaving the lower zones alone.
> 
> This reduced the amount of available VM of a 1GB system
> to 128MB, which is somewhat suboptimal ;)
> 
> What we learned from that is that we need to have some
> way to auto-balance the reclaiming, keeping the objective
> of evicting the least used page from RAM in mind.
> 
> > I guess the problem is that a lot of the different
> > page reclamation schemes first of all do not know
> > how to reclaim pages for a specific zone,
> 
> > try_to_swap_out is a good example, which can be solved
> > by rmaps.
> 
> Indeed. Most of the time things go right, but the current
> system cannot cope at all when things go wrong. I think we
> really want things like rmaps and more sturdy reclaiming
> mechanisms to cope with these worst cases (and also to make
> the common case easier to get right).

As I said to Kanoj, I agree that we really want rmaps to fix that thing
right.

Now I don't see any other way for fixing that on _2.4_ except something
similar to the patch I posted. That patch can still have problems in
practice, but fundamentally _it is the right thing_, IMO.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] Separate global/perzone inactive/free shortage
  2001-07-17  0:27     ` Marcelo Tosatti
@ 2001-07-17  2:07       ` Kanoj Sarcar
  0 siblings, 0 replies; 21+ messages in thread
From: Kanoj Sarcar @ 2001-07-17  2:07 UTC (permalink / raw)
  To: Marcelo Tosatti, Rik van Riel
  Cc: Kanoj Sarcar, lkml, Dirk Wetter, Mike Galbraith, linux-mm,
	Stephen C. Tweedie

--- Marcelo Tosatti <marcelo@conectiva.com.br> wrote:
> 
> On Mon, 16 Jul 2001, Rik van Riel wrote:
> 
> > On Mon, 16 Jul 2001, Kanoj Sarcar wrote:
> > 
> > > Just a quick note. A per-zone page reclamation
> > > method like this was what I had advocated and
> sent
> > > patches to Linus for in the 2.3.43 time frame or
> so.
> > > I think later performance work ripped out that
> work.
> > 
> > Yes, the system ended up swapping as soon as the
> first zone
> > was filled up and after that would fill up the
> other zones;
> > the way the system stabilised was cycling through
> the pages
> > of one zone and leaving the lower zones alone.
> > 
> > This reduced the amount of available VM of a 1GB
> system
> > to 128MB, which is somewhat suboptimal ;)
> > 
> > What we learned from that is that we need to have
> some
> > way to auto-balance the reclaiming, keeping the
> objective
> > of evicting the least used page from RAM in mind.
> > 
> > > I guess the problem is that a lot of the
> different
> > > page reclamation schemes first of all do not
> know
> > > how to reclaim pages for a specific zone,
> > 
> > > try_to_swap_out is a good example, which can be
> solved
> > > by rmaps.
> > 
> > Indeed. Most of the time things go right, but the
> current
> > system cannot cope at all when things go wrong. I
> think we
> > really want things like rmaps and more sturdy
> reclaiming
> > mechanisms to cope with these worst cases (and
> also to make
> > the common case easier to get right).
> 
> As I said to Kanoj, I agree that we really want
> rmaps to fix that thing
> right.
> 
> Now I don't see any other way for fixing that on
> _2.4_ except something
> similar to the patch I posted. That patch can still
> have problems in
> practice, but fundamentally _it is the right thing_,
> IMO.

Yes, I agree with you, and that is why I had sent the
patch to Linus during 2.3 in the first place.  

What I am trying to point out is that you should talk
to Rik, and understand why it was removed previously.
Rik  obviously had his reasons at that point, but some
of those might not apply anymore, given that 2.4 is
quite different from 2.3.43.

Kanoj
 
> 
> 
> --
> To unsubscribe, send a message with 'unsubscribe
> linux-mm' in
> the body to majordomo@kvack.org.  For more info on
> Linux MM,
> see: http://www.linux-mm.org/


__________________________________________________
Do You Yahoo!?
Get personalized email addresses from Yahoo! Mail
http://personal.mail.yahoo.com/
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] Separate global/perzone inactive/free shortage
  2001-07-16 15:51 ` Kanoj Sarcar
  2001-07-16 19:00   ` Rik van Riel
@ 2001-07-17  0:01   ` Marcelo Tosatti
  1 sibling, 0 replies; 21+ messages in thread
From: Marcelo Tosatti @ 2001-07-17  0:01 UTC (permalink / raw)
  To: Kanoj Sarcar
  Cc: lkml, Rik van Riel, Dirk Wetter, Mike Galbraith, linux-mm,
	Stephen C. Tweedie


On Mon, 16 Jul 2001, Kanoj Sarcar wrote:

> 
> --- Marcelo Tosatti <marcelo@conectiva.com.br> wrote:
> > Hi,
> > 
> > As well known, the VM does not make a distiction
> > between global and
> > per-zone shortages when trying to free memory. That
> > means if only a given
> > memory zone is under shortage, the kernel will scan
> > pages from all zones. 
> > 
> > The following patch (against 2.4.6-ac2), changes the
> > kernel behaviour to
> > avoid freeing pages from zones which do not have an
> > inactive and/or
> > free shortage.
> > 
> > Now I'm able to run memory hogs allocating 4GB of
> > memory (on 4GB machine)
> > without getting real long hangs on my ssh session.
> > (which used to happen
> > on stock -ac2 due to exhaustion of DMA pages for
> > networking).
> > 
> > Comments ? 
> > 
> > Dirk, Can you please try the patch and tell us if it
> > fixes your problem ? 
> > 
> > 
> 
> Just a quick note. A per-zone page reclamation
> method like this was what I had advocated and sent
> patches to Linus for in the 2.3.43 time frame or so.
> I think later performance work ripped out that work.
> I guess the problem is that a lot of the different
> page reclamation schemes first of all do not know
> how to reclaim pages for a specific zone, and secondly
> have to go thru a lot of work before they discover the
> page they are trying to reclaim does not belong to the
> shortage zone, hence wasting a lot of work/cputime.
> try_to_swap_out is a good example, which can be solved
> by rmaps. 

Oh sure, rmaps would fix the performance problem caused by this. But I we
dont have rmaps right now, and I doubt we want rmaps for 2.4.

Besides, the performance degradation of doing the perzone
aging/deactivation this way is nothing compared to _not_ doing the thing
on a perzone basis at all, IMHO.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2001-07-18 16:09 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2001-07-16 13:56 [PATCH] Separate global/perzone inactive/free shortage Bulent Abali
2001-07-16 15:56 ` Stephen C. Tweedie
2001-07-16 19:04   ` Rik van Riel
2001-07-18  8:54 ` Mike Galbraith
2001-07-18 10:18   ` Stephen C. Tweedie
2001-07-18 14:51     ` Mike Galbraith
2001-07-18 15:07   ` Dave McCracken
2001-07-18 16:09     ` Rik van Riel
  -- strict thread matches above, loose matches on Subject: below --
2001-07-14  5:19 Marcelo Tosatti
2001-07-14  7:11 ` Marcelo Tosatti
2001-07-14 20:13 ` Dirk
     [not found] ` <Pine.LNX.4.33.0107141023440.283-100000@mikeg.weiden.de>
2001-07-16 13:19   ` Stephen C. Tweedie
2001-07-16 15:44     ` Mike Galbraith
2001-07-16 18:30       ` Stephen C. Tweedie
2001-07-17  2:55         ` Mike Galbraith
2001-07-16 18:42     ` Dirk Wetter
2001-07-16 15:51 ` Kanoj Sarcar
2001-07-16 19:00   ` Rik van Riel
2001-07-17  0:27     ` Marcelo Tosatti
2001-07-17  2:07       ` Kanoj Sarcar
2001-07-17  0:01   ` Marcelo Tosatti

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox