Re: Kernel 2.6.8.1: swap storm of death - nr_requests

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* Re: Kernel 2.6.8.1: swap storm of death - nr_requests > 1024 on swap partition
       [not found]             ` <4130F55A.90705@pandora.be>
@ 2004-08-28 21:43               ` Andrew Morton
  2004-08-28 21:54                 ` William Lee Irwin III
                                   ` (2 more replies)
  0 siblings, 3 replies; 35+ messages in thread
From: Andrew Morton @ 2004-08-28 21:43 UTC (permalink / raw)
  To: Karl Vogel; +Cc: Jens Axboe, linux-mm

(Added linux-mm)

Karl Vogel <karl.vogel@pandora.be> wrote:
>
> Andrew Morton wrote:
> > Karl Vogel <karl.vogel@seagha.com> wrote:
> > 
> >>Further testing shows that all the schedulers exhibit this exact same
> >> problem when run with a nr_requests size of 8192 on the drive hosting 
> >> the swap partition.
> >>
> >> I tried noop, deadline, as and CFQ with:
> >>
> >> 	echo 8192 >/sys/block/hda/queue/nr_requests
> > 
> > 
> > That allows up to 2GB of memory to be under writeout at the same time.  The
> > VM cannot touch any of that memory.
> 
> Well I used that value as it is the default for CFQ.. and it was with 
> CFQ that I had the problems. The patch Jens offered to track down the 
> problem, commented out this 'q->nr_requests = 8192' in CFQ and it 
> helped. Therefor I tried the other schedulers with this value to see if 
> it made a difference.
> 
> So if I understand you correctly, CFQ shouldn't be using 8192 on 512Mb 
> systems?!

Yup.  It's asking for trouble to allow that much memory to be unreclaimably
pinned.

Of course, you could have the same problem with just 128 requests per
queue, and lots of queues.  I solved all these problems in the dirty memory
writeback paths.  But I forgot about swapout!

> With overcommit_memory set to 1, the program can be run again after the 
> OOM kill.. but the OOM killing remains.
> 
> With overcommit_memory set to 0 a second run fails. I 'think' it's 
> because somehow SwapCache is 500Kb after the OOM, so in effect my system 
> doesn't have 1Gb to spare anymore. Doing swapoff/swapon frees this and 
> then I can do the calloc(1Gb) again.
> 
> Another way to free the SwapCached is to generate lots of I/O doing 'dd 
> if=/dev/hda of=/dev/null' ... after a while SwapCached is < 1Mb again.
> 

urgh.  It sounds like the overcommit logic forgot to account swapcache as
reclaimable.  It's been a ton of trouble, that code.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Kernel 2.6.8.1: swap storm of death - nr_requests > 1024 on swap partition
  2004-08-28 21:43               ` Kernel 2.6.8.1: swap storm of death - nr_requests > 1024 on swap partition Andrew Morton
@ 2004-08-28 21:54                 ` William Lee Irwin III
  2004-08-28 22:13                   ` Andrew Morton
  2004-08-28 21:59                 ` Karl Vogel
  2004-08-29 16:53                 ` Jens Axboe
  2 siblings, 1 reply; 35+ messages in thread
From: William Lee Irwin III @ 2004-08-28 21:54 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Karl Vogel, Jens Axboe, linux-mm

Karl Vogel <karl.vogel@pandora.be> wrote:
>> With overcommit_memory set to 1, the program can be run again after the 
>> OOM kill.. but the OOM killing remains.
>> With overcommit_memory set to 0 a second run fails. I 'think' it's 
>> because somehow SwapCache is 500Kb after the OOM, so in effect my system 
>> doesn't have 1Gb to spare anymore. Doing swapoff/swapon frees this and 
>> then I can do the calloc(1Gb) again.
>> Another way to free the SwapCached is to generate lots of I/O doing 'dd 
>> if=/dev/hda of=/dev/null' ... after a while SwapCached is < 1Mb again.

On Sat, Aug 28, 2004 at 02:43:03PM -0700, Andrew Morton wrote:
> urgh.  It sounds like the overcommit logic forgot to account swapcache as
> reclaimable.  It's been a ton of trouble, that code.

For overcommit purposes, swapcache still counts as committed AS; it
requires swap as backing store to evict. So AFAICT there isn't an issue
there. I was under the impression this had something to do with IO
schedulers.


-- wli
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Kernel 2.6.8.1: swap storm of death - nr_requests > 1024 on swap partition
  2004-08-28 21:43               ` Kernel 2.6.8.1: swap storm of death - nr_requests > 1024 on swap partition Andrew Morton
  2004-08-28 21:54                 ` William Lee Irwin III
@ 2004-08-28 21:59                 ` Karl Vogel
  2004-08-29 16:53                 ` Jens Axboe
  2 siblings, 0 replies; 35+ messages in thread
From: Karl Vogel @ 2004-08-28 21:59 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Jens Axboe, linux-mm

Andrew Morton wrote:

>>So if I understand you correctly, CFQ shouldn't be using 8192 on 512Mb 
>>systems?!
> 
> 
> Yup.  It's asking for trouble to allow that much memory to be unreclaimably
> pinned.
> 
> Of course, you could have the same problem with just 128 requests per
> queue, and lots of queues.  I solved all these problems in the dirty memory
> writeback paths.  But I forgot about swapout!

Looks like the default 128 requests that AS is using, also gives trouble 
with 128Mb systems. I just tried booting with mem=128M and then doing:

  $ while true; do expunge 130; done

which causes the OOM to kick in too. Decreasing nr_requests resolves the 
issue.
So it's not only with lots of queues (AS only uses 1 queue, right?)

>>With overcommit_memory set to 1, the program can be run again after the 
>>OOM kill.. but the OOM killing remains.
>>
>>With overcommit_memory set to 0 a second run fails. I 'think' it's 
>>because somehow SwapCache is 500Kb after the OOM, so in effect my system 
>>doesn't have 1Gb to spare anymore. Doing swapoff/swapon frees this and 
>>then I can do the calloc(1Gb) again.
>>
>>Another way to free the SwapCached is to generate lots of I/O doing 'dd 
>>if=/dev/hda of=/dev/null' ... after a while SwapCached is < 1Mb again.
>>
> 
> 
> urgh.  It sounds like the overcommit logic forgot to account swapcache as
> reclaimable.  It's been a ton of trouble, that code.

NOTE: this happens both with overcommit_memory set to 0 or 1.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Kernel 2.6.8.1: swap storm of death - nr_requests > 1024 on swap partition
  2004-08-28 21:54                 ` William Lee Irwin III
@ 2004-08-28 22:13                   ` Andrew Morton
  2004-08-28 22:28                     ` William Lee Irwin III
  0 siblings, 1 reply; 35+ messages in thread
From: Andrew Morton @ 2004-08-28 22:13 UTC (permalink / raw)
  To: William Lee Irwin III; +Cc: karl.vogel, axboe, linux-mm

William Lee Irwin III <wli@holomorphy.com> wrote:
>
> Karl Vogel <karl.vogel@pandora.be> wrote:
> >> With overcommit_memory set to 1, the program can be run again after the 
> >> OOM kill.. but the OOM killing remains.
> >> With overcommit_memory set to 0 a second run fails. I 'think' it's 
> >> because somehow SwapCache is 500Kb after the OOM, so in effect my system 
> >> doesn't have 1Gb to spare anymore. Doing swapoff/swapon frees this and 
> >> then I can do the calloc(1Gb) again.
> >> Another way to free the SwapCached is to generate lots of I/O doing 'dd 
> >> if=/dev/hda of=/dev/null' ... after a while SwapCached is < 1Mb again.
> 
> On Sat, Aug 28, 2004 at 02:43:03PM -0700, Andrew Morton wrote:
> > urgh.  It sounds like the overcommit logic forgot to account swapcache as
> > reclaimable.  It's been a ton of trouble, that code.
> 
> For overcommit purposes, swapcache still counts as committed AS; it
> requires swap as backing store to evict. So AFAICT there isn't an issue
> there.

But that backing store is allocated?

> I was under the impression this had something to do with IO
> schedulers.

Separate issue.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Kernel 2.6.8.1: swap storm of death - nr_requests > 1024 on swap partition
  2004-08-28 22:13                   ` Andrew Morton
@ 2004-08-28 22:28                     ` William Lee Irwin III
  2004-08-29 10:30                       ` Andrew Morton
  2004-08-29 16:54                       ` Jens Axboe
  0 siblings, 2 replies; 35+ messages in thread
From: William Lee Irwin III @ 2004-08-28 22:28 UTC (permalink / raw)
  To: Andrew Morton; +Cc: karl.vogel, axboe, linux-mm

William Lee Irwin III <wli@holomorphy.com> wrote:
>> For overcommit purposes, swapcache still counts as committed AS; it
>> requires swap as backing store to evict. So AFAICT there isn't an issue
>> there.

On Sat, Aug 28, 2004 at 03:13:49PM -0700, Andrew Morton wrote:
> But that backing store is allocated?

Committed AS is so regardless of whether backing store has been
allocated. If it has been allocated, the reservation is cashed and held.
If it hasn't been allocated, it is reserved and held, but not cashed.
In both those cases, the reservation is still held. For anonymous memory,
the reservations are not released until it's freed, as that's the only
way for an anonymous page to make a transition to not being swap-backed.


William Lee Irwin III <wli@holomorphy.com> wrote:
>> I was under the impression this had something to do with IO
>> schedulers.

On Sat, Aug 28, 2004 at 03:13:49PM -0700, Andrew Morton wrote:
> Separate issue.

It certainly appears to be the deciding factor from the thread.


-- wli
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Kernel 2.6.8.1: swap storm of death - nr_requests > 1024 on swap partition
  2004-08-28 22:28                     ` William Lee Irwin III
@ 2004-08-29 10:30                       ` Andrew Morton
  2004-08-29 14:15                         ` Jens Axboe
  2004-08-29 16:54                       ` Jens Axboe
  1 sibling, 1 reply; 35+ messages in thread
From: Andrew Morton @ 2004-08-29 10:30 UTC (permalink / raw)
  To: William Lee Irwin III; +Cc: karl.vogel, axboe, linux-mm

William Lee Irwin III <wli@holomorphy.com> wrote:
>
>  On Sat, Aug 28, 2004 at 03:13:49PM -0700, Andrew Morton wrote:
>  > Separate issue.
> 
>  It certainly appears to be the deciding factor from the thread.
> 

It's all very bizarre.

If you do a big `usemem -m 250' on a 256MB box, you end up with all memory
in swapcache _after_ usemem exits.  That's wrong: all the memory which
usemem allocated should now be free.

But all that swapcache is reclaimable under memory pressure.  It seems to
be floating about on the LRU still.

It only happens with the CFQ elevator, and this backout patch makes it go
away.

The main effect of this patch is to increase the elevator's nr_requests
from 128 to 8192.  Something to do with that, I guess.

Manyana.

--- 25/drivers/block/ll_rw_blk.c~a	2004-08-29 03:21:41.678895384 -0700
+++ 25-akpm/drivers/block/ll_rw_blk.c	2004-08-29 03:21:50.230595328 -0700
@@ -1534,6 +1534,9 @@ request_queue_t *blk_init_queue(request_
 		printk("Using %s io scheduler\n", chosen_elevator->elevator_name);
 	}
 
+	if (elevator_init(q, chosen_elevator))
+		goto out_elv;
+
 	q->request_fn		= rfn;
 	q->back_merge_fn       	= ll_back_merge_fn;
 	q->front_merge_fn      	= ll_front_merge_fn;
@@ -1551,12 +1554,8 @@ request_queue_t *blk_init_queue(request_
 	blk_queue_max_hw_segments(q, MAX_HW_SEGMENTS);
 	blk_queue_max_phys_segments(q, MAX_PHYS_SEGMENTS);
 
-	/*
-	 * all done
-	 */
-	if (!elevator_init(q, chosen_elevator))
-		return q;
-
+	return q;
+out_elv:
 	blk_cleanup_queue(q);
 out_init:
 	kmem_cache_free(requestq_cachep, q);
_

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Kernel 2.6.8.1: swap storm of death - nr_requests > 1024 on swap partition
  2004-08-29 10:30                       ` Andrew Morton
@ 2004-08-29 14:15                         ` Jens Axboe
  2004-08-29 14:17                           ` Jens Axboe
  0 siblings, 1 reply; 35+ messages in thread
From: Jens Axboe @ 2004-08-29 14:15 UTC (permalink / raw)
  To: Andrew Morton; +Cc: William Lee Irwin III, karl.vogel, linux-mm

On Sun, Aug 29 2004, Andrew Morton wrote:
> William Lee Irwin III <wli@holomorphy.com> wrote:
> >
> >  On Sat, Aug 28, 2004 at 03:13:49PM -0700, Andrew Morton wrote:
> >  > Separate issue.
> > 
> >  It certainly appears to be the deciding factor from the thread.
> > 
> 
> It's all very bizarre.
> 
> If you do a big `usemem -m 250' on a 256MB box, you end up with all memory
> in swapcache _after_ usemem exits.  That's wrong: all the memory which
> usemem allocated should now be free.
> 
> But all that swapcache is reclaimable under memory pressure.  It seems to
> be floating about on the LRU still.
> 
> It only happens with the CFQ elevator, and this backout patch makes it go
> away.

It's not bizarre, if you backout that fix (it is a fix!), ->nr_requests
isn't initialized when cfq gets there. So it'll throttle incorrectly in
may_queue, not a good idea.

> The main effect of this patch is to increase the elevator's nr_requests
> from 128 to 8192.  Something to do with that, I guess.

How do you reach that conclusion?

I think the correct fix, for now, is to remove the 8192 in CFQ. It's not
the right thing to do anyways, it should just be set from user space.
But the patch below should definitely not be backed out.

> --- 25/drivers/block/ll_rw_blk.c~a	2004-08-29 03:21:41.678895384 -0700
> +++ 25-akpm/drivers/block/ll_rw_blk.c	2004-08-29 03:21:50.230595328 -0700
> @@ -1534,6 +1534,9 @@ request_queue_t *blk_init_queue(request_
>  		printk("Using %s io scheduler\n", chosen_elevator->elevator_name);
>  	}
>  
> +	if (elevator_init(q, chosen_elevator))
> +		goto out_elv;
> +
>  	q->request_fn		= rfn;
>  	q->back_merge_fn       	= ll_back_merge_fn;
>  	q->front_merge_fn      	= ll_front_merge_fn;
> @@ -1551,12 +1554,8 @@ request_queue_t *blk_init_queue(request_
>  	blk_queue_max_hw_segments(q, MAX_HW_SEGMENTS);
>  	blk_queue_max_phys_segments(q, MAX_PHYS_SEGMENTS);
>  
> -	/*
> -	 * all done
> -	 */
> -	if (!elevator_init(q, chosen_elevator))
> -		return q;
> -
> +	return q;
> +out_elv:
>  	blk_cleanup_queue(q);
>  out_init:
>  	kmem_cache_free(requestq_cachep, q);

-- 
Jens Axboe

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Kernel 2.6.8.1: swap storm of death - nr_requests > 1024 on swap partition
  2004-08-29 14:15                         ` Jens Axboe
@ 2004-08-29 14:17                           ` Jens Axboe
  2004-08-29 14:45                             ` Rik van Riel
  2004-08-29 20:18                             ` Andrew Morton
  0 siblings, 2 replies; 35+ messages in thread
From: Jens Axboe @ 2004-08-29 14:17 UTC (permalink / raw)
  To: Andrew Morton; +Cc: William Lee Irwin III, karl.vogel, linux-mm

On Sun, Aug 29 2004, Jens Axboe wrote:
> > It's all very bizarre.
> > 
> > If you do a big `usemem -m 250' on a 256MB box, you end up with all memory
> > in swapcache _after_ usemem exits.  That's wrong: all the memory which
> > usemem allocated should now be free.
> > 
> > But all that swapcache is reclaimable under memory pressure.  It seems to
> > be floating about on the LRU still.
> > 
> > It only happens with the CFQ elevator, and this backout patch makes it go
> > away.
> 
> It's not bizarre, if you backout that fix (it is a fix!), ->nr_requests
> isn't initialized when cfq gets there. So it'll throttle incorrectly in
> may_queue, not a good idea.

Oh, and I think the main issue is the vm. It should cope correctly no
matter how much pending memory can be in progress on the queue, else it
should not write out so much. CFQ is just exposing this bug because it
defaults to bigger nr_requests.

-- 
Jens Axboe

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Kernel 2.6.8.1: swap storm of death - nr_requests > 1024 on swap partition
  2004-08-29 14:17                           ` Jens Axboe
@ 2004-08-29 14:45                             ` Rik van Riel
  2004-08-29 20:18                             ` Andrew Morton
  1 sibling, 0 replies; 35+ messages in thread
From: Rik van Riel @ 2004-08-29 14:45 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Andrew Morton, William Lee Irwin III, karl.vogel, linux-mm

On Sun, 29 Aug 2004, Jens Axboe wrote:

> Oh, and I think the main issue is the vm. It should cope correctly no
> matter how much pending memory can be in progress on the queue, else it
> should not write out so much. CFQ is just exposing this bug because it
> defaults to bigger nr_requests.

Agreed.  If the VM is short 10MB of free memory, it really
shouldn't start 200MB worth of writes.

-- 
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." - Brian W. Kernighan

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Kernel 2.6.8.1: swap storm of death - nr_requests > 1024 on swap partition
  2004-08-28 21:43               ` Kernel 2.6.8.1: swap storm of death - nr_requests > 1024 on swap partition Andrew Morton
  2004-08-28 21:54                 ` William Lee Irwin III
  2004-08-28 21:59                 ` Karl Vogel
@ 2004-08-29 16:53                 ` Jens Axboe
  2 siblings, 0 replies; 35+ messages in thread
From: Jens Axboe @ 2004-08-29 16:53 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Karl Vogel, linux-mm

On Sat, Aug 28 2004, Andrew Morton wrote:
> 
> (Added linux-mm)
> 
> Karl Vogel <karl.vogel@pandora.be> wrote:
> >
> > Andrew Morton wrote:
> > > Karl Vogel <karl.vogel@seagha.com> wrote:
> > > 
> > >>Further testing shows that all the schedulers exhibit this exact same
> > >> problem when run with a nr_requests size of 8192 on the drive hosting 
> > >> the swap partition.
> > >>
> > >> I tried noop, deadline, as and CFQ with:
> > >>
> > >> 	echo 8192 >/sys/block/hda/queue/nr_requests
> > > 
> > > 
> > > That allows up to 2GB of memory to be under writeout at the same time.  The
> > > VM cannot touch any of that memory.
> > 
> > Well I used that value as it is the default for CFQ.. and it was with 
> > CFQ that I had the problems. The patch Jens offered to track down the 
> > problem, commented out this 'q->nr_requests = 8192' in CFQ and it 
> > helped. Therefor I tried the other schedulers with this value to see if 
> > it made a difference.
> > 
> > So if I understand you correctly, CFQ shouldn't be using 8192 on 512Mb 
> > systems?!
> 
> Yup.  It's asking for trouble to allow that much memory to be unreclaimably
> pinned.

It's not pinned, it's in-progress. I think it's really bad behaviour to
_allow_ so much to be in-progress, if you can't handle it. It's silly to
expect the io scheduler to know this and limit it, belongs at a
different level (the vm, where you have such knowledge).

> Of course, you could have the same problem with just 128 requests per
> queue, and lots of queues.  I solved all these problems in the dirty memory
> writeback paths.  But I forgot about swapout!

Precisely. Or 128 requests on a 16MB system. More proof that this is a
vm problem.

-- 
Jens Axboe

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Kernel 2.6.8.1: swap storm of death - nr_requests > 1024 on swap partition
  2004-08-28 22:28                     ` William Lee Irwin III
  2004-08-29 10:30                       ` Andrew Morton
@ 2004-08-29 16:54                       ` Jens Axboe
  2004-08-29 17:52                         ` William Lee Irwin III
  1 sibling, 1 reply; 35+ messages in thread
From: Jens Axboe @ 2004-08-29 16:54 UTC (permalink / raw)
  To: William Lee Irwin III, Andrew Morton, karl.vogel, linux-mm

On Sat, Aug 28 2004, William Lee Irwin III wrote:
> >> I was under the impression this had something to do with IO
> >> schedulers.
> 
> On Sat, Aug 28, 2004 at 03:13:49PM -0700, Andrew Morton wrote:
> > Separate issue.
> 
> It certainly appears to be the deciding factor from the thread.

Has nothing to do with the io scheduler itself, apart from the fact that
CFQ exposes the problem by setting a larger q->nr_requests. And that is
the very deciding factor, not the io scheduler.

-- 
Jens Axboe

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Kernel 2.6.8.1: swap storm of death - nr_requests > 1024 on swap partition
  2004-08-29 16:54                       ` Jens Axboe
@ 2004-08-29 17:52                         ` William Lee Irwin III
  0 siblings, 0 replies; 35+ messages in thread
From: William Lee Irwin III @ 2004-08-29 17:52 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Andrew Morton, karl.vogel, linux-mm

On Sat, Aug 28 2004, William Lee Irwin III wrote:
>> It certainly appears to be the deciding factor from the thread.

On Sun, Aug 29, 2004 at 06:54:59PM +0200, Jens Axboe wrote:
> Has nothing to do with the io scheduler itself, apart from the fact that
> CFQ exposes the problem by setting a larger q->nr_requests. And that is
> the very deciding factor, not the io scheduler.

Then it's narrower still, q->nr_requests. What a priori reasons are
there for this to vomit? clear_queue_congested() seems to be called
only when a request is retired, so a large number of requests in flight
may be doing something unexpected, and I'd expect large q->nr_requests
to keep large numbers of requests around. Hmm...


-- wli
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Kernel 2.6.8.1: swap storm of death - nr_requests > 1024 on swap partition
  2004-08-29 14:17                           ` Jens Axboe
  2004-08-29 14:45                             ` Rik van Riel
@ 2004-08-29 20:18                             ` Andrew Morton
  2004-08-29 20:30                               ` Jens Axboe
  1 sibling, 1 reply; 35+ messages in thread
From: Andrew Morton @ 2004-08-29 20:18 UTC (permalink / raw)
  To: Jens Axboe; +Cc: wli, karl.vogel, linux-mm

Jens Axboe <axboe@suse.de> wrote:
>
>  > > It only happens with the CFQ elevator, and this backout patch makes it go
>  > > away.
>  > 
>  > It's not bizarre, if you backout that fix (it is a fix!), ->nr_requests
>  > isn't initialized when cfq gets there. So it'll throttle incorrectly in
>  > may_queue, not a good idea.
> 
>  Oh, and I think the main issue is the vm. It should cope correctly no
>  matter how much pending memory can be in progress on the queue, else it
>  should not write out so much. CFQ is just exposing this bug because it
>  defaults to bigger nr_requests.

That was my point.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Kernel 2.6.8.1: swap storm of death - nr_requests > 1024 on swap partition
  2004-08-29 20:18                             ` Andrew Morton
@ 2004-08-29 20:30                               ` Jens Axboe
  2004-08-29 20:59                                 ` Andrew Morton
  0 siblings, 1 reply; 35+ messages in thread
From: Jens Axboe @ 2004-08-29 20:30 UTC (permalink / raw)
  To: Andrew Morton; +Cc: wli, karl.vogel, linux-mm

On Sun, Aug 29 2004, Andrew Morton wrote:
> Jens Axboe <axboe@suse.de> wrote:
> >
> >  > > It only happens with the CFQ elevator, and this backout patch makes it go
> >  > > away.
> >  > 
> >  > It's not bizarre, if you backout that fix (it is a fix!), ->nr_requests
> >  > isn't initialized when cfq gets there. So it'll throttle incorrectly in
> >  > may_queue, not a good idea.
> > 
> >  Oh, and I think the main issue is the vm. It should cope correctly no
> >  matter how much pending memory can be in progress on the queue, else it
> >  should not write out so much. CFQ is just exposing this bug because it
> >  defaults to bigger nr_requests.
> 
> That was my point.

I didn't understand your message at all, maybe that wasn't clear enough
in my email :-). You state that the main effect of that particular patch
is to bump nr_requests to 8192, which is definitely not true. The main
effect of the patch is to make sure that ->nr_requests was valid, so
that cfqd->max_queued is valid. ->nr_requests was always overwritten
with 8192 for quite some time, irregardless of that patch. So this
particular change has nothing to do with that, and other io schedulers
will experience exactly this very problem with 8192 requests.

Why you do see a difference is that when ->max_queued isn't valid, you
end up block a lot more in get_request_wait() because cfq_may_queue will
disallow you to queue a lot more than with the patch. Since other io
schedulers don't have these sort of checks, they behave like CFQ does
with the bug in blk_init_queue() fixed.

-- 
Jens Axboe

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Kernel 2.6.8.1: swap storm of death - nr_requests > 1024 on swap partition
  2004-08-29 20:30                               ` Jens Axboe
@ 2004-08-29 20:59                                 ` Andrew Morton
  2004-08-29 22:17                                   ` William Lee Irwin III
  2004-08-30 15:20                                   ` Marcelo Tosatti
  0 siblings, 2 replies; 35+ messages in thread
From: Andrew Morton @ 2004-08-29 20:59 UTC (permalink / raw)
  To: Jens Axboe; +Cc: wli, karl.vogel, linux-mm

Jens Axboe <axboe@suse.de> wrote:
>
> > That was my point.
> 
>  I didn't understand your message at all, maybe that wasn't clear enough
>  in my email :-). You state that the main effect of that particular patch
>  is to bump nr_requests to 8192, which is definitely not true. The main
>  effect of the patch is to make sure that ->nr_requests was valid, so
>  that cfqd->max_queued is valid. ->nr_requests was always overwritten
>  with 8192 for quite some time, irregardless of that patch. So this
>  particular change has nothing to do with that, and other io schedulers
>  will experience exactly this very problem with 8192 requests.
> 
>  Why you do see a difference is that when ->max_queued isn't valid, you
>  end up block a lot more in get_request_wait() because cfq_may_queue will
>  disallow you to queue a lot more than with the patch. Since other io
>  schedulers don't have these sort of checks, they behave like CFQ does
>  with the bug in blk_init_queue() fixed.

The changlog wasn't that detailed ;)

But yes, it's the large nr_requests which is tripping up swapout.  I'm
assuming that when a process exits with its anonymous memory still under
swap I/O we're forgetting to actually free the pages when the I/O
completes.  So we end up with a ton of zero-ref swapcache pages on the LRU.

I assume.   Something odd's happening, that's for sure.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Kernel 2.6.8.1: swap storm of death - nr_requests > 1024 on swap partition
  2004-08-29 20:59                                 ` Andrew Morton
@ 2004-08-29 22:17                                   ` William Lee Irwin III
  2004-08-29 22:28                                     ` Andrew Morton
  2004-08-30 15:20                                   ` Marcelo Tosatti
  1 sibling, 1 reply; 35+ messages in thread
From: William Lee Irwin III @ 2004-08-29 22:17 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Jens Axboe, karl.vogel, linux-mm

Jens Axboe <axboe@suse.de> wrote:
>>  Why you do see a difference is that when ->max_queued isn't valid, you
>>  end up block a lot more in get_request_wait() because cfq_may_queue will
>>  disallow you to queue a lot more than with the patch. Since other io
>>  schedulers don't have these sort of checks, they behave like CFQ does
>>  with the bug in blk_init_queue() fixed.

On Sun, Aug 29, 2004 at 01:59:17PM -0700, Andrew Morton wrote:
> The changlog wasn't that detailed ;)
> But yes, it's the large nr_requests which is tripping up swapout.  I'm
> assuming that when a process exits with its anonymous memory still under
> swap I/O we're forgetting to actually free the pages when the I/O
> completes.  So we end up with a ton of zero-ref swapcache pages on the LRU.
> I assume.   Something odd's happening, that's for sure.

Maybe we need to be checking for this in end_swap_bio_write() or
rotate_reclaimable_page()?


-- wli
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Kernel 2.6.8.1: swap storm of death - nr_requests > 1024 on swap partition
  2004-08-29 22:17                                   ` William Lee Irwin III
@ 2004-08-29 22:28                                     ` Andrew Morton
  2004-08-30  7:41                                       ` Hugh Dickins
  0 siblings, 1 reply; 35+ messages in thread
From: Andrew Morton @ 2004-08-29 22:28 UTC (permalink / raw)
  To: William Lee Irwin III; +Cc: axboe, karl.vogel, linux-mm

William Lee Irwin III <wli@holomorphy.com> wrote:
>
>  On Sun, Aug 29, 2004 at 01:59:17PM -0700, Andrew Morton wrote:
>  > The changlog wasn't that detailed ;)
>  > But yes, it's the large nr_requests which is tripping up swapout.  I'm
>  > assuming that when a process exits with its anonymous memory still under
>  > swap I/O we're forgetting to actually free the pages when the I/O
>  > completes.  So we end up with a ton of zero-ref swapcache pages on the LRU.
>  > I assume.   Something odd's happening, that's for sure.
> 
>  Maybe we need to be checking for this in end_swap_bio_write() or
>  rotate_reclaimable_page()?

Maybe.  I thought a get_page() in swap_writepage() and a put_page() in
end_swap_bio_write() would cause the page to be freed.  But not.  It needs
some actual real work done on it.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Kernel 2.6.8.1: swap storm of death - nr_requests > 1024 on swap partition
  2004-08-29 22:28                                     ` Andrew Morton
@ 2004-08-30  7:41                                       ` Hugh Dickins
  0 siblings, 0 replies; 35+ messages in thread
From: Hugh Dickins @ 2004-08-30  7:41 UTC (permalink / raw)
  To: Andrew Morton; +Cc: William Lee Irwin III, axboe, karl.vogel, linux-mm

On Sun, 29 Aug 2004, Andrew Morton wrote:
> William Lee Irwin III <wli@holomorphy.com> wrote:
> >  On Sun, Aug 29, 2004 at 01:59:17PM -0700, Andrew Morton wrote:
> >  > The changlog wasn't that detailed ;)
> >  > But yes, it's the large nr_requests which is tripping up swapout.  I'm
> >  > assuming that when a process exits with its anonymous memory still under
> >  > swap I/O we're forgetting to actually free the pages when the I/O
> >  > completes.  So we end up with a ton of zero-ref swapcache pages on the LRU.
> >  > I assume.   Something odd's happening, that's for sure.
> > 
> >  Maybe we need to be checking for this in end_swap_bio_write() or
> >  rotate_reclaimable_page()?
> 
> Maybe.  I thought a get_page() in swap_writepage() and a put_page() in
> end_swap_bio_write() would cause the page to be freed.  But not.  It needs
> some actual real work done on it.

There are quite a few limitations on when page can be freed from SwapCache.
Involves locks you wouldn't want to take from just anywhere.  If the right
conditions don't happen to be met at the time a process exits, it's quite
normal for the SwapCache pages to hang around awhile, until eventually the
__delete_from_swap_cache towards the end of shrink_list removes them.

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Kernel 2.6.8.1: swap storm of death - nr_requests > 1024 on swap partition
  2004-08-29 20:59                                 ` Andrew Morton
  2004-08-29 22:17                                   ` William Lee Irwin III
@ 2004-08-30 15:20                                   ` Marcelo Tosatti
  2004-08-30 18:01                                     ` Karl Vogel
  1 sibling, 1 reply; 35+ messages in thread
From: Marcelo Tosatti @ 2004-08-30 15:20 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Jens Axboe, wli, karl.vogel, linux-mm

On Sun, Aug 29, 2004 at 01:59:17PM -0700, Andrew Morton wrote:
> Jens Axboe <axboe@suse.de> wrote:
> >
> > > That was my point.
> > 
> >  I didn't understand your message at all, maybe that wasn't clear enough
> >  in my email :-). You state that the main effect of that particular patch
> >  is to bump nr_requests to 8192, which is definitely not true. The main
> >  effect of the patch is to make sure that ->nr_requests was valid, so
> >  that cfqd->max_queued is valid. ->nr_requests was always overwritten
> >  with 8192 for quite some time, irregardless of that patch. So this
> >  particular change has nothing to do with that, and other io schedulers
> >  will experience exactly this very problem with 8192 requests.
> > 
> >  Why you do see a difference is that when ->max_queued isn't valid, you
> >  end up block a lot more in get_request_wait() because cfq_may_queue will
> >  disallow you to queue a lot more than with the patch. Since other io
> >  schedulers don't have these sort of checks, they behave like CFQ does
> >  with the bug in blk_init_queue() fixed.
> 
> The changlog wasn't that detailed ;)
> 
> But yes, it's the large nr_requests which is tripping up swapout.  I'm
> assuming that when a process exits with its anonymous memory still under
> swap I/O we're forgetting to actually free the pages when the I/O
> completes.  So we end up with a ton of zero-ref swapcache pages on the LRU.

What nr_requests would have to do with swapcache not being freed after 
the owner of it exits?

I can't reproduce the behaviour which swapcache is not freed after
the memory hog exited (I'm using fillmem, dont think that matters). Where
can I find usemem?

The filesystem dirty writeback (page-writeback.c) code effectively throttles tasks 
on the size of the queue. blk_congestion_wait() is not enough to avoid the
queueing get full. 

Same with swap IO. 

So, Andrew, you say you fixed that in the dirty writeback code. Where is that? 

What Jens seems to argue is that VM needs limiting IO in flight - it
doesnt do that at all, it relies on the IO scheduler to do such limiting.
That is how Linux always worked.

I'm I missing something? 

> I assume.   Something odd's happening, that's for sure.

What is the problem Karl is seeing again? There seem to be several, lets
separate them

- OOM killer triggering (if there's swap space available and 
"enough" anonymous memory to be swapped out this should not happen). 
One of his complaint on the initial report (about the OOM killer).

- Swap cache not freed after test app exists. Should not be a
problem because such memory will be freed as soon as theres 
pressure, I think.

How can you reproduce that?

I can't see any big difference between using cfq/as with either 8192 or 128. 
Both make the box trash completly (ie very unresponsive) as soon as intensive
swap IO starts.

---

"I can bring down my box by running a program that does a calloc() of 512Mb
(which is the size of my RAM). The box starts to heavily swap and never
recovers from it. The process that calloc's the memory gets OOM killed (which
is also strange as I have 1Gb free swap).

After the OOM kill, the shell where I started the calloc() program is alive
but very slow. The box continues to swap and the other processes remain dead. "
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Kernel 2.6.8.1: swap storm of death - nr_requests > 1024 on swap partition
  2004-08-30 18:01                                     ` Karl Vogel
@ 2004-08-30 17:16                                       ` Marcelo Tosatti
  2004-08-30 22:59                                         ` Karl Vogel
  2004-08-30 20:33                                       ` Marcelo Tosatti
  1 sibling, 1 reply; 35+ messages in thread
From: Marcelo Tosatti @ 2004-08-30 17:16 UTC (permalink / raw)
  To: Karl Vogel; +Cc: Andrew Morton, Jens Axboe, wli, linux-mm

[-- Attachment #1: Type: text/plain, Size: 2489 bytes --]


Karl, 

Please apply the attached patch and rerun your tests. With it applied, 
the OOM killer output will print the number of available swap pages at
the time of killing.

In the meantime I'll be doing some more tests.

On Mon, Aug 30, 2004 at 08:01:19PM +0200, Karl Vogel wrote:
> Marcelo Tosatti wrote:
> 
> >What is the problem Karl is seeing again? There seem to be several, lets
> >separate them
> >
> >- OOM killer triggering (if there's swap space available and 
> >"enough" anonymous memory to be swapped out this should not happen). 
> >One of his complaint on the initial report (about the OOM killer).
> 
> Correct. On my 512Mb RAM system with 1Gb swap partition, running a 
> calloc(1Gb) causes the process to get OOM killed when using CFQ.
> The problem is not CFQ as such.. the problem is when nr_requests is too 
> large (8192 being the default for CFQ).
> 
> The same will happen with the default nr_request of 128 which AS uses, 
> if you use a low memory system. e.g. I booted with mem=128M and then a 
> calloc(128Mb) can trigger the OOM.
> 
> >- Swap cache not freed after test app exists. Should not be a
> >problem because such memory will be freed as soon as theres 
> >pressure, I think.
> 
> After the OOM killer killed the calloc() task, the SwapCache still 
> contains a large chunk of the original allocation. This get's cleared if 
> there is alot of I/O (example: dd if=/dev/hdX of=/dev/null).
> 
> However, without the I/O's it doesn't seem to get freed.. this also 
> causes a second run of calloc(1Gb) to fail as the SwapCache still 
> accounts for used memory.
> 
> >How can you reproduce that?
> 
> It should be reproducable as follows:
> - boot with mem=512M
> - have a 1Gb swap partition / swapfile (the size doesn't really matter)
> - use CFQ or set nr_requests to 8192 on the drive _hosting the swap_
> - run  'expunge 1024'   (might work the 1st time, if so, run it again)
> 
> 
> --- expunge.c program source ---
> #include <stdio.h>
> #include <stdlib.h>
> 
> int main(int argc, char *argv[])
> {
>     char *p= calloc(1, atol(argv[1])*1024L*1024L);
>     if (!p) {
>         perror("calloc");
>         exit(1);
>     }
>     return 0;
> }
> --- expunge.c program source ---
> 
> 
> 
> Another thing that you can try:
> - boot with mem=128M
> - have enough swap
> - execute:  while true; do expunge 128; done
> 
> This will trigger an OOM even with AS (nr_requests = 128)
> 
> 
> 
> After the OOM, SwapCache still holds part of the allocation.

[-- Attachment #2: vm-reclaim2.patch --]
[-- Type: text/plain, Size: 1331 bytes --]

--- mm/page_alloc.c.orig	2004-08-24 20:37:53.000000000 -0300
+++ mm/page_alloc.c	2004-08-24 22:51:49.498375608 -0300
@@ -1021,11 +1021,12 @@
 void show_free_areas(void)
 {
 	struct page_state ps;
-	int cpu, temperature;
+	int cpu, temperature, i;
 	unsigned long active;
 	unsigned long inactive;
 	unsigned long free;
 	struct zone *zone;
+	unsigned int swap_pages = 0;
 
 	for_each_zone(zone) {
 		show_node(zone);
@@ -1086,6 +1087,8 @@
 			" active:%lukB"
 			" inactive:%lukB"
 			" present:%lukB"
+			" pages_scanned:%lu"
+			" all_unreclaimable? %s"
 			"\n",
 			zone->name,
 			K(zone->free_pages),
@@ -1094,7 +1097,9 @@
 			K(zone->pages_high),
 			K(zone->nr_active),
 			K(zone->nr_inactive),
-			K(zone->present_pages)
+			K(zone->present_pages),
+			zone->pages_scanned,
+			(zone->all_unreclaimable ? "yes" : "no")
 			);
 		printk("protections[]:");
 		for (i = 0; i < MAX_NR_ZONES; i++)
@@ -1125,6 +1130,18 @@
 		printk("= %lukB\n", K(total));
 	}
 
+	swap_list_lock();
+	for (i = 0; i < nr_swapfiles; i++) {
+		if (!(swap_info[i].flags & SWP_USED) ||
+		     (swap_info[i].flags & SWP_WRITEOK))
+                       continue;
+		swap_pages += swap_info[i].inuse_pages;
+	}
+	swap_pages += nr_swap_pages;
+	swap_list_unlock();
+
+	printk("nr_free_swap_pages: %u\n", swap_pages);
+
 	show_swap_cache_info();
 }
 

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Kernel 2.6.8.1: swap storm of death - nr_requests > 1024 on swap partition
  2004-08-30 15:20                                   ` Marcelo Tosatti
@ 2004-08-30 18:01                                     ` Karl Vogel
  2004-08-30 17:16                                       ` Marcelo Tosatti
  2004-08-30 20:33                                       ` Marcelo Tosatti
  0 siblings, 2 replies; 35+ messages in thread
From: Karl Vogel @ 2004-08-30 18:01 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: Andrew Morton, Jens Axboe, wli, linux-mm

Marcelo Tosatti wrote:

> What is the problem Karl is seeing again? There seem to be several, lets
> separate them
> 
> - OOM killer triggering (if there's swap space available and 
> "enough" anonymous memory to be swapped out this should not happen). 
> One of his complaint on the initial report (about the OOM killer).

Correct. On my 512Mb RAM system with 1Gb swap partition, running a 
calloc(1Gb) causes the process to get OOM killed when using CFQ.
The problem is not CFQ as such.. the problem is when nr_requests is too 
large (8192 being the default for CFQ).

The same will happen with the default nr_request of 128 which AS uses, 
if you use a low memory system. e.g. I booted with mem=128M and then a 
calloc(128Mb) can trigger the OOM.

> - Swap cache not freed after test app exists. Should not be a
> problem because such memory will be freed as soon as theres 
> pressure, I think.

After the OOM killer killed the calloc() task, the SwapCache still 
contains a large chunk of the original allocation. This get's cleared if 
there is alot of I/O (example: dd if=/dev/hdX of=/dev/null).

However, without the I/O's it doesn't seem to get freed.. this also 
causes a second run of calloc(1Gb) to fail as the SwapCache still 
accounts for used memory.

> How can you reproduce that?

It should be reproducable as follows:
- boot with mem=512M
- have a 1Gb swap partition / swapfile (the size doesn't really matter)
- use CFQ or set nr_requests to 8192 on the drive _hosting the swap_
- run  'expunge 1024'   (might work the 1st time, if so, run it again)


--- expunge.c program source ---
#include <stdio.h>
#include <stdlib.h>

int main(int argc, char *argv[])
{
     char *p= calloc(1, atol(argv[1])*1024L*1024L);
     if (!p) {
         perror("calloc");
         exit(1);
     }
     return 0;
}
--- expunge.c program source ---



Another thing that you can try:
- boot with mem=128M
- have enough swap
- execute:  while true; do expunge 128; done

This will trigger an OOM even with AS (nr_requests = 128)



After the OOM, SwapCache still holds part of the allocation.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Kernel 2.6.8.1: swap storm of death - nr_requests > 1024 on swap partition
  2004-08-30 18:01                                     ` Karl Vogel
  2004-08-30 17:16                                       ` Marcelo Tosatti
@ 2004-08-30 20:33                                       ` Marcelo Tosatti
  2004-08-30 22:37                                         ` Andrew Morton
  2004-08-30 23:02                                         ` Karl Vogel
  1 sibling, 2 replies; 35+ messages in thread
From: Marcelo Tosatti @ 2004-08-30 20:33 UTC (permalink / raw)
  To: Karl Vogel; +Cc: Andrew Morton, Jens Axboe, wli, linux-mm

On Mon, Aug 30, 2004 at 08:01:19PM +0200, Karl Vogel wrote:
> Marcelo Tosatti wrote:
> 
> >What is the problem Karl is seeing again? There seem to be several, lets
> >separate them
> >
> >- OOM killer triggering (if there's swap space available and 
> >"enough" anonymous memory to be swapped out this should not happen). 
> >One of his complaint on the initial report (about the OOM killer).
> 
> Correct. On my 512Mb RAM system with 1Gb swap partition, running a 
> calloc(1Gb) causes the process to get OOM killed when using CFQ.
> The problem is not CFQ as such.. the problem is when nr_requests is too 
> large (8192 being the default for CFQ).
> 
> The same will happen with the default nr_request of 128 which AS uses, 
> if you use a low memory system. e.g. I booted with mem=128M and then a 
> calloc(128Mb) can trigger the OOM.

Karl,

Can you please try the following - it limits the number of in-flight writeback 
pages to 25% of total RAM at the VM level. 

Does wonders for me with 8192 nr_requests. The hogs finish _much_ faster and 
and interactivity feels much better.

With nr_requests=128, this limit is not reached (probably never), but with 8192, 
it certainly does.

--- a/mm/vmscan.c	2004-08-30 17:50:25.000000000 -0300
+++ b/mm/vmscan.c	2004-08-30 18:34:54.666423368 -0300
@@ -247,6 +247,12 @@
 
 static int may_write_to_queue(struct backing_dev_info *bdi)
 {
+	int nr_writeback = read_page_state(nr_writeback);
+
+	if (nr_writeback > (totalram_pages * 25 / 100)) { 
+		blk_congestion_wait(WRITE, HZ/5);
+		return 0;
+	}
 	if (current_is_kswapd())
 		return 1;
 	if (current_is_pdflush())	/* This is unlikely, but why not... */
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Kernel 2.6.8.1: swap storm of death - nr_requests > 1024 on swap partition
  2004-08-30 22:37                                         ` Andrew Morton
@ 2004-08-30 22:17                                           ` Marcelo Tosatti
  2004-08-30 23:51                                             ` Andrew Morton
  0 siblings, 1 reply; 35+ messages in thread
From: Marcelo Tosatti @ 2004-08-30 22:17 UTC (permalink / raw)
  To: Andrew Morton; +Cc: karl.vogel, axboe, wli, linux-mm

On Mon, Aug 30, 2004 at 03:37:30PM -0700, Andrew Morton wrote:
> Marcelo Tosatti <marcelo.tosatti@cyclades.com> wrote:
> >
> >  static int may_write_to_queue(struct backing_dev_info *bdi)
> >  {
> > +	int nr_writeback = read_page_state(nr_writeback);
> > +
> > +	if (nr_writeback > (totalram_pages * 25 / 100)) { 
> > +		blk_congestion_wait(WRITE, HZ/5);
> > +		return 0;
> > +	}
> 
> That's probably a good way of special-casing this special-place problem.
> 
> For a final patch I'd be inclined to take into account /proc/sys/vm/dirty_ratio
> and to avoid running the expensive read_page_state() once per writepage.

What you think of this, which tries to address your comments

We might want to make shrink_caches() bailoff when the limit is reached


--- include/linux/writeback.h.orig	2004-08-30 20:18:06.291153336 -0300
+++ include/linux/writeback.h	2004-08-30 20:17:47.284042856 -0300
@@ -86,6 +86,7 @@
 int wakeup_bdflush(long nr_pages);
 void laptop_io_completion(void);
 void laptop_sync_completion(void);
+int vm_eviction_limits(int);
 
 /* These are exported to sysctl. */
 extern int dirty_background_ratio;
--- mm/page-writeback.c.orig	2004-08-30 20:10:50.508402384 -0300
+++ m//page-writeback.c	2004-08-30 20:16:26.583311232 -0300
@@ -279,6 +279,21 @@
 EXPORT_SYMBOL(balance_dirty_pages_ratelimited);
 
 /*
+ * This function calculates the maximum pinned-for-IO memory 
+ * the page eviction threads can generate. 
+ *
+ * Returns true if we cant writeout.
+ */
+int vm_eviction_limits(int inflight) 
+{
+	if (inflight > (totalram_pages * vm_dirty_ratio) / 100)  {
+                blk_congestion_wait(WRITE, HZ/10);
+		return 1;
+	} 
+	return 0;
+}
+
+/*
  * writeback at least _min_pages, and keep writing until the amount of dirty
  * memory is less than the background threshold, or until we're all clean.
  */
--- vmscan.c.orig	2004-08-30 20:19:05.501152048 -0300
+++ vmscan.c	2004-08-30 20:16:38.552491640 -0300
@@ -245,8 +245,11 @@
 	return page_count(page) - !!PagePrivate(page) == 2;
 }
 
-static int may_write_to_queue(struct backing_dev_info *bdi)
+static int may_write_to_queue(struct backing_dev_info *bdi, int inflight)
 {
+	if (vm_eviction_limits(inflight)) /* Check VM writeout limit */
+		return 0;
+
 	if (current_is_kswapd())
 		return 1;
 	if (current_is_pdflush())	/* This is unlikely, but why not... */
@@ -286,7 +289,8 @@
 /*
  * pageout is called by shrink_list() for each dirty page. Calls ->writepage().
  */
-static pageout_t pageout(struct page *page, struct address_space *mapping)
+static pageout_t pageout(struct page *page, struct address_space *mapping, int
+inflight)
 {
 	/*
 	 * If the page is dirty, only perform writeback if that write
@@ -311,7 +315,7 @@
 		return PAGE_KEEP;
 	if (mapping->a_ops->writepage == NULL)
 		return PAGE_ACTIVATE;
-	if (!may_write_to_queue(mapping->backing_dev_info))
+	if (!may_write_to_queue(mapping->backing_dev_info, inflight))
 		return PAGE_KEEP;
 
 	if (clear_page_dirty_for_io(page)) {
@@ -351,6 +355,7 @@
 	struct pagevec freed_pvec;
 	int pgactivate = 0;
 	int reclaimed = 0;
+	int inflight = read_page_state(nr_writeback);
 
 	cond_resched();
 
@@ -421,7 +426,7 @@
 				goto keep_locked;
 
 			/* Page is dirty, try to write it out here */
-			switch(pageout(page, mapping)) {
+			switch(pageout(page, mapping, inflight)) {
 			case PAGE_KEEP:
 				goto keep_locked;
 			case PAGE_ACTIVATE:
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Kernel 2.6.8.1: swap storm of death - nr_requests > 1024 on swap partition
  2004-08-30 20:33                                       ` Marcelo Tosatti
@ 2004-08-30 22:37                                         ` Andrew Morton
  2004-08-30 22:17                                           ` Marcelo Tosatti
  2004-08-30 23:02                                         ` Karl Vogel
  1 sibling, 1 reply; 35+ messages in thread
From: Andrew Morton @ 2004-08-30 22:37 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: karl.vogel, axboe, wli, linux-mm

Marcelo Tosatti <marcelo.tosatti@cyclades.com> wrote:
>
>  static int may_write_to_queue(struct backing_dev_info *bdi)
>  {
> +	int nr_writeback = read_page_state(nr_writeback);
> +
> +	if (nr_writeback > (totalram_pages * 25 / 100)) { 
> +		blk_congestion_wait(WRITE, HZ/5);
> +		return 0;
> +	}

That's probably a good way of special-casing this special-place problem.

For a final patch I'd be inclined to take into account /proc/sys/vm/dirty_ratio
and to avoid running the expensive read_page_state() once per writepage.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Kernel 2.6.8.1: swap storm of death - nr_requests > 1024 on swap partition
  2004-08-30 17:16                                       ` Marcelo Tosatti
@ 2004-08-30 22:59                                         ` Karl Vogel
  0 siblings, 0 replies; 35+ messages in thread
From: Karl Vogel @ 2004-08-30 22:59 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: Karl Vogel, Andrew Morton, Jens Axboe, wli, linux-mm

On Monday 30 August 2004 19:16, Marcelo Tosatti wrote:
> Karl,
>
> Please apply the attached patch and rerun your tests. With it applied,
> the OOM killer output will print the number of available swap pages at
> the time of killing.

[kvo@localhost sources]$ cat /proc/meminfo
MemTotal:       515728 kB
MemFree:        495772 kB
Buffers:           556 kB
Cached:           3384 kB
SwapCached:          0 kB
Active:           7736 kB
Inactive:         1948 kB
HighTotal:           0 kB
HighFree:            0 kB
LowTotal:       515728 kB
LowFree:        495772 kB
SwapTotal:     1044216 kB
SwapFree:      1044216 kB
Dirty:              40 kB
Writeback:           0 kB
Mapped:           7044 kB
Slab:             5412 kB
Committed_AS:     9544 kB
PageTables:        548 kB
VmallocTotal:   516020 kB
VmallocUsed:      2372 kB
VmallocChunk:   512624 kB
HugePages_Total:     0
HugePages_Free:      0
Hugepagesize:     4096 kB
[kvo@localhost sources]$ date;time ./expunge 1024;date;time 
cat /proc/meminfo;date
Tue Aug 31 00:45:25 CEST 2004
Killed

real	0m8.662s
user	0m0.636s
sys	0m1.015s
Tue Aug 31 00:45:42 CEST 2004
MemTotal:       515728 kB
MemFree:         10364 kB
Buffers:           140 kB
Cached:           2696 kB
SwapCached:     482928 kB
Active:           2308 kB
Inactive:       484124 kB
HighTotal:           0 kB
HighFree:            0 kB
LowTotal:       515728 kB
LowFree:         10364 kB
SwapTotal:     1044216 kB
SwapFree:       556868 kB
Dirty:               0 kB
Writeback:      219084 kB
Mapped:           1784 kB
Slab:            13948 kB
Committed_AS:     9544 kB
PageTables:        548 kB
VmallocTotal:   516020 kB
VmallocUsed:      2372 kB
VmallocChunk:   512624 kB
HugePages_Total:     0
HugePages_Free:      0
Hugepagesize:     4096 kB

real	0m0.655s
user	0m0.000s
sys	0m0.001s
Tue Aug 31 00:45:43 CEST 2004


dmesg output:

kswapd0: page allocation failure. order:0, mode:0x20
 [<c013e9a8>] __alloc_pages+0x1c8/0x390
 [<c013eb8f>] __get_free_pages+0x1f/0x40
 [<c014205d>] kmem_getpages+0x1d/0xb0
 [<c0142d16>] cache_grow+0xb6/0x170
 [<c0142f36>] cache_alloc_refill+0x166/0x210
 [<c015d579>] bio_alloc+0xd9/0x1b0
 [<c01431d6>] kmem_cache_alloc+0x56/0x70
 [<c01b2d5f>] radix_tree_node_alloc+0x1f/0x60
 [<c01b3002>] radix_tree_insert+0xe2/0x100
 [<c0152c42>] __add_to_swap_cache+0x72/0xf0
 [<c0152e1b>] add_to_swap+0x5b/0xb0
 [<c014599c>] shrink_list+0x43c/0x470
 [<c014e319>] page_referenced_anon+0x49/0x90
 [<c0144718>] __pagevec_release+0x28/0x40
 [<c0145b1d>] shrink_cache+0x14d/0x340
 [<c014525f>] shrink_slab+0x7f/0x180
 [<c014627a>] shrink_zone+0x9a/0xc0
 [<c014665b>] balance_pgdat+0x1cb/0x230
 [<c0146787>] kswapd+0xc7/0xe0
 [<c011cbb0>] autoremove_wake_function+0x0/0x60
 [<c010605e>] ret_from_fork+0x6/0x14
 [<c011cbb0>] autoremove_wake_function+0x0/0x60
 [<c01466c0>] kswapd+0x0/0xe0
 [<c0104291>] kernel_thread_helper+0x5/0x14

>>> lots of these cut from mail

oom-killer: gfp_mask=0xd2
DMA per-cpu:
cpu 0 hot: low 2, high 6, batch 1
cpu 0 cold: low 0, high 2, batch 1
Normal per-cpu:
cpu 0 hot: low 32, high 96, batch 16
cpu 0 cold: low 0, high 32, batch 16
HighMem per-cpu: empty

Free pages:         660kB (0kB HighMem)
Active:596 inactive:120914 dirty:0 writeback:120868 unstable:0 free:165 
slab:5896 mapped:598 pagetables:278
DMA free:20kB min:20kB low:40kB high:60kB active:32kB inactive:11040kB 
present:16384kB pages_scanned:8928 all_unreclaimable? yes
protections[]: 0 0 0
Normal free:640kB min:696kB low:1392kB high:2088kB active:2352kB 
inactive:472616kB present:507328kB pages_scanned:276672 all_unreclaimable? 
yes
protections[]: 0 0 0
HighMem free:0kB min:128kB low:256kB high:384kB active:0kB inactive:0kB 
present:0kB pages_scanned:0 all_unreclaimable? no
protections[]: 0 0 0
DMA: 1*4kB 0*8kB 1*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 
0*2048kB 0*4096kB = 20kB
Normal: 0*4kB 0*8kB 0*16kB 0*32kB 10*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 
0*2048kB 0*4096kB = 640kB
HighMem: empty
nr_free_swap_pages: 116933
Swap cache: add 925862, delete 804994, find 990/1254, race 0+0
Out of Memory: Killed process 2513 (expunge).
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Kernel 2.6.8.1: swap storm of death - nr_requests > 1024 on swap partition
  2004-08-30 20:33                                       ` Marcelo Tosatti
  2004-08-30 22:37                                         ` Andrew Morton
@ 2004-08-30 23:02                                         ` Karl Vogel
  1 sibling, 0 replies; 35+ messages in thread
From: Karl Vogel @ 2004-08-30 23:02 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: Karl Vogel, Andrew Morton, Jens Axboe, wli, linux-mm

On Monday 30 August 2004 22:33, Marcelo Tosatti wrote:
> Can you please try the following - it limits the number of in-flight
> writeback pages to 25% of total RAM at the VM level.
>
> Does wonders for me with 8192 nr_requests. The hogs finish _much_ faster
> and and interactivity feels much better.
>
> With nr_requests=128, this limit is not reached (probably never), but with
> 8192, it certainly does.
>
> --- a/mm/vmscan.c	2004-08-30 17:50:25.000000000 -0300
> +++ b/mm/vmscan.c	2004-08-30 18:34:54.666423368 -0300
> @@ -247,6 +247,12 @@
>
>  static int may_write_to_queue(struct backing_dev_info *bdi)
>  {
> +	int nr_writeback = read_page_state(nr_writeback);
> +
> +	if (nr_writeback > (totalram_pages * 25 / 100)) {
> +		blk_congestion_wait(WRITE, HZ/5);
> +		return 0;
> +	}
>  	if (current_is_kswapd())
>  		return 1;
>  	if (current_is_pdflush())	/* This is unlikely, but why not... */

This fixes the OOM for me.. I can do some more testing if needed tomorrow..

[kvo@localhost sources]$ cat /proc/meminfo
MemTotal:       515728 kB
MemFree:        445084 kB
Buffers:          9492 kB
Cached:          33268 kB
SwapCached:          0 kB
Active:          19748 kB
Inactive:        28716 kB
HighTotal:           0 kB
HighFree:            0 kB
LowTotal:       515728 kB
LowFree:        445084 kB
SwapTotal:     1044216 kB
SwapFree:      1044216 kB
Dirty:              84 kB
Writeback:           0 kB
Mapped:           8960 kB
Slab:            17284 kB
Committed_AS:     9544 kB
PageTables:        548 kB
VmallocTotal:   516020 kB
VmallocUsed:      2372 kB
VmallocChunk:   512624 kB
HugePages_Total:     0
HugePages_Free:      0
Hugepagesize:     4096 kB
[kvo@localhost sources]$ date;./expunge 1024;date;time cat /proc/meminfo;date
Tue Aug 31 00:51:20 CEST 2004
Tue Aug 31 00:51:55 CEST 2004
MemTotal:       515728 kB
MemFree:        381036 kB
Buffers:           272 kB
Cached:           2844 kB
SwapCached:     120572 kB
Active:           2036 kB
Inactive:       121868 kB
HighTotal:           0 kB
HighFree:            0 kB
LowTotal:       515728 kB
LowFree:        381036 kB
SwapTotal:     1044216 kB
SwapFree:       919020 kB
Dirty:               0 kB
Writeback:           0 kB
Mapped:           1420 kB
Slab:             5932 kB
Committed_AS:     9764 kB
PageTables:        572 kB
VmallocTotal:   516020 kB
VmallocUsed:      2372 kB
VmallocChunk:   512624 kB
HugePages_Total:     0
HugePages_Free:      0
Hugepagesize:     4096 kB

real	0m0.071s
user	0m0.000s
sys	0m0.000s
Tue Aug 31 00:51:55 CEST 2004
[kvo@localhost sources]$ date;./expunge 1024;date;time cat /proc/meminfo;date
Tue Aug 31 00:52:41 CEST 2004
Tue Aug 31 00:53:16 CEST 2004
MemTotal:       515728 kB
MemFree:        383832 kB
Buffers:           220 kB
Cached:           2792 kB
SwapCached:     117196 kB
Active:           1944 kB
Inactive:       118652 kB
HighTotal:           0 kB
HighFree:            0 kB
LowTotal:       515728 kB
LowFree:        383832 kB
SwapTotal:     1044216 kB
SwapFree:       922316 kB
Dirty:               0 kB
Writeback:       16432 kB
Mapped:           1484 kB
Slab:             6328 kB
Committed_AS:     9768 kB
PageTables:        572 kB
VmallocTotal:   516020 kB
VmallocUsed:      2372 kB
VmallocChunk:   512624 kB
HugePages_Total:     0
HugePages_Free:      0
Hugepagesize:     4096 kB

real	0m0.328s
user	0m0.000s
sys	0m0.001s
Tue Aug 31 00:53:16 CEST 2004
[kvo@localhost sources]$ exit
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Kernel 2.6.8.1: swap storm of death - nr_requests > 1024 on swap partition
  2004-08-30 22:17                                           ` Marcelo Tosatti
@ 2004-08-30 23:51                                             ` Andrew Morton
  2004-08-31 10:23                                               ` Marcelo Tosatti
  0 siblings, 1 reply; 35+ messages in thread
From: Andrew Morton @ 2004-08-30 23:51 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: karl.vogel, axboe, wli, linux-mm

Marcelo Tosatti <marcelo.tosatti@cyclades.com> wrote:
>
> What you think of this, which tries to address your comments

Suggest you pass the scan_control structure down into pageout(), stick
`inflight' into struct scan_control and use some flag in scan_control to
ensure that we only throttle once per try_to_free_pages()/blaance_pgdat()
pass.

See, page reclaim is now, as much as possible, "batched".  Think of it as
operating in units of 32 pages at a time.  We should only examine the dirty
memory thresholds and throttle once per "batch", not once per page.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Kernel 2.6.8.1: swap storm of death - nr_requests > 1024 on swap partition
  2004-08-30 23:51                                             ` Andrew Morton
@ 2004-08-31 10:23                                               ` Marcelo Tosatti
  2004-08-31 16:02                                                 ` Marcelo Tosatti
  2004-08-31 17:50                                                 ` Karl Vogel
  0 siblings, 2 replies; 35+ messages in thread
From: Marcelo Tosatti @ 2004-08-31 10:23 UTC (permalink / raw)
  To: Andrew Morton; +Cc: karl.vogel, axboe, wli, linux-mm

On Mon, Aug 30, 2004 at 04:51:00PM -0700, Andrew Morton wrote:
> Marcelo Tosatti <marcelo.tosatti@cyclades.com> wrote:
> >
> > What you think of this, which tries to address your comments
> 
> Suggest you pass the scan_control structure down into pageout(), stick
> `inflight' into struct scan_control and use some flag in scan_control to

Done the scan_control modifications.

> ensure that we only throttle once per try_to_free_pages()/blaance_pgdat()
> pass.

Throttling once is enough

I added a 

+		 if (sc->throttled < 5) {
+                       blk_congestion_wait(WRITE, HZ/5);
+                       sc->throttled++;
+               }

To loop five times max per try_to_free_pages()/balance_pgdat().

Because only one blk_congestion_wait(WRITE, HZ/5)
makes my 64MB boot testcase with 8192 nr_requests fail. The OOM killer
triggers prematurely.

> See, page reclaim is now, as much as possible, "batched".  Think of it as
> operating in units of 32 pages at a time.  We should only examine the dirty
> memory thresholds and throttle once per "batch", not once per page.

That should do it 

--- mm/vmscan.c.orig	2004-08-30 20:19:05.000000000 -0300
+++ mm/vmscan.c	2004-08-31 08:30:08.323989416 -0300
@@ -73,6 +73,10 @@
 	unsigned int gfp_mask;
 
 	int may_writepage;
+
+	int inflight;
+
+	int throttled; /* how many times have we throttled on VM inflight IO limit */
 };
 
 /*
@@ -245,8 +249,30 @@
 	return page_count(page) - !!PagePrivate(page) == 2;
 }
 
-static int may_write_to_queue(struct backing_dev_info *bdi)
+/*
+ * This function calculates the maximum pinned-for-IO memory
+ * the page eviction threads can generate. If we hit the max,
+ * we throttle taking a nap.
+ *
+ * Returns true if we cant writeout.
+ */
+int vm_eviction_limits(struct scan_control *sc)
+{
+        if (sc->inflight > (totalram_pages * vm_dirty_ratio) / 100)  {
+		if (sc->throttled < 5) {
+			blk_congestion_wait(WRITE, HZ/5);
+			sc->throttled++;
+		}
+                return 1;
+        }
+        return 0;
+}
+
+static int may_write_to_queue(struct backing_dev_info *bdi, struct scan_control *sc)
 {
+	if (vm_eviction_limits(sc)) /* Check VM writeout limit */
+		return 0;
+
 	if (current_is_kswapd())
 		return 1;
 	if (current_is_pdflush())	/* This is unlikely, but why not... */
@@ -286,7 +312,7 @@
 /*
  * pageout is called by shrink_list() for each dirty page. Calls ->writepage().
  */
-static pageout_t pageout(struct page *page, struct address_space *mapping)
+static pageout_t pageout(struct page *page, struct address_space *mapping, struct scan_control *sc)
 {
 	/*
 	 * If the page is dirty, only perform writeback if that write
@@ -311,7 +337,7 @@
 		return PAGE_KEEP;
 	if (mapping->a_ops->writepage == NULL)
 		return PAGE_ACTIVATE;
-	if (!may_write_to_queue(mapping->backing_dev_info))
+	if (!may_write_to_queue(mapping->backing_dev_info, sc))
 		return PAGE_KEEP;
 
 	if (clear_page_dirty_for_io(page)) {
@@ -421,7 +447,7 @@
 				goto keep_locked;
 
 			/* Page is dirty, try to write it out here */
-			switch(pageout(page, mapping)) {
+			switch(pageout(page, mapping, sc)) {
 			case PAGE_KEEP:
 				goto keep_locked;
 			case PAGE_ACTIVATE:
@@ -807,6 +833,7 @@
 		nr_inactive = 0;
 
 	sc->nr_to_reclaim = SWAP_CLUSTER_MAX;
+	sc->throttled = 0;
 
 	while (nr_active || nr_inactive) {
 		if (nr_active) {
@@ -819,6 +846,7 @@
 		if (nr_inactive) {
 			sc->nr_to_scan = min(nr_inactive,
 					(unsigned long)SWAP_CLUSTER_MAX);
+			sc->inflight = read_page_state(nr_writeback);
 			nr_inactive -= sc->nr_to_scan;
 			shrink_cache(zone, sc);
 			if (sc->nr_to_reclaim <= 0)
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Kernel 2.6.8.1: swap storm of death - nr_requests > 1024 on swap partition
  2004-08-31 10:23                                               ` Marcelo Tosatti
@ 2004-08-31 16:02                                                 ` Marcelo Tosatti
  2004-08-31 17:50                                                 ` Karl Vogel
  1 sibling, 0 replies; 35+ messages in thread
From: Marcelo Tosatti @ 2004-08-31 16:02 UTC (permalink / raw)
  To: Andrew Morton; +Cc: karl.vogel, axboe, wli, linux-mm

On Tue, Aug 31, 2004 at 07:23:42AM -0300, Marcelo Tosatti wrote:
> On Mon, Aug 30, 2004 at 04:51:00PM -0700, Andrew Morton wrote:
> > Marcelo Tosatti <marcelo.tosatti@cyclades.com> wrote:
> > >
> > > What you think of this, which tries to address your comments
> > 
> > Suggest you pass the scan_control structure down into pageout(), stick
> > `inflight' into struct scan_control and use some flag in scan_control to
> 
> Done the scan_control modifications.
> 
> > ensure that we only throttle once per try_to_free_pages()/blaance_pgdat()
> > pass.
> 
> Throttling once is enough

I meant "throttling once is not enough" 

Any comments?

> I added a 
> 
> +		 if (sc->throttled < 5) {
> +                       blk_congestion_wait(WRITE, HZ/5);
> +                       sc->throttled++;
> +               }
> 
> To loop five times max per try_to_free_pages()/balance_pgdat().
> 
> Because only one blk_congestion_wait(WRITE, HZ/5)
> makes my 64MB boot testcase with 8192 nr_requests fail. The OOM killer
> triggers prematurely.
> 
> > See, page reclaim is now, as much as possible, "batched".  Think of it as
> > operating in units of 32 pages at a time.  We should only examine the dirty
> > memory thresholds and throttle once per "batch", not once per page.
> 
> That should do it 
> 
> --- mm/vmscan.c.orig	2004-08-30 20:19:05.000000000 -0300
> +++ mm/vmscan.c	2004-08-31 08:30:08.323989416 -0300
> @@ -73,6 +73,10 @@
>  	unsigned int gfp_mask;
>  
>  	int may_writepage;
> +
> +	int inflight;
> +
> +	int throttled; /* how many times have we throttled on VM inflight IO limit */
>  };
>  
>  /*
> @@ -245,8 +249,30 @@
>  	return page_count(page) - !!PagePrivate(page) == 2;
>  }
>  
> -static int may_write_to_queue(struct backing_dev_info *bdi)
> +/*
> + * This function calculates the maximum pinned-for-IO memory
> + * the page eviction threads can generate. If we hit the max,
> + * we throttle taking a nap.
> + *
> + * Returns true if we cant writeout.
> + */
> +int vm_eviction_limits(struct scan_control *sc)
> +{
> +        if (sc->inflight > (totalram_pages * vm_dirty_ratio) / 100)  {
> +		if (sc->throttled < 5) {
> +			blk_congestion_wait(WRITE, HZ/5);
> +			sc->throttled++;
> +		}
> +                return 1;
> +        }
> +        return 0;
> +}
> +
> +static int may_write_to_queue(struct backing_dev_info *bdi, struct scan_control *sc)
>  {
> +	if (vm_eviction_limits(sc)) /* Check VM writeout limit */
> +		return 0;
> +
>  	if (current_is_kswapd())
>  		return 1;
>  	if (current_is_pdflush())	/* This is unlikely, but why not... */
> @@ -286,7 +312,7 @@
>  /*
>   * pageout is called by shrink_list() for each dirty page. Calls ->writepage().
>   */
> -static pageout_t pageout(struct page *page, struct address_space *mapping)
> +static pageout_t pageout(struct page *page, struct address_space *mapping, struct scan_control *sc)
>  {
>  	/*
>  	 * If the page is dirty, only perform writeback if that write
> @@ -311,7 +337,7 @@
>  		return PAGE_KEEP;
>  	if (mapping->a_ops->writepage == NULL)
>  		return PAGE_ACTIVATE;
> -	if (!may_write_to_queue(mapping->backing_dev_info))
> +	if (!may_write_to_queue(mapping->backing_dev_info, sc))
>  		return PAGE_KEEP;
>  
>  	if (clear_page_dirty_for_io(page)) {
> @@ -421,7 +447,7 @@
>  				goto keep_locked;
>  
>  			/* Page is dirty, try to write it out here */
> -			switch(pageout(page, mapping)) {
> +			switch(pageout(page, mapping, sc)) {
>  			case PAGE_KEEP:
>  				goto keep_locked;
>  			case PAGE_ACTIVATE:
> @@ -807,6 +833,7 @@
>  		nr_inactive = 0;
>  
>  	sc->nr_to_reclaim = SWAP_CLUSTER_MAX;
> +	sc->throttled = 0;
>  
>  	while (nr_active || nr_inactive) {
>  		if (nr_active) {
> @@ -819,6 +846,7 @@
>  		if (nr_inactive) {
>  			sc->nr_to_scan = min(nr_inactive,
>  					(unsigned long)SWAP_CLUSTER_MAX);
> +			sc->inflight = read_page_state(nr_writeback);
>  			nr_inactive -= sc->nr_to_scan;
>  			shrink_cache(zone, sc);
>  			if (sc->nr_to_reclaim <= 0)
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Kernel 2.6.8.1: swap storm of death - nr_requests > 1024 on swap partition
  2004-08-31 17:50                                                 ` Karl Vogel
@ 2004-08-31 16:52                                                   ` Marcelo Tosatti
  2004-08-31 18:24                                                     ` Karl Vogel
  0 siblings, 1 reply; 35+ messages in thread
From: Marcelo Tosatti @ 2004-08-31 16:52 UTC (permalink / raw)
  To: Karl Vogel; +Cc: Andrew Morton, karl.vogel, axboe, wli, linux-mm

On Tue, Aug 31, 2004 at 07:50:07PM +0200, Karl Vogel wrote:
> On Tuesday 31 August 2004 12:23, Marcelo Tosatti wrote:
> > On Mon, Aug 30, 2004 at 04:51:00PM -0700, Andrew Morton wrote:
> > > Marcelo Tosatti <marcelo.tosatti@cyclades.com> wrote:
> > > > What you think of this, which tries to address your comments
> > >
> > > Suggest you pass the scan_control structure down into pageout(), stick
> > > `inflight' into struct scan_control and use some flag in scan_control to
> >
> > Done the scan_control modifications.
> 
> Took the patch for a spin.. it seems to behave ok here! No more OOMs.
> 
> Quick question: is it to be expected that when I run a calloc(500Mb) on my 
> system, when X is up and amarok is streaming live audio, that everything 
> (apps) freezes for a few seconds until the calloc task exits?!
> The apps probably get pushed out to swap, but I would think that since these 
> applications are running, that their pages are kept on the active list?! 
> Setting swappiness to 0 doesn't make a difference.
> 
> Is there a concept of a minimum working set size of an application? (kind of 
> the reverse of an RSS limit)

Not really. A hungry memory app can starve the rest of the system. 

One thing: what kernel version are you using? 

I've seen extreme decreases in performance (interactivity) with hungry memory apps 
with Rik's swap token code.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Kernel 2.6.8.1: swap storm of death - nr_requests > 1024 on swap partition
  2004-08-31 18:24                                                     ` Karl Vogel
@ 2004-08-31 17:25                                                       ` Marcelo Tosatti
  2004-08-31 19:36                                                         ` Karl Vogel
  2004-09-02  9:05                                                         ` Rik van Riel
  0 siblings, 2 replies; 35+ messages in thread
From: Marcelo Tosatti @ 2004-08-31 17:25 UTC (permalink / raw)
  To: Karl Vogel; +Cc: Andrew Morton, karl.vogel, axboe, wli, linux-mm

On Tue, Aug 31, 2004 at 08:24:31PM +0200, Karl Vogel wrote:
> On Tuesday 31 August 2004 18:52, Marcelo Tosatti wrote:
> > > Is there a concept of a minimum working set size of an application? (kind
> > > of the reverse of an RSS limit)
> >
> > Not really. A hungry memory app can starve the rest of the system.
> 
> I noticed that a few times on our spamassassin box :-)
> 
> > One thing: what kernel version are you using?
> 
> 2.6.9-rc1-bk3

Can you try the same tests with 2.6.8.1 and check the difference, pretty please?

> > I've seen extreme decreases in performance (interactivity) with hungry
> > memory apps with Rik's swap token code.
> 
> Decrease?!

Yep, its odd. Rik knows the exact reason.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Kernel 2.6.8.1: swap storm of death - nr_requests > 1024 on swap partition
  2004-08-31 10:23                                               ` Marcelo Tosatti
  2004-08-31 16:02                                                 ` Marcelo Tosatti
@ 2004-08-31 17:50                                                 ` Karl Vogel
  2004-08-31 16:52                                                   ` Marcelo Tosatti
  1 sibling, 1 reply; 35+ messages in thread
From: Karl Vogel @ 2004-08-31 17:50 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: Andrew Morton, karl.vogel, axboe, wli, linux-mm

On Tuesday 31 August 2004 12:23, Marcelo Tosatti wrote:
> On Mon, Aug 30, 2004 at 04:51:00PM -0700, Andrew Morton wrote:
> > Marcelo Tosatti <marcelo.tosatti@cyclades.com> wrote:
> > > What you think of this, which tries to address your comments
> >
> > Suggest you pass the scan_control structure down into pageout(), stick
> > `inflight' into struct scan_control and use some flag in scan_control to
>
> Done the scan_control modifications.

Took the patch for a spin.. it seems to behave ok here! No more OOMs.

Quick question: is it to be expected that when I run a calloc(500Mb) on my 
system, when X is up and amarok is streaming live audio, that everything 
(apps) freezes for a few seconds until the calloc task exits?!
The apps probably get pushed out to swap, but I would think that since these 
applications are running, that their pages are kept on the active list?! 
Setting swappiness to 0 doesn't make a difference.

Is there a concept of a minimum working set size of an application? (kind of 
the reverse of an RSS limit)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Kernel 2.6.8.1: swap storm of death - nr_requests > 1024 on swap partition
  2004-08-31 16:52                                                   ` Marcelo Tosatti
@ 2004-08-31 18:24                                                     ` Karl Vogel
  2004-08-31 17:25                                                       ` Marcelo Tosatti
  0 siblings, 1 reply; 35+ messages in thread
From: Karl Vogel @ 2004-08-31 18:24 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: Andrew Morton, karl.vogel, axboe, wli, linux-mm

On Tuesday 31 August 2004 18:52, Marcelo Tosatti wrote:
> > Is there a concept of a minimum working set size of an application? (kind
> > of the reverse of an RSS limit)
>
> Not really. A hungry memory app can starve the rest of the system.

I noticed that a few times on our spamassassin box :-)

> One thing: what kernel version are you using?

2.6.9-rc1-bk3

> I've seen extreme decreases in performance (interactivity) with hungry
> memory apps with Rik's swap token code.

Decrease?!
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Kernel 2.6.8.1: swap storm of death - nr_requests > 1024 on swap partition
  2004-08-31 17:25                                                       ` Marcelo Tosatti
@ 2004-08-31 19:36                                                         ` Karl Vogel
  2004-09-02  9:05                                                         ` Rik van Riel
  1 sibling, 0 replies; 35+ messages in thread
From: Karl Vogel @ 2004-08-31 19:36 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: Andrew Morton, karl.vogel, axboe, wli, linux-mm

On Tuesday 31 August 2004 19:25, Marcelo Tosatti wrote:
> Can you try the same tests with 2.6.8.1 and check the difference, pretty
> please?

You forgot the sugar on top :)  Anyway 2.6.8.1 also seems to behave now.. I do 
get a few 'kswapd0: page allocation failure. order:0, mode:0x20' but the 
system doesn't OOM kill and it recovers after the expunge. Although I think 
it recovers a tad slower than 2.6.9-rc1-bk3

 
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Kernel 2.6.8.1: swap storm of death - nr_requests > 1024 on swap partition
  2004-08-31 17:25                                                       ` Marcelo Tosatti
  2004-08-31 19:36                                                         ` Karl Vogel
@ 2004-09-02  9:05                                                         ` Rik van Riel
  1 sibling, 0 replies; 35+ messages in thread
From: Rik van Riel @ 2004-09-02  9:05 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Karl Vogel, Andrew Morton, karl.vogel, axboe, wli, linux-mm

On Tue, 31 Aug 2004, Marcelo Tosatti wrote:
> On Tue, Aug 31, 2004 at 08:24:31PM +0200, Karl Vogel wrote:
> > On Tuesday 31 August 2004 18:52, Marcelo Tosatti wrote:
> > > I've seen extreme decreases in performance (interactivity) with hungry
> > > memory apps with Rik's swap token code.
> > 
> > Decrease?!
> 
> Yep, its odd. Rik knows the exact reason.

Yes, it appears that the swap token patch works great on
systems where the workload consists of similar applications.
If you have a desktop, the swap token makes switching between
apps faster.  If you have a server, the swap token helps
increase throughput.

However, if you have one app that needs more memory than the
system has and the rest of the apps are all "friendly", then
the swap token can help the system hog steal resources from
the other processes.

This needs to be fixed somehow, but I'm at a conference now
so I don't think I'll get around to it this week ;)

-- 
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." - Brian W. Kernighan

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

end of thread, other threads:[~2004-09-02  9:05 UTC | newest]

Thread overview: 35+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <20040824124356.GW2355@suse.de>
     [not found] ` <412CDE7E.9060307@seagha.com>
     [not found]   ` <20040826144155.GH2912@suse.de>
     [not found]     ` <412E13DB.6040102@seagha.com>
     [not found]       ` <412E31EE.3090102@pandora.be>
     [not found]         ` <41308C62.7030904@seagha.com>
     [not found]           ` <20040828125028.2fa2a12b.akpm@osdl.org>
     [not found]             ` <4130F55A.90705@pandora.be>
2004-08-28 21:43               ` Kernel 2.6.8.1: swap storm of death - nr_requests > 1024 on swap partition Andrew Morton
2004-08-28 21:54                 ` William Lee Irwin III
2004-08-28 22:13                   ` Andrew Morton
2004-08-28 22:28                     ` William Lee Irwin III
2004-08-29 10:30                       ` Andrew Morton
2004-08-29 14:15                         ` Jens Axboe
2004-08-29 14:17                           ` Jens Axboe
2004-08-29 14:45                             ` Rik van Riel
2004-08-29 20:18                             ` Andrew Morton
2004-08-29 20:30                               ` Jens Axboe
2004-08-29 20:59                                 ` Andrew Morton
2004-08-29 22:17                                   ` William Lee Irwin III
2004-08-29 22:28                                     ` Andrew Morton
2004-08-30  7:41                                       ` Hugh Dickins
2004-08-30 15:20                                   ` Marcelo Tosatti
2004-08-30 18:01                                     ` Karl Vogel
2004-08-30 17:16                                       ` Marcelo Tosatti
2004-08-30 22:59                                         ` Karl Vogel
2004-08-30 20:33                                       ` Marcelo Tosatti
2004-08-30 22:37                                         ` Andrew Morton
2004-08-30 22:17                                           ` Marcelo Tosatti
2004-08-30 23:51                                             ` Andrew Morton
2004-08-31 10:23                                               ` Marcelo Tosatti
2004-08-31 16:02                                                 ` Marcelo Tosatti
2004-08-31 17:50                                                 ` Karl Vogel
2004-08-31 16:52                                                   ` Marcelo Tosatti
2004-08-31 18:24                                                     ` Karl Vogel
2004-08-31 17:25                                                       ` Marcelo Tosatti
2004-08-31 19:36                                                         ` Karl Vogel
2004-09-02  9:05                                                         ` Rik van Riel
2004-08-30 23:02                                         ` Karl Vogel
2004-08-29 16:54                       ` Jens Axboe
2004-08-29 17:52                         ` William Lee Irwin III
2004-08-28 21:59                 ` Karl Vogel
2004-08-29 16:53                 ` Jens Axboe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox