* Re: Kernel 2.6.8.1: swap storm of death - nr_requests > 1024 on swap partition [not found] ` <4130F55A.90705@pandora.be> @ 2004-08-28 21:43 ` Andrew Morton 2004-08-28 21:54 ` William Lee Irwin III ` (2 more replies) 0 siblings, 3 replies; 35+ messages in thread From: Andrew Morton @ 2004-08-28 21:43 UTC (permalink / raw) To: Karl Vogel; +Cc: Jens Axboe, linux-mm (Added linux-mm) Karl Vogel <karl.vogel@pandora.be> wrote: > > Andrew Morton wrote: > > Karl Vogel <karl.vogel@seagha.com> wrote: > > > >>Further testing shows that all the schedulers exhibit this exact same > >> problem when run with a nr_requests size of 8192 on the drive hosting > >> the swap partition. > >> > >> I tried noop, deadline, as and CFQ with: > >> > >> echo 8192 >/sys/block/hda/queue/nr_requests > > > > > > That allows up to 2GB of memory to be under writeout at the same time. The > > VM cannot touch any of that memory. > > Well I used that value as it is the default for CFQ.. and it was with > CFQ that I had the problems. The patch Jens offered to track down the > problem, commented out this 'q->nr_requests = 8192' in CFQ and it > helped. Therefor I tried the other schedulers with this value to see if > it made a difference. > > So if I understand you correctly, CFQ shouldn't be using 8192 on 512Mb > systems?! Yup. It's asking for trouble to allow that much memory to be unreclaimably pinned. Of course, you could have the same problem with just 128 requests per queue, and lots of queues. I solved all these problems in the dirty memory writeback paths. But I forgot about swapout! > With overcommit_memory set to 1, the program can be run again after the > OOM kill.. but the OOM killing remains. > > With overcommit_memory set to 0 a second run fails. I 'think' it's > because somehow SwapCache is 500Kb after the OOM, so in effect my system > doesn't have 1Gb to spare anymore. Doing swapoff/swapon frees this and > then I can do the calloc(1Gb) again. > > Another way to free the SwapCached is to generate lots of I/O doing 'dd > if=/dev/hda of=/dev/null' ... after a while SwapCached is < 1Mb again. > urgh. It sounds like the overcommit logic forgot to account swapcache as reclaimable. It's been a ton of trouble, that code. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: Kernel 2.6.8.1: swap storm of death - nr_requests > 1024 on swap partition 2004-08-28 21:43 ` Kernel 2.6.8.1: swap storm of death - nr_requests > 1024 on swap partition Andrew Morton @ 2004-08-28 21:54 ` William Lee Irwin III 2004-08-28 22:13 ` Andrew Morton 2004-08-28 21:59 ` Karl Vogel 2004-08-29 16:53 ` Jens Axboe 2 siblings, 1 reply; 35+ messages in thread From: William Lee Irwin III @ 2004-08-28 21:54 UTC (permalink / raw) To: Andrew Morton; +Cc: Karl Vogel, Jens Axboe, linux-mm Karl Vogel <karl.vogel@pandora.be> wrote: >> With overcommit_memory set to 1, the program can be run again after the >> OOM kill.. but the OOM killing remains. >> With overcommit_memory set to 0 a second run fails. I 'think' it's >> because somehow SwapCache is 500Kb after the OOM, so in effect my system >> doesn't have 1Gb to spare anymore. Doing swapoff/swapon frees this and >> then I can do the calloc(1Gb) again. >> Another way to free the SwapCached is to generate lots of I/O doing 'dd >> if=/dev/hda of=/dev/null' ... after a while SwapCached is < 1Mb again. On Sat, Aug 28, 2004 at 02:43:03PM -0700, Andrew Morton wrote: > urgh. It sounds like the overcommit logic forgot to account swapcache as > reclaimable. It's been a ton of trouble, that code. For overcommit purposes, swapcache still counts as committed AS; it requires swap as backing store to evict. So AFAICT there isn't an issue there. I was under the impression this had something to do with IO schedulers. -- wli -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: Kernel 2.6.8.1: swap storm of death - nr_requests > 1024 on swap partition 2004-08-28 21:54 ` William Lee Irwin III @ 2004-08-28 22:13 ` Andrew Morton 2004-08-28 22:28 ` William Lee Irwin III 0 siblings, 1 reply; 35+ messages in thread From: Andrew Morton @ 2004-08-28 22:13 UTC (permalink / raw) To: William Lee Irwin III; +Cc: karl.vogel, axboe, linux-mm William Lee Irwin III <wli@holomorphy.com> wrote: > > Karl Vogel <karl.vogel@pandora.be> wrote: > >> With overcommit_memory set to 1, the program can be run again after the > >> OOM kill.. but the OOM killing remains. > >> With overcommit_memory set to 0 a second run fails. I 'think' it's > >> because somehow SwapCache is 500Kb after the OOM, so in effect my system > >> doesn't have 1Gb to spare anymore. Doing swapoff/swapon frees this and > >> then I can do the calloc(1Gb) again. > >> Another way to free the SwapCached is to generate lots of I/O doing 'dd > >> if=/dev/hda of=/dev/null' ... after a while SwapCached is < 1Mb again. > > On Sat, Aug 28, 2004 at 02:43:03PM -0700, Andrew Morton wrote: > > urgh. It sounds like the overcommit logic forgot to account swapcache as > > reclaimable. It's been a ton of trouble, that code. > > For overcommit purposes, swapcache still counts as committed AS; it > requires swap as backing store to evict. So AFAICT there isn't an issue > there. But that backing store is allocated? > I was under the impression this had something to do with IO > schedulers. Separate issue. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: Kernel 2.6.8.1: swap storm of death - nr_requests > 1024 on swap partition 2004-08-28 22:13 ` Andrew Morton @ 2004-08-28 22:28 ` William Lee Irwin III 2004-08-29 10:30 ` Andrew Morton 2004-08-29 16:54 ` Jens Axboe 0 siblings, 2 replies; 35+ messages in thread From: William Lee Irwin III @ 2004-08-28 22:28 UTC (permalink / raw) To: Andrew Morton; +Cc: karl.vogel, axboe, linux-mm William Lee Irwin III <wli@holomorphy.com> wrote: >> For overcommit purposes, swapcache still counts as committed AS; it >> requires swap as backing store to evict. So AFAICT there isn't an issue >> there. On Sat, Aug 28, 2004 at 03:13:49PM -0700, Andrew Morton wrote: > But that backing store is allocated? Committed AS is so regardless of whether backing store has been allocated. If it has been allocated, the reservation is cashed and held. If it hasn't been allocated, it is reserved and held, but not cashed. In both those cases, the reservation is still held. For anonymous memory, the reservations are not released until it's freed, as that's the only way for an anonymous page to make a transition to not being swap-backed. William Lee Irwin III <wli@holomorphy.com> wrote: >> I was under the impression this had something to do with IO >> schedulers. On Sat, Aug 28, 2004 at 03:13:49PM -0700, Andrew Morton wrote: > Separate issue. It certainly appears to be the deciding factor from the thread. -- wli -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: Kernel 2.6.8.1: swap storm of death - nr_requests > 1024 on swap partition 2004-08-28 22:28 ` William Lee Irwin III @ 2004-08-29 10:30 ` Andrew Morton 2004-08-29 14:15 ` Jens Axboe 2004-08-29 16:54 ` Jens Axboe 1 sibling, 1 reply; 35+ messages in thread From: Andrew Morton @ 2004-08-29 10:30 UTC (permalink / raw) To: William Lee Irwin III; +Cc: karl.vogel, axboe, linux-mm William Lee Irwin III <wli@holomorphy.com> wrote: > > On Sat, Aug 28, 2004 at 03:13:49PM -0700, Andrew Morton wrote: > > Separate issue. > > It certainly appears to be the deciding factor from the thread. > It's all very bizarre. If you do a big `usemem -m 250' on a 256MB box, you end up with all memory in swapcache _after_ usemem exits. That's wrong: all the memory which usemem allocated should now be free. But all that swapcache is reclaimable under memory pressure. It seems to be floating about on the LRU still. It only happens with the CFQ elevator, and this backout patch makes it go away. The main effect of this patch is to increase the elevator's nr_requests from 128 to 8192. Something to do with that, I guess. Manyana. --- 25/drivers/block/ll_rw_blk.c~a 2004-08-29 03:21:41.678895384 -0700 +++ 25-akpm/drivers/block/ll_rw_blk.c 2004-08-29 03:21:50.230595328 -0700 @@ -1534,6 +1534,9 @@ request_queue_t *blk_init_queue(request_ printk("Using %s io scheduler\n", chosen_elevator->elevator_name); } + if (elevator_init(q, chosen_elevator)) + goto out_elv; + q->request_fn = rfn; q->back_merge_fn = ll_back_merge_fn; q->front_merge_fn = ll_front_merge_fn; @@ -1551,12 +1554,8 @@ request_queue_t *blk_init_queue(request_ blk_queue_max_hw_segments(q, MAX_HW_SEGMENTS); blk_queue_max_phys_segments(q, MAX_PHYS_SEGMENTS); - /* - * all done - */ - if (!elevator_init(q, chosen_elevator)) - return q; - + return q; +out_elv: blk_cleanup_queue(q); out_init: kmem_cache_free(requestq_cachep, q); _ -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: Kernel 2.6.8.1: swap storm of death - nr_requests > 1024 on swap partition 2004-08-29 10:30 ` Andrew Morton @ 2004-08-29 14:15 ` Jens Axboe 2004-08-29 14:17 ` Jens Axboe 0 siblings, 1 reply; 35+ messages in thread From: Jens Axboe @ 2004-08-29 14:15 UTC (permalink / raw) To: Andrew Morton; +Cc: William Lee Irwin III, karl.vogel, linux-mm On Sun, Aug 29 2004, Andrew Morton wrote: > William Lee Irwin III <wli@holomorphy.com> wrote: > > > > On Sat, Aug 28, 2004 at 03:13:49PM -0700, Andrew Morton wrote: > > > Separate issue. > > > > It certainly appears to be the deciding factor from the thread. > > > > It's all very bizarre. > > If you do a big `usemem -m 250' on a 256MB box, you end up with all memory > in swapcache _after_ usemem exits. That's wrong: all the memory which > usemem allocated should now be free. > > But all that swapcache is reclaimable under memory pressure. It seems to > be floating about on the LRU still. > > It only happens with the CFQ elevator, and this backout patch makes it go > away. It's not bizarre, if you backout that fix (it is a fix!), ->nr_requests isn't initialized when cfq gets there. So it'll throttle incorrectly in may_queue, not a good idea. > The main effect of this patch is to increase the elevator's nr_requests > from 128 to 8192. Something to do with that, I guess. How do you reach that conclusion? I think the correct fix, for now, is to remove the 8192 in CFQ. It's not the right thing to do anyways, it should just be set from user space. But the patch below should definitely not be backed out. > --- 25/drivers/block/ll_rw_blk.c~a 2004-08-29 03:21:41.678895384 -0700 > +++ 25-akpm/drivers/block/ll_rw_blk.c 2004-08-29 03:21:50.230595328 -0700 > @@ -1534,6 +1534,9 @@ request_queue_t *blk_init_queue(request_ > printk("Using %s io scheduler\n", chosen_elevator->elevator_name); > } > > + if (elevator_init(q, chosen_elevator)) > + goto out_elv; > + > q->request_fn = rfn; > q->back_merge_fn = ll_back_merge_fn; > q->front_merge_fn = ll_front_merge_fn; > @@ -1551,12 +1554,8 @@ request_queue_t *blk_init_queue(request_ > blk_queue_max_hw_segments(q, MAX_HW_SEGMENTS); > blk_queue_max_phys_segments(q, MAX_PHYS_SEGMENTS); > > - /* > - * all done > - */ > - if (!elevator_init(q, chosen_elevator)) > - return q; > - > + return q; > +out_elv: > blk_cleanup_queue(q); > out_init: > kmem_cache_free(requestq_cachep, q); -- Jens Axboe -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: Kernel 2.6.8.1: swap storm of death - nr_requests > 1024 on swap partition 2004-08-29 14:15 ` Jens Axboe @ 2004-08-29 14:17 ` Jens Axboe 2004-08-29 14:45 ` Rik van Riel 2004-08-29 20:18 ` Andrew Morton 0 siblings, 2 replies; 35+ messages in thread From: Jens Axboe @ 2004-08-29 14:17 UTC (permalink / raw) To: Andrew Morton; +Cc: William Lee Irwin III, karl.vogel, linux-mm On Sun, Aug 29 2004, Jens Axboe wrote: > > It's all very bizarre. > > > > If you do a big `usemem -m 250' on a 256MB box, you end up with all memory > > in swapcache _after_ usemem exits. That's wrong: all the memory which > > usemem allocated should now be free. > > > > But all that swapcache is reclaimable under memory pressure. It seems to > > be floating about on the LRU still. > > > > It only happens with the CFQ elevator, and this backout patch makes it go > > away. > > It's not bizarre, if you backout that fix (it is a fix!), ->nr_requests > isn't initialized when cfq gets there. So it'll throttle incorrectly in > may_queue, not a good idea. Oh, and I think the main issue is the vm. It should cope correctly no matter how much pending memory can be in progress on the queue, else it should not write out so much. CFQ is just exposing this bug because it defaults to bigger nr_requests. -- Jens Axboe -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: Kernel 2.6.8.1: swap storm of death - nr_requests > 1024 on swap partition 2004-08-29 14:17 ` Jens Axboe @ 2004-08-29 14:45 ` Rik van Riel 2004-08-29 20:18 ` Andrew Morton 1 sibling, 0 replies; 35+ messages in thread From: Rik van Riel @ 2004-08-29 14:45 UTC (permalink / raw) To: Jens Axboe; +Cc: Andrew Morton, William Lee Irwin III, karl.vogel, linux-mm On Sun, 29 Aug 2004, Jens Axboe wrote: > Oh, and I think the main issue is the vm. It should cope correctly no > matter how much pending memory can be in progress on the queue, else it > should not write out so much. CFQ is just exposing this bug because it > defaults to bigger nr_requests. Agreed. If the VM is short 10MB of free memory, it really shouldn't start 200MB worth of writes. -- "Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it." - Brian W. Kernighan -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: Kernel 2.6.8.1: swap storm of death - nr_requests > 1024 on swap partition 2004-08-29 14:17 ` Jens Axboe 2004-08-29 14:45 ` Rik van Riel @ 2004-08-29 20:18 ` Andrew Morton 2004-08-29 20:30 ` Jens Axboe 1 sibling, 1 reply; 35+ messages in thread From: Andrew Morton @ 2004-08-29 20:18 UTC (permalink / raw) To: Jens Axboe; +Cc: wli, karl.vogel, linux-mm Jens Axboe <axboe@suse.de> wrote: > > > > It only happens with the CFQ elevator, and this backout patch makes it go > > > away. > > > > It's not bizarre, if you backout that fix (it is a fix!), ->nr_requests > > isn't initialized when cfq gets there. So it'll throttle incorrectly in > > may_queue, not a good idea. > > Oh, and I think the main issue is the vm. It should cope correctly no > matter how much pending memory can be in progress on the queue, else it > should not write out so much. CFQ is just exposing this bug because it > defaults to bigger nr_requests. That was my point. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: Kernel 2.6.8.1: swap storm of death - nr_requests > 1024 on swap partition 2004-08-29 20:18 ` Andrew Morton @ 2004-08-29 20:30 ` Jens Axboe 2004-08-29 20:59 ` Andrew Morton 0 siblings, 1 reply; 35+ messages in thread From: Jens Axboe @ 2004-08-29 20:30 UTC (permalink / raw) To: Andrew Morton; +Cc: wli, karl.vogel, linux-mm On Sun, Aug 29 2004, Andrew Morton wrote: > Jens Axboe <axboe@suse.de> wrote: > > > > > > It only happens with the CFQ elevator, and this backout patch makes it go > > > > away. > > > > > > It's not bizarre, if you backout that fix (it is a fix!), ->nr_requests > > > isn't initialized when cfq gets there. So it'll throttle incorrectly in > > > may_queue, not a good idea. > > > > Oh, and I think the main issue is the vm. It should cope correctly no > > matter how much pending memory can be in progress on the queue, else it > > should not write out so much. CFQ is just exposing this bug because it > > defaults to bigger nr_requests. > > That was my point. I didn't understand your message at all, maybe that wasn't clear enough in my email :-). You state that the main effect of that particular patch is to bump nr_requests to 8192, which is definitely not true. The main effect of the patch is to make sure that ->nr_requests was valid, so that cfqd->max_queued is valid. ->nr_requests was always overwritten with 8192 for quite some time, irregardless of that patch. So this particular change has nothing to do with that, and other io schedulers will experience exactly this very problem with 8192 requests. Why you do see a difference is that when ->max_queued isn't valid, you end up block a lot more in get_request_wait() because cfq_may_queue will disallow you to queue a lot more than with the patch. Since other io schedulers don't have these sort of checks, they behave like CFQ does with the bug in blk_init_queue() fixed. -- Jens Axboe -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: Kernel 2.6.8.1: swap storm of death - nr_requests > 1024 on swap partition 2004-08-29 20:30 ` Jens Axboe @ 2004-08-29 20:59 ` Andrew Morton 2004-08-29 22:17 ` William Lee Irwin III 2004-08-30 15:20 ` Marcelo Tosatti 0 siblings, 2 replies; 35+ messages in thread From: Andrew Morton @ 2004-08-29 20:59 UTC (permalink / raw) To: Jens Axboe; +Cc: wli, karl.vogel, linux-mm Jens Axboe <axboe@suse.de> wrote: > > > That was my point. > > I didn't understand your message at all, maybe that wasn't clear enough > in my email :-). You state that the main effect of that particular patch > is to bump nr_requests to 8192, which is definitely not true. The main > effect of the patch is to make sure that ->nr_requests was valid, so > that cfqd->max_queued is valid. ->nr_requests was always overwritten > with 8192 for quite some time, irregardless of that patch. So this > particular change has nothing to do with that, and other io schedulers > will experience exactly this very problem with 8192 requests. > > Why you do see a difference is that when ->max_queued isn't valid, you > end up block a lot more in get_request_wait() because cfq_may_queue will > disallow you to queue a lot more than with the patch. Since other io > schedulers don't have these sort of checks, they behave like CFQ does > with the bug in blk_init_queue() fixed. The changlog wasn't that detailed ;) But yes, it's the large nr_requests which is tripping up swapout. I'm assuming that when a process exits with its anonymous memory still under swap I/O we're forgetting to actually free the pages when the I/O completes. So we end up with a ton of zero-ref swapcache pages on the LRU. I assume. Something odd's happening, that's for sure. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: Kernel 2.6.8.1: swap storm of death - nr_requests > 1024 on swap partition 2004-08-29 20:59 ` Andrew Morton @ 2004-08-29 22:17 ` William Lee Irwin III 2004-08-29 22:28 ` Andrew Morton 2004-08-30 15:20 ` Marcelo Tosatti 1 sibling, 1 reply; 35+ messages in thread From: William Lee Irwin III @ 2004-08-29 22:17 UTC (permalink / raw) To: Andrew Morton; +Cc: Jens Axboe, karl.vogel, linux-mm Jens Axboe <axboe@suse.de> wrote: >> Why you do see a difference is that when ->max_queued isn't valid, you >> end up block a lot more in get_request_wait() because cfq_may_queue will >> disallow you to queue a lot more than with the patch. Since other io >> schedulers don't have these sort of checks, they behave like CFQ does >> with the bug in blk_init_queue() fixed. On Sun, Aug 29, 2004 at 01:59:17PM -0700, Andrew Morton wrote: > The changlog wasn't that detailed ;) > But yes, it's the large nr_requests which is tripping up swapout. I'm > assuming that when a process exits with its anonymous memory still under > swap I/O we're forgetting to actually free the pages when the I/O > completes. So we end up with a ton of zero-ref swapcache pages on the LRU. > I assume. Something odd's happening, that's for sure. Maybe we need to be checking for this in end_swap_bio_write() or rotate_reclaimable_page()? -- wli -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: Kernel 2.6.8.1: swap storm of death - nr_requests > 1024 on swap partition 2004-08-29 22:17 ` William Lee Irwin III @ 2004-08-29 22:28 ` Andrew Morton 2004-08-30 7:41 ` Hugh Dickins 0 siblings, 1 reply; 35+ messages in thread From: Andrew Morton @ 2004-08-29 22:28 UTC (permalink / raw) To: William Lee Irwin III; +Cc: axboe, karl.vogel, linux-mm William Lee Irwin III <wli@holomorphy.com> wrote: > > On Sun, Aug 29, 2004 at 01:59:17PM -0700, Andrew Morton wrote: > > The changlog wasn't that detailed ;) > > But yes, it's the large nr_requests which is tripping up swapout. I'm > > assuming that when a process exits with its anonymous memory still under > > swap I/O we're forgetting to actually free the pages when the I/O > > completes. So we end up with a ton of zero-ref swapcache pages on the LRU. > > I assume. Something odd's happening, that's for sure. > > Maybe we need to be checking for this in end_swap_bio_write() or > rotate_reclaimable_page()? Maybe. I thought a get_page() in swap_writepage() and a put_page() in end_swap_bio_write() would cause the page to be freed. But not. It needs some actual real work done on it. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: Kernel 2.6.8.1: swap storm of death - nr_requests > 1024 on swap partition 2004-08-29 22:28 ` Andrew Morton @ 2004-08-30 7:41 ` Hugh Dickins 0 siblings, 0 replies; 35+ messages in thread From: Hugh Dickins @ 2004-08-30 7:41 UTC (permalink / raw) To: Andrew Morton; +Cc: William Lee Irwin III, axboe, karl.vogel, linux-mm On Sun, 29 Aug 2004, Andrew Morton wrote: > William Lee Irwin III <wli@holomorphy.com> wrote: > > On Sun, Aug 29, 2004 at 01:59:17PM -0700, Andrew Morton wrote: > > > The changlog wasn't that detailed ;) > > > But yes, it's the large nr_requests which is tripping up swapout. I'm > > > assuming that when a process exits with its anonymous memory still under > > > swap I/O we're forgetting to actually free the pages when the I/O > > > completes. So we end up with a ton of zero-ref swapcache pages on the LRU. > > > I assume. Something odd's happening, that's for sure. > > > > Maybe we need to be checking for this in end_swap_bio_write() or > > rotate_reclaimable_page()? > > Maybe. I thought a get_page() in swap_writepage() and a put_page() in > end_swap_bio_write() would cause the page to be freed. But not. It needs > some actual real work done on it. There are quite a few limitations on when page can be freed from SwapCache. Involves locks you wouldn't want to take from just anywhere. If the right conditions don't happen to be met at the time a process exits, it's quite normal for the SwapCache pages to hang around awhile, until eventually the __delete_from_swap_cache towards the end of shrink_list removes them. Hugh -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: Kernel 2.6.8.1: swap storm of death - nr_requests > 1024 on swap partition 2004-08-29 20:59 ` Andrew Morton 2004-08-29 22:17 ` William Lee Irwin III @ 2004-08-30 15:20 ` Marcelo Tosatti 2004-08-30 18:01 ` Karl Vogel 1 sibling, 1 reply; 35+ messages in thread From: Marcelo Tosatti @ 2004-08-30 15:20 UTC (permalink / raw) To: Andrew Morton; +Cc: Jens Axboe, wli, karl.vogel, linux-mm On Sun, Aug 29, 2004 at 01:59:17PM -0700, Andrew Morton wrote: > Jens Axboe <axboe@suse.de> wrote: > > > > > That was my point. > > > > I didn't understand your message at all, maybe that wasn't clear enough > > in my email :-). You state that the main effect of that particular patch > > is to bump nr_requests to 8192, which is definitely not true. The main > > effect of the patch is to make sure that ->nr_requests was valid, so > > that cfqd->max_queued is valid. ->nr_requests was always overwritten > > with 8192 for quite some time, irregardless of that patch. So this > > particular change has nothing to do with that, and other io schedulers > > will experience exactly this very problem with 8192 requests. > > > > Why you do see a difference is that when ->max_queued isn't valid, you > > end up block a lot more in get_request_wait() because cfq_may_queue will > > disallow you to queue a lot more than with the patch. Since other io > > schedulers don't have these sort of checks, they behave like CFQ does > > with the bug in blk_init_queue() fixed. > > The changlog wasn't that detailed ;) > > But yes, it's the large nr_requests which is tripping up swapout. I'm > assuming that when a process exits with its anonymous memory still under > swap I/O we're forgetting to actually free the pages when the I/O > completes. So we end up with a ton of zero-ref swapcache pages on the LRU. What nr_requests would have to do with swapcache not being freed after the owner of it exits? I can't reproduce the behaviour which swapcache is not freed after the memory hog exited (I'm using fillmem, dont think that matters). Where can I find usemem? The filesystem dirty writeback (page-writeback.c) code effectively throttles tasks on the size of the queue. blk_congestion_wait() is not enough to avoid the queueing get full. Same with swap IO. So, Andrew, you say you fixed that in the dirty writeback code. Where is that? What Jens seems to argue is that VM needs limiting IO in flight - it doesnt do that at all, it relies on the IO scheduler to do such limiting. That is how Linux always worked. I'm I missing something? > I assume. Something odd's happening, that's for sure. What is the problem Karl is seeing again? There seem to be several, lets separate them - OOM killer triggering (if there's swap space available and "enough" anonymous memory to be swapped out this should not happen). One of his complaint on the initial report (about the OOM killer). - Swap cache not freed after test app exists. Should not be a problem because such memory will be freed as soon as theres pressure, I think. How can you reproduce that? I can't see any big difference between using cfq/as with either 8192 or 128. Both make the box trash completly (ie very unresponsive) as soon as intensive swap IO starts. --- "I can bring down my box by running a program that does a calloc() of 512Mb (which is the size of my RAM). The box starts to heavily swap and never recovers from it. The process that calloc's the memory gets OOM killed (which is also strange as I have 1Gb free swap). After the OOM kill, the shell where I started the calloc() program is alive but very slow. The box continues to swap and the other processes remain dead. " -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: Kernel 2.6.8.1: swap storm of death - nr_requests > 1024 on swap partition 2004-08-30 15:20 ` Marcelo Tosatti @ 2004-08-30 18:01 ` Karl Vogel 2004-08-30 17:16 ` Marcelo Tosatti 2004-08-30 20:33 ` Marcelo Tosatti 0 siblings, 2 replies; 35+ messages in thread From: Karl Vogel @ 2004-08-30 18:01 UTC (permalink / raw) To: Marcelo Tosatti; +Cc: Andrew Morton, Jens Axboe, wli, linux-mm Marcelo Tosatti wrote: > What is the problem Karl is seeing again? There seem to be several, lets > separate them > > - OOM killer triggering (if there's swap space available and > "enough" anonymous memory to be swapped out this should not happen). > One of his complaint on the initial report (about the OOM killer). Correct. On my 512Mb RAM system with 1Gb swap partition, running a calloc(1Gb) causes the process to get OOM killed when using CFQ. The problem is not CFQ as such.. the problem is when nr_requests is too large (8192 being the default for CFQ). The same will happen with the default nr_request of 128 which AS uses, if you use a low memory system. e.g. I booted with mem=128M and then a calloc(128Mb) can trigger the OOM. > - Swap cache not freed after test app exists. Should not be a > problem because such memory will be freed as soon as theres > pressure, I think. After the OOM killer killed the calloc() task, the SwapCache still contains a large chunk of the original allocation. This get's cleared if there is alot of I/O (example: dd if=/dev/hdX of=/dev/null). However, without the I/O's it doesn't seem to get freed.. this also causes a second run of calloc(1Gb) to fail as the SwapCache still accounts for used memory. > How can you reproduce that? It should be reproducable as follows: - boot with mem=512M - have a 1Gb swap partition / swapfile (the size doesn't really matter) - use CFQ or set nr_requests to 8192 on the drive _hosting the swap_ - run 'expunge 1024' (might work the 1st time, if so, run it again) --- expunge.c program source --- #include <stdio.h> #include <stdlib.h> int main(int argc, char *argv[]) { char *p= calloc(1, atol(argv[1])*1024L*1024L); if (!p) { perror("calloc"); exit(1); } return 0; } --- expunge.c program source --- Another thing that you can try: - boot with mem=128M - have enough swap - execute: while true; do expunge 128; done This will trigger an OOM even with AS (nr_requests = 128) After the OOM, SwapCache still holds part of the allocation. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: Kernel 2.6.8.1: swap storm of death - nr_requests > 1024 on swap partition 2004-08-30 18:01 ` Karl Vogel @ 2004-08-30 17:16 ` Marcelo Tosatti 2004-08-30 22:59 ` Karl Vogel 2004-08-30 20:33 ` Marcelo Tosatti 1 sibling, 1 reply; 35+ messages in thread From: Marcelo Tosatti @ 2004-08-30 17:16 UTC (permalink / raw) To: Karl Vogel; +Cc: Andrew Morton, Jens Axboe, wli, linux-mm [-- Attachment #1: Type: text/plain, Size: 2489 bytes --] Karl, Please apply the attached patch and rerun your tests. With it applied, the OOM killer output will print the number of available swap pages at the time of killing. In the meantime I'll be doing some more tests. On Mon, Aug 30, 2004 at 08:01:19PM +0200, Karl Vogel wrote: > Marcelo Tosatti wrote: > > >What is the problem Karl is seeing again? There seem to be several, lets > >separate them > > > >- OOM killer triggering (if there's swap space available and > >"enough" anonymous memory to be swapped out this should not happen). > >One of his complaint on the initial report (about the OOM killer). > > Correct. On my 512Mb RAM system with 1Gb swap partition, running a > calloc(1Gb) causes the process to get OOM killed when using CFQ. > The problem is not CFQ as such.. the problem is when nr_requests is too > large (8192 being the default for CFQ). > > The same will happen with the default nr_request of 128 which AS uses, > if you use a low memory system. e.g. I booted with mem=128M and then a > calloc(128Mb) can trigger the OOM. > > >- Swap cache not freed after test app exists. Should not be a > >problem because such memory will be freed as soon as theres > >pressure, I think. > > After the OOM killer killed the calloc() task, the SwapCache still > contains a large chunk of the original allocation. This get's cleared if > there is alot of I/O (example: dd if=/dev/hdX of=/dev/null). > > However, without the I/O's it doesn't seem to get freed.. this also > causes a second run of calloc(1Gb) to fail as the SwapCache still > accounts for used memory. > > >How can you reproduce that? > > It should be reproducable as follows: > - boot with mem=512M > - have a 1Gb swap partition / swapfile (the size doesn't really matter) > - use CFQ or set nr_requests to 8192 on the drive _hosting the swap_ > - run 'expunge 1024' (might work the 1st time, if so, run it again) > > > --- expunge.c program source --- > #include <stdio.h> > #include <stdlib.h> > > int main(int argc, char *argv[]) > { > char *p= calloc(1, atol(argv[1])*1024L*1024L); > if (!p) { > perror("calloc"); > exit(1); > } > return 0; > } > --- expunge.c program source --- > > > > Another thing that you can try: > - boot with mem=128M > - have enough swap > - execute: while true; do expunge 128; done > > This will trigger an OOM even with AS (nr_requests = 128) > > > > After the OOM, SwapCache still holds part of the allocation. [-- Attachment #2: vm-reclaim2.patch --] [-- Type: text/plain, Size: 1331 bytes --] --- mm/page_alloc.c.orig 2004-08-24 20:37:53.000000000 -0300 +++ mm/page_alloc.c 2004-08-24 22:51:49.498375608 -0300 @@ -1021,11 +1021,12 @@ void show_free_areas(void) { struct page_state ps; - int cpu, temperature; + int cpu, temperature, i; unsigned long active; unsigned long inactive; unsigned long free; struct zone *zone; + unsigned int swap_pages = 0; for_each_zone(zone) { show_node(zone); @@ -1086,6 +1087,8 @@ " active:%lukB" " inactive:%lukB" " present:%lukB" + " pages_scanned:%lu" + " all_unreclaimable? %s" "\n", zone->name, K(zone->free_pages), @@ -1094,7 +1097,9 @@ K(zone->pages_high), K(zone->nr_active), K(zone->nr_inactive), - K(zone->present_pages) + K(zone->present_pages), + zone->pages_scanned, + (zone->all_unreclaimable ? "yes" : "no") ); printk("protections[]:"); for (i = 0; i < MAX_NR_ZONES; i++) @@ -1125,6 +1130,18 @@ printk("= %lukB\n", K(total)); } + swap_list_lock(); + for (i = 0; i < nr_swapfiles; i++) { + if (!(swap_info[i].flags & SWP_USED) || + (swap_info[i].flags & SWP_WRITEOK)) + continue; + swap_pages += swap_info[i].inuse_pages; + } + swap_pages += nr_swap_pages; + swap_list_unlock(); + + printk("nr_free_swap_pages: %u\n", swap_pages); + show_swap_cache_info(); } ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: Kernel 2.6.8.1: swap storm of death - nr_requests > 1024 on swap partition 2004-08-30 17:16 ` Marcelo Tosatti @ 2004-08-30 22:59 ` Karl Vogel 0 siblings, 0 replies; 35+ messages in thread From: Karl Vogel @ 2004-08-30 22:59 UTC (permalink / raw) To: Marcelo Tosatti; +Cc: Karl Vogel, Andrew Morton, Jens Axboe, wli, linux-mm On Monday 30 August 2004 19:16, Marcelo Tosatti wrote: > Karl, > > Please apply the attached patch and rerun your tests. With it applied, > the OOM killer output will print the number of available swap pages at > the time of killing. [kvo@localhost sources]$ cat /proc/meminfo MemTotal: 515728 kB MemFree: 495772 kB Buffers: 556 kB Cached: 3384 kB SwapCached: 0 kB Active: 7736 kB Inactive: 1948 kB HighTotal: 0 kB HighFree: 0 kB LowTotal: 515728 kB LowFree: 495772 kB SwapTotal: 1044216 kB SwapFree: 1044216 kB Dirty: 40 kB Writeback: 0 kB Mapped: 7044 kB Slab: 5412 kB Committed_AS: 9544 kB PageTables: 548 kB VmallocTotal: 516020 kB VmallocUsed: 2372 kB VmallocChunk: 512624 kB HugePages_Total: 0 HugePages_Free: 0 Hugepagesize: 4096 kB [kvo@localhost sources]$ date;time ./expunge 1024;date;time cat /proc/meminfo;date Tue Aug 31 00:45:25 CEST 2004 Killed real 0m8.662s user 0m0.636s sys 0m1.015s Tue Aug 31 00:45:42 CEST 2004 MemTotal: 515728 kB MemFree: 10364 kB Buffers: 140 kB Cached: 2696 kB SwapCached: 482928 kB Active: 2308 kB Inactive: 484124 kB HighTotal: 0 kB HighFree: 0 kB LowTotal: 515728 kB LowFree: 10364 kB SwapTotal: 1044216 kB SwapFree: 556868 kB Dirty: 0 kB Writeback: 219084 kB Mapped: 1784 kB Slab: 13948 kB Committed_AS: 9544 kB PageTables: 548 kB VmallocTotal: 516020 kB VmallocUsed: 2372 kB VmallocChunk: 512624 kB HugePages_Total: 0 HugePages_Free: 0 Hugepagesize: 4096 kB real 0m0.655s user 0m0.000s sys 0m0.001s Tue Aug 31 00:45:43 CEST 2004 dmesg output: kswapd0: page allocation failure. order:0, mode:0x20 [<c013e9a8>] __alloc_pages+0x1c8/0x390 [<c013eb8f>] __get_free_pages+0x1f/0x40 [<c014205d>] kmem_getpages+0x1d/0xb0 [<c0142d16>] cache_grow+0xb6/0x170 [<c0142f36>] cache_alloc_refill+0x166/0x210 [<c015d579>] bio_alloc+0xd9/0x1b0 [<c01431d6>] kmem_cache_alloc+0x56/0x70 [<c01b2d5f>] radix_tree_node_alloc+0x1f/0x60 [<c01b3002>] radix_tree_insert+0xe2/0x100 [<c0152c42>] __add_to_swap_cache+0x72/0xf0 [<c0152e1b>] add_to_swap+0x5b/0xb0 [<c014599c>] shrink_list+0x43c/0x470 [<c014e319>] page_referenced_anon+0x49/0x90 [<c0144718>] __pagevec_release+0x28/0x40 [<c0145b1d>] shrink_cache+0x14d/0x340 [<c014525f>] shrink_slab+0x7f/0x180 [<c014627a>] shrink_zone+0x9a/0xc0 [<c014665b>] balance_pgdat+0x1cb/0x230 [<c0146787>] kswapd+0xc7/0xe0 [<c011cbb0>] autoremove_wake_function+0x0/0x60 [<c010605e>] ret_from_fork+0x6/0x14 [<c011cbb0>] autoremove_wake_function+0x0/0x60 [<c01466c0>] kswapd+0x0/0xe0 [<c0104291>] kernel_thread_helper+0x5/0x14 >>> lots of these cut from mail oom-killer: gfp_mask=0xd2 DMA per-cpu: cpu 0 hot: low 2, high 6, batch 1 cpu 0 cold: low 0, high 2, batch 1 Normal per-cpu: cpu 0 hot: low 32, high 96, batch 16 cpu 0 cold: low 0, high 32, batch 16 HighMem per-cpu: empty Free pages: 660kB (0kB HighMem) Active:596 inactive:120914 dirty:0 writeback:120868 unstable:0 free:165 slab:5896 mapped:598 pagetables:278 DMA free:20kB min:20kB low:40kB high:60kB active:32kB inactive:11040kB present:16384kB pages_scanned:8928 all_unreclaimable? yes protections[]: 0 0 0 Normal free:640kB min:696kB low:1392kB high:2088kB active:2352kB inactive:472616kB present:507328kB pages_scanned:276672 all_unreclaimable? yes protections[]: 0 0 0 HighMem free:0kB min:128kB low:256kB high:384kB active:0kB inactive:0kB present:0kB pages_scanned:0 all_unreclaimable? no protections[]: 0 0 0 DMA: 1*4kB 0*8kB 1*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 20kB Normal: 0*4kB 0*8kB 0*16kB 0*32kB 10*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 640kB HighMem: empty nr_free_swap_pages: 116933 Swap cache: add 925862, delete 804994, find 990/1254, race 0+0 Out of Memory: Killed process 2513 (expunge). -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: Kernel 2.6.8.1: swap storm of death - nr_requests > 1024 on swap partition 2004-08-30 18:01 ` Karl Vogel 2004-08-30 17:16 ` Marcelo Tosatti @ 2004-08-30 20:33 ` Marcelo Tosatti 2004-08-30 22:37 ` Andrew Morton 2004-08-30 23:02 ` Karl Vogel 1 sibling, 2 replies; 35+ messages in thread From: Marcelo Tosatti @ 2004-08-30 20:33 UTC (permalink / raw) To: Karl Vogel; +Cc: Andrew Morton, Jens Axboe, wli, linux-mm On Mon, Aug 30, 2004 at 08:01:19PM +0200, Karl Vogel wrote: > Marcelo Tosatti wrote: > > >What is the problem Karl is seeing again? There seem to be several, lets > >separate them > > > >- OOM killer triggering (if there's swap space available and > >"enough" anonymous memory to be swapped out this should not happen). > >One of his complaint on the initial report (about the OOM killer). > > Correct. On my 512Mb RAM system with 1Gb swap partition, running a > calloc(1Gb) causes the process to get OOM killed when using CFQ. > The problem is not CFQ as such.. the problem is when nr_requests is too > large (8192 being the default for CFQ). > > The same will happen with the default nr_request of 128 which AS uses, > if you use a low memory system. e.g. I booted with mem=128M and then a > calloc(128Mb) can trigger the OOM. Karl, Can you please try the following - it limits the number of in-flight writeback pages to 25% of total RAM at the VM level. Does wonders for me with 8192 nr_requests. The hogs finish _much_ faster and and interactivity feels much better. With nr_requests=128, this limit is not reached (probably never), but with 8192, it certainly does. --- a/mm/vmscan.c 2004-08-30 17:50:25.000000000 -0300 +++ b/mm/vmscan.c 2004-08-30 18:34:54.666423368 -0300 @@ -247,6 +247,12 @@ static int may_write_to_queue(struct backing_dev_info *bdi) { + int nr_writeback = read_page_state(nr_writeback); + + if (nr_writeback > (totalram_pages * 25 / 100)) { + blk_congestion_wait(WRITE, HZ/5); + return 0; + } if (current_is_kswapd()) return 1; if (current_is_pdflush()) /* This is unlikely, but why not... */ -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: Kernel 2.6.8.1: swap storm of death - nr_requests > 1024 on swap partition 2004-08-30 20:33 ` Marcelo Tosatti @ 2004-08-30 22:37 ` Andrew Morton 2004-08-30 22:17 ` Marcelo Tosatti 2004-08-30 23:02 ` Karl Vogel 1 sibling, 1 reply; 35+ messages in thread From: Andrew Morton @ 2004-08-30 22:37 UTC (permalink / raw) To: Marcelo Tosatti; +Cc: karl.vogel, axboe, wli, linux-mm Marcelo Tosatti <marcelo.tosatti@cyclades.com> wrote: > > static int may_write_to_queue(struct backing_dev_info *bdi) > { > + int nr_writeback = read_page_state(nr_writeback); > + > + if (nr_writeback > (totalram_pages * 25 / 100)) { > + blk_congestion_wait(WRITE, HZ/5); > + return 0; > + } That's probably a good way of special-casing this special-place problem. For a final patch I'd be inclined to take into account /proc/sys/vm/dirty_ratio and to avoid running the expensive read_page_state() once per writepage. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: Kernel 2.6.8.1: swap storm of death - nr_requests > 1024 on swap partition 2004-08-30 22:37 ` Andrew Morton @ 2004-08-30 22:17 ` Marcelo Tosatti 2004-08-30 23:51 ` Andrew Morton 0 siblings, 1 reply; 35+ messages in thread From: Marcelo Tosatti @ 2004-08-30 22:17 UTC (permalink / raw) To: Andrew Morton; +Cc: karl.vogel, axboe, wli, linux-mm On Mon, Aug 30, 2004 at 03:37:30PM -0700, Andrew Morton wrote: > Marcelo Tosatti <marcelo.tosatti@cyclades.com> wrote: > > > > static int may_write_to_queue(struct backing_dev_info *bdi) > > { > > + int nr_writeback = read_page_state(nr_writeback); > > + > > + if (nr_writeback > (totalram_pages * 25 / 100)) { > > + blk_congestion_wait(WRITE, HZ/5); > > + return 0; > > + } > > That's probably a good way of special-casing this special-place problem. > > For a final patch I'd be inclined to take into account /proc/sys/vm/dirty_ratio > and to avoid running the expensive read_page_state() once per writepage. What you think of this, which tries to address your comments We might want to make shrink_caches() bailoff when the limit is reached --- include/linux/writeback.h.orig 2004-08-30 20:18:06.291153336 -0300 +++ include/linux/writeback.h 2004-08-30 20:17:47.284042856 -0300 @@ -86,6 +86,7 @@ int wakeup_bdflush(long nr_pages); void laptop_io_completion(void); void laptop_sync_completion(void); +int vm_eviction_limits(int); /* These are exported to sysctl. */ extern int dirty_background_ratio; --- mm/page-writeback.c.orig 2004-08-30 20:10:50.508402384 -0300 +++ m//page-writeback.c 2004-08-30 20:16:26.583311232 -0300 @@ -279,6 +279,21 @@ EXPORT_SYMBOL(balance_dirty_pages_ratelimited); /* + * This function calculates the maximum pinned-for-IO memory + * the page eviction threads can generate. + * + * Returns true if we cant writeout. + */ +int vm_eviction_limits(int inflight) +{ + if (inflight > (totalram_pages * vm_dirty_ratio) / 100) { + blk_congestion_wait(WRITE, HZ/10); + return 1; + } + return 0; +} + +/* * writeback at least _min_pages, and keep writing until the amount of dirty * memory is less than the background threshold, or until we're all clean. */ --- vmscan.c.orig 2004-08-30 20:19:05.501152048 -0300 +++ vmscan.c 2004-08-30 20:16:38.552491640 -0300 @@ -245,8 +245,11 @@ return page_count(page) - !!PagePrivate(page) == 2; } -static int may_write_to_queue(struct backing_dev_info *bdi) +static int may_write_to_queue(struct backing_dev_info *bdi, int inflight) { + if (vm_eviction_limits(inflight)) /* Check VM writeout limit */ + return 0; + if (current_is_kswapd()) return 1; if (current_is_pdflush()) /* This is unlikely, but why not... */ @@ -286,7 +289,8 @@ /* * pageout is called by shrink_list() for each dirty page. Calls ->writepage(). */ -static pageout_t pageout(struct page *page, struct address_space *mapping) +static pageout_t pageout(struct page *page, struct address_space *mapping, int +inflight) { /* * If the page is dirty, only perform writeback if that write @@ -311,7 +315,7 @@ return PAGE_KEEP; if (mapping->a_ops->writepage == NULL) return PAGE_ACTIVATE; - if (!may_write_to_queue(mapping->backing_dev_info)) + if (!may_write_to_queue(mapping->backing_dev_info, inflight)) return PAGE_KEEP; if (clear_page_dirty_for_io(page)) { @@ -351,6 +355,7 @@ struct pagevec freed_pvec; int pgactivate = 0; int reclaimed = 0; + int inflight = read_page_state(nr_writeback); cond_resched(); @@ -421,7 +426,7 @@ goto keep_locked; /* Page is dirty, try to write it out here */ - switch(pageout(page, mapping)) { + switch(pageout(page, mapping, inflight)) { case PAGE_KEEP: goto keep_locked; case PAGE_ACTIVATE: -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: Kernel 2.6.8.1: swap storm of death - nr_requests > 1024 on swap partition 2004-08-30 22:17 ` Marcelo Tosatti @ 2004-08-30 23:51 ` Andrew Morton 2004-08-31 10:23 ` Marcelo Tosatti 0 siblings, 1 reply; 35+ messages in thread From: Andrew Morton @ 2004-08-30 23:51 UTC (permalink / raw) To: Marcelo Tosatti; +Cc: karl.vogel, axboe, wli, linux-mm Marcelo Tosatti <marcelo.tosatti@cyclades.com> wrote: > > What you think of this, which tries to address your comments Suggest you pass the scan_control structure down into pageout(), stick `inflight' into struct scan_control and use some flag in scan_control to ensure that we only throttle once per try_to_free_pages()/blaance_pgdat() pass. See, page reclaim is now, as much as possible, "batched". Think of it as operating in units of 32 pages at a time. We should only examine the dirty memory thresholds and throttle once per "batch", not once per page. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: Kernel 2.6.8.1: swap storm of death - nr_requests > 1024 on swap partition 2004-08-30 23:51 ` Andrew Morton @ 2004-08-31 10:23 ` Marcelo Tosatti 2004-08-31 16:02 ` Marcelo Tosatti 2004-08-31 17:50 ` Karl Vogel 0 siblings, 2 replies; 35+ messages in thread From: Marcelo Tosatti @ 2004-08-31 10:23 UTC (permalink / raw) To: Andrew Morton; +Cc: karl.vogel, axboe, wli, linux-mm On Mon, Aug 30, 2004 at 04:51:00PM -0700, Andrew Morton wrote: > Marcelo Tosatti <marcelo.tosatti@cyclades.com> wrote: > > > > What you think of this, which tries to address your comments > > Suggest you pass the scan_control structure down into pageout(), stick > `inflight' into struct scan_control and use some flag in scan_control to Done the scan_control modifications. > ensure that we only throttle once per try_to_free_pages()/blaance_pgdat() > pass. Throttling once is enough I added a + if (sc->throttled < 5) { + blk_congestion_wait(WRITE, HZ/5); + sc->throttled++; + } To loop five times max per try_to_free_pages()/balance_pgdat(). Because only one blk_congestion_wait(WRITE, HZ/5) makes my 64MB boot testcase with 8192 nr_requests fail. The OOM killer triggers prematurely. > See, page reclaim is now, as much as possible, "batched". Think of it as > operating in units of 32 pages at a time. We should only examine the dirty > memory thresholds and throttle once per "batch", not once per page. That should do it --- mm/vmscan.c.orig 2004-08-30 20:19:05.000000000 -0300 +++ mm/vmscan.c 2004-08-31 08:30:08.323989416 -0300 @@ -73,6 +73,10 @@ unsigned int gfp_mask; int may_writepage; + + int inflight; + + int throttled; /* how many times have we throttled on VM inflight IO limit */ }; /* @@ -245,8 +249,30 @@ return page_count(page) - !!PagePrivate(page) == 2; } -static int may_write_to_queue(struct backing_dev_info *bdi) +/* + * This function calculates the maximum pinned-for-IO memory + * the page eviction threads can generate. If we hit the max, + * we throttle taking a nap. + * + * Returns true if we cant writeout. + */ +int vm_eviction_limits(struct scan_control *sc) +{ + if (sc->inflight > (totalram_pages * vm_dirty_ratio) / 100) { + if (sc->throttled < 5) { + blk_congestion_wait(WRITE, HZ/5); + sc->throttled++; + } + return 1; + } + return 0; +} + +static int may_write_to_queue(struct backing_dev_info *bdi, struct scan_control *sc) { + if (vm_eviction_limits(sc)) /* Check VM writeout limit */ + return 0; + if (current_is_kswapd()) return 1; if (current_is_pdflush()) /* This is unlikely, but why not... */ @@ -286,7 +312,7 @@ /* * pageout is called by shrink_list() for each dirty page. Calls ->writepage(). */ -static pageout_t pageout(struct page *page, struct address_space *mapping) +static pageout_t pageout(struct page *page, struct address_space *mapping, struct scan_control *sc) { /* * If the page is dirty, only perform writeback if that write @@ -311,7 +337,7 @@ return PAGE_KEEP; if (mapping->a_ops->writepage == NULL) return PAGE_ACTIVATE; - if (!may_write_to_queue(mapping->backing_dev_info)) + if (!may_write_to_queue(mapping->backing_dev_info, sc)) return PAGE_KEEP; if (clear_page_dirty_for_io(page)) { @@ -421,7 +447,7 @@ goto keep_locked; /* Page is dirty, try to write it out here */ - switch(pageout(page, mapping)) { + switch(pageout(page, mapping, sc)) { case PAGE_KEEP: goto keep_locked; case PAGE_ACTIVATE: @@ -807,6 +833,7 @@ nr_inactive = 0; sc->nr_to_reclaim = SWAP_CLUSTER_MAX; + sc->throttled = 0; while (nr_active || nr_inactive) { if (nr_active) { @@ -819,6 +846,7 @@ if (nr_inactive) { sc->nr_to_scan = min(nr_inactive, (unsigned long)SWAP_CLUSTER_MAX); + sc->inflight = read_page_state(nr_writeback); nr_inactive -= sc->nr_to_scan; shrink_cache(zone, sc); if (sc->nr_to_reclaim <= 0) -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: Kernel 2.6.8.1: swap storm of death - nr_requests > 1024 on swap partition 2004-08-31 10:23 ` Marcelo Tosatti @ 2004-08-31 16:02 ` Marcelo Tosatti 2004-08-31 17:50 ` Karl Vogel 1 sibling, 0 replies; 35+ messages in thread From: Marcelo Tosatti @ 2004-08-31 16:02 UTC (permalink / raw) To: Andrew Morton; +Cc: karl.vogel, axboe, wli, linux-mm On Tue, Aug 31, 2004 at 07:23:42AM -0300, Marcelo Tosatti wrote: > On Mon, Aug 30, 2004 at 04:51:00PM -0700, Andrew Morton wrote: > > Marcelo Tosatti <marcelo.tosatti@cyclades.com> wrote: > > > > > > What you think of this, which tries to address your comments > > > > Suggest you pass the scan_control structure down into pageout(), stick > > `inflight' into struct scan_control and use some flag in scan_control to > > Done the scan_control modifications. > > > ensure that we only throttle once per try_to_free_pages()/blaance_pgdat() > > pass. > > Throttling once is enough I meant "throttling once is not enough" Any comments? > I added a > > + if (sc->throttled < 5) { > + blk_congestion_wait(WRITE, HZ/5); > + sc->throttled++; > + } > > To loop five times max per try_to_free_pages()/balance_pgdat(). > > Because only one blk_congestion_wait(WRITE, HZ/5) > makes my 64MB boot testcase with 8192 nr_requests fail. The OOM killer > triggers prematurely. > > > See, page reclaim is now, as much as possible, "batched". Think of it as > > operating in units of 32 pages at a time. We should only examine the dirty > > memory thresholds and throttle once per "batch", not once per page. > > That should do it > > --- mm/vmscan.c.orig 2004-08-30 20:19:05.000000000 -0300 > +++ mm/vmscan.c 2004-08-31 08:30:08.323989416 -0300 > @@ -73,6 +73,10 @@ > unsigned int gfp_mask; > > int may_writepage; > + > + int inflight; > + > + int throttled; /* how many times have we throttled on VM inflight IO limit */ > }; > > /* > @@ -245,8 +249,30 @@ > return page_count(page) - !!PagePrivate(page) == 2; > } > > -static int may_write_to_queue(struct backing_dev_info *bdi) > +/* > + * This function calculates the maximum pinned-for-IO memory > + * the page eviction threads can generate. If we hit the max, > + * we throttle taking a nap. > + * > + * Returns true if we cant writeout. > + */ > +int vm_eviction_limits(struct scan_control *sc) > +{ > + if (sc->inflight > (totalram_pages * vm_dirty_ratio) / 100) { > + if (sc->throttled < 5) { > + blk_congestion_wait(WRITE, HZ/5); > + sc->throttled++; > + } > + return 1; > + } > + return 0; > +} > + > +static int may_write_to_queue(struct backing_dev_info *bdi, struct scan_control *sc) > { > + if (vm_eviction_limits(sc)) /* Check VM writeout limit */ > + return 0; > + > if (current_is_kswapd()) > return 1; > if (current_is_pdflush()) /* This is unlikely, but why not... */ > @@ -286,7 +312,7 @@ > /* > * pageout is called by shrink_list() for each dirty page. Calls ->writepage(). > */ > -static pageout_t pageout(struct page *page, struct address_space *mapping) > +static pageout_t pageout(struct page *page, struct address_space *mapping, struct scan_control *sc) > { > /* > * If the page is dirty, only perform writeback if that write > @@ -311,7 +337,7 @@ > return PAGE_KEEP; > if (mapping->a_ops->writepage == NULL) > return PAGE_ACTIVATE; > - if (!may_write_to_queue(mapping->backing_dev_info)) > + if (!may_write_to_queue(mapping->backing_dev_info, sc)) > return PAGE_KEEP; > > if (clear_page_dirty_for_io(page)) { > @@ -421,7 +447,7 @@ > goto keep_locked; > > /* Page is dirty, try to write it out here */ > - switch(pageout(page, mapping)) { > + switch(pageout(page, mapping, sc)) { > case PAGE_KEEP: > goto keep_locked; > case PAGE_ACTIVATE: > @@ -807,6 +833,7 @@ > nr_inactive = 0; > > sc->nr_to_reclaim = SWAP_CLUSTER_MAX; > + sc->throttled = 0; > > while (nr_active || nr_inactive) { > if (nr_active) { > @@ -819,6 +846,7 @@ > if (nr_inactive) { > sc->nr_to_scan = min(nr_inactive, > (unsigned long)SWAP_CLUSTER_MAX); > + sc->inflight = read_page_state(nr_writeback); > nr_inactive -= sc->nr_to_scan; > shrink_cache(zone, sc); > if (sc->nr_to_reclaim <= 0) -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: Kernel 2.6.8.1: swap storm of death - nr_requests > 1024 on swap partition 2004-08-31 10:23 ` Marcelo Tosatti 2004-08-31 16:02 ` Marcelo Tosatti @ 2004-08-31 17:50 ` Karl Vogel 2004-08-31 16:52 ` Marcelo Tosatti 1 sibling, 1 reply; 35+ messages in thread From: Karl Vogel @ 2004-08-31 17:50 UTC (permalink / raw) To: Marcelo Tosatti; +Cc: Andrew Morton, karl.vogel, axboe, wli, linux-mm On Tuesday 31 August 2004 12:23, Marcelo Tosatti wrote: > On Mon, Aug 30, 2004 at 04:51:00PM -0700, Andrew Morton wrote: > > Marcelo Tosatti <marcelo.tosatti@cyclades.com> wrote: > > > What you think of this, which tries to address your comments > > > > Suggest you pass the scan_control structure down into pageout(), stick > > `inflight' into struct scan_control and use some flag in scan_control to > > Done the scan_control modifications. Took the patch for a spin.. it seems to behave ok here! No more OOMs. Quick question: is it to be expected that when I run a calloc(500Mb) on my system, when X is up and amarok is streaming live audio, that everything (apps) freezes for a few seconds until the calloc task exits?! The apps probably get pushed out to swap, but I would think that since these applications are running, that their pages are kept on the active list?! Setting swappiness to 0 doesn't make a difference. Is there a concept of a minimum working set size of an application? (kind of the reverse of an RSS limit) -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: Kernel 2.6.8.1: swap storm of death - nr_requests > 1024 on swap partition 2004-08-31 17:50 ` Karl Vogel @ 2004-08-31 16:52 ` Marcelo Tosatti 2004-08-31 18:24 ` Karl Vogel 0 siblings, 1 reply; 35+ messages in thread From: Marcelo Tosatti @ 2004-08-31 16:52 UTC (permalink / raw) To: Karl Vogel; +Cc: Andrew Morton, karl.vogel, axboe, wli, linux-mm On Tue, Aug 31, 2004 at 07:50:07PM +0200, Karl Vogel wrote: > On Tuesday 31 August 2004 12:23, Marcelo Tosatti wrote: > > On Mon, Aug 30, 2004 at 04:51:00PM -0700, Andrew Morton wrote: > > > Marcelo Tosatti <marcelo.tosatti@cyclades.com> wrote: > > > > What you think of this, which tries to address your comments > > > > > > Suggest you pass the scan_control structure down into pageout(), stick > > > `inflight' into struct scan_control and use some flag in scan_control to > > > > Done the scan_control modifications. > > Took the patch for a spin.. it seems to behave ok here! No more OOMs. > > Quick question: is it to be expected that when I run a calloc(500Mb) on my > system, when X is up and amarok is streaming live audio, that everything > (apps) freezes for a few seconds until the calloc task exits?! > The apps probably get pushed out to swap, but I would think that since these > applications are running, that their pages are kept on the active list?! > Setting swappiness to 0 doesn't make a difference. > > Is there a concept of a minimum working set size of an application? (kind of > the reverse of an RSS limit) Not really. A hungry memory app can starve the rest of the system. One thing: what kernel version are you using? I've seen extreme decreases in performance (interactivity) with hungry memory apps with Rik's swap token code. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: Kernel 2.6.8.1: swap storm of death - nr_requests > 1024 on swap partition 2004-08-31 16:52 ` Marcelo Tosatti @ 2004-08-31 18:24 ` Karl Vogel 2004-08-31 17:25 ` Marcelo Tosatti 0 siblings, 1 reply; 35+ messages in thread From: Karl Vogel @ 2004-08-31 18:24 UTC (permalink / raw) To: Marcelo Tosatti; +Cc: Andrew Morton, karl.vogel, axboe, wli, linux-mm On Tuesday 31 August 2004 18:52, Marcelo Tosatti wrote: > > Is there a concept of a minimum working set size of an application? (kind > > of the reverse of an RSS limit) > > Not really. A hungry memory app can starve the rest of the system. I noticed that a few times on our spamassassin box :-) > One thing: what kernel version are you using? 2.6.9-rc1-bk3 > I've seen extreme decreases in performance (interactivity) with hungry > memory apps with Rik's swap token code. Decrease?! -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: Kernel 2.6.8.1: swap storm of death - nr_requests > 1024 on swap partition 2004-08-31 18:24 ` Karl Vogel @ 2004-08-31 17:25 ` Marcelo Tosatti 2004-08-31 19:36 ` Karl Vogel 2004-09-02 9:05 ` Rik van Riel 0 siblings, 2 replies; 35+ messages in thread From: Marcelo Tosatti @ 2004-08-31 17:25 UTC (permalink / raw) To: Karl Vogel; +Cc: Andrew Morton, karl.vogel, axboe, wli, linux-mm On Tue, Aug 31, 2004 at 08:24:31PM +0200, Karl Vogel wrote: > On Tuesday 31 August 2004 18:52, Marcelo Tosatti wrote: > > > Is there a concept of a minimum working set size of an application? (kind > > > of the reverse of an RSS limit) > > > > Not really. A hungry memory app can starve the rest of the system. > > I noticed that a few times on our spamassassin box :-) > > > One thing: what kernel version are you using? > > 2.6.9-rc1-bk3 Can you try the same tests with 2.6.8.1 and check the difference, pretty please? > > I've seen extreme decreases in performance (interactivity) with hungry > > memory apps with Rik's swap token code. > > Decrease?! Yep, its odd. Rik knows the exact reason. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: Kernel 2.6.8.1: swap storm of death - nr_requests > 1024 on swap partition 2004-08-31 17:25 ` Marcelo Tosatti @ 2004-08-31 19:36 ` Karl Vogel 2004-09-02 9:05 ` Rik van Riel 1 sibling, 0 replies; 35+ messages in thread From: Karl Vogel @ 2004-08-31 19:36 UTC (permalink / raw) To: Marcelo Tosatti; +Cc: Andrew Morton, karl.vogel, axboe, wli, linux-mm On Tuesday 31 August 2004 19:25, Marcelo Tosatti wrote: > Can you try the same tests with 2.6.8.1 and check the difference, pretty > please? You forgot the sugar on top :) Anyway 2.6.8.1 also seems to behave now.. I do get a few 'kswapd0: page allocation failure. order:0, mode:0x20' but the system doesn't OOM kill and it recovers after the expunge. Although I think it recovers a tad slower than 2.6.9-rc1-bk3 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: Kernel 2.6.8.1: swap storm of death - nr_requests > 1024 on swap partition 2004-08-31 17:25 ` Marcelo Tosatti 2004-08-31 19:36 ` Karl Vogel @ 2004-09-02 9:05 ` Rik van Riel 1 sibling, 0 replies; 35+ messages in thread From: Rik van Riel @ 2004-09-02 9:05 UTC (permalink / raw) To: Marcelo Tosatti Cc: Karl Vogel, Andrew Morton, karl.vogel, axboe, wli, linux-mm On Tue, 31 Aug 2004, Marcelo Tosatti wrote: > On Tue, Aug 31, 2004 at 08:24:31PM +0200, Karl Vogel wrote: > > On Tuesday 31 August 2004 18:52, Marcelo Tosatti wrote: > > > I've seen extreme decreases in performance (interactivity) with hungry > > > memory apps with Rik's swap token code. > > > > Decrease?! > > Yep, its odd. Rik knows the exact reason. Yes, it appears that the swap token patch works great on systems where the workload consists of similar applications. If you have a desktop, the swap token makes switching between apps faster. If you have a server, the swap token helps increase throughput. However, if you have one app that needs more memory than the system has and the rest of the apps are all "friendly", then the swap token can help the system hog steal resources from the other processes. This needs to be fixed somehow, but I'm at a conference now so I don't think I'll get around to it this week ;) -- "Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it." - Brian W. Kernighan -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: Kernel 2.6.8.1: swap storm of death - nr_requests > 1024 on swap partition 2004-08-30 20:33 ` Marcelo Tosatti 2004-08-30 22:37 ` Andrew Morton @ 2004-08-30 23:02 ` Karl Vogel 1 sibling, 0 replies; 35+ messages in thread From: Karl Vogel @ 2004-08-30 23:02 UTC (permalink / raw) To: Marcelo Tosatti; +Cc: Karl Vogel, Andrew Morton, Jens Axboe, wli, linux-mm On Monday 30 August 2004 22:33, Marcelo Tosatti wrote: > Can you please try the following - it limits the number of in-flight > writeback pages to 25% of total RAM at the VM level. > > Does wonders for me with 8192 nr_requests. The hogs finish _much_ faster > and and interactivity feels much better. > > With nr_requests=128, this limit is not reached (probably never), but with > 8192, it certainly does. > > --- a/mm/vmscan.c 2004-08-30 17:50:25.000000000 -0300 > +++ b/mm/vmscan.c 2004-08-30 18:34:54.666423368 -0300 > @@ -247,6 +247,12 @@ > > static int may_write_to_queue(struct backing_dev_info *bdi) > { > + int nr_writeback = read_page_state(nr_writeback); > + > + if (nr_writeback > (totalram_pages * 25 / 100)) { > + blk_congestion_wait(WRITE, HZ/5); > + return 0; > + } > if (current_is_kswapd()) > return 1; > if (current_is_pdflush()) /* This is unlikely, but why not... */ This fixes the OOM for me.. I can do some more testing if needed tomorrow.. [kvo@localhost sources]$ cat /proc/meminfo MemTotal: 515728 kB MemFree: 445084 kB Buffers: 9492 kB Cached: 33268 kB SwapCached: 0 kB Active: 19748 kB Inactive: 28716 kB HighTotal: 0 kB HighFree: 0 kB LowTotal: 515728 kB LowFree: 445084 kB SwapTotal: 1044216 kB SwapFree: 1044216 kB Dirty: 84 kB Writeback: 0 kB Mapped: 8960 kB Slab: 17284 kB Committed_AS: 9544 kB PageTables: 548 kB VmallocTotal: 516020 kB VmallocUsed: 2372 kB VmallocChunk: 512624 kB HugePages_Total: 0 HugePages_Free: 0 Hugepagesize: 4096 kB [kvo@localhost sources]$ date;./expunge 1024;date;time cat /proc/meminfo;date Tue Aug 31 00:51:20 CEST 2004 Tue Aug 31 00:51:55 CEST 2004 MemTotal: 515728 kB MemFree: 381036 kB Buffers: 272 kB Cached: 2844 kB SwapCached: 120572 kB Active: 2036 kB Inactive: 121868 kB HighTotal: 0 kB HighFree: 0 kB LowTotal: 515728 kB LowFree: 381036 kB SwapTotal: 1044216 kB SwapFree: 919020 kB Dirty: 0 kB Writeback: 0 kB Mapped: 1420 kB Slab: 5932 kB Committed_AS: 9764 kB PageTables: 572 kB VmallocTotal: 516020 kB VmallocUsed: 2372 kB VmallocChunk: 512624 kB HugePages_Total: 0 HugePages_Free: 0 Hugepagesize: 4096 kB real 0m0.071s user 0m0.000s sys 0m0.000s Tue Aug 31 00:51:55 CEST 2004 [kvo@localhost sources]$ date;./expunge 1024;date;time cat /proc/meminfo;date Tue Aug 31 00:52:41 CEST 2004 Tue Aug 31 00:53:16 CEST 2004 MemTotal: 515728 kB MemFree: 383832 kB Buffers: 220 kB Cached: 2792 kB SwapCached: 117196 kB Active: 1944 kB Inactive: 118652 kB HighTotal: 0 kB HighFree: 0 kB LowTotal: 515728 kB LowFree: 383832 kB SwapTotal: 1044216 kB SwapFree: 922316 kB Dirty: 0 kB Writeback: 16432 kB Mapped: 1484 kB Slab: 6328 kB Committed_AS: 9768 kB PageTables: 572 kB VmallocTotal: 516020 kB VmallocUsed: 2372 kB VmallocChunk: 512624 kB HugePages_Total: 0 HugePages_Free: 0 Hugepagesize: 4096 kB real 0m0.328s user 0m0.000s sys 0m0.001s Tue Aug 31 00:53:16 CEST 2004 [kvo@localhost sources]$ exit -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: Kernel 2.6.8.1: swap storm of death - nr_requests > 1024 on swap partition 2004-08-28 22:28 ` William Lee Irwin III 2004-08-29 10:30 ` Andrew Morton @ 2004-08-29 16:54 ` Jens Axboe 2004-08-29 17:52 ` William Lee Irwin III 1 sibling, 1 reply; 35+ messages in thread From: Jens Axboe @ 2004-08-29 16:54 UTC (permalink / raw) To: William Lee Irwin III, Andrew Morton, karl.vogel, linux-mm On Sat, Aug 28 2004, William Lee Irwin III wrote: > >> I was under the impression this had something to do with IO > >> schedulers. > > On Sat, Aug 28, 2004 at 03:13:49PM -0700, Andrew Morton wrote: > > Separate issue. > > It certainly appears to be the deciding factor from the thread. Has nothing to do with the io scheduler itself, apart from the fact that CFQ exposes the problem by setting a larger q->nr_requests. And that is the very deciding factor, not the io scheduler. -- Jens Axboe -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: Kernel 2.6.8.1: swap storm of death - nr_requests > 1024 on swap partition 2004-08-29 16:54 ` Jens Axboe @ 2004-08-29 17:52 ` William Lee Irwin III 0 siblings, 0 replies; 35+ messages in thread From: William Lee Irwin III @ 2004-08-29 17:52 UTC (permalink / raw) To: Jens Axboe; +Cc: Andrew Morton, karl.vogel, linux-mm On Sat, Aug 28 2004, William Lee Irwin III wrote: >> It certainly appears to be the deciding factor from the thread. On Sun, Aug 29, 2004 at 06:54:59PM +0200, Jens Axboe wrote: > Has nothing to do with the io scheduler itself, apart from the fact that > CFQ exposes the problem by setting a larger q->nr_requests. And that is > the very deciding factor, not the io scheduler. Then it's narrower still, q->nr_requests. What a priori reasons are there for this to vomit? clear_queue_congested() seems to be called only when a request is retired, so a large number of requests in flight may be doing something unexpected, and I'd expect large q->nr_requests to keep large numbers of requests around. Hmm... -- wli -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: Kernel 2.6.8.1: swap storm of death - nr_requests > 1024 on swap partition 2004-08-28 21:43 ` Kernel 2.6.8.1: swap storm of death - nr_requests > 1024 on swap partition Andrew Morton 2004-08-28 21:54 ` William Lee Irwin III @ 2004-08-28 21:59 ` Karl Vogel 2004-08-29 16:53 ` Jens Axboe 2 siblings, 0 replies; 35+ messages in thread From: Karl Vogel @ 2004-08-28 21:59 UTC (permalink / raw) To: Andrew Morton; +Cc: Jens Axboe, linux-mm Andrew Morton wrote: >>So if I understand you correctly, CFQ shouldn't be using 8192 on 512Mb >>systems?! > > > Yup. It's asking for trouble to allow that much memory to be unreclaimably > pinned. > > Of course, you could have the same problem with just 128 requests per > queue, and lots of queues. I solved all these problems in the dirty memory > writeback paths. But I forgot about swapout! Looks like the default 128 requests that AS is using, also gives trouble with 128Mb systems. I just tried booting with mem=128M and then doing: $ while true; do expunge 130; done which causes the OOM to kick in too. Decreasing nr_requests resolves the issue. So it's not only with lots of queues (AS only uses 1 queue, right?) >>With overcommit_memory set to 1, the program can be run again after the >>OOM kill.. but the OOM killing remains. >> >>With overcommit_memory set to 0 a second run fails. I 'think' it's >>because somehow SwapCache is 500Kb after the OOM, so in effect my system >>doesn't have 1Gb to spare anymore. Doing swapoff/swapon frees this and >>then I can do the calloc(1Gb) again. >> >>Another way to free the SwapCached is to generate lots of I/O doing 'dd >>if=/dev/hda of=/dev/null' ... after a while SwapCached is < 1Mb again. >> > > > urgh. It sounds like the overcommit logic forgot to account swapcache as > reclaimable. It's been a ton of trouble, that code. NOTE: this happens both with overcommit_memory set to 0 or 1. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: Kernel 2.6.8.1: swap storm of death - nr_requests > 1024 on swap partition 2004-08-28 21:43 ` Kernel 2.6.8.1: swap storm of death - nr_requests > 1024 on swap partition Andrew Morton 2004-08-28 21:54 ` William Lee Irwin III 2004-08-28 21:59 ` Karl Vogel @ 2004-08-29 16:53 ` Jens Axboe 2 siblings, 0 replies; 35+ messages in thread From: Jens Axboe @ 2004-08-29 16:53 UTC (permalink / raw) To: Andrew Morton; +Cc: Karl Vogel, linux-mm On Sat, Aug 28 2004, Andrew Morton wrote: > > (Added linux-mm) > > Karl Vogel <karl.vogel@pandora.be> wrote: > > > > Andrew Morton wrote: > > > Karl Vogel <karl.vogel@seagha.com> wrote: > > > > > >>Further testing shows that all the schedulers exhibit this exact same > > >> problem when run with a nr_requests size of 8192 on the drive hosting > > >> the swap partition. > > >> > > >> I tried noop, deadline, as and CFQ with: > > >> > > >> echo 8192 >/sys/block/hda/queue/nr_requests > > > > > > > > > That allows up to 2GB of memory to be under writeout at the same time. The > > > VM cannot touch any of that memory. > > > > Well I used that value as it is the default for CFQ.. and it was with > > CFQ that I had the problems. The patch Jens offered to track down the > > problem, commented out this 'q->nr_requests = 8192' in CFQ and it > > helped. Therefor I tried the other schedulers with this value to see if > > it made a difference. > > > > So if I understand you correctly, CFQ shouldn't be using 8192 on 512Mb > > systems?! > > Yup. It's asking for trouble to allow that much memory to be unreclaimably > pinned. It's not pinned, it's in-progress. I think it's really bad behaviour to _allow_ so much to be in-progress, if you can't handle it. It's silly to expect the io scheduler to know this and limit it, belongs at a different level (the vm, where you have such knowledge). > Of course, you could have the same problem with just 128 requests per > queue, and lots of queues. I solved all these problems in the dirty memory > writeback paths. But I forgot about swapout! Precisely. Or 128 requests on a 16MB system. More proof that this is a vm problem. -- Jens Axboe -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 35+ messages in thread
end of thread, other threads:[~2004-09-02 9:05 UTC | newest]
Thread overview: 35+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
[not found] <20040824124356.GW2355@suse.de>
[not found] ` <412CDE7E.9060307@seagha.com>
[not found] ` <20040826144155.GH2912@suse.de>
[not found] ` <412E13DB.6040102@seagha.com>
[not found] ` <412E31EE.3090102@pandora.be>
[not found] ` <41308C62.7030904@seagha.com>
[not found] ` <20040828125028.2fa2a12b.akpm@osdl.org>
[not found] ` <4130F55A.90705@pandora.be>
2004-08-28 21:43 ` Kernel 2.6.8.1: swap storm of death - nr_requests > 1024 on swap partition Andrew Morton
2004-08-28 21:54 ` William Lee Irwin III
2004-08-28 22:13 ` Andrew Morton
2004-08-28 22:28 ` William Lee Irwin III
2004-08-29 10:30 ` Andrew Morton
2004-08-29 14:15 ` Jens Axboe
2004-08-29 14:17 ` Jens Axboe
2004-08-29 14:45 ` Rik van Riel
2004-08-29 20:18 ` Andrew Morton
2004-08-29 20:30 ` Jens Axboe
2004-08-29 20:59 ` Andrew Morton
2004-08-29 22:17 ` William Lee Irwin III
2004-08-29 22:28 ` Andrew Morton
2004-08-30 7:41 ` Hugh Dickins
2004-08-30 15:20 ` Marcelo Tosatti
2004-08-30 18:01 ` Karl Vogel
2004-08-30 17:16 ` Marcelo Tosatti
2004-08-30 22:59 ` Karl Vogel
2004-08-30 20:33 ` Marcelo Tosatti
2004-08-30 22:37 ` Andrew Morton
2004-08-30 22:17 ` Marcelo Tosatti
2004-08-30 23:51 ` Andrew Morton
2004-08-31 10:23 ` Marcelo Tosatti
2004-08-31 16:02 ` Marcelo Tosatti
2004-08-31 17:50 ` Karl Vogel
2004-08-31 16:52 ` Marcelo Tosatti
2004-08-31 18:24 ` Karl Vogel
2004-08-31 17:25 ` Marcelo Tosatti
2004-08-31 19:36 ` Karl Vogel
2004-09-02 9:05 ` Rik van Riel
2004-08-30 23:02 ` Karl Vogel
2004-08-29 16:54 ` Jens Axboe
2004-08-29 17:52 ` William Lee Irwin III
2004-08-28 21:59 ` Karl Vogel
2004-08-29 16:53 ` Jens Axboe
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox