Memory pressure handling with iSCSI

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* Memory pressure handling with iSCSI
@ 2005-07-26 17:35 Badari Pulavarty
  2005-07-26 18:04 ` Roland Dreier
                   ` (2 more replies)
  0 siblings, 3 replies; 28+ messages in thread
From: Badari Pulavarty @ 2005-07-26 17:35 UTC (permalink / raw)
  To: lkml, linux-mm; +Cc: akpm

[-- Attachment #1: Type: text/plain, Size: 419 bytes --]

Hi Andrew,

After KS & OLS discussions about memory pressure, I wanted to re-do
iSCSI testing with "dd"s to see if we are throttling writes.  

I created 50 10-GB ext3 filesystems on iSCSI luns. Test is simple
50 dds (one per filesystem). System seems to throttle memory properly
and making progress. (Machine doesn't respond very well for anything
else, but my vmstat keeps running - 100% sys time).

Thanks,
Badari



[-- Attachment #2: vmstat.out --]
[-- Type: text/plain, Size: 1461 bytes --]

procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in    cs us sy id wa
38 96  30500  43360  16612 6671064    2    0   103 11079 9860  2960  0 100  0  0
43 94  30500  43872  16704 6670460    0    0   124 11232 10993  3624  0 100  0  0
41 95  30500  44756  16780 6670304   22    0    41 11615 10864  3702  0 100  0  0
43 91  30500  43392  16580 6672096    6    0    11 10885 9736  2528  0 100  0  0
44 88  30500  43268  16468 6672204    6    0    14 12084 10361  1971  0 100  0  0
42 90  30500  43640  16556 6672116    0    0    26 12094 10447  3550  0 100  0  0
45 90  30500  46120  16584 6670016    6    0    22 11546 10690  3815  0 100  0  0
42 89  30500  43516  16560 6672564   11    0    48 12902 9368  3464  0 100  0  0
40 91  30500  43640  16572 6671540    6    0    87 10866 9253  2943  0 100  0  0
37 90  30500  43516  16608 6672040    6    0    25 14411 9374  2595  0 100  0  0
36 99  30500  43268  16568 6672080    0    0    23 14071 9524  2401  0 100  0  0
36 93  30500  43268  16596 6671504    6    0    16 11502 9403  3185  0 100  0  0
33 91  30500  43392  16588 6671540    0    0    11 10191 9837  3374  0 100  0  0
33 91  30500  43392  16552 6672092    0    0    15 11762 9703  2915  0 100  0  0
33 90  30500  43268  16648 6671480    0    0   131 11692 9784  3154  0 100  0  0
33 97  30500  43640  16640 6672004    0    0    18  9253 9491  1998  0 100  0  0



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Memory pressure handling with iSCSI
  2005-07-26 17:35 Memory pressure handling with iSCSI Badari Pulavarty
@ 2005-07-26 18:04 ` Roland Dreier
  2005-07-26 18:11 ` Andrew Morton
  2005-07-26 20:59 ` Rik van Riel
  2 siblings, 0 replies; 28+ messages in thread
From: Roland Dreier @ 2005-07-26 18:04 UTC (permalink / raw)
  To: Badari Pulavarty; +Cc: lkml, linux-mm, akpm

Thanks, this is a good test.  It would be interesting to know if the
system does eventually deadlock with less system memory or with even
more filesystems.

 - R.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Memory pressure handling with iSCSI
  2005-07-26 17:35 Memory pressure handling with iSCSI Badari Pulavarty
  2005-07-26 18:04 ` Roland Dreier
@ 2005-07-26 18:11 ` Andrew Morton
  2005-07-26 18:39   ` Badari Pulavarty
  2005-07-26 20:59 ` Rik van Riel
  2 siblings, 1 reply; 28+ messages in thread
From: Andrew Morton @ 2005-07-26 18:11 UTC (permalink / raw)
  To: Badari Pulavarty; +Cc: linux-kernel, linux-mm

Badari Pulavarty <pbadari@us.ibm.com> wrote:
>
> After KS & OLS discussions about memory pressure, I wanted to re-do
>  iSCSI testing with "dd"s to see if we are throttling writes.  
> 
>  I created 50 10-GB ext3 filesystems on iSCSI luns. Test is simple
>  50 dds (one per filesystem). System seems to throttle memory properly
>  and making progress. (Machine doesn't respond very well for anything
>  else, but my vmstat keeps running - 100% sys time).

It's important to monitor /proc/meminfo too - the amount of dirty/writeback
pages, etc.

btw, 100% system time is quite appalling.  Are you sure vmstat is telling
the truth?  If so, where's it all being spent?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Memory pressure handling with iSCSI
  2005-07-26 18:11 ` Andrew Morton
@ 2005-07-26 18:39   ` Badari Pulavarty
  2005-07-26 18:48     ` Andrew Morton
  2005-07-26 19:31     ` Sonny Rao
  0 siblings, 2 replies; 28+ messages in thread
From: Badari Pulavarty @ 2005-07-26 18:39 UTC (permalink / raw)
  To: Andrew Morton; +Cc: lkml, linux-mm

On Tue, 2005-07-26 at 11:11 -0700, Andrew Morton wrote:
> Badari Pulavarty <pbadari@us.ibm.com> wrote:
> >
> > After KS & OLS discussions about memory pressure, I wanted to re-do
> >  iSCSI testing with "dd"s to see if we are throttling writes.  
> > 
> >  I created 50 10-GB ext3 filesystems on iSCSI luns. Test is simple
> >  50 dds (one per filesystem). System seems to throttle memory properly
> >  and making progress. (Machine doesn't respond very well for anything
> >  else, but my vmstat keeps running - 100% sys time).
> 
> It's important to monitor /proc/meminfo too - the amount of dirty/writeback
> pages, etc.
> 
> btw, 100% system time is quite appalling.  Are you sure vmstat is telling
> the truth?  If so, where's it all being spent?
> 
> 

Well, profile doesn't show any time in "default_idle". So
I believe, vmstat is telling the truth.

# cat /proc/meminfo
MemTotal:      7143628 kB
MemFree:         43252 kB
Buffers:         16736 kB
Cached:        6683348 kB
SwapCached:       5336 kB
Active:          14460 kB
Inactive:      6686928 kB
HighTotal:           0 kB
HighFree:            0 kB
LowTotal:      7143628 kB
LowFree:         43252 kB
SwapTotal:     1048784 kB
SwapFree:      1017920 kB
Dirty:         6225664 kB
Writeback:      447272 kB
Mapped:          10460 kB
Slab:           362136 kB
CommitLimit:   4620596 kB
Committed_AS:   168616 kB
PageTables:       2452 kB
VmallocTotal: 34359738367 kB
VmallocUsed:      9888 kB
VmallocChunk: 34359728447 kB
HugePages_Total:     0
HugePages_Free:      0
Hugepagesize:     2048 kB



# echo 2 > /proc/profile; sleep 5;  readprofile -
m /usr/src/*12.3/System.map | sort -nr
1634737 total                                      0.5464
1468569 shrink_zone                              390.5769
 21203 unlock_page                              331.2969
 19497 release_pages                             46.8678
 19061 __wake_up_bit                            397.1042
 17936 page_referenced                           53.3810
 10679 lru_add_drain                            133.4875
  7348 page_waitqueue                            76.5417
  5877 tg3_poll                                   2.4007
  4650 cond_resched                              41.5179
  4476 copy_user_generic                         15.0201
  1973 do_get_write_access                        1.2583
  1858 __mod_page_state                          38.7083
  1754 tg3_start_xmit                             0.9876
  1348 journal_dirty_metadata                     2.1063
  1250 __find_get_block                           2.7902
  1224 journal_add_journal_head                   2.6379
  1082 kmem_cache_free                           11.2708
  1077 tcp_sendpage                               0.3580
  1076 tcp_ack                                    0.1431
  1075 __make_request                             0.7999
  1035 tg3_interrupt_tagged                       2.5875
  1022 __pagevec_lru_add                          4.5625
   928 tcp_transmit_skb                           0.4677
   924 kmem_cache_alloc                          14.4375
   900 thread_return                              3.5294
   819 __ext3_get_inode_loc                       0.9307
   754 established_get_next                       2.2440
   711 journal_cancel_revoke                      1.4335
   684 file_send_actor                            7.1250


Thanks,
Badari

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Memory pressure handling with iSCSI
  2005-07-26 18:39   ` Badari Pulavarty
@ 2005-07-26 18:48     ` Andrew Morton
  2005-07-26 19:12       ` Andrew Morton
  2005-07-26 19:31     ` Sonny Rao
  1 sibling, 1 reply; 28+ messages in thread
From: Andrew Morton @ 2005-07-26 18:48 UTC (permalink / raw)
  To: Badari Pulavarty; +Cc: linux-kernel, linux-mm

Badari Pulavarty <pbadari@us.ibm.com> wrote:
>
> On Tue, 2005-07-26 at 11:11 -0700, Andrew Morton wrote:
> > Badari Pulavarty <pbadari@us.ibm.com> wrote:
> > >
> > > After KS & OLS discussions about memory pressure, I wanted to re-do
> > >  iSCSI testing with "dd"s to see if we are throttling writes.  
> > > 
> > >  I created 50 10-GB ext3 filesystems on iSCSI luns. Test is simple
> > >  50 dds (one per filesystem). System seems to throttle memory properly
> > >  and making progress. (Machine doesn't respond very well for anything
> > >  else, but my vmstat keeps running - 100% sys time).
> > 
> > It's important to monitor /proc/meminfo too - the amount of dirty/writeback
> > pages, etc.
> > 
> > btw, 100% system time is quite appalling.  Are you sure vmstat is telling
> > the truth?  If so, where's it all being spent?
> > 
> > 
> 
> Well, profile doesn't show any time in "default_idle". So
> I believe, vmstat is telling the truth.
> 
> # cat /proc/meminfo
> MemTotal:      7143628 kB
> MemFree:         43252 kB
> Buffers:         16736 kB
> Cached:        6683348 kB
> SwapCached:       5336 kB
> Active:          14460 kB
> Inactive:      6686928 kB
> HighTotal:           0 kB
> HighFree:            0 kB
> LowTotal:      7143628 kB
> LowFree:         43252 kB
> SwapTotal:     1048784 kB
> SwapFree:      1017920 kB
> Dirty:         6225664 kB
> Writeback:      447272 kB
> Mapped:          10460 kB
> Slab:           362136 kB
> CommitLimit:   4620596 kB
> Committed_AS:   168616 kB
> PageTables:       2452 kB
> VmallocTotal: 34359738367 kB
> VmallocUsed:      9888 kB
> VmallocChunk: 34359728447 kB
> HugePages_Total:     0
> HugePages_Free:      0
> Hugepagesize:     2048 kB
> 

That is extremely wrong.  dirty memory is *way* too high.

> 
> # echo 2 > /proc/profile; sleep 5;  readprofile -
> m /usr/src/*12.3/System.map | sort -nr
> 1634737 total                                      0.5464
> 1468569 shrink_zone                              390.5769
>  21203 unlock_page                              331.2969
>  19497 release_pages                             46.8678
>  19061 __wake_up_bit                            397.1042
>  17936 page_referenced                           53.3810
>  10679 lru_add_drain                            133.4875

And so page reclaim has gone crazy.

We need to work out why the dirty memory levels are so high.

Can you please reduce the number of filesystems, see if that reduces the
dirty levels?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Memory pressure handling with iSCSI
  2005-07-26 18:48     ` Andrew Morton
@ 2005-07-26 19:12       ` Andrew Morton
  2005-07-26 20:36         ` Badari Pulavarty
  2005-07-26 21:11         ` Badari Pulavarty
  0 siblings, 2 replies; 28+ messages in thread
From: Andrew Morton @ 2005-07-26 19:12 UTC (permalink / raw)
  To: pbadari, linux-kernel, linux-mm

Andrew Morton <akpm@osdl.org> wrote:
>
> Can you please reduce the number of filesystems, see if that reduces the
>  dirty levels?

Also, it's conceivable that ext3 is implicated here, so it might be saner
to perform initial investigation on ext2.

(when kjournald writes back a page via its buffers, the page remains
"dirty" as far as the VFS is concerned.  Later, someone tries to do a
writepage() on it and we'll discover the buffers' cleanness and the page
will be cleaned without any I/O being performed.  All the throttling
_should_ work OK in this case.  But ext2 is more straightforward.)
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Memory pressure handling with iSCSI
  2005-07-26 18:39   ` Badari Pulavarty
  2005-07-26 18:48     ` Andrew Morton
@ 2005-07-26 19:31     ` Sonny Rao
  2005-07-26 20:37       ` Badari Pulavarty
  1 sibling, 1 reply; 28+ messages in thread
From: Sonny Rao @ 2005-07-26 19:31 UTC (permalink / raw)
  To: Badari Pulavarty; +Cc: Andrew Morton, lkml, linux-mm

On Tue, Jul 26, 2005 at 11:39:11AM -0700, Badari Pulavarty wrote:
> On Tue, 2005-07-26 at 11:11 -0700, Andrew Morton wrote:
> > Badari Pulavarty <pbadari@us.ibm.com> wrote:
> > >
> > > After KS & OLS discussions about memory pressure, I wanted to re-do
> > >  iSCSI testing with "dd"s to see if we are throttling writes.  
> > > 
> > >  I created 50 10-GB ext3 filesystems on iSCSI luns. Test is simple
> > >  50 dds (one per filesystem). System seems to throttle memory properly
> > >  and making progress. (Machine doesn't respond very well for anything
> > >  else, but my vmstat keeps running - 100% sys time).
> > 
> > It's important to monitor /proc/meminfo too - the amount of dirty/writeback
> > pages, etc.
> > 
> > btw, 100% system time is quite appalling.  Are you sure vmstat is telling
> > the truth?  If so, where's it all being spent?
> > 
> > 
> 
> Well, profile doesn't show any time in "default_idle". So
> I believe, vmstat is telling the truth.

Badari,

You probably covered this, but just to make sure, if you're on a
pentium4 machine, I usually boot w/ "idle=poll" to see proper idle
reporting because otherwise the chip will throttle itself back and
idle time will be skewed -- at least on oprofile.

Sonny
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Memory pressure handling with iSCSI
  2005-07-26 19:12       ` Andrew Morton
@ 2005-07-26 20:36         ` Badari Pulavarty
  2005-07-26 21:11         ` Badari Pulavarty
  1 sibling, 0 replies; 28+ messages in thread
From: Badari Pulavarty @ 2005-07-26 20:36 UTC (permalink / raw)
  To: Andrew Morton; +Cc: lkml, linux-mm

On Tue, 2005-07-26 at 12:12 -0700, Andrew Morton wrote:
> Andrew Morton <akpm@osdl.org> wrote:
> >
> > Can you please reduce the number of filesystems, see if that reduces the
> >  dirty levels?
> 
> Also, it's conceivable that ext3 is implicated here, so it might be saner
> to perform initial investigation on ext2.
> 
> (when kjournald writes back a page via its buffers, the page remains
> "dirty" as far as the VFS is concerned.  Later, someone tries to do a
> writepage() on it and we'll discover the buffers' cleanness and the page
> will be cleaned without any I/O being performed.  All the throttling
> _should_ work OK in this case.  But ext2 is more straightforward.)

I will try ext2 next.

- Badari

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Memory pressure handling with iSCSI
  2005-07-26 19:31     ` Sonny Rao
@ 2005-07-26 20:37       ` Badari Pulavarty
  2005-07-26 21:21         ` Andrew Morton
  0 siblings, 1 reply; 28+ messages in thread
From: Badari Pulavarty @ 2005-07-26 20:37 UTC (permalink / raw)
  To: Sonny Rao; +Cc: Andrew Morton, lkml, linux-mm

On Tue, 2005-07-26 at 15:31 -0400, Sonny Rao wrote:
> On Tue, Jul 26, 2005 at 11:39:11AM -0700, Badari Pulavarty wrote:
> > On Tue, 2005-07-26 at 11:11 -0700, Andrew Morton wrote:
> > > Badari Pulavarty <pbadari@us.ibm.com> wrote:
> > > >
> > > > After KS & OLS discussions about memory pressure, I wanted to re-do
> > > >  iSCSI testing with "dd"s to see if we are throttling writes.  
> > > > 
> > > >  I created 50 10-GB ext3 filesystems on iSCSI luns. Test is simple
> > > >  50 dds (one per filesystem). System seems to throttle memory properly
> > > >  and making progress. (Machine doesn't respond very well for anything
> > > >  else, but my vmstat keeps running - 100% sys time).
> > > 
> > > It's important to monitor /proc/meminfo too - the amount of dirty/writeback
> > > pages, etc.
> > > 
> > > btw, 100% system time is quite appalling.  Are you sure vmstat is telling
> > > the truth?  If so, where's it all being spent?
> > > 
> > > 
> > 
> > Well, profile doesn't show any time in "default_idle". So
> > I believe, vmstat is telling the truth.
> 
> Badari,
> 
> You probably covered this, but just to make sure, if you're on a
> pentium4 machine, I usually boot w/ "idle=poll" to see proper idle
> reporting because otherwise the chip will throttle itself back and
> idle time will be skewed -- at least on oprofile.
> 

My machine is AMD64.

- Badari

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Memory pressure handling with iSCSI
  2005-07-26 17:35 Memory pressure handling with iSCSI Badari Pulavarty
  2005-07-26 18:04 ` Roland Dreier
  2005-07-26 18:11 ` Andrew Morton
@ 2005-07-26 20:59 ` Rik van Riel
  2005-07-26 21:05   ` Badari Pulavarty
  2005-07-26 21:12   ` Andrew Morton
  2 siblings, 2 replies; 28+ messages in thread
From: Rik van Riel @ 2005-07-26 20:59 UTC (permalink / raw)
  To: Badari Pulavarty; +Cc: lkml, linux-mm, Andrew Morton

On Tue, 26 Jul 2005, Badari Pulavarty wrote:

> After KS & OLS discussions about memory pressure, I wanted to re-do
> iSCSI testing with "dd"s to see if we are throttling writes.  

Could you also try with shared writable mmap, to see if that
works ok or triggers a deadlock ?

-- 
The Theory of Escalating Commitment: "The cost of continuing mistakes is
borne by others, while the cost of admitting mistakes is borne by yourself."
  -- Joseph Stiglitz, Nobel Laureate in Economics
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Memory pressure handling with iSCSI
  2005-07-26 20:59 ` Rik van Riel
@ 2005-07-26 21:05   ` Badari Pulavarty
  2005-07-26 21:33     ` Martin J. Bligh
  2005-07-26 21:12   ` Andrew Morton
  1 sibling, 1 reply; 28+ messages in thread
From: Badari Pulavarty @ 2005-07-26 21:05 UTC (permalink / raw)
  To: Rik van Riel; +Cc: lkml, linux-mm, Andrew Morton

On Tue, 2005-07-26 at 16:59 -0400, Rik van Riel wrote:
> On Tue, 26 Jul 2005, Badari Pulavarty wrote:
> 
> > After KS & OLS discussions about memory pressure, I wanted to re-do
> > iSCSI testing with "dd"s to see if we are throttling writes.  
> 
> Could you also try with shared writable mmap, to see if that
> works ok or triggers a deadlock ?


I can, but lets finish addressing one issue at a time. Last time,
I changed too many things at the same time and got no where :(

Thanks,
Badari

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Memory pressure handling with iSCSI
  2005-07-26 19:12       ` Andrew Morton
  2005-07-26 20:36         ` Badari Pulavarty
@ 2005-07-26 21:11         ` Badari Pulavarty
  2005-07-26 21:24           ` Andrew Morton
  1 sibling, 1 reply; 28+ messages in thread
From: Badari Pulavarty @ 2005-07-26 21:11 UTC (permalink / raw)
  To: Andrew Morton; +Cc: lkml, linux-mm

[-- Attachment #1: Type: text/plain, Size: 2668 bytes --]

On Tue, 2005-07-26 at 12:12 -0700, Andrew Morton wrote:
> Andrew Morton <akpm@osdl.org> wrote:
> >
> > Can you please reduce the number of filesystems, see if that reduces the
> >  dirty levels?
> 
> Also, it's conceivable that ext3 is implicated here, so it might be saner
> to perform initial investigation on ext2.
> 
> (when kjournald writes back a page via its buffers, the page remains
> "dirty" as far as the VFS is concerned.  Later, someone tries to do a
> writepage() on it and we'll discover the buffers' cleanness and the page
> will be cleaned without any I/O being performed.  All the throttling
> _should_ work OK in this case.  But ext2 is more straightforward.)

ext2 is incredibly better. Machine is very responsive. 


# echo 2 > /proc/profile; sleep 5; readprofile -
m /usr/src/*12.3/System.map | sort -nr
 28671 total                                      0.0096
 25024 default_idle                             521.3333
  1987 shrink_zone                                0.5285
   163 tg3_poll                                   0.0666
   154 unlock_page                                2.4062
   113 page_referenced                            0.3363
   106 copy_user_generic                          0.3557
    98 __wake_up_bit                              2.0417
    74 release_pages                              0.1779
    71 page_waitqueue                             0.7396
    51 tg3_start_xmit                             0.0287
    39 __make_request                             0.0290
    36 tcp_ack                                    0.0048
    30 tcp_sendpage                               0.0100
    30 scsi_request_fn                            0.0260
    28 tg3_interrupt_tagged                       0.0700
    27 kmem_cache_alloc                           0.4219
    23 kmem_cache_free                            0.2396
    22 rotate_reclaimable_page                    0.0859
    20 established_get_next                       0.0595
    20 cond_resched                               0.1786
    20 __mod_page_state                           0.4167
    16 tcp_transmit_skb                           0.0081
    15 memset                                     0.0781
    15 __kfree_skb                                0.0521
    14 tcp_write_xmit                             0.0194
    14 handle_IRQ_event                           0.1458
    12 skb_clone                                  0.0214
    12 kfree                                      0.0500
    12 end_buffer_async_write                     0.0469
    11 tcp_v4_rcv                                 0.0041
    10 test_set_page_writeback                    0.0329


Thanks,
Badari


[-- Attachment #2: vmstat-ext2.out --]
[-- Type: text/plain, Size: 1375 bytes --]

procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in    cs us sy id wa
 1 56      4  33372  12512 6794560    0    0   142  1451 10283  1632  0  7  0 93
 0 56      4  35488  12496 6791996    0    0   131  1762 10335  1583  0  3  0 96
 0 56      4  33132  12540 6794532    0    0     1  1320 10228  2082  0  4  0 96
 0 56      4  33132  12684 6794388    0    0    35  2054 10414  1973  0  7  0 93
 0 56      4  33380  12712 6794876    0    0     0  2676 10635  2739  0  6  0 94
 0 56      4  33132  12672 6793368    0    0     2  6799 10240  2617  0 10  0 90
 0 56      4  33132  12608 6793948    0    0     0 10525 10249  2945  0 10  0 90
 2 56      4  33380  12528 6792996    0    0     1 12566 11081  2813  0 12  0 88
 1 55      4  33380  12368 6793672    0    0     1  9206 10237  2608  0 13  0 87
 0 56      4  33132  12176 6793348    0    0     0 10939 10156  2744  0 17  0 83
 2 59      4  33256  12060 6794496    0    0     5 11706 10464  2746  0 15  0 85
 0 56      4  33504  11844 6794196    0    0     0 12196 10525  2835  0 17  0 83
 0 56      4  33504  11592 6795480    0    0     0  8656 10463  2692  0 10  0 90
 0 56      4  33132  11492 6796612    0    0     1  9022 10222  2496  0 11  0 89
 2 55      4  33256  11384 6796720    0    0     0  9661 10830  2813  0  9  0 91



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Memory pressure handling with iSCSI
  2005-07-26 20:59 ` Rik van Riel
  2005-07-26 21:05   ` Badari Pulavarty
@ 2005-07-26 21:12   ` Andrew Morton
  1 sibling, 0 replies; 28+ messages in thread
From: Andrew Morton @ 2005-07-26 21:12 UTC (permalink / raw)
  To: Rik van Riel; +Cc: pbadari, linux-kernel, linux-mm

Rik van Riel <riel@redhat.com> wrote:
>
> On Tue, 26 Jul 2005, Badari Pulavarty wrote:
> 
> > After KS & OLS discussions about memory pressure, I wanted to re-do
> > iSCSI testing with "dd"s to see if we are throttling writes.  
> 
> Could you also try with shared writable mmap, to see if that
> works ok or triggers a deadlock ?
> 

That'll cause problems for sure, but we need to get `dd' right first :(
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Memory pressure handling with iSCSI
  2005-07-26 20:37       ` Badari Pulavarty
@ 2005-07-26 21:21         ` Andrew Morton
  0 siblings, 0 replies; 28+ messages in thread
From: Andrew Morton @ 2005-07-26 21:21 UTC (permalink / raw)
  To: Badari Pulavarty; +Cc: sonny, linux-kernel, linux-mm

Badari Pulavarty <pbadari@us.ibm.com> wrote:
>
> > You probably covered this, but just to make sure, if you're on a
> > pentium4 machine, I usually boot w/ "idle=poll" to see proper idle
> > reporting because otherwise the chip will throttle itself back and
> > idle time will be skewed -- at least on oprofile.
> > 
> 
> My machine is AMD64.

I'd expect the problem to which Sonny refers will occur on many
architectures.

IIRC, the problem is that many (or all) of the counters which oprofile uses
are turned off when the CPU does a halt.  So the profiler ends up thinking
that zero time is spent in the idle handler.  The net effect is that if
your workload spends 90% of its time idle then all the other profiler hits
are exaggerated by a factor of ten.  Making the CPU busywait in idle()
fixes this.

But you're using the old /proc/profile profiler which uses a free-running
timer which doesn't get stopped by halt, so it is unaffected by this.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Memory pressure handling with iSCSI
  2005-07-26 21:11         ` Badari Pulavarty
@ 2005-07-26 21:24           ` Andrew Morton
  2005-07-26 21:45             ` Badari Pulavarty
  0 siblings, 1 reply; 28+ messages in thread
From: Andrew Morton @ 2005-07-26 21:24 UTC (permalink / raw)
  To: Badari Pulavarty; +Cc: linux-kernel, linux-mm

Badari Pulavarty <pbadari@us.ibm.com> wrote:
>
> ext2 is incredibly better. Machine is very responsive. 
> 

OK.  Please, always monitor and send /proc/meminfo.  I assume that the
dirty-memory clamping is working OK with ext2 and that perhaps it'll work
OK with ext3/data=writeback.

All very odd.  I wonder how to reproduce this.  Maybe 50 ext3 filesystems
on regular old scsi will do it?
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Memory pressure handling with iSCSI
  2005-07-26 21:05   ` Badari Pulavarty
@ 2005-07-26 21:33     ` Martin J. Bligh
  2005-07-26 22:05       ` Adam Litke
  0 siblings, 1 reply; 28+ messages in thread
From: Martin J. Bligh @ 2005-07-26 21:33 UTC (permalink / raw)
  To: Badari Pulavarty, Rik van Riel, agl; +Cc: lkml, linux-mm, Andrew Morton

>> > After KS & OLS discussions about memory pressure, I wanted to re-do
>> > iSCSI testing with "dd"s to see if we are throttling writes.  
>> 
>> Could you also try with shared writable mmap, to see if that
>> works ok or triggers a deadlock ?
> 
> 
> I can, but lets finish addressing one issue at a time. Last time,
> I changed too many things at the same time and got no where :(

Adam is working that one, but not over iSCSI.

M.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Memory pressure handling with iSCSI
  2005-07-26 21:24           ` Andrew Morton
@ 2005-07-26 21:45             ` Badari Pulavarty
  2005-07-26 22:10               ` Andrew Morton
  0 siblings, 1 reply; 28+ messages in thread
From: Badari Pulavarty @ 2005-07-26 21:45 UTC (permalink / raw)
  To: Andrew Morton; +Cc: lkml, linux-mm

On Tue, 2005-07-26 at 14:24 -0700, Andrew Morton wrote:
> Badari Pulavarty <pbadari@us.ibm.com> wrote:
> >
> > ext2 is incredibly better. Machine is very responsive. 
> > 
> 
> OK.  Please, always monitor and send /proc/meminfo.  I assume that the
> dirty-memory clamping is working OK with ext2 and that perhaps it'll work
> OK with ext3/data=writeback.

Nope. Dirty is still very high..

# cat /proc/meminfo
MemTotal:      7143628 kB
MemFree:         33248 kB
Buffers:          8368 kB
Cached:        6789932 kB
SwapCached:          0 kB
Active:          51316 kB
Inactive:      6769144 kB
HighTotal:           0 kB
HighFree:            0 kB
LowTotal:      7143628 kB
LowFree:         33248 kB
SwapTotal:     1048784 kB
SwapFree:      1048780 kB
Dirty:         6605704 kB
Writeback:      168452 kB
Mapped:          49724 kB
Slab:           252200 kB
CommitLimit:   4620596 kB
Committed_AS:   163524 kB
PageTables:       2284 kB
VmallocTotal: 34359738367 kB
VmallocUsed:      9888 kB
VmallocChunk: 34359728447 kB
HugePages_Total:     0
HugePages_Free:      0
Hugepagesize:     2048 kB

Thanks,
Badari

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Memory pressure handling with iSCSI
  2005-07-26 21:33     ` Martin J. Bligh
@ 2005-07-26 22:05       ` Adam Litke
  0 siblings, 0 replies; 28+ messages in thread
From: Adam Litke @ 2005-07-26 22:05 UTC (permalink / raw)
  To: Martin J. Bligh
  Cc: Badari Pulavarty [imap], Rik van Riel, lkml, linux-mm, Andrew Morton

On Tue, 2005-07-26 at 16:33, Martin J. Bligh wrote:
> >> > After KS & OLS discussions about memory pressure, I wanted to re-do
> >> > iSCSI testing with "dd"s to see if we are throttling writes.  
> >> 
> >> Could you also try with shared writable mmap, to see if that
> >> works ok or triggers a deadlock ?
> > 
> > 
> > I can, but lets finish addressing one issue at a time. Last time,
> > I changed too many things at the same time and got no where :(
> 
> Adam is working that one, but not over iSCSI.

I wrote a simple/ugly C program to demonstrate the MAP_SHARED,PROT_WRITE
case.  I was able to saturate the system with 75% of all memory in dirty
pages before I got bored.

To reproduce:
- Create a 3GB file with dd
- ./map-shared-dirty bigfile <number of chunks>

I break up the mmap & dirty operation into chunks in case the system is
tight on memory.  Choose a large enough number of chunks so the
individual mmaps will be small enough for your system to accomodate.

-- 

MemTotal:      4092492 kB
MemFree:        786988 kB
Buffers:          6372 kB
Cached:        3211388 kB
SwapCached:          0 kB
Active:        3197428 kB
Inactive:        36696 kB
HighTotal:     3211264 kB
HighFree:         1024 kB
LowTotal:       881228 kB
LowFree:        785964 kB
SwapTotal:           0 kB
SwapFree:            0 kB
Dirty:         3117300 kB
Writeback:        3568 kB
Mapped:          24780 kB
Slab:            59316 kB
Committed_AS:    49760 kB
PageTables:        780 kB
VmallocTotal:   114680 kB
VmallocUsed:        32 kB
VmallocChunk:   114648 kB

/*
 * map-shared-dirty.c - Demonstrate a loophole in dirty-ratio when 
 * heavily dirtying MAP_SHARED memory.
 *
 * Usage: (I know it's ugly)
 * ./map-shared-dirty <large file> <number of chunks>
 */

#include <string.h>
#include <fcntl.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <unistd.h>
#include <sys/mman.h>
#include <stdio.h>

size_t page_size;

void dirty_file(int fd, unsigned long bytes, size_t map_offset) {
	char *addr;
	
	addr = mmap(NULL, bytes, PROT_READ|PROT_WRITE, MAP_SHARED, fd, map_offset);
	if (addr == MAP_FAILED) {
		fprintf(stderr, "Failed to map file\n");
		fprintf(stderr, "bytes: %i offset: %i\n", bytes,map_offset);
		exit(1);
	}
	
	/* Dirty the pages */
	memset(addr, map_offset%255, bytes);

	munmap(addr, bytes);
}

int main(int argc, char **argv)
{
	char *filename = argv[1];
	int chunks = atoi(argv[2]);
	int fd;
	unsigned long i, chunk_size, bytes;
	struct stat file_info;

	fd = open(filename, O_RDWR|0100000); /* O_LARGEFILE */
	if (fd <= 0) {
		fprintf(stderr, "Failed to open file\n");
		exit(1);
	}
	fstat(fd, &file_info);
	bytes = file_info.st_size;
	
	page_size = getpagesize();
	chunk_size = (bytes / chunks) & ~(page_size - 1);
	printf("Chunk size = %i\n", chunk_size);
	for (i = 0; i < bytes; i+=chunk_size)
		dirty_file(fd, chunk_size, i);
	
	exit(0);
}


-- 
Adam Litke - (agl at us.ibm.com)
IBM Linux Technology Center

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Memory pressure handling with iSCSI
  2005-07-26 21:45             ` Badari Pulavarty
@ 2005-07-26 22:10               ` Andrew Morton
  2005-07-26 22:48                 ` Badari Pulavarty
  0 siblings, 1 reply; 28+ messages in thread
From: Andrew Morton @ 2005-07-26 22:10 UTC (permalink / raw)
  To: Badari Pulavarty; +Cc: linux-kernel, linux-mm

Badari Pulavarty <pbadari@us.ibm.com> wrote:
>
> On Tue, 2005-07-26 at 14:24 -0700, Andrew Morton wrote:
> > Badari Pulavarty <pbadari@us.ibm.com> wrote:
> > >
> > > ext2 is incredibly better. Machine is very responsive. 
> > > 
> > 
> > OK.  Please, always monitor and send /proc/meminfo.  I assume that the
> > dirty-memory clamping is working OK with ext2 and that perhaps it'll work
> > OK with ext3/data=writeback.
> 
> Nope. Dirty is still very high..

That's a relief in a way.  Can you please try decreasing the number of
filesystems now?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Memory pressure handling with iSCSI
  2005-07-26 22:10               ` Andrew Morton
@ 2005-07-26 22:48                 ` Badari Pulavarty
  2005-07-26 23:07                   ` Andrew Morton
  0 siblings, 1 reply; 28+ messages in thread
From: Badari Pulavarty @ 2005-07-26 22:48 UTC (permalink / raw)
  To: Andrew Morton; +Cc: lkml, linux-mm

[-- Attachment #1: Type: text/plain, Size: 887 bytes --]

On Tue, 2005-07-26 at 15:10 -0700, Andrew Morton wrote:
> Badari Pulavarty <pbadari@us.ibm.com> wrote:
> >
> > On Tue, 2005-07-26 at 14:24 -0700, Andrew Morton wrote:
> > > Badari Pulavarty <pbadari@us.ibm.com> wrote:
> > > >
> > > > ext2 is incredibly better. Machine is very responsive. 
> > > > 
> > > 
> > > OK.  Please, always monitor and send /proc/meminfo.  I assume that the
> > > dirty-memory clamping is working OK with ext2 and that perhaps it'll work
> > > OK with ext3/data=writeback.
> > 
> > Nope. Dirty is still very high..
> 
> That's a relief in a way.  Can you please try decreasing the number of
> filesystems now?

Here is the data with 5 ext2 filesystems. I also collected /proc/meminfo
every 5 seconds. As you can see, we seem to dirty 6GB of data in 20
seconds of starting the test. I am not sure if its bad, since we have
lots of free memory..

Thanks,
Badari



[-- Attachment #2: vmstat-5-ext2.out --]
[-- Type: text/plain, Size: 969 bytes --]

procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in    cs us sy id wa
 2 11    120  32912  10624 6813476    0    0     0  6766 10364   485  0  4  0 96
 0 11    120  33036  10652 6813964    0    0     2  8889 10079   475  0  4  0 96
 0 11    120  33036  10712 6813904    0    0     0  8077 9984   469  0  4  0 96
 0 11    120  32912  10752 6814380    0    0     0 15576 10226   514  0  4  0 95
 0 11    120  33036  10668 6813432    0    0     0 11334 10112   488  0  4  0 96
 0 11    120  33656  10600 6813500    0    0     0 11811 10238   497  0  4  0 96
 0 11    120  33036  10596 6814020    0    0     0 12713 10191   489  0  4  0 96
 0 11    120  33036  10648 6813968    0    0     1 15775 10195   508  0  4  0 96
 0 10    120  33780  10656 6812928    0    0     2  5390 10265   503  0  3  5 92
 0 11    120  33036  10660 6813440    0    0     0  9700 10217   518  0  4  2 94



[-- Attachment #3: meminfo.out --]
[-- Type: text/plain, Size: 3384 bytes --]

MemTotal:      7143628 kB
MemFree:       7001860 kB
Buffers:          5080 kB
Cached:          23300 kB
SwapCached:          0 kB
Active:          48600 kB
Inactive:         5872 kB
HighTotal:           0 kB
HighFree:            0 kB
LowTotal:      7143628 kB
LowFree:       7001860 kB
SwapTotal:     1048784 kB
SwapFree:      1048780 kB
Dirty:               0 kB
Writeback:           0 kB
Mapped:          45948 kB
Slab:            56348 kB
CommitLimit:   4620596 kB
Committed_AS:   148436 kB
PageTables:       1544 kB
VmallocTotal: 34359738367 kB
VmallocUsed:      9888 kB
VmallocChunk: 34359728447 kB
HugePages_Total:     0
HugePages_Free:      0
Hugepagesize:     2048 kB

MemTotal:      7143628 kB
MemFree:       4871864 kB
Buffers:         14564 kB
Cached:        2091232 kB
SwapCached:          0 kB
Active:          51380 kB
Inactive:      2081780 kB
HighTotal:           0 kB
HighFree:            0 kB
LowTotal:      7143628 kB
LowFree:       4871864 kB
SwapTotal:     1048784 kB
SwapFree:      1048780 kB
Dirty:         2070752 kB
Writeback:           0 kB
Mapped:          46368 kB
Slab:           107912 kB
CommitLimit:   4620596 kB
Committed_AS:   148524 kB
PageTables:       1608 kB
VmallocTotal: 34359738367 kB
VmallocUsed:      9888 kB
VmallocChunk: 34359728447 kB
HugePages_Total:     0
HugePages_Free:      0
Hugepagesize:     2048 kB

MemTotal:      7143628 kB
MemFree:        406384 kB
Buffers:         18940 kB
Cached:        6443960 kB
SwapCached:          0 kB
Active:          55688 kB
Inactive:      6435048 kB
HighTotal:           0 kB
HighFree:            0 kB
LowTotal:      7143628 kB
LowFree:        406384 kB
SwapTotal:     1048784 kB
SwapFree:      1048780 kB
Dirty:         6144652 kB
Writeback:      252152 kB
Mapped:          46380 kB
Slab:           216580 kB
CommitLimit:   4620596 kB
Committed_AS:   148756 kB
PageTables:       1608 kB
VmallocTotal: 34359738367 kB
VmallocUsed:      9888 kB
VmallocChunk: 34359728447 kB
HugePages_Total:     0
HugePages_Free:      0
Hugepagesize:     2048 kB

MemTotal:      7143628 kB
MemFree:         32772 kB
Buffers:         10028 kB
Cached:        6817680 kB
SwapCached:          4 kB
Active:          48180 kB
Inactive:      6804552 kB
HighTotal:           0 kB
HighFree:            0 kB
LowTotal:      7143628 kB
LowFree:         32772 kB
SwapTotal:     1048784 kB
SwapFree:      1048664 kB
Dirty:         6489496 kB
Writeback:      285264 kB
Mapped:          46000 kB
Slab:           228172 kB
CommitLimit:   4620596 kB
Committed_AS:   148756 kB
PageTables:       1608 kB
VmallocTotal: 34359738367 kB
VmallocUsed:      9888 kB
VmallocChunk: 34359728447 kB
HugePages_Total:     0
HugePages_Free:      0
Hugepagesize:     2048 kB

MemTotal:      7143628 kB
MemFree:         32524 kB
Buffers:         10056 kB
Cached:        6816620 kB
SwapCached:          4 kB
Active:          48672 kB
Inactive:      6803212 kB
HighTotal:           0 kB
HighFree:            0 kB
LowTotal:      7143628 kB
LowFree:         32524 kB
SwapTotal:     1048784 kB
SwapFree:      1048664 kB
Dirty:         6465124 kB
Writeback:      268876 kB
Mapped:          46008 kB
Slab:           229580 kB
CommitLimit:   4620596 kB
Committed_AS:   148996 kB
PageTables:       1608 kB
VmallocTotal: 34359738367 kB
VmallocUsed:      9888 kB
VmallocChunk: 34359728447 kB
HugePages_Total:     0
HugePages_Free:      0
Hugepagesize:     2048 kB

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Memory pressure handling with iSCSI
  2005-07-26 22:48                 ` Badari Pulavarty
@ 2005-07-26 23:07                   ` Andrew Morton
  2005-07-26 23:26                     ` Badari Pulavarty
  0 siblings, 1 reply; 28+ messages in thread
From: Andrew Morton @ 2005-07-26 23:07 UTC (permalink / raw)
  To: Badari Pulavarty; +Cc: linux-kernel, linux-mm

Badari Pulavarty <pbadari@us.ibm.com> wrote:
>
> Here is the data with 5 ext2 filesystems. I also collected /proc/meminfo
> every 5 seconds. As you can see, we seem to dirty 6GB of data in 20
> seconds of starting the test. I am not sure if its bad, since we have
> lots of free memory..

It's bad.  The logic in balance_dirty_pages() should block those write()
callers as soon as we hit 40% dirty memory or whatever is in
/proc/sys/vm/dirty_ratio.  So something is horridly busted.

Can you try reducing the number of filesystems even further?

Either the underlying block driver is doing something most bizarre to the
VFS or something has gone wrong with the arithmetic in page-writeback.c. 
If total_pages or ratelimit_pages are totally wrong or if
get_dirty_limits() is returning junk then we'd be seeing something like
this.

It'll be something simple - if you have time, stick some printks in
balance_dirty_pages(), work out why it is not remaining in that `for' loop
until dirty memory has fallen below the 40%.

I'll take a shot at reproducing this on my 4G x86_64 box, but this is so
grossly wrong that I'm sure it would have been noted before now if it was
commonly happening (famous last words).
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Memory pressure handling with iSCSI
  2005-07-26 23:07                   ` Andrew Morton
@ 2005-07-26 23:26                     ` Badari Pulavarty
  2005-07-27  0:31                       ` Andrew Morton
  0 siblings, 1 reply; 28+ messages in thread
From: Badari Pulavarty @ 2005-07-26 23:26 UTC (permalink / raw)
  To: Andrew Morton; +Cc: lkml, linux-mm

On Tue, 2005-07-26 at 16:07 -0700, Andrew Morton wrote:
> Badari Pulavarty <pbadari@us.ibm.com> wrote:
> >
> > Here is the data with 5 ext2 filesystems. I also collected /proc/meminfo
> > every 5 seconds. As you can see, we seem to dirty 6GB of data in 20
> > seconds of starting the test. I am not sure if its bad, since we have
> > lots of free memory..
> 
> It's bad.  The logic in balance_dirty_pages() should block those write()
> callers as soon as we hit 40% dirty memory or whatever is in
> /proc/sys/vm/dirty_ratio.  So something is horridly busted.
> 
> Can you try reducing the number of filesystems even further?

Single ext2 filesystem. We still dirty pretty quickly (data collected
every 5 seconds).

 # grep Dirty OUT
Dirty:             312 kB
Dirty:         1121852 kB
Dirty:         2896952 kB
Dirty:         4344564 kB
Dirty:         5310856 kB
Dirty:         5507812 kB
Dirty:         5714884 kB
Dirty:         5865132 kB
Dirty:         6004276 kB
Dirty:         6206544 kB
Dirty:         6380524 kB
Dirty:         6583200 kB
Dirty:         6727296 kB
Dirty:         6708564 kB
Dirty:         6733768 kB
Dirty:         6737868 kB

Thanks,
Badari

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Memory pressure handling with iSCSI
  2005-07-26 23:26                     ` Badari Pulavarty
@ 2005-07-27  0:31                       ` Andrew Morton
  2005-07-27  1:20                         ` Martin J. Bligh
  2005-07-27  1:31                         ` Badari Pulavarty
  0 siblings, 2 replies; 28+ messages in thread
From: Andrew Morton @ 2005-07-27  0:31 UTC (permalink / raw)
  To: Badari Pulavarty; +Cc: linux-kernel, linux-mm

Badari Pulavarty <pbadari@us.ibm.com> wrote:
>
> On Tue, 2005-07-26 at 16:07 -0700, Andrew Morton wrote:
>  > Badari Pulavarty <pbadari@us.ibm.com> wrote:
>  > >
>  > > Here is the data with 5 ext2 filesystems. I also collected /proc/meminfo
>  > > every 5 seconds. As you can see, we seem to dirty 6GB of data in 20
>  > > seconds of starting the test. I am not sure if its bad, since we have
>  > > lots of free memory..
>  > 
>  > It's bad.  The logic in balance_dirty_pages() should block those write()
>  > callers as soon as we hit 40% dirty memory or whatever is in
>  > /proc/sys/vm/dirty_ratio.  So something is horridly busted.
>  > 
>  > Can you try reducing the number of filesystems even further?
> 
>  Single ext2 filesystem. We still dirty pretty quickly (data collected
>  every 5 seconds).

It happens here, a bit.  My machine goes up to 60% dirty when it should be
clamping at 40%.

The variable `total_pages' in page-writeback.c (from
nr_free_pagecache_pages()) is too high.  I trace it back to here:

On node 0 totalpages: 1572864
  DMA zone: 4096 pages, LIFO batch:1
  Normal zone: 1568768 pages, LIFO batch:31
  HighMem zone: 0 pages, LIFO batch:1

This machine only has 4G of memory, so the platform code is overestimating
the number of pages by 50%.  Can you please check your dmesg, see if your
system is also getting this wrong?
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Memory pressure handling with iSCSI
  2005-07-27  0:31                       ` Andrew Morton
@ 2005-07-27  1:20                         ` Martin J. Bligh
  2005-07-27  1:26                           ` Andrew Morton
  2005-07-27  1:31                         ` Badari Pulavarty
  1 sibling, 1 reply; 28+ messages in thread
From: Martin J. Bligh @ 2005-07-27  1:20 UTC (permalink / raw)
  To: Andrew Morton, Badari Pulavarty; +Cc: linux-kernel, linux-mm

> It happens here, a bit.  My machine goes up to 60% dirty when it should be
> clamping at 40%.
> 
> The variable `total_pages' in page-writeback.c (from
> nr_free_pagecache_pages()) is too high.  I trace it back to here:
> 
> On node 0 totalpages: 1572864
>   DMA zone: 4096 pages, LIFO batch:1
>   Normal zone: 1568768 pages, LIFO batch:31
>   HighMem zone: 0 pages, LIFO batch:1
> 
> This machine only has 4G of memory, so the platform code is overestimating
> the number of pages by 50%.  Can you please check your dmesg, see if your
> system is also getting this wrong?

I think we're repeatedly iterating over the same zones by walking the 
zonelists:

static unsigned int nr_free_zone_pages(int offset)
{
        pg_data_t *pgdat;
        unsigned int sum = 0;
        int i;

        for_each_pgdat(pgdat) {
                struct zone *zone;

                for (i = 0; i < MAX_NR_ZONES; i++) {
                        unsigned long size, high;

                        zone = pgdat->node_zones[i];
                        size = zone->present_pages;
                        high = zone->pages_high;

                        if (size > high)
                                sum += size - high;
                }
        }
}

Does that look more sensible? I'd send you a real patch, except the
box just crashed ;-)

M.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Memory pressure handling with iSCSI
  2005-07-27  1:20                         ` Martin J. Bligh
@ 2005-07-27  1:26                           ` Andrew Morton
  2005-07-27  1:47                             ` Martin J. Bligh
  0 siblings, 1 reply; 28+ messages in thread
From: Andrew Morton @ 2005-07-27  1:26 UTC (permalink / raw)
  To: Martin J. Bligh; +Cc: pbadari, linux-kernel, linux-mm

"Martin J. Bligh" <mbligh@mbligh.org> wrote:
>
> 
> > It happens here, a bit.  My machine goes up to 60% dirty when it should be
> > clamping at 40%.
> > 
> > The variable `total_pages' in page-writeback.c (from
> > nr_free_pagecache_pages()) is too high.  I trace it back to here:
> > 
> > On node 0 totalpages: 1572864
> >   DMA zone: 4096 pages, LIFO batch:1
> >   Normal zone: 1568768 pages, LIFO batch:31
> >   HighMem zone: 0 pages, LIFO batch:1
> > 
> > This machine only has 4G of memory, so the platform code is overestimating
> > the number of pages by 50%.  Can you please check your dmesg, see if your
> > system is also getting this wrong?
> 
> I think we're repeatedly iterating over the same zones by walking the 
> zonelists:
> 
> static unsigned int nr_free_zone_pages(int offset)
> {
>         pg_data_t *pgdat;
>         unsigned int sum = 0;
>         int i;
> 
>         for_each_pgdat(pgdat) {
>                 struct zone *zone;
> 
>                 for (i = 0; i < MAX_NR_ZONES; i++) {
>                         unsigned long size, high;
> 
>                         zone = pgdat->node_zones[i];
>                         size = zone->present_pages;
>                         high = zone->pages_high;
> 
>                         if (size > high)
>                                 sum += size - high;
>                 }
>         }
> }

I don't think so.  We're getting the wrong answer out of
calculate_zone_totalpages() which is an init-time thing.

Maybe nr_free_zone_pages() is supposed to fix that up post-facto somehow,
but calculate_zone_totalpages() sure as heck shouldn't be putting 1568768
into my ZONE_NORMAL's ->node_present_pages.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Memory pressure handling with iSCSI
  2005-07-27  0:31                       ` Andrew Morton
  2005-07-27  1:20                         ` Martin J. Bligh
@ 2005-07-27  1:31                         ` Badari Pulavarty
  2005-07-27  1:40                           ` Andrew Morton
  1 sibling, 1 reply; 28+ messages in thread
From: Badari Pulavarty @ 2005-07-27  1:31 UTC (permalink / raw)
  To: Andrew Morton; +Cc: lkml, linux-mm

On Tue, 2005-07-26 at 17:31 -0700, Andrew Morton wrote:
> Badari Pulavarty <pbadari@us.ibm.com> wrote:
> >
> > On Tue, 2005-07-26 at 16:07 -0700, Andrew Morton wrote:
> >  > Badari Pulavarty <pbadari@us.ibm.com> wrote:
> >  > >
> >  > > Here is the data with 5 ext2 filesystems. I also collected /proc/meminfo
> >  > > every 5 seconds. As you can see, we seem to dirty 6GB of data in 20
> >  > > seconds of starting the test. I am not sure if its bad, since we have
> >  > > lots of free memory..
> >  > 
> >  > It's bad.  The logic in balance_dirty_pages() should block those write()
> >  > callers as soon as we hit 40% dirty memory or whatever is in
> >  > /proc/sys/vm/dirty_ratio.  So something is horridly busted.
> >  > 
> >  > Can you try reducing the number of filesystems even further?
> > 
> >  Single ext2 filesystem. We still dirty pretty quickly (data collected
> >  every 5 seconds).
> 
> It happens here, a bit.  My machine goes up to 60% dirty when it should be
> clamping at 40%.
> 
> The variable `total_pages' in page-writeback.c (from
> nr_free_pagecache_pages()) is too high.  I trace it back to here:
> 
> On node 0 totalpages: 1572864
>   DMA zone: 4096 pages, LIFO batch:1
>   Normal zone: 1568768 pages, LIFO batch:31
>   HighMem zone: 0 pages, LIFO batch:1
> 
> This machine only has 4G of memory, so the platform code is overestimating
> the number of pages by 50%.  Can you please check your dmesg, see if your
> system is also getting this wrong?



On node 0 totalpages: 1572863
  DMA zone: 4096 pages, LIFO batch:1
  Normal zone: 1568767 pages, LIFO batch:31
  HighMem zone: 0 pages, LIFO batch:1
On node 1 totalpages: 131071
  DMA zone: 0 pages, LIFO batch:1
  Normal zone: 131071 pages, LIFO batch:31
  HighMem zone: 0 pages, LIFO batch:1
On node 2 totalpages: 131071
  DMA zone: 0 pages, LIFO batch:1
  Normal zone: 131071 pages, LIFO batch:31
  HighMem zone: 0 pages, LIFO batch:1
On node 3 totalpages: 131071
  DMA zone: 0 pages, LIFO batch:1
  Normal zone: 131071 pages, LIFO batch:31
  HighMem zone: 0 pages, LIFO batch:1



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Memory pressure handling with iSCSI
  2005-07-27  1:31                         ` Badari Pulavarty
@ 2005-07-27  1:40                           ` Andrew Morton
  0 siblings, 0 replies; 28+ messages in thread
From: Andrew Morton @ 2005-07-27  1:40 UTC (permalink / raw)
  To: Badari Pulavarty; +Cc: linux-kernel, linux-mm

Badari Pulavarty <pbadari@us.ibm.com> wrote:
>
> > This machine only has 4G of memory, so the platform code is overestimating
> > the number of pages by 50%.  Can you please check your dmesg, see if your
> > system is also getting this wrong?
> 
> 
> 
> On node 0 totalpages: 1572863
>   DMA zone: 4096 pages, LIFO batch:1
>   Normal zone: 1568767 pages, LIFO batch:31
>   HighMem zone: 0 pages, LIFO batch:1
> On node 1 totalpages: 131071
>   DMA zone: 0 pages, LIFO batch:1
>   Normal zone: 131071 pages, LIFO batch:31
>   HighMem zone: 0 pages, LIFO batch:1
> On node 2 totalpages: 131071
>   DMA zone: 0 pages, LIFO batch:1
>   Normal zone: 131071 pages, LIFO batch:31
>   HighMem zone: 0 pages, LIFO batch:1
> On node 3 totalpages: 131071
>   DMA zone: 0 pages, LIFO batch:1
>   Normal zone: 131071 pages, LIFO batch:31
>   HighMem zone: 0 pages, LIFO batch:1

That's 7.7GB, yes?   On a 6GB machine?

If so, that's a bit off, but not grossly.

Here's the dopey debug patch which I used:

- boot
- dmesg -s 1000000 | grep total_pages > foo
- kill off syslogd  (sudo service syslog stop)
- run the dd command
- wait for it to hit steady state (max dirty memory)
- dmesg -s 1000000 >> foo

diff -puN mm/page-writeback.c~a mm/page-writeback.c
--- 25/mm/page-writeback.c~a	2005-07-26 15:53:46.000000000 -0700
+++ 25-akpm/mm/page-writeback.c	2005-07-26 16:21:55.000000000 -0700
@@ -161,7 +161,8 @@ get_dirty_limits(struct writeback_state 
 	dirty_ratio = vm_dirty_ratio;
 	if (dirty_ratio > unmapped_ratio / 2)
 		dirty_ratio = unmapped_ratio / 2;
-
+	printk("vm_dirty_ratio=%d unmapped_ratio=%d dirty_ratio=%d\n",
+		vm_dirty_ratio, unmapped_ratio, dirty_ratio);
 	if (dirty_ratio < 5)
 		dirty_ratio = 5;
 
@@ -171,6 +172,8 @@ get_dirty_limits(struct writeback_state 
 
 	background = (background_ratio * available_memory) / 100;
 	dirty = (dirty_ratio * available_memory) / 100;
+	printk("dirty_ratio=%d available_memory=%lu dirty=%lu\n",
+		dirty_ratio, available_memory, dirty);
 	tsk = current;
 	if (tsk->flags & PF_LESS_THROTTLE || rt_task(tsk)) {
 		background += background / 4;
@@ -209,6 +212,12 @@ static void balance_dirty_pages(struct a
 		get_dirty_limits(&wbs, &background_thresh,
 					&dirty_thresh, mapping);
 		nr_reclaimable = wbs.nr_dirty + wbs.nr_unstable;
+		printk("background_thresh=%ld dirty_thresh=%ld "
+				"nr_dirty=%ld nr_unstable=%ld "
+				"nr_reclaimable=%ld wbs.nr_writeback=%ld\n",
+			background_thresh, dirty_thresh,
+			wbs.nr_dirty, wbs.nr_unstable,
+			nr_reclaimable, wbs.nr_writeback);
 		if (nr_reclaimable + wbs.nr_writeback <= dirty_thresh)
 			break;
 
@@ -532,6 +541,8 @@ void __init page_writeback_init(void)
 
 	total_pages = nr_free_pagecache_pages();
 
+	printk("total_pages=%ld\n", total_pages);
+
 	correction = (100 * 4 * buffer_pages) / total_pages;
 
 	if (correction < 100) {
_

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Memory pressure handling with iSCSI
  2005-07-27  1:26                           ` Andrew Morton
@ 2005-07-27  1:47                             ` Martin J. Bligh
  0 siblings, 0 replies; 28+ messages in thread
From: Martin J. Bligh @ 2005-07-27  1:47 UTC (permalink / raw)
  To: Andrew Morton; +Cc: pbadari, linux-kernel, linux-mm


> I don't think so.  We're getting the wrong answer out of
> calculate_zone_totalpages() which is an init-time thing.
> 
> Maybe nr_free_zone_pages() is supposed to fix that up post-facto somehow,
> but calculate_zone_totalpages() sure as heck shouldn't be putting 1568768
> into my ZONE_NORMAL's ->node_present_pages.

Humpf. I'll look at it again later.

nr_free_pagecache_pages -> nr_free_zone_pages -> nr_free_zone_pages

is it not?

M.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2005-07-27  1:47 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2005-07-26 17:35 Memory pressure handling with iSCSI Badari Pulavarty
2005-07-26 18:04 ` Roland Dreier
2005-07-26 18:11 ` Andrew Morton
2005-07-26 18:39   ` Badari Pulavarty
2005-07-26 18:48     ` Andrew Morton
2005-07-26 19:12       ` Andrew Morton
2005-07-26 20:36         ` Badari Pulavarty
2005-07-26 21:11         ` Badari Pulavarty
2005-07-26 21:24           ` Andrew Morton
2005-07-26 21:45             ` Badari Pulavarty
2005-07-26 22:10               ` Andrew Morton
2005-07-26 22:48                 ` Badari Pulavarty
2005-07-26 23:07                   ` Andrew Morton
2005-07-26 23:26                     ` Badari Pulavarty
2005-07-27  0:31                       ` Andrew Morton
2005-07-27  1:20                         ` Martin J. Bligh
2005-07-27  1:26                           ` Andrew Morton
2005-07-27  1:47                             ` Martin J. Bligh
2005-07-27  1:31                         ` Badari Pulavarty
2005-07-27  1:40                           ` Andrew Morton
2005-07-26 19:31     ` Sonny Rao
2005-07-26 20:37       ` Badari Pulavarty
2005-07-26 21:21         ` Andrew Morton
2005-07-26 20:59 ` Rik van Riel
2005-07-26 21:05   ` Badari Pulavarty
2005-07-26 21:33     ` Martin J. Bligh
2005-07-26 22:05       ` Adam Litke
2005-07-26 21:12   ` Andrew Morton

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox