xmm2 - monitor Linux MM active/inactive lists graphically

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* xmm2 - monitor Linux MM active/inactive lists graphically
@ 2001-10-24 10:42 Zlatko Calusic
  2001-10-24 14:26 ` Marcelo Tosatti
  0 siblings, 1 reply; 37+ messages in thread
From: Zlatko Calusic @ 2001-10-24 10:42 UTC (permalink / raw)
  To: linux-mm, linux-kernel

New version is out and can be found at the same URL:

<URL:http://linux.inet.hr/>

As Linus' MM lost inactive dirty/clean lists in favour of just one
inactive list, the application needed to be modified to support that.

You can still continue to use the older one for kernels <= 2.4.9
and/or Alan's (-ac) kernels, which continued to use older Rik's VM
system.

Enjoy and, as usual, all comments welcome!
-- 
Zlatko

P.S. BTW, 2.4.13 still has very unoptimal writeout performance and
     andrea@suse.de is redirected to /dev/null. <g>
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: xmm2 - monitor Linux MM active/inactive lists graphically
  2001-10-24 10:42 xmm2 - monitor Linux MM active/inactive lists graphically Zlatko Calusic
@ 2001-10-24 14:26 ` Marcelo Tosatti
  2001-10-25  0:25   ` Zlatko Calusic
  0 siblings, 1 reply; 37+ messages in thread
From: Marcelo Tosatti @ 2001-10-24 14:26 UTC (permalink / raw)
  To: Zlatko Calusic, Linus Torvalds; +Cc: linux-mm, lkml


On 24 Oct 2001, Zlatko Calusic wrote:

> P.S. BTW, 2.4.13 still has very unoptimal writeout performance and
>      andrea@suse.de is redirected to /dev/null. <g>

Zlatko,

Could you please show us your case of bad writeout performance ? 

Thanks

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: xmm2 - monitor Linux MM active/inactive lists graphically
  2001-10-24 14:26 ` Marcelo Tosatti
@ 2001-10-25  0:25   ` Zlatko Calusic
  2001-10-25  4:19     ` Linus Torvalds
  0 siblings, 1 reply; 37+ messages in thread
From: Zlatko Calusic @ 2001-10-25  0:25 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: Linus Torvalds, linux-mm, lkml

Marcelo Tosatti <marcelo@conectiva.com.br> writes:

> On 24 Oct 2001, Zlatko Calusic wrote:
> 
> > P.S. BTW, 2.4.13 still has very unoptimal writeout performance and
> >      andrea@suse.de is redirected to /dev/null. <g>
> 
> Zlatko,
> 
> Could you please show us your case of bad writeout performance ? 
> 
> Thanks
> 

Sure. Output of 'vmstat 1' follows:


 1  0  0      0 254552   5120 183476   0   0    12    24  178   438 2  37  60
 0  1  0      0 137296   5232 297760   0   0     4  5284  195   440 3  43  54
 1  0  0      0 126520   5244 308260   0   0     0 10588  215   230 0   3  96
 0  2  0      0 117488   5252 317064   0   0     0  8796  176   139 1   3  96
 0  2  0      0 107556   5264 326744   0   0     0  9704  174    78 0   3  97
 0  2  0      0  99552   5268 334548   0   0     0  7880  174    67 0   3  97
 0  2  0      0  89448   5280 344392   0   0     0  9804  175    76 0   4  96
 0  1  0      0  79352   5288 354236   0   0     0  9852  176    87 0   5  95
 0  1  0      0  71220   5300 362156   0   0     4  7884  170   120 0   4  96
 0  1  0      0  63088   5308 370084   0   0     0  7936  174    76 0   3  97
 0  2  0      0  52988   5320 379924   0   0     0  9920  175    77 0   4  96
 0  2  0      0  43148   5328 389516   0   0     0  9548  174    97 0   4  95
 0  2  0      0  35144   5336 397316   0   0     0  7820  176    73 0   3  97
 0  2  0      0  25172   5344 407036   0   0     0  9724  188   183 0   4  96
 0  2  1      0  17300   5352 414708   0   0     0  7744  174    78 0   4  96
 0  1  0      0   7068   5360 424684   0   0     0  9920  175    93 0   3  97
 0  1  0      0   3128   4132 430132   0   0     0  9920  174    81 0   4  96

Notice how there's planty of RAM. I'm writing sequentially to a file
on the ext2 filesystem. The disk I'm writing on is a 7200rpm IDE,
capable of ~ 22 MB/s and I'm still getting only ~ 9 MB/s. Weird!
-- 
Zlatko
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: xmm2 - monitor Linux MM active/inactive lists graphically
  2001-10-25  0:25   ` Zlatko Calusic
@ 2001-10-25  4:19     ` Linus Torvalds
  2001-10-25  4:57       ` Linus Torvalds
  2001-10-25  9:07       ` Zlatko Calusic
  0 siblings, 2 replies; 37+ messages in thread
From: Linus Torvalds @ 2001-10-25  4:19 UTC (permalink / raw)
  To: Zlatko Calusic; +Cc: Marcelo Tosatti, linux-mm, lkml

On 25 Oct 2001, Zlatko Calusic wrote:
>
> Sure. Output of 'vmstat 1' follows:
>
>  1  0  0      0 254552   5120 183476   0   0    12    24  178   438 2  37  60
>  0  1  0      0 137296   5232 297760   0   0     4  5284  195   440 3  43  54
>  1  0  0      0 126520   5244 308260   0   0     0 10588  215   230 0   3  96
>  0  2  0      0 117488   5252 317064   0   0     0  8796  176   139 1   3  96
>  0  2  0      0 107556   5264 326744   0   0     0  9704  174    78 0   3  97

This does not look like a VM issue at all - at this point you're already
getting only 10MB/s, yet the VM isn't even involved (there's definitely no
VM pressure here).

> Notice how there's planty of RAM. I'm writing sequentially to a file
> on the ext2 filesystem. The disk I'm writing on is a 7200rpm IDE,
> capable of ~ 22 MB/s and I'm still getting only ~ 9 MB/s. Weird!

Are you sure you haven't lost some DMA setting or something?

		Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: xmm2 - monitor Linux MM active/inactive lists graphically
  2001-10-25  4:19     ` Linus Torvalds
@ 2001-10-25  4:57       ` Linus Torvalds
  2001-10-25 12:48         ` Zlatko Calusic
  2001-10-25  9:07       ` Zlatko Calusic
  1 sibling, 1 reply; 37+ messages in thread
From: Linus Torvalds @ 2001-10-25  4:57 UTC (permalink / raw)
  To: Zlatko Calusic; +Cc: Marcelo Tosatti, linux-mm, lkml

On Wed, 24 Oct 2001, Linus Torvalds wrote:
>
> On 25 Oct 2001, Zlatko Calusic wrote:
> >
> > Sure. Output of 'vmstat 1' follows:
> >
> >  1  0  0      0 254552   5120 183476   0   0    12    24  178   438 2  37  60
> >  0  1  0      0 137296   5232 297760   0   0     4  5284  195   440 3  43  54
> >  1  0  0      0 126520   5244 308260   0   0     0 10588  215   230 0   3  96
> >  0  2  0      0 117488   5252 317064   0   0     0  8796  176   139 1   3  96
> >  0  2  0      0 107556   5264 326744   0   0     0  9704  174    78 0   3  97
>
> This does not look like a VM issue at all - at this point you're already
> getting only 10MB/s, yet the VM isn't even involved (there's definitely no
> VM pressure here).

I wonder if you're getting screwed by bdflush().. You do have a lot of
context switching going on, and you do have a clear pattern: once the
write-out gets going, you're filling new cached pages at about the same
pace that you're writing them out, which definitely means that the dirty
buffer balancing is nice and active.

So the problem is that you're obviously not actually getting the
throughput you should - it's not the VM, as the page cache grows nicely at
the same rate you're writing.

Try something for me: in fs/buffer.c make "balance_dirty_state()" never
return > 0, ie make the "return 1" be a "return 0" instead.

That will cause us to not wake up bdflush at all, and if you're just on
the "border" of 40% dirty buffer usage you'll have bdflush work in
lock-step with you, alternately writing out buffers and waiting for them.

Quite frankly, just the act of doing the "write_some_buffers()" in
balance_dirty() should cause us to block much better than the synchronous
waiting anyway, because then we will block when the request queue fills
up, not at random points.

Even so, considering that you have such a steady 9-10MB/s, please double-
check that it's not something even simpler and embarrassing, like just
having forgotten to enable auto-DMA in the kernel config ;)

		Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: xmm2 - monitor Linux MM active/inactive lists graphically
  2001-10-25  4:19     ` Linus Torvalds
  2001-10-25  4:57       ` Linus Torvalds
@ 2001-10-25  9:07       ` Zlatko Calusic
  1 sibling, 0 replies; 37+ messages in thread
From: Zlatko Calusic @ 2001-10-25  9:07 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Marcelo Tosatti, linux-mm, lkml

Linus Torvalds <torvalds@transmeta.com> writes:

> On 25 Oct 2001, Zlatko Calusic wrote:
> >
> > Sure. Output of 'vmstat 1' follows:
> >
> >  1  0  0      0 254552   5120 183476   0   0    12    24  178   438 2  37  60
> >  0  1  0      0 137296   5232 297760   0   0     4  5284  195   440 3  43  54
> >  1  0  0      0 126520   5244 308260   0   0     0 10588  215   230 0   3  96
> >  0  2  0      0 117488   5252 317064   0   0     0  8796  176   139 1   3  96
> >  0  2  0      0 107556   5264 326744   0   0     0  9704  174    78 0   3  97
> 
> This does not look like a VM issue at all - at this point you're already
> getting only 10MB/s, yet the VM isn't even involved (there's definitely no
> VM pressure here).

That's true, I'll admit. Anyway, -ac kernels don't have the problem,
and I was misleaded by the fact that only VM implementation differs in
those two branches (at least I think so).

> 
> > Notice how there's planty of RAM. I'm writing sequentially to a file
> > on the ext2 filesystem. The disk I'm writing on is a 7200rpm IDE,
> > capable of ~ 22 MB/s and I'm still getting only ~ 9 MB/s. Weird!
> 
> Are you sure you haven't lost some DMA setting or something?
> 

No. Setup is fine. I wouldn't make such a mistake. :)
If the disk were in some PIO mode, CPU usage would be much higher, but
it isn't.

This all definitely looks like a problem either in the bdflush daemon,
or request queue/elevator, but unfortunately I don't have enough
knowledge of that areas to pinpoint it more precisely.
-- 
Zlatko
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: xmm2 - monitor Linux MM active/inactive lists graphically
  2001-10-25  4:57       ` Linus Torvalds
@ 2001-10-25 12:48         ` Zlatko Calusic
  2001-10-25 16:31           ` Linus Torvalds
  0 siblings, 1 reply; 37+ messages in thread
From: Zlatko Calusic @ 2001-10-25 12:48 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Marcelo Tosatti, linux-mm, lkml

Linus Torvalds <torvalds@transmeta.com> writes:

> I wonder if you're getting screwed by bdflush().. You do have a lot of
> context switching going on, and you do have a clear pattern: once the
> write-out gets going, you're filling new cached pages at about the same
> pace that you're writing them out, which definitely means that the dirty
> buffer balancing is nice and active.
>

Yes, but things are similar when I finally allocate whole memory, and
kswapd kicks in. Everything is behaving in the same way, so it is
definitely not the VM, as you pointed out.

> So the problem is that you're obviously not actually getting the
> throughput you should - it's not the VM, as the page cache grows nicely at
> the same rate you're writing.
>

Yes.

> Try something for me: in fs/buffer.c make "balance_dirty_state()" never
> return > 0, ie make the "return 1" be a "return 0" instead.
>

Sure. I recompiled fresh 2.4.13 at the work an rerun tests. This time
on different setup, so numbers are even smaller (tests were performed
at the last partition of the disk, where disk is capable of ~ 13MB/s)


   procs                      memory    swap          io     system         cpu
 r  b  w   swpd   free   buff  cache  si  so    bi    bo   in    cs  us  sy  id
 1  0  0      0   6308    600 441592   0   0     0  7788  159   132   0   7  93
 0  1  0      0   3692    580 444272   0   0     0  5748  169   197   1   4  95
 0  1  0      0   3180    556 444804   0   0     0  5632  228   408   1   5  94
 0  1  0      0   3720    556 444284   0   0     0  7672  226   418   3   4  93
 0  1  0      0   3836    556 444148   0   0     0  5928  249   509   0   8  92
 0  1  0      0   3204    388 444952   0   0     0  7828  156   139   0   6  94
 1  1  0      0   3456    392 444692   0   0     0  5952  157   139   0   5  95
 0  1  0      0   3728    400 444428   0   0     0  7840  312   750   0   7  93
 0  1  0      0   3968    404 444168   0   0     0  5952  216   364   0   5  95


> That will cause us to not wake up bdflush at all, and if you're just on
> the "border" of 40% dirty buffer usage you'll have bdflush work in
> lock-step with you, alternately writing out buffers and waiting for them.
> 
> Quite frankly, just the act of doing the "write_some_buffers()" in
> balance_dirty() should cause us to block much better than the synchronous
> waiting anyway, because then we will block when the request queue fills
> up, not at random points.
> 
> Even so, considering that you have such a steady 9-10MB/s, please double-
> check that it's not something even simpler and embarrassing, like just
> having forgotten to enable auto-DMA in the kernel config ;)
> 

Yes, I definitely have DMA turned ON. All parameters are OK. :)

# hdparm /dev/hda

/dev/hda:
 multcount    = 16 (on)
 I/O support  =  0 (default 16-bit)
 unmaskirq    =  0 (off)
 using_dma    =  1 (on)
 keepsettings =  0 (off)
 nowerr       =  0 (off)
 readonly     =  0 (off)
 readahead    =  8 (on)
 geometry     = 1650/255/63, sectors = 26520480, start = 0

-- 
Zlatko
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: xmm2 - monitor Linux MM active/inactive lists graphically
  2001-10-25 12:48         ` Zlatko Calusic
@ 2001-10-25 16:31           ` Linus Torvalds
  2001-10-25 17:33             ` Jens Axboe
                               ` (2 more replies)
  0 siblings, 3 replies; 37+ messages in thread
From: Linus Torvalds @ 2001-10-25 16:31 UTC (permalink / raw)
  To: Zlatko Calusic; +Cc: Marcelo Tosatti, linux-mm, lkml


On 25 Oct 2001, Zlatko Calusic wrote:
>
> Yes, I definitely have DMA turned ON. All parameters are OK. :)

I suspect it may just be that "queue_nr_requests"/"batch_count" is
different in -ac: what happens if you tweak them to the same values?

(See drivers/block/ll_rw_block.c)

I think -ac made the queues a bit deeper the regular kernel does 128
requests and a batch-count of 16, I _think_ -ac does something like "2
requests per megabyte" and batch_count=32, so if you have 512MB you should
try with

	queue_nr_requests = 1024
	batch_count = 32

Does that help?

		Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: xmm2 - monitor Linux MM active/inactive lists graphically
  2001-10-25 16:31           ` Linus Torvalds
@ 2001-10-25 17:33             ` Jens Axboe
  2001-10-26  9:45             ` Zlatko Calusic
  2001-10-26 10:08             ` Zlatko Calusic
  2 siblings, 0 replies; 37+ messages in thread
From: Jens Axboe @ 2001-10-25 17:33 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Zlatko Calusic, Marcelo Tosatti, linux-mm, lkml

On Thu, Oct 25 2001, Linus Torvalds wrote:
> 
> On 25 Oct 2001, Zlatko Calusic wrote:
> >
> > Yes, I definitely have DMA turned ON. All parameters are OK. :)
> 
> I suspect it may just be that "queue_nr_requests"/"batch_count" is
> different in -ac: what happens if you tweak them to the same values?
> 
> (See drivers/block/ll_rw_block.c)
> 
> I think -ac made the queues a bit deeper the regular kernel does 128
> requests and a batch-count of 16, I _think_ -ac does something like "2
> requests per megabyte" and batch_count=32, so if you have 512MB you should
> try with
> 
> 	queue_nr_requests = 1024
> 	batch_count = 32

Right, -ac keeps the elevator flow control and proper queue sizes.

-- 
Jens Axboe

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: xmm2 - monitor Linux MM active/inactive lists graphically
  2001-10-25 16:31           ` Linus Torvalds
  2001-10-25 17:33             ` Jens Axboe
@ 2001-10-26  9:45             ` Zlatko Calusic
  2001-10-26 10:08             ` Zlatko Calusic
  2 siblings, 0 replies; 37+ messages in thread
From: Zlatko Calusic @ 2001-10-26  9:45 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Marcelo Tosatti, linux-mm, lkml

Linus Torvalds <torvalds@transmeta.com> writes:

> On 25 Oct 2001, Zlatko Calusic wrote:
> >
> > Yes, I definitely have DMA turned ON. All parameters are OK. :)
> 
> I suspect it may just be that "queue_nr_requests"/"batch_count" is
> different in -ac: what happens if you tweak them to the same values?
> 
> (See drivers/block/ll_rw_block.c)
> 
> I think -ac made the queues a bit deeper the regular kernel does 128
> requests and a batch-count of 16, I _think_ -ac does something like "2
> requests per megabyte" and batch_count=32, so if you have 512MB you should
> try with
> 
> 	queue_nr_requests = 1024
> 	batch_count = 32
> 
> Does that help?
> 

Unfortunately not. It makes a machine quite unresponsive while it's
writing to disk, and vmstat 1 discovers strange "spiky"
behaviour. Average throughput is ~ 8MB/s (disk is capable of ~ 13MB/s)

   procs                      memory    swap          io     system         cpu
 r  b  w   swpd   free   buff  cache  si  so    bi    bo   in    cs  us  sy  id
 2  0  0      0   3840    528 441900   0   0     0 34816  188   594   2  34  64
 0  1  0      0   3332    536 442384   0   0     4 10624  187   519   2   8  90
 0  1  0      0   3324    536 442384   0   0     0     0  182   499   0   0 100
 2  1  0      0   3300    536 442384   0   0     0     0  198   486   0   1  99
 1  1  0      0   3304    536 442384   0   0     0     0  186   513   0   0 100
 0  1  1      0   3304    536 442384   0   0     0     0  193   473   0   1  99
 0  1  1      0   3304    536 442384   0   0     0     0  191   508   1   1  98
 0  1  0      0   3884    536 441840   0   0     4 44672  189   590   4  40  56
 0  1  0      0   3860    536 441840   0   0     0     0  186   526   0   1  99
 0  1  0      0   3852    536 441840   0   0     0     0  191   500   0   0 100
 0  1  0      0   3844    536 441840   0   0     0     0  193   482   1   0  99
 0  1  0      0   3844    536 441840   0   0     0     0  187   511   0   1  99
 0  2  1      0   3832    540 441844   0   0     4     0  305  1004   3   2  95
 0  3  1      0   3824    544 441844   0   0     4     0  410  1340   2   2  96
 0  3  0      0   3764    552 441916   0   0    12 47360  346   915   6  41  53
 0  3  0      0   3764    552 441916   0   0     0     0  373   887   0   0 100
 0  3  0      0   3764    552 441916   0   0     0     0  278   692   1   2  97
 1  3  0      0   3764    552 441916   0   0     0     0  221   579   0   3  97
 0  3  0      0   3764    552 441916   0   0     0     0  286   704   0   2  98

I'll now test "batch_count = queue_nr_requests / 3", which I found in
2.4.14-pre2, but with queue_nr_request still left at 1024. And report
results after that.
-- 
Zlatko
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: xmm2 - monitor Linux MM active/inactive lists graphically
  2001-10-25 16:31           ` Linus Torvalds
  2001-10-25 17:33             ` Jens Axboe
  2001-10-26  9:45             ` Zlatko Calusic
@ 2001-10-26 10:08             ` Zlatko Calusic
  2001-10-26 14:39               ` Jens Axboe
  2001-10-27 13:14               ` xmm2 - monitor Linux MM active/inactive lists graphically Giuliano Pochini
  2 siblings, 2 replies; 37+ messages in thread
From: Zlatko Calusic @ 2001-10-26 10:08 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Marcelo Tosatti, linux-mm, lkml

Linus Torvalds <torvalds@transmeta.com> writes:

> On 25 Oct 2001, Zlatko Calusic wrote:
> >
> > Yes, I definitely have DMA turned ON. All parameters are OK. :)
> 
> I suspect it may just be that "queue_nr_requests"/"batch_count" is
> different in -ac: what happens if you tweak them to the same values?
> 

Next test:

block: 1024 slots per queue, batch=341

Wrote 600.00 MB in 71 seconds -> 8.39 MB/s (7.5 %CPU)

Still very spiky, and during the write disk is uncapable of doing any
reads. IOW, no serious application can be started before writing has
finished. Shouldn't we favour reads over writes? Or is it just that
the elevator is not doing its job right, so reads suffer?


   procs                      memory    swap          io     system         cpu
 r  b  w   swpd   free   buff  cache  si  so    bi    bo   in    cs  us  sy  id
 0  1  1      0   3600    424 453416   0   0     0     0  190   510   2   1  97
 0  1  1      0   3596    424 453416   0   0     0 40468  189   508   2   2  96
 0  1  1      0   3592    424 453416   0   0     0     0  189   541   1   0  99
 0  1  1      0   3592    424 453416   0   0     0     0  190   513   1   0  99
 1  1  1      0   3592    424 453416   0   0     0     0  192   511   0   1  99
 0  1  1      0   3596    424 453416   0   0     0     0  188   528   0   0 100
 0  1  1      0   3592    424 453416   0   0     0     0  188   510   1   0  99
 0  1  1      0   3592    424 453416   0   0     0 41444  195   507   0   2  98
 0  1  1      0   3592    424 453416   0   0     0     0  190   514   1   1  98
 1  1  1      0   3588    424 453416   0   0     0     0  192   554   0   2  98
 0  1  1      0   3584    424 453416   0   0     0     0  191   506   0   1  99
 0  1  1      0   3584    424 453416   0   0     0     0  186   514   0   0 100
 0  1  1      0   3584    424 453416   0   0     0     0  186   515   0   0 100
 1  1  1      0   3576    424 453416   0   0     0     0  434  1493   3   2  95
 1  1  1      0   3564    424 453416   0   0     0 40560  301   936   3   1  96
 0  1  1      0   3564    424 453416   0   0     0     0  338  1050   1   2  97
 0  1  1      0   3560    424 453416   0   0     0     0  286   893   1   2  97

-- 
Zlatko
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: xmm2 - monitor Linux MM active/inactive lists graphically
  2001-10-26 10:08             ` Zlatko Calusic
@ 2001-10-26 14:39               ` Jens Axboe
  2001-10-26 14:57                 ` Zlatko Calusic
  2001-10-27 13:14               ` xmm2 - monitor Linux MM active/inactive lists graphically Giuliano Pochini
  1 sibling, 1 reply; 37+ messages in thread
From: Jens Axboe @ 2001-10-26 14:39 UTC (permalink / raw)
  To: Zlatko Calusic; +Cc: Linus Torvalds, Marcelo Tosatti, linux-mm, lkml

On Fri, Oct 26 2001, Zlatko Calusic wrote:
> Linus Torvalds <torvalds@transmeta.com> writes:
> 
> > On 25 Oct 2001, Zlatko Calusic wrote:
> > >
> > > Yes, I definitely have DMA turned ON. All parameters are OK. :)
> > 
> > I suspect it may just be that "queue_nr_requests"/"batch_count" is
> > different in -ac: what happens if you tweak them to the same values?
> > 
> 
> Next test:
> 
> block: 1024 slots per queue, batch=341

That's way too much, batch should just stay around 32, that is fine.

> Still very spiky, and during the write disk is uncapable of doing any
> reads. IOW, no serious application can be started before writing has
> finished. Shouldn't we favour reads over writes? Or is it just that
> the elevator is not doing its job right, so reads suffer?

You are probably just seeing starvation due to the very long queues.

-- 
Jens Axboe

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: xmm2 - monitor Linux MM active/inactive lists graphically
  2001-10-26 14:39               ` Jens Axboe
@ 2001-10-26 14:57                 ` Zlatko Calusic
  2001-10-26 15:01                   ` Jens Axboe
                                     ` (2 more replies)
  0 siblings, 3 replies; 37+ messages in thread
From: Zlatko Calusic @ 2001-10-26 14:57 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Linus Torvalds, Marcelo Tosatti, linux-mm, lkml

Jens Axboe <axboe@suse.de> writes:

> On Fri, Oct 26 2001, Zlatko Calusic wrote:
> > Linus Torvalds <torvalds@transmeta.com> writes:
> > 
> > > On 25 Oct 2001, Zlatko Calusic wrote:
> > > >
> > > > Yes, I definitely have DMA turned ON. All parameters are OK. :)
> > > 
> > > I suspect it may just be that "queue_nr_requests"/"batch_count" is
> > > different in -ac: what happens if you tweak them to the same values?
> > > 
> > 
> > Next test:
> > 
> > block: 1024 slots per queue, batch=341
> 
> That's way too much, batch should just stay around 32, that is fine.

OK. Anyway, neither configuration works well, so the problem might be
somewhere else.

While at it, could you give short explanation of those two parameters?

> 
> > Still very spiky, and during the write disk is uncapable of doing any
> > reads. IOW, no serious application can be started before writing has
> > finished. Shouldn't we favour reads over writes? Or is it just that
> > the elevator is not doing its job right, so reads suffer?
> 
> You are probably just seeing starvation due to the very long queues.
> 

Is there anything we could do about that? I remember Linux once had a
favoured reads, but I'm not sure if we do that likewise these days.

When I find some time, I'll dig around that code. It is very
interesting part of the kernel, I'm sure, I just didn't have enough
time so far, to spend hacking on that part.
-- 
Zlatko

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: xmm2 - monitor Linux MM active/inactive lists graphically
  2001-10-26 14:57                 ` Zlatko Calusic
@ 2001-10-26 15:01                   ` Jens Axboe
  2001-10-26 16:04                   ` Linus Torvalds
  2001-10-26 16:57                   ` Linus Torvalds
  2 siblings, 0 replies; 37+ messages in thread
From: Jens Axboe @ 2001-10-26 15:01 UTC (permalink / raw)
  To: Zlatko Calusic; +Cc: Linus Torvalds, Marcelo Tosatti, linux-mm, lkml

On Fri, Oct 26 2001, Zlatko Calusic wrote:
> > On Fri, Oct 26 2001, Zlatko Calusic wrote:
> > > Linus Torvalds <torvalds@transmeta.com> writes:
> > > 
> > > > On 25 Oct 2001, Zlatko Calusic wrote:
> > > > >
> > > > > Yes, I definitely have DMA turned ON. All parameters are OK. :)
> > > > 
> > > > I suspect it may just be that "queue_nr_requests"/"batch_count" is
> > > > different in -ac: what happens if you tweak them to the same values?
> > > > 
> > > 
> > > Next test:
> > > 
> > > block: 1024 slots per queue, batch=341
> > 
> > That's way too much, batch should just stay around 32, that is fine.
> 
> OK. Anyway, neither configuration works well, so the problem might be
> somewhere else.

Most likely, yes.

> While at it, could you give short explanation of those two parameters?

Sure. queue_nr_requests is the total number of free request slots per
queue. There are queue_nr_requests / 2 free slots for READ and WRITE.
Each request can be anywhere from fs block size and up to 127kB of data
per default. batch only matters once the request free list has been
depleted. In order to give the elevator some input to work with, we free
request slots in batches of 'batch' to get decent merging etc. That's
why numbers bigger than ~ 32 would not be such a good idea and only add
to bad latency.

> > > Still very spiky, and during the write disk is uncapable of doing any
> > > reads. IOW, no serious application can be started before writing has
> > > finished. Shouldn't we favour reads over writes? Or is it just that
> > > the elevator is not doing its job right, so reads suffer?
> > 
> > You are probably just seeing starvation due to the very long queues.
> > 
> 
> Is there anything we could do about that? I remember Linux once had a
> favoured reads, but I'm not sure if we do that likewise these days.

It still favors reads, take a look at the initial sequence numbers given
to reads and writes. We use to favor reads in the request slots too --
you could try and change the blk_init_freelist split so that you get a
1/3 - 2/3 ratio between WRITE's and READ's and see if that makes the
system more smooth.

> When I find some time, I'll dig around that code. It is very
> interesting part of the kernel, I'm sure, I just didn't have enough
> time so far, to spend hacking on that part.

Indeed it is.

-- 
Jens Axboe

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: xmm2 - monitor Linux MM active/inactive lists graphically
  2001-10-26 14:57                 ` Zlatko Calusic
  2001-10-26 15:01                   ` Jens Axboe
@ 2001-10-26 16:04                   ` Linus Torvalds
  2001-10-26 16:57                   ` Linus Torvalds
  2 siblings, 0 replies; 37+ messages in thread
From: Linus Torvalds @ 2001-10-26 16:04 UTC (permalink / raw)
  To: Zlatko Calusic; +Cc: Jens Axboe, Marcelo Tosatti, linux-mm, lkml

On 26 Oct 2001, Zlatko Calusic wrote:
>
> OK. Anyway, neither configuration works well, so the problem might be
> somewhere else.
>
> While at it, could you give short explanation of those two parameters?

Did you try the ones 2.4.14-2 does?

Basically, the "queue_nr_requests" means how many requests there can be
for this queue. Half of them are allocated to reads, half of them are
allocated to writes.

The "batch_requests" thing is something that kicks in when the queue has
emptied - we don't want to "trickle" requests to users, because if we do
that means that a new large write will not be able to merge its new
requests sanely because it basically has to do them one at a time. So when
we run out of requests (ie "queue_nr_requests" isn't enough), we start
putting the freed-up requests on a "pending" list, and we release them
only when the pending list is bigger than "batch_requests".

Now, one thing to remember is that "queue_nr_requests" is for the whole
queue (half of them for reads, half for writes), and "batch_requests" is a
per-type thing (ie we batch reads and writes separately). So
"batch_requests" must be less than half of "queue_nr_requests", or we will
never release anything at all.

Now, in Alan's tree, there is a separate tuning thing, which is the "max
nr of _sectors_ in flight", which in my opinion is pretty bogus. It's
really a memory-management thing, but it also does something else: it has
low-and-high water-marks, and those might well be a good idea. It is
possible that we should just ditch the "batch_requests" thing, and use the
watermarks instead.

Side note: all of this is relevant really only for writes - reads pretty
much only care about the maximum queue-size, and it's very hard to get a
_huge_ queue-size with reads unless you do tons of read-ahead.

Now, the "batching" is technically equivalent with water-marking if there
is _one_ writer. But if there are multiple writers, water-marking may
actually has some advantages: it might allow the other writer to make some
progress when the first one has stopped, while the batching will stop
everybody until the batch is released. Who knows.

Anyway, the reason I think Alan's "max nr of sectors" is bogus is because:

 - it's a global count, and if you have 10 controllers and want to write
   to all 10, you _should_ be able to - you can write 10 times as many
   requests in the same latency, so there is nothing "global" with it.

   (It turns out that one advantage of the globalism is that it ends up
   limiting MM write-outs, but I personally think that is a _MM_ thing, ie
   we might want to have a "we have half of all our pages in flight, we
   have to throttle now" thing in "writepage()", not in the queue)

 - "nr of sectors" has very little to do with request latency on most
   hardware. You can do 255 sectors (ie one request) almost as fast as you
   can do just one, if you do them in one request. While just _two_
   sectors might be much slower than the 255, if they are in separate
   requests and cause seeking.

   So from a latency standpoint, the "request" is a much better number.

So Alan almost never throttles on requests (on big machines, the -ac tree
allows thousands of requests in flight per queue), while he _does_ have
this water-marking for sectors.

So I have two suspicions:

 - 128 requests (ie 64 for writes) like the default kernel should be
   _plenty_ enough to keep the disks busy, especially for streaming
   writes. It's small enough that you don't get the absolutely _huge_
   spikes you get with thousands of requests, while being large enough for
   fast writers that even if they _do_ block for 32 of the 64 requests,
   they'll have time to refill the next 32 long before the 32 pending one
   have finished.

   Also: limiting the write queue to 128 requests means that you can
   pretty much guarantee that you can get at least a few read requests
   per second, even if the write queue is constantly full, and even if
   your reader is serialized.

BUT:

 - the hard "batch" count is too harsh. It works as a watermark in the
   degenerate case, but doesn't allow a second writer to use up _some_ of
   the requests while the first writer is blocked due to watermarking.

   So with batching, when the queue is full and another process wants
   memory, that _OTHER_ process will also always block untilt he queue has
   emptied.

   With watermarks, when the writer has filled up the queue and starts
   waiting, other processes can still do some writing as long as they
   don't fill up the queue again. So if you have MM pressure but the
   writer is blocked (and some requests _have_ completed, but the writer
   waits for the low-water-mark), you can still push out requests.

   That's also likely to be a lot more fair - batching tends to give the
   whole batch to the big writer, while watermarking automatically allows
   others to get a look at the queue.

I'll whip up a patch for testing (2.4.14-2 made the batching slightly
saner, but the same "hard" behaviour is pretty much unavoidable with
batching)

			Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: xmm2 - monitor Linux MM active/inactive lists graphically
  2001-10-26 14:57                 ` Zlatko Calusic
  2001-10-26 15:01                   ` Jens Axboe
  2001-10-26 16:04                   ` Linus Torvalds
@ 2001-10-26 16:57                   ` Linus Torvalds
  2001-10-26 17:19                     ` Linus Torvalds
  2 siblings, 1 reply; 37+ messages in thread
From: Linus Torvalds @ 2001-10-26 16:57 UTC (permalink / raw)
  To: Zlatko Calusic; +Cc: Jens Axboe, Marcelo Tosatti, linux-mm, lkml

[-- Attachment #1: Type: TEXT/PLAIN, Size: 666 bytes --]


On 26 Oct 2001, Zlatko Calusic wrote:
>
> When I find some time, I'll dig around that code. It is very
> interesting part of the kernel, I'm sure, I just didn't have enough
> time so far, to spend hacking on that part.

Attached is a very untested patch (but hey, it compiles, so it must work,
right?) against 2.4.14-pre2, that makes the batching be a high/low
watermark thing instead. It actually simplified the code, but that is, of
course, assuming that it works at all ;)

(If I got the comparisons wrong, of if I update the counts wrong, your IO
queue will probably stop cold. So be careful. The code is obvious
enough, but typos and thinkos happen).

		Linus

[-- Attachment #2: Type: TEXT/PLAIN, Size: 5499 bytes --]

diff -u --recursive pre2/linux/drivers/block/ll_rw_blk.c linux/drivers/block/ll_rw_blk.c
--- pre2/linux/drivers/block/ll_rw_blk.c	Fri Oct 26 09:48:25 2001
+++ linux/drivers/block/ll_rw_blk.c	Fri Oct 26 09:53:54 2001
@@ -140,21 +140,23 @@
 		return &blk_dev[MAJOR(dev)].request_queue;
 }
 
-static int __blk_cleanup_queue(struct list_head *head)
+static int __blk_cleanup_queue(struct request_list *list)
 {
+	struct list_head *head = &list->free;
 	struct request *rq;
 	int i = 0;
 
-	if (list_empty(head))
-		return 0;
-
-	do {
+	while (!list_empty(head)) {
 		rq = list_entry(head->next, struct request, queue);
 		list_del(&rq->queue);
 		kmem_cache_free(request_cachep, rq);
 		i++;
-	} while (!list_empty(head));
+	};
 
+	if (i != list->count)
+		printk("request list leak!\n");
+
+	list->count = 0;
 	return i;
 }
 
@@ -176,10 +178,8 @@
 {
 	int count = queue_nr_requests;
 
-	count -= __blk_cleanup_queue(&q->request_freelist[READ]);
-	count -= __blk_cleanup_queue(&q->request_freelist[WRITE]);
-	count -= __blk_cleanup_queue(&q->pending_freelist[READ]);
-	count -= __blk_cleanup_queue(&q->pending_freelist[WRITE]);
+	count -= __blk_cleanup_queue(&q->rq[READ]);
+	count -= __blk_cleanup_queue(&q->rq[WRITE]);
 
 	if (count)
 		printk("blk_cleanup_queue: leaked requests (%d)\n", count);
@@ -331,11 +331,10 @@
 	struct request *rq;
 	int i;
 
-	INIT_LIST_HEAD(&q->request_freelist[READ]);
-	INIT_LIST_HEAD(&q->request_freelist[WRITE]);
-	INIT_LIST_HEAD(&q->pending_freelist[READ]);
-	INIT_LIST_HEAD(&q->pending_freelist[WRITE]);
-	q->pending_free[READ] = q->pending_free[WRITE] = 0;
+	INIT_LIST_HEAD(&q->rq[READ].free);
+	INIT_LIST_HEAD(&q->rq[WRITE].free);
+	q->rq[READ].count = 0;
+	q->rq[WRITE].count = 0;
 
 	/*
 	 * Divide requests in half between read and write
@@ -349,7 +348,8 @@
 		}
 		memset(rq, 0, sizeof(struct request));
 		rq->rq_status = RQ_INACTIVE;
-		list_add(&rq->queue, &q->request_freelist[i & 1]);
+		list_add(&rq->queue, &q->rq[i&1].free);
+		q->rq[i&1].count++;
 	}
 
 	init_waitqueue_head(&q->wait_for_request);
@@ -423,10 +423,12 @@
 static inline struct request *get_request(request_queue_t *q, int rw)
 {
 	struct request *rq = NULL;
+	struct request_list *rl = q->rq + rw;
 
-	if (!list_empty(&q->request_freelist[rw])) {
-		rq = blkdev_free_rq(&q->request_freelist[rw]);
+	if (!list_empty(&rl->free)) {
+		rq = blkdev_free_rq(&rl->free);
 		list_del(&rq->queue);
+		rl->count--;
 		rq->rq_status = RQ_ACTIVE;
 		rq->special = NULL;
 		rq->q = q;
@@ -443,17 +445,13 @@
 	register struct request *rq;
 	DECLARE_WAITQUEUE(wait, current);
 
+	generic_unplug_device(q);
 	add_wait_queue_exclusive(&q->wait_for_request, &wait);
-	for (;;) {
-		__set_current_state(TASK_UNINTERRUPTIBLE);
-		spin_lock_irq(&io_request_lock);
-		rq = get_request(q, rw);
-		spin_unlock_irq(&io_request_lock);
-		if (rq)
-			break;
-		generic_unplug_device(q);
-		schedule();
-	}
+	do {
+		set_current_state(TASK_UNINTERRUPTIBLE);
+		if (q->rq[rw].count < batch_requests)
+			schedule();
+	} while ((rq = get_request(q,rw)) == NULL);
 	remove_wait_queue(&q->wait_for_request, &wait);
 	current->state = TASK_RUNNING;
 	return rq;
@@ -542,15 +540,6 @@
 	list_add(&req->queue, insert_here);
 }
 
-inline void blk_refill_freelist(request_queue_t *q, int rw)
-{
-	if (q->pending_free[rw]) {
-		list_splice(&q->pending_freelist[rw], &q->request_freelist[rw]);
-		INIT_LIST_HEAD(&q->pending_freelist[rw]);
-		q->pending_free[rw] = 0;
-	}
-}
-
 /*
  * Must be called with io_request_lock held and interrupts disabled
  */
@@ -564,28 +553,12 @@
 
 	/*
 	 * Request may not have originated from ll_rw_blk. if not,
-	 * asumme it has free buffers and check waiters
+	 * assume it has free buffers and check waiters
 	 */
 	if (q) {
-		/*
-		 * If nobody is waiting for requests, don't bother
-		 * batching up.
-		 */
-		if (!list_empty(&q->request_freelist[rw])) {
-			list_add(&req->queue, &q->request_freelist[rw]);
-			return;
-		}
-
-		/*
-		 * Add to pending free list and batch wakeups
-		 */
-		list_add(&req->queue, &q->pending_freelist[rw]);
-
-		if (++q->pending_free[rw] >= batch_requests) {
-			int wake_up = q->pending_free[rw];
-			blk_refill_freelist(q, rw);
-			wake_up_nr(&q->wait_for_request, wake_up);
-		}
+		list_add(&req->queue, &q->rq[rw].free);
+		if (++q->rq[rw].count >= batch_requests && waitqueue_active(&q->wait_for_request))
+			wake_up(&q->wait_for_request);
 	}
 }
 
@@ -1144,7 +1117,7 @@
 	/*
 	 * Batch frees according to queue length
 	 */
-	batch_requests = queue_nr_requests/3;
+	batch_requests = queue_nr_requests/4;
 	printk("block: %d slots per queue, batch=%d\n", queue_nr_requests, batch_requests);
 
 #ifdef CONFIG_AMIGA_Z2RAM
diff -u --recursive pre2/linux/include/linux/blkdev.h linux/include/linux/blkdev.h
--- pre2/linux/include/linux/blkdev.h	Tue Oct 23 22:01:01 2001
+++ linux/include/linux/blkdev.h	Fri Oct 26 09:36:41 2001
@@ -66,14 +66,17 @@
  */
 #define QUEUE_NR_REQUESTS	8192
 
+struct request_list {
+	unsigned int count;
+	struct list_head free;
+};
+
 struct request_queue
 {
 	/*
 	 * the queue request freelist, one for reads and one for writes
 	 */
-	struct list_head	request_freelist[2];
-	struct list_head	pending_freelist[2];
-	int			pending_free[2];
+	struct request_list	rq[2];
 
 	/*
 	 * Together with queue_head for cacheline sharing

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: xmm2 - monitor Linux MM active/inactive lists graphically
  2001-10-26 16:57                   ` Linus Torvalds
@ 2001-10-26 17:19                     ` Linus Torvalds
  2001-10-28 17:30                       ` Zlatko Calusic
  0 siblings, 1 reply; 37+ messages in thread
From: Linus Torvalds @ 2001-10-26 17:19 UTC (permalink / raw)
  To: Zlatko Calusic; +Cc: Jens Axboe, Marcelo Tosatti, linux-mm, lkml

On Fri, 26 Oct 2001, Linus Torvalds wrote:
>
> Attached is a very untested patch (but hey, it compiles, so it must work,
> right?)

And it actually does seem to.

Zlatko, does this make a difference for your disk?

		Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: xmm2 - monitor Linux MM active/inactive lists graphically
  2001-10-26 10:08             ` Zlatko Calusic
  2001-10-26 14:39               ` Jens Axboe
@ 2001-10-27 13:14               ` Giuliano Pochini
  2001-10-28  5:05                 ` Mike Fedyk
  1 sibling, 1 reply; 37+ messages in thread
From: Giuliano Pochini @ 2001-10-27 13:14 UTC (permalink / raw)
  To: zlatko.calusic; +Cc: Linus Torvalds, Marcelo Tosatti, linux-mm, lkml


> block: 1024 slots per queue, batch=341
> 
> Wrote 600.00 MB in 71 seconds -> 8.39 MB/s (7.5 %CPU)
> 
> Still very spiky, and during the write disk is uncapable of doing any
> reads. IOW, no serious application can be started before writing has
> finished. Shouldn't we favour reads over writes? Or is it just that
> the elevator is not doing its job right, so reads suffer?
>
>    procs                      memory    swap          io     system         cpu
>  r  b  w   swpd   free   buff  cache  si  so    bi    bo   in    cs  us  sy  id
>  0  1  1      0   3596    424 453416   0   0     0 40468  189   508   2   2  96

341*127K = ~40M.

Batch is too high. It doesn't explain why reads get delayed so much, anyway.

Bye.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: xmm2 - monitor Linux MM active/inactive lists graphically
  2001-10-27 13:14               ` xmm2 - monitor Linux MM active/inactive lists graphically Giuliano Pochini
@ 2001-10-28  5:05                 ` Mike Fedyk
  0 siblings, 0 replies; 37+ messages in thread
From: Mike Fedyk @ 2001-10-28  5:05 UTC (permalink / raw)
  To: Giuliano Pochini
  Cc: zlatko.calusic, Linus Torvalds, Marcelo Tosatti, linux-mm, lkml

On Sat, Oct 27, 2001 at 03:14:44PM +0200, Giuliano Pochini wrote:
> 
> > block: 1024 slots per queue, batch=341
> > 
> > Wrote 600.00 MB in 71 seconds -> 8.39 MB/s (7.5 %CPU)
> > 
> > Still very spiky, and during the write disk is uncapable of doing any
> > reads. IOW, no serious application can be started before writing has
> > finished. Shouldn't we favour reads over writes? Or is it just that
> > the elevator is not doing its job right, so reads suffer?
> >
> >    procs                      memory    swap          io     system         cpu
> >  r  b  w   swpd   free   buff  cache  si  so    bi    bo   in    cs  us  sy  id
> >  0  1  1      0   3596    424 453416   0   0     0 40468  189   508   2   2  96
> 
> 341*127K = ~40M.
> 
> Batch is too high. It doesn't explain why reads get delayed so much, anyway.
> 

Try modifying the elivator queue length with elvtune.

BTW, 2.2.19 has the queue lengths in the hundreds, and 2.4.xx has it in the
thousands.  I've set 2.4 kernels back to the 2.2 defaults, and interactive
performance has gone up considerably.  These are subjective tests though.

Mike

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: xmm2 - monitor Linux MM active/inactive lists graphically
  2001-10-26 17:19                     ` Linus Torvalds
@ 2001-10-28 17:30                       ` Zlatko Calusic
  2001-10-28 17:34                         ` Linus Torvalds
                                           ` (2 more replies)
  0 siblings, 3 replies; 37+ messages in thread
From: Zlatko Calusic @ 2001-10-28 17:30 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Jens Axboe, Marcelo Tosatti, linux-mm, lkml

Linus Torvalds <torvalds@transmeta.com> writes:

> On Fri, 26 Oct 2001, Linus Torvalds wrote:
> >
> > Attached is a very untested patch (but hey, it compiles, so it must work,
> > right?)
> 
> And it actually does seem to.
> 
> Zlatko, does this make a difference for your disk?
> 

First, sorry for such a delay in answering, I was busy.

I compiled 2.4.14-pre3 as it seems to be identical to your p2p3 patch,
with regard to queue processing.

Unfortunately, things didn't change on my first disk (IBM 7200rpm
@home). I'm still getting low numbers, check the vmstat output at the
end of the email.

But, now I found something interesting, other two disk which are on
the standard IDE controller work correctly (writing is at 17-22
MB/sec). The disk which doesn't work well is on the HPT366 interface,
so that may be our culprit. Now I got the idea to check patches
retrogradely to see where it started behaving poorely.

Also, one more thing, I'm pretty sure that under strange circumstances
(specific alignment of stars) it behaves well (with appropriate
writing speed). I just haven't yet pinpointed what needs to be done to
get to that point.

I know I haven't supplied you with a lot of information, but I'll keep
investigating until I have some more solid data on the problem.

BTW, thank you and Jens for nice explanation of the numbers, very good
reading.

 0  2  0  13208   2924    516 450716   0   0     0 11808  179   113   0   6  93
 0  1  0  13208   2656    524 450964   0   0     0  8432  174    86   1   6  93
 0  1  0  13208   3676    532 449924   0   0     0  8432  174    91   1   4  95
 0  1  0  13208   3400    540 450172   0   0     0  8432  231   343   1   4  94
 0  2  0  13208   3520    548 450036   0   0     0  8440  180   179   2   5  93
 0  1  0  20216   3544    728 456976  32   0    32  8432  175    94   0   4  95
 0  2  0  20212   3280    728 457232   0   0     0  8440  174    88   0   5  95
 0  2  0  20208   3032    728 457480   0   0     0  8364  174    84   1   4  95
   procs                      memory    swap          io     system         cpu
 r  b  w   swpd   free   buff  cache  si  so    bi    bo   in    cs  us  sy  id
 0  2  0  20208   3412    732 457092   0   0     0  6964  175   111   0   4  96
 0  2  0  20208   3272    728 457224   0   0     0  1216  207    89   0   1  99
 0  2  0  20208   3164    728 457352   0   0     0  1300  256    77   1   2  97
 0  2  1  20208   2928    732 457604   0   0     0  1444  283    77   1   0  99
 0  2  1  20208   2764    732 457732   0   0     0  1316  278    73   1   1  98
 0  2  1  20208   3420    728 457096   0   0     0  1652  273   117   0   1  99
 0  2  1  20208   3180    732 457348   0   0     0  1404  240    90   0   0  99
 0  2  1  20208   3696    728 456840   0   0     0  1784  247    80   0   1  98
 0  2  1  20204   3432    728 457096   0   0     0  1404  237    77   1   0  99
 0  2  1  20204   2896    732 457604   0   0     0  1672  255    77   1   1  98
 0  1  0  20204   3284    728 457224   0   0     0  1976  257   112   0   2  98
 0  1  0  20204   2772    728 457736   0   0     0  7628  260   100   0   4  96
 0  1  0  20204   3540    728 456968   0   0     0  8492  178    83   1   4  95
 0  2  0  20204   3584    736 456916   0   0     4  4848  175    88   0   2  97

Regards,
-- 
Zlatko

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: xmm2 - monitor Linux MM active/inactive lists graphically
  2001-10-28 17:30                       ` Zlatko Calusic
@ 2001-10-28 17:34                         ` Linus Torvalds
  2001-10-28 17:48                           ` Alan Cox
  2001-10-28 19:13                         ` Barry K. Nathan
  2001-11-02  5:52                         ` Zlatko's I/O slowdown status Andrea Arcangeli
  2 siblings, 1 reply; 37+ messages in thread
From: Linus Torvalds @ 2001-10-28 17:34 UTC (permalink / raw)
  To: Zlatko Calusic; +Cc: Jens Axboe, Marcelo Tosatti, linux-mm, lkml

On 28 Oct 2001, Zlatko Calusic wrote:
>
> But, now I found something interesting, other two disk which are on
> the standard IDE controller work correctly (writing is at 17-22
> MB/sec). The disk which doesn't work well is on the HPT366 interface,
> so that may be our culprit. Now I got the idea to check patches
> retrogradely to see where it started behaving poorely.

Ok. That _is_ indeed a big clue.

Does the -ac patches have any hpt366-specific stuff? Although I suspect
you're right, and that it's just the driver (or controller itself) being
very very sensitive to some random alignment of stars, rather than any
real code itself.

>  0  2  0  13208   2924    516 450716   0   0     0 11808  179   113   0   6  93
>  0  1  0  13208   2656    524 450964   0   0     0  8432  174    86   1   6  93
>  0  1  0  13208   3676    532 449924   0   0     0  8432  174    91   1   4  95
>  0  1  0  13208   3400    540 450172   0   0     0  8432  231   343   1   4  94
>  0  2  0  13208   3520    548 450036   0   0     0  8440  180   179   2   5  93
>  0  1  0  20216   3544    728 456976  32   0    32  8432  175    94   0   4  95
>  0  2  0  20212   3280    728 457232   0   0     0  8440  174    88   0   5  95
>  0  2  0  20208   3032    728 457480   0   0     0  8364  174    84   1   4  95
>    procs                      memory    swap          io     system         cpu
>  r  b  w   swpd   free   buff  cache  si  so    bi    bo   in    cs  us  sy  id
>  0  2  0  20208   3412    732 457092   0   0     0  6964  175   111   0   4  96
>  0  2  0  20208   3272    728 457224   0   0     0  1216  207    89   0   1  99
>  0  2  0  20208   3164    728 457352   0   0     0  1300  256    77   1   2  97
>  0  2  1  20208   2928    732 457604   0   0     0  1444  283    77   1   0  99
>  0  2  1  20208   2764    732 457732   0   0     0  1316  278    73   1   1  98

So it actually slows down to just 1.5MB/s at times? That's just
disgusting. I wonder what the driver is doing..

		Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: xmm2 - monitor Linux MM active/inactive lists graphically
  2001-10-28 17:34                         ` Linus Torvalds
@ 2001-10-28 17:48                           ` Alan Cox
  2001-10-28 17:59                             ` Linus Torvalds
  0 siblings, 1 reply; 37+ messages in thread
From: Alan Cox @ 2001-10-28 17:48 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Zlatko Calusic, Jens Axboe, Marcelo Tosatti, linux-mm, lkml

> Does the -ac patches have any hpt366-specific stuff? Although I suspect
> you're right, and that it's just the driver (or controller itself) being

The IDE code matches between the two. It isnt a driver change


Alan

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: xmm2 - monitor Linux MM active/inactive lists graphically
  2001-10-28 17:48                           ` Alan Cox
@ 2001-10-28 17:59                             ` Linus Torvalds
  2001-10-28 18:22                               ` Alan Cox
                                                 ` (2 more replies)
  0 siblings, 3 replies; 37+ messages in thread
From: Linus Torvalds @ 2001-10-28 17:59 UTC (permalink / raw)
  To: Alan Cox; +Cc: Zlatko Calusic, Jens Axboe, Marcelo Tosatti, linux-mm, lkml

On Sun, 28 Oct 2001, Alan Cox wrote:
>
> > Does the -ac patches have any hpt366-specific stuff? Although I suspect
> > you're right, and that it's just the driver (or controller itself) being
>
> The IDE code matches between the two. It isnt a driver change

It might, of course, just be timing, but that sounds like a bit _too_ easy
an explanation. Even if it could easily be true.

The fact that -ac gets higher speeds, and -ac has a very different
request watermark strategy makes me suspect that that might be the cause.

In particular, the standard kernel _requires_ that in order to get good
performance you can merge many bh's onto one request. That's a very
reasonable assumption: it basically says that any high-performance driver
has to accept merging, because that in turn is required for the elevator
overhead to not grow without bounds. And if the driver doesn't accept big
requests, that driver cannot perform well because it won't have many
requests pending.

In contrast, the -ac logic says roughly "Who the hell cares if the driver
can merge requests or not, we can just give it thousands of small requests
instead, and cap the total number of _sectors_ instead of capping the
total number of requests earlier".

In my opinion, the -ac logic is really bad, but one thing it does allow is
for stupid drivers that look like high-performance drivers. Which may be
why it got implemented.

And it may be that the hpt366 IDE driver has always had this braindamage,
which the -ac code hides. Or something like this.

Does anybody know the hpt driver? Does it, for example, limit the maximum
number of sectors per merge somehow for some reason?

Jens?

		Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: xmm2 - monitor Linux MM active/inactive lists graphically
  2001-10-28 17:59                             ` Linus Torvalds
@ 2001-10-28 18:22                               ` Alan Cox
  2001-10-28 18:46                                 ` Linus Torvalds
  2001-10-28 18:56                               ` Andrew Morton
  2001-10-30  8:56                               ` Jens Axboe
  2 siblings, 1 reply; 37+ messages in thread
From: Alan Cox @ 2001-10-28 18:22 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Alan Cox, Zlatko Calusic, Jens Axboe, Marcelo Tosatti, linux-mm, lkml

> In contrast, the -ac logic says roughly "Who the hell cares if the driver
> can merge requests or not, we can just give it thousands of small requests
> instead, and cap the total number of _sectors_ instead of capping the
> total number of requests earlier"

If you think about it the major resource constraint is sectors - or another
way to think of it "number of pinned pages the VM cannot rescue until the
I/O is done". We also have many devices where the latency is horribly
important - IDE is one because it lacks sensible overlapping I/O. I'm less
sure what the latency trade offs are. Less commands means less turnarounds
so there is counterbalance.

In the case of IDE the -ac tree will do basically the same merging - the
limitations on IDE DMA are pretty reasonable. DMA IDE has scatter gather
tables and is actually smarter than many older scsi controllers. The IDE
layer supports up to 128 chunks of up to just under 64Kb (should be 64K
but some chipsets get 64K = 0 wrong and its not pretty)

> In my opinion, the -ac logic is really bad, but one thing it does allow is
> for stupid drivers that look like high-performance drivers. Which may be
> why it got implemented.

Well I'm all for making dumb hardware go as fast as smart stuff but that
wasn't the original goal - the original goal was to fix the bad behaviour
with the base kernel and large I/O queues to slow devices like M/O disks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: xmm2 - monitor Linux MM active/inactive lists graphically
  2001-10-28 18:22                               ` Alan Cox
@ 2001-10-28 18:46                                 ` Linus Torvalds
  2001-10-28 19:29                                   ` Alan Cox
  0 siblings, 1 reply; 37+ messages in thread
From: Linus Torvalds @ 2001-10-28 18:46 UTC (permalink / raw)
  To: Alan Cox; +Cc: Zlatko Calusic, Jens Axboe, Marcelo Tosatti, linux-mm, lkml

On Sun, 28 Oct 2001, Alan Cox wrote:
>
> > In contrast, the -ac logic says roughly "Who the hell cares if the driver
> > can merge requests or not, we can just give it thousands of small requests
> > instead, and cap the total number of _sectors_ instead of capping the
> > total number of requests earlier"
>
> If you think about it the major resource constraint is sectors - or another
> way to think of it "number of pinned pages the VM cannot rescue until the
> I/O is done".

Yes. But that's a VM decision, and that's a decision the VM _can_ and does
make. At least in newer VM's.

So counting sectors is only hiding problems at a higher level, and it's
hiding problems that the higher level can know about.

In contrast, one thing that the higher level _cannot_ know about is the
latency of the request queue, because that latency depends on the layout
of the requests. Contiguous requests are fast, seeks are slow. So the
number of requests (as long as they aren't infinitely sized) fairly well
approximates the latency.

Note that you are certainly right that the Linux VM system did not use to
be very good at throttling, and you could make it try to write out all of
memory on small machines. But that's really a VM issue.

(And have we have VM's that tried to push all of memory onto the disk, and
then returned Out-of-Memory when all pages were locked? Sure we have. But
I know mine doesn't, don't know about yours).

>		 We also have many devices where the latency is horribly
> important - IDE is one because it lacks sensible overlapping I/O. I'm less
> sure what the latency trade offs are. Less commands means less turnarounds
> so there is counterbalance.

Note that from a latency standpoint, you only need to have enough requests
to fill the queue - and right now we have a total of 128 requests, of
which half a for reads, and half are for the watermarking, so you end up
having 32 requests "in flight" while you refill the queue.

Which is _plenty_. Because each request can be 255 sectors (or 128,
depending on where the limit is today ;), which means that if you actually
have something throughput-limited, you can certainly keep the disk busy.

(And if the requests aren't localized enough to coalesce well, you cannot
keep the disk at platter-speed _anyway_, plus the requests will take
longer to process, so you'll have even more time to fill the queue).

The important part for real throughput is not to have thousands of
requests in flight, but to have _big_enough_ requests in flight. You can
keep even a fast disk busy with just a few requests, if you just keep
refilling them quickly enough and if they are _big_ enough.

> In the case of IDE the -ac tree will do basically the same merging - the
> limitations on IDE DMA are pretty reasonable. DMA IDE has scatter gather
> tables and is actually smarter than many older scsi controllers. The IDE
> layer supports up to 128 chunks of up to just under 64Kb (should be 64K
> but some chipsets get 64K = 0 wrong and its not pretty)

Yes. My question is more: does the dpt366 thing limit the queueing some
way?

> Well I'm all for making dumb hardware go as fast as smart stuff but that
> wasn't the original goal - the original goal was to fix the bad behaviour
> with the base kernel and large I/O queues to slow devices like M/O disks.

Now, that's a _latency_ issue, and should be fixed by having the max
number of requests (and the max _size_ of a request too) be a per-queue
thing.

But notice how that actually doesn't have anything to do with memory size,
and makes your "scale by max memory" thing illogical.

		Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: xmm2 - monitor Linux MM active/inactive lists graphically
  2001-10-28 17:59                             ` Linus Torvalds
  2001-10-28 18:22                               ` Alan Cox
@ 2001-10-28 18:56                               ` Andrew Morton
  2001-10-30  8:56                               ` Jens Axboe
  2 siblings, 0 replies; 37+ messages in thread
From: Andrew Morton @ 2001-10-28 18:56 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Alan Cox, Zlatko Calusic, Jens Axboe, Marcelo Tosatti, linux-mm, lkml

Linus Torvalds wrote:
> 
> And it may be that the hpt366 IDE driver has always had this braindamage,
> which the -ac code hides. Or something like this.
> 

My hpt366, running stock 2.4.14-pre3 performs OK.
	time ( dd if=/dev/zero of=foo bs=10240k count=100 ; sync )
takes 35 seconds (30 megs/sec).  The same on current -ac kernels.

Maybe Zlatko's drive stopped doing DMA?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: xmm2 - monitor Linux MM active/inactive lists graphically
  2001-10-28 17:30                       ` Zlatko Calusic
  2001-10-28 17:34                         ` Linus Torvalds
@ 2001-10-28 19:13                         ` Barry K. Nathan
  2001-10-28 21:42                           ` Jonathan Morton
  2001-11-02  5:52                         ` Zlatko's I/O slowdown status Andrea Arcangeli
  2 siblings, 1 reply; 37+ messages in thread
From: Barry K. Nathan @ 2001-10-28 19:13 UTC (permalink / raw)
  To: zlatko.calusic
  Cc: Linus Torvalds, Jens Axboe, Marcelo Tosatti, linux-mm, lkml

> Unfortunately, things didn't change on my first disk (IBM 7200rpm
> @home). I'm still getting low numbers, check the vmstat output at the
> end of the email.
> 
> But, now I found something interesting, other two disk which are on
> the standard IDE controller work correctly (writing is at 17-22
> MB/sec). The disk which doesn't work well is on the HPT366 interface,
> so that may be our culprit. Now I got the idea to check patches
> retrogradely to see where it started behaving poorely.
> 
> Also, one more thing, I'm pretty sure that under strange circumstances
> (specific alignment of stars) it behaves well (with appropriate
> writing speed). I just haven't yet pinpointed what needs to be done to
> get to that point.

I didn't read the entire thread, so this is a bit of a stab in the dark,
but:

This really reminds me of a problem I once had with a hard drive of
mine. It would usually go at 15-20MB/sec, but sometimes (under both
Linux and Windows) would slow down to maybe 350KB/sec. The slowdown, or
lack thereof, did seem to depend on the alignment of the stars. I lived
with it for a number of months, then started getting intermittent I/O
errors as well, as if the drive had bad sectors on disk.

The problem turned out to be insufficient ventilation for the controller
board on the bottom of the drive -- it was in the lowest 3.5" drive bay
in my case, so the bottom of the drive was snuggled next to a piece of
metal with ventilation holes. The holes were rather large (maybe 0.5"
diameter) -- and so were the areas without holes. Guess where one of the
drive's controller chips happened to be positioned, relative to the
holes? :( Moving the drive up a bit in the case, so as to allow 0.5"-1"
of space for air beneath the drive, fixed the problem (both the slowdown
and the I/O errors).

I don't know if this is your problem, but I'm mentioning it just in
case it is...

-Barry K. Nathan <barryn@pobox.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: xmm2 - monitor Linux MM active/inactive lists graphically
  2001-10-28 18:46                                 ` Linus Torvalds
@ 2001-10-28 19:29                                   ` Alan Cox
  0 siblings, 0 replies; 37+ messages in thread
From: Alan Cox @ 2001-10-28 19:29 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Alan Cox, Zlatko Calusic, Jens Axboe, Marcelo Tosatti, linux-mm, lkml

> Yes. My question is more: does the dpt366 thing limit the queueing some
> way?

Nope. The HPT366 is a bog standard DMA IDE controller. At least unless Andre
can point out something I've forgotten any behaviour seen on it should be
the same as seen on any other IDE controller with DMA support.

In practical terms that should mean you can obsere the same HPT366 problem
he does on whatever random IDE controller is on your desktop box

> But notice how that actually doesn't have anything to do with memory size,
> and makes your "scale by max memory" thing illogical.

When you are dealing with the VM limit which the limiter was originally
added for then it makes a lot of sense. When you want to use it solely for
other purposes then it doesnt.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: xmm2 - monitor Linux MM active/inactive lists graphically
  2001-10-28 19:13                         ` Barry K. Nathan
@ 2001-10-28 21:42                           ` Jonathan Morton
  0 siblings, 0 replies; 37+ messages in thread
From: Jonathan Morton @ 2001-10-28 21:42 UTC (permalink / raw)
  To: barryn, zlatko.calusic
  Cc: Linus Torvalds, Jens Axboe, Marcelo Tosatti, linux-mm, lkml

>  > Unfortunately, things didn't change on my first disk (IBM 7200rpm
>>  @home). I'm still getting low numbers, check the vmstat output at the
>>  end of the email.
>>
>>  But, now I found something interesting, other two disk which are on
>>  the standard IDE controller work correctly (writing is at 17-22
>>  MB/sec). The disk which doesn't work well is on the HPT366 interface,
>>  so that may be our culprit. Now I got the idea to check patches
>  > retrogradely to see where it started behaving poorely.

>This really reminds me of a problem I once had with a hard drive of
>mine. It would usually go at 15-20MB/sec, but sometimes (under both
>Linux and Windows) would slow down to maybe 350KB/sec. The slowdown, or
>lack thereof, did seem to depend on the alignment of the stars. I lived
>with it for a number of months, then started getting intermittent I/O
>errors as well, as if the drive had bad sectors on disk.
>
>The problem turned out to be insufficient ventilation for the controller
>board on the bottom of the drive

As an extra datapoint, my IBM Deskstar 60GXP's (40Gb version) runs 
slightly slower with writing than with reading.  This is on a VIA 
686a controller, UDMA/66 active.  The drive also has plenty of air 
around it, being in a 5.25" bracket with fans in front.

Writing 1GB from /dev/zero takes 34.27s = 29.88MB/sec, 19% CPU
Reading 1GB from test file takes 29.64s = 34.58MB/sec, 18% CPU

Hmm, that's almost as fast as the 10000rpm Ultrastar sited just above 
it, but with higher CPU usage.  Ultrastar gets 36MB/sec on reading 
with hdparm, haven't tested write performance due to probable 
fragmentation.

Both tests conducted using 'dd bs=1k' on my 1GHz Athlon with 256Mb 
RAM.  Test file is on a freshly-created ext2 filesystem starting at 
10Gb into the 40Gb drive (knowing IBM's recent trend, this'll still 
be fairly close to the outer rim).  Write test includes a sync at the 
end.  Kernel is Linus 2.4.9, no relevant patches.

-- 
--------------------------------------------------------------
from:     Jonathan "Chromatix" Morton
mail:     chromi@cyberspace.org  (not for attachments)
website:  http://www.chromatix.uklinux.net/vnc/
geekcode: GCS$/E dpu(!) s:- a20 C+++ UL++ P L+++ E W+ N- o? K? w--- O-- M++$
           V? PS PE- Y+ PGP++ t- 5- X- R !tv b++ DI+++ D G e+ h+ r++ y+(*)
tagline:  The key to knowledge is not to rely on people to teach you it.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: xmm2 - monitor Linux MM active/inactive lists graphically
  2001-10-28 17:59                             ` Linus Torvalds
  2001-10-28 18:22                               ` Alan Cox
  2001-10-28 18:56                               ` Andrew Morton
@ 2001-10-30  8:56                               ` Jens Axboe
  2001-10-30  9:26                                 ` Zlatko Calusic
  2 siblings, 1 reply; 37+ messages in thread
From: Jens Axboe @ 2001-10-30  8:56 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Alan Cox, Zlatko Calusic, Marcelo Tosatti, linux-mm, lkml

On Sun, Oct 28 2001, Linus Torvalds wrote:
> 
> On Sun, 28 Oct 2001, Alan Cox wrote:
> >
> > > Does the -ac patches have any hpt366-specific stuff? Although I suspect
> > > you're right, and that it's just the driver (or controller itself) being
> >
> > The IDE code matches between the two. It isnt a driver change
> 
> It might, of course, just be timing, but that sounds like a bit _too_ easy
> an explanation. Even if it could easily be true.
> 
> The fact that -ac gets higher speeds, and -ac has a very different
> request watermark strategy makes me suspect that that might be the cause.
> 
> In particular, the standard kernel _requires_ that in order to get good
> performance you can merge many bh's onto one request. That's a very
> reasonable assumption: it basically says that any high-performance driver
> has to accept merging, because that in turn is required for the elevator
> overhead to not grow without bounds. And if the driver doesn't accept big
> requests, that driver cannot perform well because it won't have many
> requests pending.

Nod

> In contrast, the -ac logic says roughly "Who the hell cares if the driver
> can merge requests or not, we can just give it thousands of small requests
> instead, and cap the total number of _sectors_ instead of capping the
> total number of requests earlier".

Not true, that was not the intended goal. We always want the driver to
get merged requests, even if we can have ridicilously large queue
lengths. The large queues were a benchmark win (blush), since it allowed
the elevator to reorder seeks across a big bench run effieciently. I've
later done more real life testing and I don't think it matters too much
here, in fact it only seems to incur greater latency and starvation.

> In my opinion, the -ac logic is really bad, but one thing it does allow is
> for stupid drivers that look like high-performance drivers. Which may be
> why it got implemented.

Don't mix up the larger queues with lack of will to merge, that is not
the case.

> And it may be that the hpt366 IDE driver has always had this braindamage,
> which the -ac code hides. Or something like this.
> 
> Does anybody know the hpt driver? Does it, for example, limit the maximum
> number of sectors per merge somehow for some reason?

hpt366 has no special work arounds or stuff it disables, it can't be
anything like that.

-- 
Jens Axboe

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: xmm2 - monitor Linux MM active/inactive lists graphically
  2001-10-30  8:56                               ` Jens Axboe
@ 2001-10-30  9:26                                 ` Zlatko Calusic
  0 siblings, 0 replies; 37+ messages in thread
From: Zlatko Calusic @ 2001-10-30  9:26 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Linus Torvalds, Alan Cox, Marcelo Tosatti, linux-mm, lkml

[-- Attachment #1: Type: text/plain, Size: 1366 bytes --]

Jens Axboe <axboe@suse.de> writes:

> hpt366 has no special work arounds or stuff it disables, it can't be
> anything like that.
> 

Followup on the problem. Yesterday I was upgrading my Debian Linux. To
do that I have to remount /usr read-write. After the update finished,
I tested once again disk writing speed. And there it was, full
22MB/sec (on the same partition). And once I get to that point, disk
will remain performant. Then I thought (poor man's logic) that poor
performance might have something to do with my /usr mounted read-only
(BTW, it's on the same disk I'm having problems with).

Quick test: reboot (/usr is ro), check speed -> only 8MB/sec, remount
/usr rw, but unfortunately didn't help, writing speed remains low.

So it was just an idea. I still don't know what can be done to return
speed to normal. I don't know if I have mentioned, but reading from
the same disk is always going at the full speed.

Also, I'm pretty sure that I have the same problem on the completely
another machine (at the work) which doesn't use HPT366, but standard
controller (BX chipset).

So, something might be wrong with my setup, but I'm still unable to
find what.

I'm compiling with 2.95.4 20011006 (Debian prerelease) from the Debian
unstable distribution. Kernel is completely monolithic (no modules).

Attached is the _relevant_ part of IDE configuration.


[-- Attachment #2: A --]
[-- Type: text/plain, Size: 2385 bytes --]

#
# ATA/IDE/MFM/RLL support
#
CONFIG_IDE=y

#
# IDE, ATA and ATAPI Block devices
#
CONFIG_BLK_DEV_IDE=y

#
# Please see Documentation/ide.txt for help/info on IDE drives
#
# CONFIG_BLK_DEV_HD_IDE is not set
# CONFIG_BLK_DEV_HD is not set
CONFIG_BLK_DEV_IDEDISK=y
CONFIG_IDEDISK_MULTI_MODE=y
# CONFIG_BLK_DEV_IDEDISK_VENDOR is not set
# CONFIG_BLK_DEV_IDEDISK_FUJITSU is not set
# CONFIG_BLK_DEV_IDEDISK_IBM is not set
# CONFIG_BLK_DEV_IDEDISK_MAXTOR is not set
# CONFIG_BLK_DEV_IDEDISK_QUANTUM is not set
# CONFIG_BLK_DEV_IDEDISK_SEAGATE is not set
# CONFIG_BLK_DEV_IDEDISK_WD is not set
# CONFIG_BLK_DEV_COMMERIAL is not set
# CONFIG_BLK_DEV_TIVO is not set
# CONFIG_BLK_DEV_IDECS is not set
CONFIG_BLK_DEV_IDECD=y
# CONFIG_BLK_DEV_IDETAPE is not set
# CONFIG_BLK_DEV_IDEFLOPPY is not set
# CONFIG_BLK_DEV_IDESCSI is not set

#
# IDE chipset support/bugfixes
#
# CONFIG_BLK_DEV_CMD640 is not set
# CONFIG_BLK_DEV_CMD640_ENHANCED is not set
# CONFIG_BLK_DEV_ISAPNP is not set
# CONFIG_BLK_DEV_RZ1000 is not set
CONFIG_BLK_DEV_IDEPCI=y
CONFIG_IDEPCI_SHARE_IRQ=y
CONFIG_BLK_DEV_IDEDMA_PCI=y
CONFIG_BLK_DEV_ADMA=y
# CONFIG_BLK_DEV_OFFBOARD is not set
CONFIG_IDEDMA_PCI_AUTO=y
CONFIG_BLK_DEV_IDEDMA=y
CONFIG_IDEDMA_PCI_WIP=y
# CONFIG_IDEDMA_NEW_DRIVE_LISTINGS is not set
# CONFIG_BLK_DEV_AEC62XX is not set
# CONFIG_AEC62XX_TUNING is not set
# CONFIG_BLK_DEV_ALI15X3 is not set
# CONFIG_WDC_ALI15X3 is not set
# CONFIG_BLK_DEV_AMD74XX is not set
# CONFIG_AMD74XX_OVERRIDE is not set
# CONFIG_BLK_DEV_CMD64X is not set
# CONFIG_BLK_DEV_CY82C693 is not set
# CONFIG_BLK_DEV_CS5530 is not set
# CONFIG_BLK_DEV_HPT34X is not set
# CONFIG_HPT34X_AUTODMA is not set
# CONFIG_BLK_DEV_HPT366 is not set
# CONFIG_BLK_DEV_PIIX is not set
# CONFIG_PIIX_TUNING is not set
# CONFIG_BLK_DEV_NS87415 is not set
# CONFIG_BLK_DEV_OPTI621 is not set
# CONFIG_BLK_DEV_PDC202XX is not set
# CONFIG_PDC202XX_BURST is not set
# CONFIG_PDC202XX_FORCE is not set
# CONFIG_BLK_DEV_SVWKS is not set
# CONFIG_BLK_DEV_SIS5513 is not set
# CONFIG_BLK_DEV_SLC90E66 is not set
# CONFIG_BLK_DEV_TRM290 is not set
# CONFIG_BLK_DEV_VIA82CXXX is not set
# CONFIG_IDE_CHIPSETS is not set
CONFIG_IDEDMA_AUTO=y
# CONFIG_IDEDMA_IVB is not set
# CONFIG_DMA_NONPCI is not set
# CONFIG_BLK_DEV_IDE_MODES is not set
# CONFIG_BLK_DEV_ATARAID is not set
# CONFIG_BLK_DEV_ATARAID_PDC is not set
# CONFIG_BLK_DEV_ATARAID_HPT is not set

[-- Attachment #3: Type: text/plain, Size: 12 bytes --]


-- 
Zlatko

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Zlatko's I/O slowdown status
  2001-10-28 17:30                       ` Zlatko Calusic
  2001-10-28 17:34                         ` Linus Torvalds
  2001-10-28 19:13                         ` Barry K. Nathan
@ 2001-11-02  5:52                         ` Andrea Arcangeli
  2001-11-02 20:14                           ` Zlatko Calusic
  2 siblings, 1 reply; 37+ messages in thread
From: Andrea Arcangeli @ 2001-11-02  5:52 UTC (permalink / raw)
  To: Zlatko Calusic
  Cc: Linus Torvalds, Jens Axboe, Marcelo Tosatti, linux-mm, lkml

Hello Zlatko,

I'm not sure how the email thread ended but I noticed different
unplugging of the I/O queues in mainline (mainline was a little more
overkill than -ac) and also wrong bdflush histeresis (pre-wakekup of
bdflush to avoid blocking if the write flood could be sustained by the
bandwith of the HD was missing for example).

So you may want to give a spin to pre6aa1 and see if it makes any
difference, if it makes any difference I'll know what your problem is
(see the buffer.c part of the vm-10 patch in pre6aa1 for more details).

thanks,

Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Zlatko's I/O slowdown status
  2001-11-02  5:52                         ` Zlatko's I/O slowdown status Andrea Arcangeli
@ 2001-11-02 20:14                           ` Zlatko Calusic
  2001-11-02 20:26                             ` Rik van Riel
                                               ` (2 more replies)
  0 siblings, 3 replies; 37+ messages in thread
From: Zlatko Calusic @ 2001-11-02 20:14 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Linus Torvalds, Jens Axboe, Marcelo Tosatti, linux-mm, lkml

Andrea Arcangeli <andrea@suse.de> writes:

> Hello Zlatko,
> 
> I'm not sure how the email thread ended but I noticed different
> unplugging of the I/O queues in mainline (mainline was a little more
> overkill than -ac) and also wrong bdflush histeresis (pre-wakekup of
> bdflush to avoid blocking if the write flood could be sustained by the
> bandwith of the HD was missing for example).

Thank God, today it is finally solved. Just two days ago, I was pretty
sure that disk had started dying on me, and i didn't know of any
solution for that. Today, while I was about to try your patch, I got
another idea and finally pinpointed the problem.

It was write caching. Somehow disk was running with write cache turned
off and I was getting abysmal write performance. Then I found hdparm
-W0 /proc/ide/hd* in /etc/init.d/umountfs which is ran during shutdown
but I don't understand how it survived through reboots and restarts!
And why only two of four disks, which I'm dealing with, got confused
with the command. And finally I don't understand how I could still got
full speed occassionaly. Weird!

I would advise users of Debian unstable to comment that part, I'm sure
it's useless on most if not all setups. You might be pleasantly
surprised with performance gains (write speed doubles).

> 
> So you may want to give a spin to pre6aa1 and see if it makes any
> difference, if it makes any difference I'll know what your problem is
> (see the buffer.c part of the vm-10 patch in pre6aa1 for more details).
> 

Thanks for your concern. Eventually I compiled aa1 and it is running
correctly (whole day at work, and last hour at home - SMP), although I
now don't see any performance improvements.

I would like to thank all the others that spent time helping me,
especially Linus, Jens and Marcelo, sorry guys for taking your time.
-- 
Zlatko

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Zlatko's I/O slowdown status
  2001-11-02 20:14                           ` Zlatko Calusic
@ 2001-11-02 20:26                             ` Rik van Riel
  2001-11-02 21:22                               ` Zlatko Calusic
  2001-11-02 20:57                             ` Andrea Arcangeli
  2001-11-02 23:23                             ` Simon Kirby
  2 siblings, 1 reply; 37+ messages in thread
From: Rik van Riel @ 2001-11-02 20:26 UTC (permalink / raw)
  To: Zlatko Calusic
  Cc: Andrea Arcangeli, Jens Axboe, Marcelo Tosatti, linux-mm, lkml

On 2 Nov 2001, Zlatko Calusic wrote:

> It was write caching. Somehow disk was running with write cache turned
> off and I was getting abysmal write performance. Then I found hdparm
> -W0 /proc/ide/hd* in /etc/init.d/umountfs which is ran during shutdown
>
> I would advise users of Debian unstable to comment that part,

Why do you want Debian users to loose their data ? ;)

The 'hdparm -W0' is useful in getting the drive to flush
out the data to disk instead of having it linger around
in the drive cache.

regards,

Rik
-- 
DMCA, SSSCA, W3C?  Who cares?  http://thefreeworld.net/

http://www.surriel.com/		http://distro.conectiva.com/

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Zlatko's I/O slowdown status
  2001-11-02 20:14                           ` Zlatko Calusic
  2001-11-02 20:26                             ` Rik van Riel
@ 2001-11-02 20:57                             ` Andrea Arcangeli
  2001-11-02 23:23                             ` Simon Kirby
  2 siblings, 0 replies; 37+ messages in thread
From: Andrea Arcangeli @ 2001-11-02 20:57 UTC (permalink / raw)
  To: Zlatko Calusic
  Cc: Linus Torvalds, Jens Axboe, Marcelo Tosatti, linux-mm, lkml

On Fri, Nov 02, 2001 at 09:14:14PM +0100, Zlatko Calusic wrote:
> It was write caching. Somehow disk was running with write cache turned

Ah, I was going to ask you to try with:

	/sbin/hdparm -d1 -u1 -W1 -c1 /dev/hda

(my settings, of course not safe for journaling fs, safe to use it only
with ext2 and I -W0 back during /etc/init.d/halt) but I assumed you were
using the same hdparm settings in -ac and mainline. Never mind, good
that it's solved now :).

Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Zlatko's I/O slowdown status
  2001-11-02 20:26                             ` Rik van Riel
@ 2001-11-02 21:22                               ` Zlatko Calusic
  0 siblings, 0 replies; 37+ messages in thread
From: Zlatko Calusic @ 2001-11-02 21:22 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Andrea Arcangeli, Jens Axboe, Marcelo Tosatti, linux-mm, lkml

Rik van Riel <riel@conectiva.com.br> writes:

> On 2 Nov 2001, Zlatko Calusic wrote:
> 
> > It was write caching. Somehow disk was running with write cache turned
> > off and I was getting abysmal write performance. Then I found hdparm
> > -W0 /proc/ide/hd* in /etc/init.d/umountfs which is ran during shutdown
> >
> > I would advise users of Debian unstable to comment that part,
> 
> Why do you want Debian users to loose their data ? ;)

That few lines of code is a recent addition to Debian. It never
existed before, so do you want to say that Debian was buggy for years
and people lost massive amounts of data because of that? :)

No, really, I'm using poweroff on my computer and not once I had a
problem with it (I'm speaking about thousands of poweroffs) losing
data after poweroff. But I have a problem with bad performance. :)

> 
> The 'hdparm -W0' is useful in getting the drive to flush
> out the data to disk instead of having it linger around
> in the drive cache.
> 

Yes, I know, but it's not THAT important, otherwise it wouldn't be
missing so many years from the init script.

Anyway, this whole debate probably points to a problem of missing
hdparm -W1 in the startup init script. IDE drives really behave
poorely without write caching and there's nothing we could do about
that, beside turning it on and pray to God we don't have too many
power outages. :)
-- 
Zlatko

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Zlatko's I/O slowdown status
  2001-11-02 20:14                           ` Zlatko Calusic
  2001-11-02 20:26                             ` Rik van Riel
  2001-11-02 20:57                             ` Andrea Arcangeli
@ 2001-11-02 23:23                             ` Simon Kirby
  2 siblings, 0 replies; 37+ messages in thread
From: Simon Kirby @ 2001-11-02 23:23 UTC (permalink / raw)
  To: Zlatko Calusic
  Cc: Andrea Arcangeli, Linus Torvalds, Jens Axboe, Marcelo Tosatti,
	linux-mm, lkml

On Fri, Nov 02, 2001 at 09:14:14PM +0100, Zlatko Calusic wrote:

> Thank God, today it is finally solved. Just two days ago, I was pretty
> sure that disk had started dying on me, and i didn't know of any
> solution for that. Today, while I was about to try your patch, I got
> another idea and finally pinpointed the problem.
> 
> It was write caching. Somehow disk was running with write cache turned
> off and I was getting abysmal write performance. Then I found hdparm
> -W0 /proc/ide/hd* in /etc/init.d/umountfs which is ran during shutdown
> but I don't understand how it survived through reboots and restarts!
> And why only two of four disks, which I'm dealing with, got confused
> with the command. And finally I don't understand how I could still got
> full speed occassionaly. Weird!
> 
> I would advise users of Debian unstable to comment that part, I'm sure
> it's useless on most if not all setups. You might be pleasantly
> surprised with performance gains (write speed doubles).

Aha!  That would explain why I was seeing it as well... and why I was
seeing errors from hdparm for /dev/hdc and /dev/hdd, which are CDROMs.

Argh. :)

If they have hdparm -W 0 at shutdown, there should be a -W 1 during
startup.

Simon-

[  Stormix Technologies Inc.  ][  NetNation Communications Inc. ]
[       sim@stormix.com       ][       sim@netnation.com        ]
[ Opinions expressed are not necessarily those of my employers. ]

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 37+ messages in thread

end of thread, other threads:[~2001-11-02 23:23 UTC | newest]

Thread overview: 37+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2001-10-24 10:42 xmm2 - monitor Linux MM active/inactive lists graphically Zlatko Calusic
2001-10-24 14:26 ` Marcelo Tosatti
2001-10-25  0:25   ` Zlatko Calusic
2001-10-25  4:19     ` Linus Torvalds
2001-10-25  4:57       ` Linus Torvalds
2001-10-25 12:48         ` Zlatko Calusic
2001-10-25 16:31           ` Linus Torvalds
2001-10-25 17:33             ` Jens Axboe
2001-10-26  9:45             ` Zlatko Calusic
2001-10-26 10:08             ` Zlatko Calusic
2001-10-26 14:39               ` Jens Axboe
2001-10-26 14:57                 ` Zlatko Calusic
2001-10-26 15:01                   ` Jens Axboe
2001-10-26 16:04                   ` Linus Torvalds
2001-10-26 16:57                   ` Linus Torvalds
2001-10-26 17:19                     ` Linus Torvalds
2001-10-28 17:30                       ` Zlatko Calusic
2001-10-28 17:34                         ` Linus Torvalds
2001-10-28 17:48                           ` Alan Cox
2001-10-28 17:59                             ` Linus Torvalds
2001-10-28 18:22                               ` Alan Cox
2001-10-28 18:46                                 ` Linus Torvalds
2001-10-28 19:29                                   ` Alan Cox
2001-10-28 18:56                               ` Andrew Morton
2001-10-30  8:56                               ` Jens Axboe
2001-10-30  9:26                                 ` Zlatko Calusic
2001-10-28 19:13                         ` Barry K. Nathan
2001-10-28 21:42                           ` Jonathan Morton
2001-11-02  5:52                         ` Zlatko's I/O slowdown status Andrea Arcangeli
2001-11-02 20:14                           ` Zlatko Calusic
2001-11-02 20:26                             ` Rik van Riel
2001-11-02 21:22                               ` Zlatko Calusic
2001-11-02 20:57                             ` Andrea Arcangeli
2001-11-02 23:23                             ` Simon Kirby
2001-10-27 13:14               ` xmm2 - monitor Linux MM active/inactive lists graphically Giuliano Pochini
2001-10-28  5:05                 ` Mike Fedyk
2001-10-25  9:07       ` Zlatko Calusic

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox