Re: Increase dirty_ratio and dirty_background

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* Re: Increase dirty_ratio and dirty_background_ratio?
       [not found] <20090107154517.GA5565@duck.suse.cz>
@ 2009-01-07 16:25 ` Peter Zijlstra
  2009-01-07 16:39   ` Linus Torvalds
  0 siblings, 1 reply; 16+ messages in thread
From: Peter Zijlstra @ 2009-01-07 16:25 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-kernel, linux-mm, Linus Torvalds, Nick Piggin

On Wed, 2009-01-07 at 16:45 +0100, Jan Kara wrote:
> Hi,
> 
>   I'm writing mainly to gather opinions of clever people here ;). In commit
> 07db59bd6b0f279c31044cba6787344f63be87ea (in April 2007) Linus has
> decreased default /proc/sys/vm/dirty_ratio from 40 to 10 and
> /proc/sys/vm/dirty_background_ratio from 10 to 5. 

> While tracking
> performance regressions in SLES11 wrt SLES10 we noted that this has severely
> affected perfomance of some workloads using Berkeley DB (basically because
> what the database does is that it creates a file almost as big as available
> memory, mmaps it and randomly scribbles all over it and with lower limits
> it gets much earlier throttled / pdflush is more aggressive writing back
> stuff which is counterproductive in this particular case).

>   So the question is: What kind of workloads are lower limits supposed to
> help? Desktop? Has anybody reported that they actually help? I'm asking
> because we are probably going to increase limits to the old values for
> SLES11 if we don't see serious negative impact on other workloads...

Adding some CCs.

The idea was that 40% of the memory is a _lot_ these days, and writeback
times will be huge for those hitting sync or similar. By lowering these
you'd smooth that out a bit.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Increase dirty_ratio and dirty_background_ratio?
  2009-01-07 16:25 ` Increase dirty_ratio and dirty_background_ratio? Peter Zijlstra
@ 2009-01-07 16:39   ` Linus Torvalds
  2009-01-07 20:51     ` David Miller
  0 siblings, 1 reply; 16+ messages in thread
From: Linus Torvalds @ 2009-01-07 16:39 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Jan Kara, linux-kernel, linux-mm, Nick Piggin

On Wed, 7 Jan 2009, Peter Zijlstra wrote:
> 
> >   So the question is: What kind of workloads are lower limits supposed to
> > help? Desktop? Has anybody reported that they actually help? I'm asking
> > because we are probably going to increase limits to the old values for
> > SLES11 if we don't see serious negative impact on other workloads...
> 
> Adding some CCs.
> 
> The idea was that 40% of the memory is a _lot_ these days, and writeback
> times will be huge for those hitting sync or similar. By lowering these
> you'd smooth that out a bit.

Not just a bit. If you have 4GB of RAM (not at all unusual for even just a 
regular desktop, never mind a "real" workstation), it's simply crazy to 
allow 1.5GB of dirty memory. Not unless you have a really wicked RAID 
system with great write performance that can push it out to disk (with 
seeking) in just a few seconds.

And few people have that.

For a server, where throughput matters but latency generally does not, go 
ahead and raise it. But please don't raise it for anything sane. The only 
time it makes sense upping that percentage is for some odd special-case 
benchmark that otherwise can fit the dirty data set in memory, and never 
syncs it (ie it deletes all the files after generating them).

In other words, yes, 40% dirty can make a big difference to benchmarks, 
but is almost never actually a good idea any more.

That said, the _right_ thing to do is to 

 (a) limit dirty by number of bytes (in addition to having a percentage 
     limit). Current -git adds support for that.

 (b) scale it dynamically by your IO performance. No, current -git does 
     _not_ support this.

but just upping the percentage is not a good idea.

		Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Increase dirty_ratio and dirty_background_ratio?
  2009-01-07 16:39   ` Linus Torvalds
@ 2009-01-07 20:51     ` David Miller
  2009-01-08 11:02       ` Andrew Morton
  0 siblings, 1 reply; 16+ messages in thread
From: David Miller @ 2009-01-07 20:51 UTC (permalink / raw)
  To: torvalds; +Cc: peterz, jack, linux-kernel, linux-mm, npiggin

From: Linus Torvalds <torvalds@linux-foundation.org>
Date: Wed, 7 Jan 2009 08:39:01 -0800 (PST)

> On Wed, 7 Jan 2009, Peter Zijlstra wrote:
> > 
> > >   So the question is: What kind of workloads are lower limits supposed to
> > > help? Desktop? Has anybody reported that they actually help? I'm asking
> > > because we are probably going to increase limits to the old values for
> > > SLES11 if we don't see serious negative impact on other workloads...
> > 
> > Adding some CCs.
> > 
> > The idea was that 40% of the memory is a _lot_ these days, and writeback
> > times will be huge for those hitting sync or similar. By lowering these
> > you'd smooth that out a bit.
> 
> Not just a bit. If you have 4GB of RAM (not at all unusual for even just a 
> regular desktop, never mind a "real" workstation), it's simply crazy to 
> allow 1.5GB of dirty memory. Not unless you have a really wicked RAID 
> system with great write performance that can push it out to disk (with 
> seeking) in just a few seconds.
> 
> And few people have that.
> 
> For a server, where throughput matters but latency generally does not, go 
> ahead and raise it. But please don't raise it for anything sane. The only 
> time it makes sense upping that percentage is for some odd special-case 
> benchmark that otherwise can fit the dirty data set in memory, and never 
> syncs it (ie it deletes all the files after generating them).
> 
> In other words, yes, 40% dirty can make a big difference to benchmarks, 
> but is almost never actually a good idea any more.

I have to say that my workstation is still helped by reverting this
change and all I do is play around in GIT trees and read email.

It's a slow UltraSPARC-IIIi 1.5GHz machine with a very slow IDE disk
and 2GB of ram.

With the dirty ratio changeset there, I'm waiting for disk I/O
seemingly all the time.  Without it, I only feel the machine seize up
in disk I/O when I really punish it.

Maybe all the dirty I/O is from my not using 'noatime', and if that's
how I should "fix" this then we can ask why isn't it the default? :)

I did mention this when the original changeset went into the tree.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Increase dirty_ratio and dirty_background_ratio?
  2009-01-07 20:51     ` David Miller
@ 2009-01-08 11:02       ` Andrew Morton
  2009-01-08 16:24         ` David Miller
  0 siblings, 1 reply; 16+ messages in thread
From: Andrew Morton @ 2009-01-08 11:02 UTC (permalink / raw)
  To: David Miller; +Cc: torvalds, peterz, jack, linux-kernel, linux-mm, npiggin

On Wed, 07 Jan 2009 12:51:33 -0800 (PST) David Miller <davem@davemloft.net> wrote:

> From: Linus Torvalds <torvalds@linux-foundation.org>
> Date: Wed, 7 Jan 2009 08:39:01 -0800 (PST)
> 
> > On Wed, 7 Jan 2009, Peter Zijlstra wrote:
> > > 
> > > >   So the question is: What kind of workloads are lower limits supposed to
> > > > help? Desktop? Has anybody reported that they actually help? I'm asking
> > > > because we are probably going to increase limits to the old values for
> > > > SLES11 if we don't see serious negative impact on other workloads...
> > > 
> > > Adding some CCs.
> > > 
> > > The idea was that 40% of the memory is a _lot_ these days, and writeback
> > > times will be huge for those hitting sync or similar. By lowering these
> > > you'd smooth that out a bit.
> > 
> > Not just a bit. If you have 4GB of RAM (not at all unusual for even just a 
> > regular desktop, never mind a "real" workstation), it's simply crazy to 
> > allow 1.5GB of dirty memory. Not unless you have a really wicked RAID 
> > system with great write performance that can push it out to disk (with 
> > seeking) in just a few seconds.
> > 
> > And few people have that.
> > 
> > For a server, where throughput matters but latency generally does not, go 
> > ahead and raise it. But please don't raise it for anything sane. The only 
> > time it makes sense upping that percentage is for some odd special-case 
> > benchmark that otherwise can fit the dirty data set in memory, and never 
> > syncs it (ie it deletes all the files after generating them).
> > 
> > In other words, yes, 40% dirty can make a big difference to benchmarks, 
> > but is almost never actually a good idea any more.
> 
> I have to say that my workstation is still helped by reverting this
> change and all I do is play around in GIT trees and read email.
> 

The kernel can't get this right - it doesn't know the usage
patterns/workloads, etc.  It's rather disappointing that distros appear
to have put so little work into finding ways of setting suitable values
for this, and for other tunables.

Maybe we should set them to 1%, or 99% or something similarly stupid to
force the issue.

yes, perhaps the kernel's default percentage should be larger on
smaller-memory systems.  And smaller on slow-disk systems.  etc.  But
initscripts already have all the information to do this, and have the
advantage that any such scripts are backportable to five-year-old kernels.

So I say leave it as-is.  If suse can come up with a scriptlet which scales
this according to memory size, disk speed, workload, etc then good for
them - it'll produce a better end result.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Increase dirty_ratio and dirty_background_ratio?
  2009-01-08 11:02       ` Andrew Morton
@ 2009-01-08 16:24         ` David Miller
  2009-01-08 16:48           ` Linus Torvalds
  0 siblings, 1 reply; 16+ messages in thread
From: David Miller @ 2009-01-08 16:24 UTC (permalink / raw)
  To: akpm; +Cc: torvalds, peterz, jack, linux-kernel, linux-mm, npiggin

From: Andrew Morton <akpm@linux-foundation.org>
Date: Thu, 8 Jan 2009 03:02:45 -0800

> The kernel can't get this right - it doesn't know the usage
> patterns/workloads, etc.

I don't agree with that.

The kernel is watching and gets to see every operation that happens
both to memory and to the disk, so of course it can see what
the "patterns" and the "workload" are.

It also can see how fast or slow the disk technology is.  And I think
that is one of the largest determinants to what these values should
be set to.

So, in fact, the kernel is the place that has all of the information
necessary to try and adjust these settings dynamically.

Userland can only approximate a good setting, at best, because it has
so many fewer pieces of information to work with.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Increase dirty_ratio and dirty_background_ratio?
  2009-01-08 16:24         ` David Miller
@ 2009-01-08 16:48           ` Linus Torvalds
  2009-01-08 16:55             ` Chris Mason
  0 siblings, 1 reply; 16+ messages in thread
From: Linus Torvalds @ 2009-01-08 16:48 UTC (permalink / raw)
  To: David Miller; +Cc: akpm, peterz, jack, linux-kernel, linux-mm, npiggin

On Thu, 8 Jan 2009, David Miller wrote:

> From: Andrew Morton <akpm@linux-foundation.org>
> Date: Thu, 8 Jan 2009 03:02:45 -0800
> 
> > The kernel can't get this right - it doesn't know the usage
> > patterns/workloads, etc.
> 
> I don't agree with that.

We can certainly try to tune it better. 

And I do agree that we did a very drastic reduction in the dirty limits, 
and we can probably look at raising it up a bit. I definitely do not want 
to go back to the old 40% dirty model, but I could imagine 10/20% for 
async/sync (it's 5/10 now, isn't it?)

But I do not want to be guided by benchmarks per se, unless they are 
latency-sensitive. And one of the reasons for the drastic reduction was 
that there was actually a real deadlock situation with the old limits, 
although we solved that one twice - first by reducing the limits 
drastically, and then by making them be relative to the non-highmem memory 
(rather than all of it).

So in effect, we actually reduced the limits more than originally 
intended, although that particular effect should be noticeable mainly just 
on 32-bit x86.

I'm certainly open to tuning. As long as "tuning" doesn't involve 
something insane like dbench numbers.

			Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Increase dirty_ratio and dirty_background_ratio?
  2009-01-08 16:48           ` Linus Torvalds
@ 2009-01-08 16:55             ` Chris Mason
  2009-01-08 17:05               ` Linus Torvalds
  0 siblings, 1 reply; 16+ messages in thread
From: Chris Mason @ 2009-01-08 16:55 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: David Miller, akpm, peterz, jack, linux-kernel, linux-mm, npiggin

On Thu, 2009-01-08 at 08:48 -0800, Linus Torvalds wrote:
> 
> On Thu, 8 Jan 2009, David Miller wrote:
> 
> > From: Andrew Morton <akpm@linux-foundation.org>
> > Date: Thu, 8 Jan 2009 03:02:45 -0800
> > 
> > > The kernel can't get this right - it doesn't know the usage
> > > patterns/workloads, etc.
> > 
> > I don't agree with that.
> 
> We can certainly try to tune it better. 
> 

Does it make sense to hook into kupdate?  If kupdate finds it can't meet
the no-data-older-than 30 seconds target, it lowers the sync/async combo
down to some reasonable bottom.  

If it finds it is going to sleep without missing the target, raise the
combo up to some reasonable top.

-chris


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Increase dirty_ratio and dirty_background_ratio?
  2009-01-08 16:55             ` Chris Mason
@ 2009-01-08 17:05               ` Linus Torvalds
  2009-01-08 19:57                 ` Jan Kara
  2009-01-14  3:29                 ` Nick Piggin
  0 siblings, 2 replies; 16+ messages in thread
From: Linus Torvalds @ 2009-01-08 17:05 UTC (permalink / raw)
  To: Chris Mason
  Cc: David Miller, akpm, peterz, jack, linux-kernel, linux-mm, npiggin

On Thu, 8 Jan 2009, Chris Mason wrote:
> 
> Does it make sense to hook into kupdate?  If kupdate finds it can't meet
> the no-data-older-than 30 seconds target, it lowers the sync/async combo
> down to some reasonable bottom.  
> 
> If it finds it is going to sleep without missing the target, raise the
> combo up to some reasonable top.

I like autotuning, so that sounds like an intriguing approach. It's worked 
for us before (ie VM).

That said, 30 seconds sounds like a _loong_ time for something like this. 
I'd use the normal 5-second dirty_writeback_interval for this: if we can't 
clean the whole queue in that normal background writeback interval, then 
we try to lower the tagets. We already have that "congestion_wait()" thing 
there, that would be a logical place, methinks.

I'm not sure how to raise them, though. We don't want to raise any limits 
just because the user suddenly went idle. I think the raising should 
happen if we hit the sync/async ratio, and we haven't lowered in the last 
30 seconds or something like that.

		Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Increase dirty_ratio and dirty_background_ratio?
  2009-01-08 17:05               ` Linus Torvalds
@ 2009-01-08 19:57                 ` Jan Kara
  2009-01-08 20:01                   ` David Miller
  2009-01-09 18:02                   ` Jan Kara
  2009-01-14  3:29                 ` Nick Piggin
  1 sibling, 2 replies; 16+ messages in thread
From: Jan Kara @ 2009-01-08 19:57 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Chris Mason, David Miller, akpm, peterz, jack, linux-kernel,
	linux-mm, npiggin

On Thu 08-01-09 09:05:01, Linus Torvalds wrote:
> On Thu, 8 Jan 2009, Chris Mason wrote:
> > 
> > Does it make sense to hook into kupdate?  If kupdate finds it can't meet
> > the no-data-older-than 30 seconds target, it lowers the sync/async combo
> > down to some reasonable bottom.  
> > 
> > If it finds it is going to sleep without missing the target, raise the
> > combo up to some reasonable top.
> 
> I like autotuning, so that sounds like an intriguing approach. It's worked 
> for us before (ie VM).
> 
> That said, 30 seconds sounds like a _loong_ time for something like this. 
> I'd use the normal 5-second dirty_writeback_interval for this: if we can't 
> clean the whole queue in that normal background writeback interval, then 
> we try to lower the tagets. We already have that "congestion_wait()" thing 
> there, that would be a logical place, methinks.
  But I think there are workloads for which this is suboptimal to say the
least. Imagine you do some crazy LDAP database crunching or other similar load
which randomly writes to a big file (big means it's size is rougly
comparable to your available memory). Kernel finds pdflush isn't able to
flush the data fast enough so we decrease dirty limits. This results in
even more agressive flushing but that makes things even worse (in a sence
that your application runs slower and the disk is busy all the time anyway).
This is the kind of load where we observe problems currently.
  Ideally we could observe that we write out the same pages again and again
(or even pages close to them) and in that case be less agressive about
writeback on the file. But it feels a bit overcomplicated...

> I'm not sure how to raise them, though. We don't want to raise any limits 
> just because the user suddenly went idle. I think the raising should 
> happen if we hit the sync/async ratio, and we haven't lowered in the last 
> 30 seconds or something like that.

									Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Increase dirty_ratio and dirty_background_ratio?
  2009-01-08 19:57                 ` Jan Kara
@ 2009-01-08 20:01                   ` David Miller
  2009-01-09 18:02                   ` Jan Kara
  1 sibling, 0 replies; 16+ messages in thread
From: David Miller @ 2009-01-08 20:01 UTC (permalink / raw)
  To: jack; +Cc: torvalds, chris.mason, akpm, peterz, linux-kernel, linux-mm, npiggin

From: Jan Kara <jack@suse.cz>
Date: Thu, 8 Jan 2009 20:57:28 +0100

>   But I think there are workloads for which this is suboptimal to say the
> least. Imagine you do some crazy LDAP database crunching or other similar load
> which randomly writes to a big file (big means it's size is rougly
> comparable to your available memory). Kernel finds pdflush isn't able to
> flush the data fast enough so we decrease dirty limits. This results in
> even more agressive flushing but that makes things even worse (in a sence
> that your application runs slower and the disk is busy all the time anyway).
> This is the kind of load where we observe problems currently.

I'm pretty sure this is what I see as well.

If you just barely fit your working GIT state into memory, and you are
not using "noatime" on that partition, doing a bunch of git operations
is just going to trigger all of this forced and blocking writeback on
the atime dirtying of the inodes, and this will subsequently grind
your machine to a halt if your disk is slow.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Increase dirty_ratio and dirty_background_ratio?
  2009-01-08 19:57                 ` Jan Kara
  2009-01-08 20:01                   ` David Miller
@ 2009-01-09 18:02                   ` Jan Kara
  2009-01-09 19:00                     ` Andrew Morton
                                       ` (2 more replies)
  1 sibling, 3 replies; 16+ messages in thread
From: Jan Kara @ 2009-01-09 18:02 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Chris Mason, David Miller, akpm, peterz, jack, linux-kernel,
	linux-mm, npiggin

[-- Attachment #1: Type: text/plain, Size: 3916 bytes --]

On Thu 08-01-09 20:57:28, Jan Kara wrote:
> On Thu 08-01-09 09:05:01, Linus Torvalds wrote:
> > On Thu, 8 Jan 2009, Chris Mason wrote:
> > > 
> > > Does it make sense to hook into kupdate?  If kupdate finds it can't meet
> > > the no-data-older-than 30 seconds target, it lowers the sync/async combo
> > > down to some reasonable bottom.  
> > > 
> > > If it finds it is going to sleep without missing the target, raise the
> > > combo up to some reasonable top.
> > 
> > I like autotuning, so that sounds like an intriguing approach. It's worked 
> > for us before (ie VM).
> > 
> > That said, 30 seconds sounds like a _loong_ time for something like this. 
> > I'd use the normal 5-second dirty_writeback_interval for this: if we can't 
> > clean the whole queue in that normal background writeback interval, then 
> > we try to lower the tagets. We already have that "congestion_wait()" thing 
> > there, that would be a logical place, methinks.
>   But I think there are workloads for which this is suboptimal to say the
> least. Imagine you do some crazy LDAP database crunching or other similar load
> which randomly writes to a big file (big means it's size is rougly
> comparable to your available memory). Kernel finds pdflush isn't able to
> flush the data fast enough so we decrease dirty limits. This results in
> even more agressive flushing but that makes things even worse (in a sence
> that your application runs slower and the disk is busy all the time anyway).
> This is the kind of load where we observe problems currently.
>   Ideally we could observe that we write out the same pages again and again
> (or even pages close to them) and in that case be less agressive about
> writeback on the file. But it feels a bit overcomplicated...
  And there's actually one more thing that probably needs some improvement
in the writeback algorithms:
  What we observe in the seekwatcher graphs is, that there are three
processes writing back the single database file in parallel (2 pdflush
threads because the machine has 2 CPUs, and the database process itself
because of dirty throttling). Each of the processes is writing back the
file at a different offset and so they together create even more random IO
(I'm attaching the graph and can provide blocktrace data if someone is
interested). If there was just one process doing the writeback, we'd be
writing back those data considerably faster...
  This problem could have reasonably easy solution. IMHO if there is one
process doing writeback on a block device, there's no point for another
process to do any writeback on that device. Block device congestion
detection is supposed to avoid this I think but it does not work quite well
in this case. The result is (I guess) that all the three threads are calling
write_cache_pages() on that single DB file, eventually the congested flag
is cleared from the block device, now all three threads hugrily jump on 
the file and start writing which quickly congests the device again...
  My proposed solution would be that we'll have two flags per BDI -
PDFLUSH_IS_WRITING_BACK and THROTTLING_IS_WRITING_BACK. They are set /
cleared as their names suggest. When pdflush sees THROTTLING took place,
it relaxes and let throttled process to do the work. Also pdflush would
not try writeback on devices that have PDFLUSH_IS_WRITING_BACK flag set
(OK, we should know that *this* pdflush thread set this flag for the device
and do writeback then, but I think you get the idea). This could improve
the situation at least for smaller machines, what do you think? I
understand that there might be problem on machines with a lot of CPUs where
one thread might not be fast enough to send out all the dirty data created
by other CPUs. But as long as there is just one backing device, does it
really help to have more threads doing writeback even on a big machine?

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

[-- Attachment #2: trace.png --]
[-- Type: image/png, Size: 95894 bytes --]

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Increase dirty_ratio and dirty_background_ratio?
  2009-01-09 18:02                   ` Jan Kara
@ 2009-01-09 19:00                     ` Andrew Morton
  2009-01-09 19:07                     ` Chris Mason
  2009-01-09 22:31                     ` david
  2 siblings, 0 replies; 16+ messages in thread
From: Andrew Morton @ 2009-01-09 19:00 UTC (permalink / raw)
  To: Jan Kara
  Cc: Linus Torvalds, Chris Mason, David Miller, peterz, linux-kernel,
	linux-mm, npiggin

On Fri, 9 Jan 2009 19:02:41 +0100 Jan Kara <jack@suse.cz> wrote:

> On Thu 08-01-09 20:57:28, Jan Kara wrote:
> > On Thu 08-01-09 09:05:01, Linus Torvalds wrote:
> > > On Thu, 8 Jan 2009, Chris Mason wrote:
> > > > 
> > > > Does it make sense to hook into kupdate?  If kupdate finds it can't meet
> > > > the no-data-older-than 30 seconds target, it lowers the sync/async combo
> > > > down to some reasonable bottom.  
> > > > 
> > > > If it finds it is going to sleep without missing the target, raise the
> > > > combo up to some reasonable top.
> > > 
> > > I like autotuning, so that sounds like an intriguing approach. It's worked 
> > > for us before (ie VM).
> > > 
> > > That said, 30 seconds sounds like a _loong_ time for something like this. 
> > > I'd use the normal 5-second dirty_writeback_interval for this: if we can't 
> > > clean the whole queue in that normal background writeback interval, then 
> > > we try to lower the tagets. We already have that "congestion_wait()" thing 
> > > there, that would be a logical place, methinks.
> >   But I think there are workloads for which this is suboptimal to say the
> > least. Imagine you do some crazy LDAP database crunching or other similar load
> > which randomly writes to a big file (big means it's size is rougly
> > comparable to your available memory). Kernel finds pdflush isn't able to
> > flush the data fast enough so we decrease dirty limits. This results in
> > even more agressive flushing but that makes things even worse (in a sence
> > that your application runs slower and the disk is busy all the time anyway).
> > This is the kind of load where we observe problems currently.
> >   Ideally we could observe that we write out the same pages again and again
> > (or even pages close to them) and in that case be less agressive about
> > writeback on the file. But it feels a bit overcomplicated...
>   And there's actually one more thing that probably needs some improvement
> in the writeback algorithms:
>   What we observe in the seekwatcher graphs is, that there are three
> processes writing back the single database file in parallel (2 pdflush
> threads because the machine has 2 CPUs, and the database process itself
> because of dirty throttling). Each of the processes is writing back the
> file at a different offset and so they together create even more random IO
> (I'm attaching the graph and can provide blocktrace data if someone is
> interested). If there was just one process doing the writeback, we'd be
> writing back those data considerably faster...

A database application really should be taking care of the writeback
scheduling itself, rather than hoping that the kernel behaves optimally.

yeah, it'd be nice if the kernel was perfect.  But in the real world,
there will always be gains available by smart use of (say)
sync_file_range().  Because the application knows more about its writeout
behaviour (especialy _future_ behaviour) than the kernel ever will.

>   This problem could have reasonably easy solution. IMHO if there is one
> process doing writeback on a block device, there's no point for another
> process to do any writeback on that device. Block device congestion
> detection is supposed to avoid this I think but it does not work quite well
> in this case. The result is (I guess) that all the three threads are calling
> write_cache_pages() on that single DB file, eventually the congested flag
> is cleared from the block device, now all three threads hugrily jump on 
> the file and start writing which quickly congests the device again...
>   My proposed solution would be that we'll have two flags per BDI -
> PDFLUSH_IS_WRITING_BACK and THROTTLING_IS_WRITING_BACK. They are set /
> cleared as their names suggest. When pdflush sees THROTTLING took place,
> it relaxes and let throttled process to do the work. Also pdflush would
> not try writeback on devices that have PDFLUSH_IS_WRITING_BACK flag set
> (OK, we should know that *this* pdflush thread set this flag for the device
> and do writeback then, but I think you get the idea). This could improve
> the situation at least for smaller machines, what do you think? I
> understand that there might be problem on machines with a lot of CPUs where
> one thread might not be fast enough to send out all the dirty data created
> by other CPUs. But as long as there is just one backing device, does it
> really help to have more threads doing writeback even on a big machine?

Yes, the XFS guys have said that some machines run out of puff when
only one CPU is doing writeback.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Increase dirty_ratio and dirty_background_ratio?
  2009-01-09 18:02                   ` Jan Kara
  2009-01-09 19:00                     ` Andrew Morton
@ 2009-01-09 19:07                     ` Chris Mason
  2009-01-09 22:31                     ` david
  2 siblings, 0 replies; 16+ messages in thread
From: Chris Mason @ 2009-01-09 19:07 UTC (permalink / raw)
  To: Jan Kara
  Cc: Linus Torvalds, David Miller, akpm, peterz, linux-kernel,
	linux-mm, npiggin

On Fri, 2009-01-09 at 19:02 +0100, Jan Kara wrote:

>   What we observe in the seekwatcher graphs is, that there are three
> processes writing back the single database file in parallel (2 pdflush
> threads because the machine has 2 CPUs, and the database process itself
> because of dirty throttling). Each of the processes is writing back the
> file at a different offset and so they together create even more random IO
> (I'm attaching the graph and can provide blocktrace data if someone is
> interested). If there was just one process doing the writeback, we'd be
> writing back those data considerably faster...

I spent some time trying similar things for btrfs, and went as far as
making my own writeback thread and changing pdflush and throttled writes
to wait on it.  It was a great hack, but in the end I found the real
problem was the way write_cache_pages is advancing the page_index.

You probably remember the related ext4 discussion, and you could try my
simple patch in this workload to see if it helps ext3.

http://lkml.org/lkml/2008/10/1/278

Ext3 may need similar tricks.

-chris


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Increase dirty_ratio and dirty_background_ratio?
  2009-01-09 22:31                     ` david
@ 2009-01-09 21:34                       ` Peter Zijlstra
  0 siblings, 0 replies; 16+ messages in thread
From: Peter Zijlstra @ 2009-01-09 21:34 UTC (permalink / raw)
  To: david
  Cc: Jan Kara, Linus Torvalds, Chris Mason, David Miller, akpm,
	linux-kernel, linux-mm, npiggin

On Fri, 2009-01-09 at 14:31 -0800, david@lang.hm wrote:

> for that matter, it's not getting to where it makes sense to have wildly 
> different storage on a machine
> 
> 10's of GB of SSD for super-fast read-mostly
> 100's of GB of high-speed SCSI for fast writes
> TB's of SATA for high capacity
> 
> does it make sense to consider tracking the dirty pages per-destination so 
> that in addition to only having one process writing to the drive at a time 
> you can also allow for different amounts of data to be queued per device?
> 
> on a machine with 10's of GB of ram it becomes possible to hit the point 
> where at one point you could have the entire SSD worth of data queued up 
> to write, and at another point have the same total amount of data queued 
> for the SATA storage and it's a fraction of a percent of the size of the 
> storage.

That's exactly what we do today. Dirty pages are tracked per backing
device and the writeback cache size is proportionally divided based on
recent write speed ratios of the devices.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Increase dirty_ratio and dirty_background_ratio?
  2009-01-09 18:02                   ` Jan Kara
  2009-01-09 19:00                     ` Andrew Morton
  2009-01-09 19:07                     ` Chris Mason
@ 2009-01-09 22:31                     ` david
  2009-01-09 21:34                       ` Peter Zijlstra
  2 siblings, 1 reply; 16+ messages in thread
From: david @ 2009-01-09 22:31 UTC (permalink / raw)
  To: Jan Kara
  Cc: Linus Torvalds, Chris Mason, David Miller, akpm, peterz,
	linux-kernel, linux-mm, npiggin

[-- Attachment #1: Type: TEXT/PLAIN, Size: 3831 bytes --]

On Fri, 9 Jan 2009, Jan Kara wrote:

> On Thu 08-01-09 20:57:28, Jan Kara wrote:
>>   But I think there are workloads for which this is suboptimal to say the
>> least. Imagine you do some crazy LDAP database crunching or other similar load
>> which randomly writes to a big file (big means it's size is rougly
>> comparable to your available memory). Kernel finds pdflush isn't able to
>> flush the data fast enough so we decrease dirty limits. This results in
>> even more agressive flushing but that makes things even worse (in a sence
>> that your application runs slower and the disk is busy all the time anyway).
>> This is the kind of load where we observe problems currently.
>>   Ideally we could observe that we write out the same pages again and again
>> (or even pages close to them) and in that case be less agressive about
>> writeback on the file. But it feels a bit overcomplicated...
>  And there's actually one more thing that probably needs some improvement
> in the writeback algorithms:
>  What we observe in the seekwatcher graphs is, that there are three
> processes writing back the single database file in parallel (2 pdflush
> threads because the machine has 2 CPUs, and the database process itself
> because of dirty throttling). Each of the processes is writing back the
> file at a different offset and so they together create even more random IO
> (I'm attaching the graph and can provide blocktrace data if someone is
> interested). If there was just one process doing the writeback, we'd be
> writing back those data considerably faster...
>  This problem could have reasonably easy solution. IMHO if there is one
> process doing writeback on a block device, there's no point for another
> process to do any writeback on that device. Block device congestion
> detection is supposed to avoid this I think but it does not work quite well
> in this case. The result is (I guess) that all the three threads are calling
> write_cache_pages() on that single DB file, eventually the congested flag
> is cleared from the block device, now all three threads hugrily jump on
> the file and start writing which quickly congests the device again...
>  My proposed solution would be that we'll have two flags per BDI -
> PDFLUSH_IS_WRITING_BACK and THROTTLING_IS_WRITING_BACK. They are set /
> cleared as their names suggest. When pdflush sees THROTTLING took place,
> it relaxes and let throttled process to do the work. Also pdflush would
> not try writeback on devices that have PDFLUSH_IS_WRITING_BACK flag set
> (OK, we should know that *this* pdflush thread set this flag for the device
> and do writeback then, but I think you get the idea). This could improve
> the situation at least for smaller machines, what do you think? I
> understand that there might be problem on machines with a lot of CPUs where
> one thread might not be fast enough to send out all the dirty data created
> by other CPUs. But as long as there is just one backing device, does it
> really help to have more threads doing writeback even on a big machine?

for that matter, it's not getting to where it makes sense to have wildly 
different storage on a machine

10's of GB of SSD for super-fast read-mostly
100's of GB of high-speed SCSI for fast writes
TB's of SATA for high capacity

does it make sense to consider tracking the dirty pages per-destination so 
that in addition to only having one process writing to the drive at a time 
you can also allow for different amounts of data to be queued per device?

on a machine with 10's of GB of ram it becomes possible to hit the point 
where at one point you could have the entire SSD worth of data queued up 
to write, and at another point have the same total amount of data queued 
for the SATA storage and it's a fraction of a percent of the size of the 
storage.

David Lang

[-- Attachment #2: Type: IMAGE/PNG, Size: 95894 bytes --]

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Increase dirty_ratio and dirty_background_ratio?
  2009-01-08 17:05               ` Linus Torvalds
  2009-01-08 19:57                 ` Jan Kara
@ 2009-01-14  3:29                 ` Nick Piggin
  1 sibling, 0 replies; 16+ messages in thread
From: Nick Piggin @ 2009-01-14  3:29 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Chris Mason, David Miller, akpm, peterz, jack, linux-kernel, linux-mm

On Thu, Jan 08, 2009 at 09:05:01AM -0800, Linus Torvalds wrote:
> 
> 
> On Thu, 8 Jan 2009, Chris Mason wrote:
> > 
> > Does it make sense to hook into kupdate?  If kupdate finds it can't meet
> > the no-data-older-than 30 seconds target, it lowers the sync/async combo
> > down to some reasonable bottom.  
> > 
> > If it finds it is going to sleep without missing the target, raise the
> > combo up to some reasonable top.
> 
> I like autotuning, so that sounds like an intriguing approach. It's worked 
> for us before (ie VM).
> 
> That said, 30 seconds sounds like a _loong_ time for something like this. 
> I'd use the normal 5-second dirty_writeback_interval for this: if we can't 
> clean the whole queue in that normal background writeback interval, then 
> we try to lower the tagets. We already have that "congestion_wait()" thing 
> there, that would be a logical place, methinks.
> 
> I'm not sure how to raise them, though. We don't want to raise any limits 
> just because the user suddenly went idle. I think the raising should 
> happen if we hit the sync/async ratio, and we haven't lowered in the last 
> 30 seconds or something like that.

The other problem is that the pagecache is quite far removed from the
block device. Writeback can go to different devices, and those devices
might have different speeds at different times or different patterns.

We might autosize our dirty data to 500MB when doing linear writes because
our block device is happily cleaning them at 100MB/s and latency is
great. But then if some process inserts even 20MB worth of very seeky dirty
pages, the time to flush can go up by an order of magnitude. Let alone
500MB.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2009-01-14  3:29 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <20090107154517.GA5565@duck.suse.cz>
2009-01-07 16:25 ` Increase dirty_ratio and dirty_background_ratio? Peter Zijlstra
2009-01-07 16:39   ` Linus Torvalds
2009-01-07 20:51     ` David Miller
2009-01-08 11:02       ` Andrew Morton
2009-01-08 16:24         ` David Miller
2009-01-08 16:48           ` Linus Torvalds
2009-01-08 16:55             ` Chris Mason
2009-01-08 17:05               ` Linus Torvalds
2009-01-08 19:57                 ` Jan Kara
2009-01-08 20:01                   ` David Miller
2009-01-09 18:02                   ` Jan Kara
2009-01-09 19:00                     ` Andrew Morton
2009-01-09 19:07                     ` Chris Mason
2009-01-09 22:31                     ` david
2009-01-09 21:34                       ` Peter Zijlstra
2009-01-14  3:29                 ` Nick Piggin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox