From: david@lang.hm
To: Jan Kara <jack@suse.cz>
Cc: Linus Torvalds <torvalds@linux-foundation.org>,
Chris Mason <chris.mason@oracle.com>,
David Miller <davem@davemloft.net>,
akpm@linux-foundation.org, peterz@infradead.org,
linux-kernel@vger.kernel.org, linux-mm@kvack.org,
npiggin@suse.de
Subject: Re: Increase dirty_ratio and dirty_background_ratio?
Date: Fri, 9 Jan 2009 14:31:39 -0800 (PST) [thread overview]
Message-ID: <alpine.DEB.1.10.0901091420190.3525@asgard.lang.hm> (raw)
In-Reply-To: <20090109180241.GA15023@duck.suse.cz>
[-- Attachment #1: Type: TEXT/PLAIN, Size: 3831 bytes --]
On Fri, 9 Jan 2009, Jan Kara wrote:
> On Thu 08-01-09 20:57:28, Jan Kara wrote:
>> But I think there are workloads for which this is suboptimal to say the
>> least. Imagine you do some crazy LDAP database crunching or other similar load
>> which randomly writes to a big file (big means it's size is rougly
>> comparable to your available memory). Kernel finds pdflush isn't able to
>> flush the data fast enough so we decrease dirty limits. This results in
>> even more agressive flushing but that makes things even worse (in a sence
>> that your application runs slower and the disk is busy all the time anyway).
>> This is the kind of load where we observe problems currently.
>> Ideally we could observe that we write out the same pages again and again
>> (or even pages close to them) and in that case be less agressive about
>> writeback on the file. But it feels a bit overcomplicated...
> And there's actually one more thing that probably needs some improvement
> in the writeback algorithms:
> What we observe in the seekwatcher graphs is, that there are three
> processes writing back the single database file in parallel (2 pdflush
> threads because the machine has 2 CPUs, and the database process itself
> because of dirty throttling). Each of the processes is writing back the
> file at a different offset and so they together create even more random IO
> (I'm attaching the graph and can provide blocktrace data if someone is
> interested). If there was just one process doing the writeback, we'd be
> writing back those data considerably faster...
> This problem could have reasonably easy solution. IMHO if there is one
> process doing writeback on a block device, there's no point for another
> process to do any writeback on that device. Block device congestion
> detection is supposed to avoid this I think but it does not work quite well
> in this case. The result is (I guess) that all the three threads are calling
> write_cache_pages() on that single DB file, eventually the congested flag
> is cleared from the block device, now all three threads hugrily jump on
> the file and start writing which quickly congests the device again...
> My proposed solution would be that we'll have two flags per BDI -
> PDFLUSH_IS_WRITING_BACK and THROTTLING_IS_WRITING_BACK. They are set /
> cleared as their names suggest. When pdflush sees THROTTLING took place,
> it relaxes and let throttled process to do the work. Also pdflush would
> not try writeback on devices that have PDFLUSH_IS_WRITING_BACK flag set
> (OK, we should know that *this* pdflush thread set this flag for the device
> and do writeback then, but I think you get the idea). This could improve
> the situation at least for smaller machines, what do you think? I
> understand that there might be problem on machines with a lot of CPUs where
> one thread might not be fast enough to send out all the dirty data created
> by other CPUs. But as long as there is just one backing device, does it
> really help to have more threads doing writeback even on a big machine?
for that matter, it's not getting to where it makes sense to have wildly
different storage on a machine
10's of GB of SSD for super-fast read-mostly
100's of GB of high-speed SCSI for fast writes
TB's of SATA for high capacity
does it make sense to consider tracking the dirty pages per-destination so
that in addition to only having one process writing to the drive at a time
you can also allow for different amounts of data to be queued per device?
on a machine with 10's of GB of ram it becomes possible to hit the point
where at one point you could have the entire SSD worth of data queued up
to write, and at another point have the same total amount of data queued
for the SATA storage and it's a fraction of a percent of the size of the
storage.
David Lang
[-- Attachment #2: Type: IMAGE/PNG, Size: 95894 bytes --]
next prev parent reply other threads:[~2009-01-09 21:29 UTC|newest]
Thread overview: 16+ messages / expand[flat|nested] mbox.gz Atom feed top
[not found] <20090107154517.GA5565@duck.suse.cz>
2009-01-07 16:25 ` Peter Zijlstra
2009-01-07 16:39 ` Linus Torvalds
2009-01-07 20:51 ` David Miller
2009-01-08 11:02 ` Andrew Morton
2009-01-08 16:24 ` David Miller
2009-01-08 16:48 ` Linus Torvalds
2009-01-08 16:55 ` Chris Mason
2009-01-08 17:05 ` Linus Torvalds
2009-01-08 19:57 ` Jan Kara
2009-01-08 20:01 ` David Miller
2009-01-09 18:02 ` Jan Kara
2009-01-09 19:00 ` Andrew Morton
2009-01-09 19:07 ` Chris Mason
2009-01-09 22:31 ` david [this message]
2009-01-09 21:34 ` Peter Zijlstra
2009-01-14 3:29 ` Nick Piggin
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=alpine.DEB.1.10.0901091420190.3525@asgard.lang.hm \
--to=david@lang.hm \
--cc=akpm@linux-foundation.org \
--cc=chris.mason@oracle.com \
--cc=davem@davemloft.net \
--cc=jack@suse.cz \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=npiggin@suse.de \
--cc=peterz@infradead.org \
--cc=torvalds@linux-foundation.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox