From: Neil Brown <neilb@suse.de>
To: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org,
miklos@szeredi.hu, akpm@linux-foundation.org, dgc@sgi.com,
tomoki.sekiyama.qu@hitachi.com, nikita@clusterfs.com,
trond.myklebust@fys.uio.no, yingchao.zhou@gmail.com
Subject: Re: [PATCH 10/10] mm: per device dirty threshold
Date: Tue, 24 Apr 2007 12:58:12 +1000 [thread overview]
Message-ID: <17965.29252.950216.971096@notabene.brown> (raw)
In-Reply-To: message from Peter Zijlstra on Friday April 20
On Friday April 20, a.p.zijlstra@chello.nl wrote:
> Scale writeback cache per backing device, proportional to its writeout speed.
So it works like this:
We account for writeout in full pages.
When a page has the Writeback flag cleared, we account that as a
successfully retired write for the relevant bdi.
By using floating averages we keep track of how many writes each bdi
has retired 'recently' where the unit of time in which we understand
'recently' is a single page written.
We keep a floating average for each bdi, and a floating average for
the total writeouts (that 'average' is, of course, 1.)
Using these numbers we can calculate what faction of 'recently'
retired writes were retired by each bdi (get_writeout_scale).
Multiplying this fraction by the system-wide number of pages that are
allowed to be dirty before write-throttling, we get the number of
pages that the bdi can have dirty before write-throttling the bdi.
I note that the same fraction is *not* applied to background_thresh.
Should it be? I guess not - there would be interesting starting
transients, as a bdi which had done no writeout would not be allowed
any dirty pages, so background writeout would start immediately,
which isn't what you want... or is it?
For each bdi we also track the number of (dirty, writeback, unstable)
pages and do not allow this to exceed the limit set for this bdi.
The calculations involving 'reserve' in get_dirty_limits are a little
confusing. It looks like you calculating how much total head-room
there is for the bdi (pages that the system can still dirty - pages
this bdi has dirty) and making sure the number returned in pbdi_dirty
doesn't allow more than that to be used. This is probably a
reasonable thing to do but it doesn't feel like the right place. I
think get_dirty_limits should return the raw threshold, and
balance_dirty_pages should do both tests - the bdi-local test and the
system-wide test.
Currently you have a rather odd situation where
+ if (bdi_nr_reclaimable + bdi_nr_writeback <= bdi_thresh)
+ break;
might included numbers obtained with bdi_stat_sum being compared with
numbers obtained with bdi_stat.
With these patches, the VM still (I think) assumes that each BDI has
a reasonable queue limit, so that writeback_inodes will block on a
full queue. If a BDI has a very large queue, balance_dirty_pages
will simply turn lots of DIRTY pages into WRITEBACK pages and then
think "We've done our duty" without actually blocking at all.
With the extra accounting that we now have, I would like to see
balance_dirty_pages dirty pages wait until RECLAIMABLE+WRITEBACK is
actually less than 'threshold'. This would probably mean that we
would need to support per-bdi background_writeout to smooth things
out. Maybe that it fodder for another patch-set.
You set:
+ vm_cycle_shift = 1 + ilog2(vm_total_pages);
Can you explain that? My experience is that scaling dirty limits
with main memory isn't what we really want. When you get machines
with very large memory, the amount that you want to be dirty is more
a function of the speed of your IO devices, rather than the amount
of memory, otherwise you can sometimes see large filesystem lags
('sync' taking minutes?)
I wonder if it makes sense to try to limit the dirty data for a bdi
to the amount that it can write out in some period of time - maybe 3
seconds. Probably configurable. You seem to have almost all the
infrastructure in place to do that, and I think it could be a
valuable feature.
At least, I think vm_cycle_shift should be tied (loosely) to
dirty_ratio * vm_total_pages
??
On the whole, looks good!
Thanks,
NeilBrown
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
next prev parent reply other threads:[~2007-04-24 2:58 UTC|newest]
Thread overview: 55+ messages / expand[flat|nested] mbox.gz Atom feed top
2007-04-20 15:51 [PATCH 00/10] per device dirty throttling -v5 Peter Zijlstra
2007-04-20 15:51 ` [PATCH 01/10] revert per-backing_dev-dirty-and-writeback-page-accounting Peter Zijlstra
2007-04-20 15:51 ` [PATCH 02/10] nfs: remove congestion_end() Peter Zijlstra
2007-04-20 15:51 ` [PATCH 03/10] lib: dampen the percpu_counter FBC_BATCH Peter Zijlstra
2007-04-21 9:55 ` Andrew Morton
2007-04-21 10:58 ` Peter Zijlstra
2007-04-20 15:51 ` [PATCH 04/10] lib: percpu_counter_mod64 Peter Zijlstra
2007-04-21 9:55 ` Andrew Morton
2007-04-21 11:02 ` Peter Zijlstra
2007-04-21 19:21 ` Andrew Morton
2007-04-21 19:30 ` Peter Zijlstra
2007-04-20 15:51 ` [PATCH 05/10] mm: bdi init hooks Peter Zijlstra
2007-04-20 15:52 ` [PATCH 06/10] mm: scalable bdi statistics counters Peter Zijlstra
2007-04-20 15:52 ` [PATCH 07/10] mm: count reclaimable pages per BDI Peter Zijlstra
2007-04-21 9:55 ` Andrew Morton
2007-04-21 11:04 ` Peter Zijlstra
2007-04-20 15:52 ` [PATCH 08/10] mm: count writeback " Peter Zijlstra
2007-04-21 9:55 ` Andrew Morton
2007-04-21 11:07 ` Peter Zijlstra
2007-04-22 7:19 ` Andrew Morton
2007-04-22 9:08 ` Peter Zijlstra
2007-04-20 15:52 ` [PATCH 09/10] mm: expose BDI statistics in sysfs Peter Zijlstra
2007-04-21 9:55 ` Andrew Morton
2007-04-21 11:08 ` Peter Zijlstra
2007-04-20 15:52 ` [PATCH 10/10] mm: per device dirty threshold Peter Zijlstra
2007-04-21 9:55 ` Andrew Morton
2007-04-21 10:38 ` Miklos Szeredi
2007-04-21 10:54 ` Andrew Morton
2007-04-21 20:25 ` Miklos Szeredi
2007-04-23 6:14 ` Peter Zijlstra
2007-04-23 6:29 ` Miklos Szeredi
2007-04-23 6:39 ` Andrew Morton
2007-04-21 12:01 ` Peter Zijlstra
2007-04-21 12:15 ` Peter Zijlstra
2007-04-21 19:50 ` Peter Zijlstra
2007-04-23 15:48 ` Christoph Lameter
2007-04-23 15:58 ` Peter Zijlstra
2007-04-23 16:08 ` Christoph Lameter
2007-04-22 7:26 ` Andrew Morton
2007-04-24 2:58 ` Neil Brown [this message]
2007-04-24 7:09 ` Peter Zijlstra
2007-04-24 8:19 ` Miklos Szeredi
2007-04-24 8:31 ` Peter Zijlstra
2007-04-24 9:14 ` Miklos Szeredi
2007-04-24 9:26 ` Peter Zijlstra
2007-04-24 9:47 ` Miklos Szeredi
2007-04-24 10:00 ` Andrew Morton
2007-04-24 10:12 ` Peter Zijlstra
2007-04-24 10:19 ` Miklos Szeredi
2007-04-24 10:24 ` Peter Zijlstra
2007-04-24 10:40 ` Andrew Morton
2007-04-24 11:22 ` Miklos Szeredi
2007-04-24 11:50 ` Andrew Morton
2007-04-24 12:07 ` Miklos Szeredi
2007-04-22 9:57 ` [PATCH 00/10] per device dirty throttling -v5 Andrew Morton
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=17965.29252.950216.971096@notabene.brown \
--to=neilb@suse.de \
--cc=a.p.zijlstra@chello.nl \
--cc=akpm@linux-foundation.org \
--cc=dgc@sgi.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=miklos@szeredi.hu \
--cc=nikita@clusterfs.com \
--cc=tomoki.sekiyama.qu@hitachi.com \
--cc=trond.myklebust@fys.uio.no \
--cc=yingchao.zhou@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox