From: Minchan Kim <minchan.kim@gmail.com>
To: Wu Fengguang <fengguang.wu@intel.com>
Cc: Dave Chinner <david@fromorbit.com>,
Christoph Hellwig <hch@infradead.org>,
Mel Gorman <mgorman@suse.de>,
Johannes Weiner <jweiner@redhat.com>,
"xfs@oss.sgi.com" <xfs@oss.sgi.com>,
"linux-mm@kvack.org" <linux-mm@kvack.org>
Subject: Re: [PATCH 03/27] xfs: use write_cache_pages for writeback clustering
Date: Wed, 6 Jul 2011 15:47:02 +0900 [thread overview]
Message-ID: <CAEwNFnCjqxBGmffeDV4_U=gxz6nz6BHyocb5T=QdCP4fT2knPA@mail.gmail.com> (raw)
In-Reply-To: <20110706045301.GA11604@localhost>
On Wed, Jul 6, 2011 at 1:53 PM, Wu Fengguang <fengguang.wu@intel.com> wrote:
> On Mon, Jul 04, 2011 at 11:25:34AM +0800, Dave Chinner wrote:
>> On Fri, Jul 01, 2011 at 11:41:36PM +0800, Wu Fengguang wrote:
>> > Christoph,
>> >
>> > On Fri, Jul 01, 2011 at 05:33:05PM +0800, Christoph Hellwig wrote:
>> > > Johannes, Mel, Wu,
>> > >
>> > > Dave has been stressing some XFS patches of mine that remove the XFS
>> > > internal writeback clustering in favour of using write_cache_pages.
>> > >
>> > > As part of investigating the behaviour he found out that we're still
>> > > doing lots of I/O from the end of the LRU in kswapd. Not only is that
>> > > pretty bad behaviour in general, but it also means we really can't
>> > > just remove the writeback clustering in writepage given how much
>> > > I/O is still done through that.
>> > >
>> > > Any chance we could the writeback vs kswap behaviour sorted out a bit
>> > > better finally?
>> >
>> > I once tried this approach:
>> >
>> > http://www.spinics.net/lists/linux-mm/msg09202.html
>> >
>> > It used a list structure that is not linearly scalable, however that
>> > part should be independently improvable when necessary.
>>
>> I don't think that handing random writeback to the flusher thread is
>> much better than doing random writeback directly. Yes, you added
>> some clustering, but I'm still don't think writing specific pages is
>> the best solution.
>
> I agree that the VM should avoid writing specific pages as much as
> possible. Mostly often, it's indeed OK to just skip sporadically
> encountered dirty page and reclaim the clean pages presumably not
> far away in the LRU list. So your 2-liner patch is all good if
> constraining it to low scan pressure, which will look like
>
> if (priority == DEF_PRIORITY)
> tag PG_reclaim on encountered dirty pages and
> skip writing it
>
> However the VM in general does need the ability to write specific
> pages, such as when reclaiming from specific zone/memcg. So I'll still
> propose to do bdi_start_inode_writeback().
>
> Below is the patch rebased to linux-next. It's good enough for testing
> purpose, and I guess even with the ->nr_pages work issue, it's
> complete enough to get roughly the same performance as your 2-liner
> patch.
>
>> > The real problem was, it seem to not very effective in my test runs.
>> > I found many ->nr_pages works queued before the ->inode works, which
>> > effectively makes the flusher working on more dispersed pages rather
>> > than focusing on the dirty pages encountered in LRU reclaim.
>>
>> But that's really just an implementation issue related to how you
>> tried to solve the problem. That could be addressed.
>>
>> However, what I'm questioning is whether we should even care what
>> page memory reclaim wants to write - it seems to make fundamentally
>> bad decisions from an IO persepctive.
>>
>> We have to remember that memory reclaim is doing LRU reclaim and the
>> flusher threads are doing "oldest first" writeback. IOWs, both are trying
>> to operate in the same direction (oldest to youngest) for the same
>> purpose. The fundamental problem that occurs when memory reclaim
>> starts writing pages back from the LRU is this:
>>
>> - memory reclaim has run ahead of IO writeback -
>>
>> The LRU usually looks like this:
>>
>> oldest youngest
>> +---------------+---------------+--------------+
>> clean writeback dirty
>> ^ ^
>> | |
>> | Where flusher will next work from
>> | Where kswapd is working from
>> |
>> IO submitted by flusher, waiting on completion
>>
>>
>> If memory reclaim is hitting dirty pages on the LRU, it means it has
>> got ahead of writeback without being throttled - it's passed over
>> all the pages currently under writeback and is trying to write back
>> pages that are *newer* than what writeback is working on. IOWs, it
>> starts trying to do the job of the flusher threads, and it does that
>> very badly.
>>
>> The $100 question is ∗why is it getting ahead of writeback*?
>
> The most important case is: faster reader + relatively slow writer.
>
> Assume for every 10 pages read, 1 page is dirtied, and the dirty speed
> is fast enough to trigger the 20% dirty ratio and hence dirty balancing.
>
> That pattern is able to evenly distribute dirty pages all over the LRU
> list and hence trigger lots of pageout()s. The "skip reclaim writes on
> low pressure" approach can fix this case.
>
> Thanks,
> Fengguang
> ---
> Subject: writeback: introduce bdi_start_inode_writeback()
> Date: Thu Jul 29 14:41:19 CST 2010
>
> This relays ASYNC file writeback IOs to the flusher threads.
>
> pageout() will continue to serve the SYNC file page writes for necessary
> throttling for preventing OOM, which may happen if the LRU list is small
> and/or the storage is slow, so that the flusher cannot clean enough
> pages before the LRU is full scanned.
>
> Only ASYNC pageout() is relayed to the flusher threads, the less
> frequent SYNC pageout()s will work as before as a last resort.
> This helps to avoid OOM when the LRU list is small and/or the storage is
> slow, and the flusher cannot clean enough pages before the LRU is
> full scanned.
>
> The flusher will piggy back more dirty pages for IO
> - it's more IO efficient
> - it helps clean more pages, a good number of them may sit in the same
> LRU list that is being scanned.
>
> To avoid memory allocations at page reclaim, a mempool is created.
>
> Background/periodic works will quit automatically (as done in another
> patch), so as to clean the pages under reclaim ASAP. However for now the
> sync work can still block us for long time.
>
> Jan Kara: limit the search scope.
>
> CC: Jan Kara <jack@suse.cz>
> CC: Rik van Riel <riel@redhat.com>
> CC: Mel Gorman <mel@linux.vnet.ibm.com>
> CC: Minchan Kim <minchan.kim@gmail.com>
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
It seems to be enhanced version of old Mel's done.
I support this approach :) but I have some questions.
> ---
> fs/fs-writeback.c | 156 ++++++++++++++++++++++++++++-
> include/linux/backing-dev.h | 1
> include/trace/events/writeback.h | 15 ++
> mm/vmscan.c | 8 +
> 4 files changed, 174 insertions(+), 6 deletions(-)
>
> --- linux-next.orig/mm/vmscan.c 2011-06-29 20:43:10.000000000 -0700
> +++ linux-next/mm/vmscan.c 2011-07-05 18:30:19.000000000 -0700
> @@ -825,6 +825,14 @@ static unsigned long shrink_page_list(st
> if (PageDirty(page)) {
> nr_dirty++;
>
> + if (page_is_file_cache(page) && mapping &&
> + sc->reclaim_mode != RECLAIM_MODE_SYNC) {
> + if (flush_inode_page(page, mapping) >= 0) {
> + SetPageReclaim(page);
> + goto keep_locked;
keep_locked changes old behavior.
Normally, in case of async mode, we does keep_lumpy(ie, we didn't
reset reclaim_mode) but now you are always resetting reclaim_mode. so
sync call of shrink_page_list never happen if flush_inode_page is
successful.
Is it your intention?
> + }
> + }
> +
If flush_inode_page fails(ie, the page isn't nearby of current work's
writeback range), we still do pageout although it's async mode. Is it
your intention?
--
Kind regards,
Minchan Kim
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
next prev parent reply other threads:[~2011-07-06 6:47 UTC|newest]
Thread overview: 20+ messages / expand[flat|nested] mbox.gz Atom feed top
[not found] <20110629140109.003209430@bombadil.infradead.org>
[not found] ` <20110629140336.950805096@bombadil.infradead.org>
[not found] ` <20110701022248.GM561@dastard>
[not found] ` <20110701041851.GN561@dastard>
2011-07-01 9:33 ` Christoph Hellwig
2011-07-01 14:59 ` Mel Gorman
2011-07-01 15:15 ` Christoph Hellwig
2011-07-02 2:42 ` Dave Chinner
2011-07-05 14:10 ` Mel Gorman
2011-07-05 15:55 ` Dave Chinner
2011-07-11 10:26 ` Christoph Hellwig
2011-07-01 15:41 ` Wu Fengguang
2011-07-04 3:25 ` Dave Chinner
2011-07-05 14:34 ` Mel Gorman
2011-07-06 1:23 ` Dave Chinner
2011-07-11 11:10 ` Christoph Hellwig
2011-07-06 4:53 ` Wu Fengguang
2011-07-06 6:47 ` Minchan Kim [this message]
2011-07-06 7:17 ` Dave Chinner
2011-07-06 15:12 ` Johannes Weiner
2011-07-08 9:54 ` Dave Chinner
2011-07-11 17:20 ` Johannes Weiner
2011-07-11 17:24 ` Christoph Hellwig
2011-07-11 19:09 ` Rik van Riel
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to='CAEwNFnCjqxBGmffeDV4_U=gxz6nz6BHyocb5T=QdCP4fT2knPA@mail.gmail.com' \
--to=minchan.kim@gmail.com \
--cc=david@fromorbit.com \
--cc=fengguang.wu@intel.com \
--cc=hch@infradead.org \
--cc=jweiner@redhat.com \
--cc=linux-mm@kvack.org \
--cc=mgorman@suse.de \
--cc=xfs@oss.sgi.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox