Re: [PATCH 03/27] xfs: use write_cache_pages for writeback clustering

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Minchan Kim <minchan.kim@gmail.com>
To: Wu Fengguang <fengguang.wu@intel.com>
Cc: Dave Chinner <david@fromorbit.com>,
	Christoph Hellwig <hch@infradead.org>,
	Mel Gorman <mgorman@suse.de>,
	Johannes Weiner <jweiner@redhat.com>,
	"xfs@oss.sgi.com" <xfs@oss.sgi.com>,
	"linux-mm@kvack.org" <linux-mm@kvack.org>
Subject: Re: [PATCH 03/27] xfs: use write_cache_pages for writeback clustering
Date: Wed, 6 Jul 2011 15:47:02 +0900	[thread overview]
Message-ID: <CAEwNFnCjqxBGmffeDV4_U=gxz6nz6BHyocb5T=QdCP4fT2knPA@mail.gmail.com> (raw)
In-Reply-To: <20110706045301.GA11604@localhost>

On Wed, Jul 6, 2011 at 1:53 PM, Wu Fengguang <fengguang.wu@intel.com> wrote:
> On Mon, Jul 04, 2011 at 11:25:34AM +0800, Dave Chinner wrote:
>> On Fri, Jul 01, 2011 at 11:41:36PM +0800, Wu Fengguang wrote:
>> > Christoph,
>> >
>> > On Fri, Jul 01, 2011 at 05:33:05PM +0800, Christoph Hellwig wrote:
>> > > Johannes, Mel, Wu,
>> > >
>> > > Dave has been stressing some XFS patches of mine that remove the XFS
>> > > internal writeback clustering in favour of using write_cache_pages.
>> > >
>> > > As part of investigating the behaviour he found out that we're still
>> > > doing lots of I/O from the end of the LRU in kswapd.  Not only is that
>> > > pretty bad behaviour in general, but it also means we really can't
>> > > just remove the writeback clustering in writepage given how much
>> > > I/O is still done through that.
>> > >
>> > > Any chance we could the writeback vs kswap behaviour sorted out a bit
>> > > better finally?
>> >
>> > I once tried this approach:
>> >
>> > http://www.spinics.net/lists/linux-mm/msg09202.html
>> >
>> > It used a list structure that is not linearly scalable, however that
>> > part should be independently improvable when necessary.
>>
>> I don't think that handing random writeback to the flusher thread is
>> much better than doing random writeback directly.  Yes, you added
>> some clustering, but I'm still don't think writing specific pages is
>> the best solution.
>
> I agree that the VM should avoid writing specific pages as much as
> possible. Mostly often, it's indeed OK to just skip sporadically
> encountered dirty page and reclaim the clean pages presumably not
> far away in the LRU list. So your 2-liner patch is all good if
> constraining it to low scan pressure, which will look like
>
>        if (priority == DEF_PRIORITY)
>                tag PG_reclaim on encountered dirty pages and
>                skip writing it
>
> However the VM in general does need the ability to write specific
> pages, such as when reclaiming from specific zone/memcg. So I'll still
> propose to do bdi_start_inode_writeback().
>
> Below is the patch rebased to linux-next. It's good enough for testing
> purpose, and I guess even with the ->nr_pages work issue, it's
> complete enough to get roughly the same performance as your 2-liner
> patch.
>
>> > The real problem was, it seem to not very effective in my test runs.
>> > I found many ->nr_pages works queued before the ->inode works, which
>> > effectively makes the flusher working on more dispersed pages rather
>> > than focusing on the dirty pages encountered in LRU reclaim.
>>
>> But that's really just an implementation issue related to how you
>> tried to solve the problem. That could be addressed.
>>
>> However, what I'm questioning is whether we should even care what
>> page memory reclaim wants to write - it seems to make fundamentally
>> bad decisions from an IO persepctive.
>>
>> We have to remember that memory reclaim is doing LRU reclaim and the
>> flusher threads are doing "oldest first" writeback. IOWs, both are trying
>> to operate in the same direction (oldest to youngest) for the same
>> purpose.  The fundamental problem that occurs when memory reclaim
>> starts writing pages back from the LRU is this:
>>
>>       - memory reclaim has run ahead of IO writeback -
>>
>> The LRU usually looks like this:
>>
>>       oldest                                  youngest
>>       +---------------+---------------+--------------+
>>       clean           writeback       dirty
>>                       ^               ^
>>                       |               |
>>                       |               Where flusher will next work from
>>                       |               Where kswapd is working from
>>                       |
>>                       IO submitted by flusher, waiting on completion
>>
>>
>> If memory reclaim is hitting dirty pages on the LRU, it means it has
>> got ahead of writeback without being throttled - it's passed over
>> all the pages currently under writeback and is trying to write back
>> pages that are *newer* than what writeback is working on. IOWs, it
>> starts trying to do the job of the flusher threads, and it does that
>> very badly.
>>
>> The $100 question is ∗why is it getting ahead of writeback*?
>
> The most important case is: faster reader + relatively slow writer.
>
> Assume for every 10 pages read, 1 page is dirtied, and the dirty speed
> is fast enough to trigger the 20% dirty ratio and hence dirty balancing.
>
> That pattern is able to evenly distribute dirty pages all over the LRU
> list and hence trigger lots of pageout()s. The "skip reclaim writes on
> low pressure" approach can fix this case.
>
> Thanks,
> Fengguang
> ---
> Subject: writeback: introduce bdi_start_inode_writeback()
> Date: Thu Jul 29 14:41:19 CST 2010
>
> This relays ASYNC file writeback IOs to the flusher threads.
>
> pageout() will continue to serve the SYNC file page writes for necessary
> throttling for preventing OOM, which may happen if the LRU list is small
> and/or the storage is slow, so that the flusher cannot clean enough
> pages before the LRU is full scanned.
>
> Only ASYNC pageout() is relayed to the flusher threads, the less
> frequent SYNC pageout()s will work as before as a last resort.
> This helps to avoid OOM when the LRU list is small and/or the storage is
> slow, and the flusher cannot clean enough pages before the LRU is
> full scanned.
>
> The flusher will piggy back more dirty pages for IO
> - it's more IO efficient
> - it helps clean more pages, a good number of them may sit in the same
>  LRU list that is being scanned.
>
> To avoid memory allocations at page reclaim, a mempool is created.
>
> Background/periodic works will quit automatically (as done in another
> patch), so as to clean the pages under reclaim ASAP. However for now the
> sync work can still block us for long time.
>
> Jan Kara: limit the search scope.
>
> CC: Jan Kara <jack@suse.cz>
> CC: Rik van Riel <riel@redhat.com>
> CC: Mel Gorman <mel@linux.vnet.ibm.com>
> CC: Minchan Kim <minchan.kim@gmail.com>
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>

It seems to be enhanced version of old Mel's done.
I support this approach :) but I have some questions.

> ---
>  fs/fs-writeback.c                |  156 ++++++++++++++++++++++++++++-
>  include/linux/backing-dev.h      |    1
>  include/trace/events/writeback.h |   15 ++
>  mm/vmscan.c                      |    8 +
>  4 files changed, 174 insertions(+), 6 deletions(-)
>
> --- linux-next.orig/mm/vmscan.c 2011-06-29 20:43:10.000000000 -0700
> +++ linux-next/mm/vmscan.c      2011-07-05 18:30:19.000000000 -0700
> @@ -825,6 +825,14 @@ static unsigned long shrink_page_list(st
>                if (PageDirty(page)) {
>                        nr_dirty++;
>
> +                       if (page_is_file_cache(page) && mapping &&
> +                           sc->reclaim_mode != RECLAIM_MODE_SYNC) {
> +                               if (flush_inode_page(page, mapping) >= 0) {
> +                                       SetPageReclaim(page);
> +                                       goto keep_locked;

keep_locked changes old behavior.
Normally, in case of async mode, we does keep_lumpy(ie, we didn't
reset reclaim_mode) but now you are always resetting reclaim_mode. so
sync call of shrink_page_list never happen if flush_inode_page is
successful.
Is it your intention?


> +                               }
> +                       }
> +

If flush_inode_page fails(ie, the page isn't nearby of current work's
writeback range), we still do pageout although it's async mode. Is it
your intention?

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

next prev parent reply	other threads:[~2011-07-06  6:47 UTC|newest]

Thread overview: 20+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <20110629140109.003209430@bombadil.infradead.org>
     [not found] ` <20110629140336.950805096@bombadil.infradead.org>
     [not found]   ` <20110701022248.GM561@dastard>
     [not found]     ` <20110701041851.GN561@dastard>
2011-07-01  9:33       ` Christoph Hellwig
2011-07-01 14:59         ` Mel Gorman
2011-07-01 15:15           ` Christoph Hellwig
2011-07-02  2:42           ` Dave Chinner
2011-07-05 14:10             ` Mel Gorman
2011-07-05 15:55               ` Dave Chinner
2011-07-11 10:26             ` Christoph Hellwig
2011-07-01 15:41         ` Wu Fengguang
2011-07-04  3:25           ` Dave Chinner
2011-07-05 14:34             ` Mel Gorman
2011-07-06  1:23               ` Dave Chinner
2011-07-11 11:10               ` Christoph Hellwig
2011-07-06  4:53             ` Wu Fengguang
2011-07-06  6:47               ` Minchan Kim [this message]
2011-07-06  7:17               ` Dave Chinner
2011-07-06 15:12             ` Johannes Weiner
2011-07-08  9:54               ` Dave Chinner
2011-07-11 17:20                 ` Johannes Weiner
2011-07-11 17:24                   ` Christoph Hellwig
2011-07-11 19:09                   ` Rik van Riel

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAEwNFnCjqxBGmffeDV4_U=gxz6nz6BHyocb5T=QdCP4fT2knPA@mail.gmail.com' \
    --to=minchan.kim@gmail.com \
    --cc=david@fromorbit.com \
    --cc=fengguang.wu@intel.com \
    --cc=hch@infradead.org \
    --cc=jweiner@redhat.com \
    --cc=linux-mm@kvack.org \
    --cc=mgorman@suse.de \
    --cc=xfs@oss.sgi.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox