Re: reclaim the LRU lists full of dirty/writeback pages

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Jan Kara <jack@suse.cz>
To: Wu Fengguang <fengguang.wu@intel.com>
Cc: Jan Kara <jack@suse.cz>, Rik van Riel <riel@redhat.com>,
	Greg Thelen <gthelen@google.com>,
	"bsingharora@gmail.com" <bsingharora@gmail.com>,
	Hugh Dickins <hughd@google.com>, Michal Hocko <mhocko@suse.cz>,
	linux-mm@kvack.org, Mel Gorman <mgorman@suse.de>,
	Ying Han <yinghan@google.com>,
	"hannes@cmpxchg.org" <hannes@cmpxchg.org>,
	KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>,
	Minchan Kim <minchan.kim@gmail.com>
Subject: Re: reclaim the LRU lists full of dirty/writeback pages
Date: Thu, 16 Feb 2012 13:44:45 +0100	[thread overview]
Message-ID: <20120216124445.GB18613@quack.suse.cz> (raw)
In-Reply-To: <20120216040019.GB17597@localhost>

On Thu 16-02-12 12:00:19, Wu Fengguang wrote:
> On Tue, Feb 14, 2012 at 02:29:50PM +0100, Jan Kara wrote:
> > > >   I wonder what happens if you run:
> > > >        mkdir /cgroup/x
> > > >        echo 100M > /cgroup/x/memory.limit_in_bytes
> > > >        echo $$ > /cgroup/x/tasks
> > > > 
> > > >        for (( i = 0; i < 2; i++ )); do
> > > >          mkdir /fs/d$i
> > > >          for (( j = 0; j < 5000; j++ )); do 
> > > >            dd if=/dev/zero of=/fs/d$i/f$j bs=1k count=50
> > > >          done &
> > > >        done
> > > 
> > > That's a very good case, thanks!
> > >  
> > > >   Because for small files the writearound logic won't help much...
> > > 
> > > Right, it also means the native background work cannot be more I/O
> > > efficient than the pageout works, except for the overheads of more
> > > work items..
> >   Yes, that's true.
> > 
> > > >   Also the number of work items queued might become interesting.
> > > 
> > > It turns out that the 1024 mempool reservations are not exhausted at
> > > all (the below patch as a trace_printk on alloc failure and it didn't
> > > trigger at all).
> > > 
> > > Here is the representative iostat lines on XFS (full "iostat -kx 1 20" log attached):
> > > 
> > > avg-cpu:  %user   %nice %system %iowait  %steal   %idle                                                                     
> > >            0.80    0.00    6.03    0.03    0.00   93.14                                                                     
> > >                                                                                                                             
> > > Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await  svctm  %util                   
> > > sda               0.00   205.00    0.00  163.00     0.00 16900.00   207.36     4.09   21.63   1.88  30.70                   
> > > 
> > > The attached dirtied/written progress graph looks interesting.
> > > Although the iostat disk utilization is low, the "dirtied" progress
> > > line is pretty straight and there is no single congestion_wait event
> > > in the trace log. Which makes me wonder if there are some unknown
> > > blocking issues in the way.
> >   Interesting. I'd also expect we should block in reclaim path. How fast
> > can dd threads progress when there is no cgroup involved?
> 
> I tried running the dd tasks in global context with
> 
>         echo $((100<<20)) > /proc/sys/vm/dirty_bytes
> 
> and got mostly the same results on XFS:
> 
>         avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>                    0.85    0.00    8.88    0.00    0.00   90.26
> 
>         Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await  svctm  %util
>         sda               0.00     0.00    0.00   50.00     0.00 23036.00   921.44     9.59  738.02   7.38  36.90
> 
>         avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>                    0.95    0.00    8.95    0.00    0.00   90.11
> 
>         Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await  svctm  %util
>         sda               0.00   854.00    0.00   99.00     0.00 19552.00   394.99    34.14   87.98   3.82  37.80
  OK, so it seems that reclaiming pages in memcg reclaim acted as a natural
throttling similar to what balance_dirty_pages() does in the global case.


> Interestingly, ext4 shows comparable throughput, however is reporting
> near 100% disk utilization:
> 
>         avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>                    0.76    0.00    9.02    0.00    0.00   90.23
> 
>         Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await  svctm  %util
>         sda               0.00     0.00    0.00  317.00     0.00 20956.00   132.21    28.57   82.71   3.16 100.10
> 
>         avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>                    0.82    0.00    8.95    0.00    0.00   90.23
> 
>         Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await  svctm  %util
>         sda               0.00     0.00    0.00  402.00     0.00 24388.00   121.33    21.09   58.55   2.42  97.40
> 
>         avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>                    0.82    0.00    8.99    0.00    0.00   90.19
> 
>         Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await  svctm  %util
>         sda               0.00     0.00    0.00  409.00     0.00 21996.00   107.56    15.25   36.74   2.30  94.10
  Average request size is smaller so maybe ext4 does more seeking.

> > > > Another common case to test - run 'slapadd' command in each cgroup to
> > > > create big LDAP database. That does pretty much random IO on a big mmaped
> > > > DB file.
> > > 
> > > I've not used this. Will it need some configuration and data feed?
> > > fio looks more handy to me for emulating mmap random IO.
> >   Yes, fio can generate random mmap IO. It's just that this is a real life
> > workload. So it is not completely random, it happens on several files and
> > is also interleaved with other memory allocations from DB. I can send you
> > the config files and data feed if you are interested.
> 
> I'm very interested, thank you!
  OK, I'll send it in private email...

> > > > > +/*
> > > > > + * schedule writeback on a range of inode pages.
> > > > > + */
> > > > > +static struct wb_writeback_work *
> > > > > +bdi_flush_inode_range(struct backing_dev_info *bdi,
> > > > > +		      struct inode *inode,
> > > > > +		      pgoff_t offset,
> > > > > +		      pgoff_t len,
> > > > > +		      bool wait)
> > > > > +{
> > > > > +	struct wb_writeback_work *work;
> > > > > +
> > > > > +	if (!igrab(inode))
> > > > > +		return ERR_PTR(-ENOENT);
> > > >   One technical note here: If the inode is deleted while it is queued, this
> > > > reference will keep it living until flusher thread gets to it. Then when
> > > > flusher thread puts its reference, the inode will get deleted in flusher
> > > > thread context. I don't see an immediate problem in that but it might be
> > > > surprising sometimes. Another problem I see is that if you try to
> > > > unmount the filesystem while the work item is queued, you'll get EBUSY for
> > > > no apparent reason (for userspace).
> > > 
> > > Yeah, we need to make umount work.
> >   The positive thing is that if the inode is reaped while the work item is
> > queue, we know all that needed to be done is done. So we don't really need
> > to pin the inode.
> 
> But I do need to make sure the *inode pointer does not point to some
> invalid memory at work exec time. Is this possible without raising
> ->i_count?
  I was thinking about it and what should work is that we have inode
reference in work item but in generic_shutdown_super() we go through
the worklist and drop all work items for superblock before calling
evict_inodes()...

								Honza

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

next prev parent reply	other threads:[~2012-02-16 12:44 UTC|newest]

Thread overview: 33+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-02-08  7:55 memcg writeback (was Re: [Lsf-pc] [LSF/MM TOPIC] memcg topics.) Greg Thelen
2012-02-08  9:31 ` Wu Fengguang
2012-02-08 20:54   ` Ying Han
2012-02-09 13:50     ` Wu Fengguang
2012-02-13 18:40       ` Ying Han
2012-02-10  5:51   ` Greg Thelen
2012-02-10  5:52     ` Greg Thelen
2012-02-10  9:20       ` Wu Fengguang
2012-02-10 11:47     ` Wu Fengguang
2012-02-11 12:44       ` reclaim the LRU lists full of dirty/writeback pages Wu Fengguang
2012-02-11 14:55         ` Rik van Riel
2012-02-12  3:10           ` Wu Fengguang
2012-02-12  6:45             ` Wu Fengguang
2012-02-13 15:43             ` Jan Kara
2012-02-14 10:03               ` Wu Fengguang
2012-02-14 13:29                 ` Jan Kara
2012-02-16  4:00                   ` Wu Fengguang
2012-02-16 12:44                     ` Jan Kara [this message]
2012-02-16 13:32                       ` Wu Fengguang
2012-02-16 14:06                         ` Wu Fengguang
2012-02-17 16:41                     ` Wu Fengguang
2012-02-20 14:00                       ` Jan Kara
2012-02-14 10:19         ` Mel Gorman
2012-02-14 13:18           ` Wu Fengguang
2012-02-14 13:35             ` Wu Fengguang
2012-02-14 15:51             ` Mel Gorman
2012-02-16  9:50               ` Wu Fengguang
2012-02-16 17:31                 ` Mel Gorman
2012-02-27 14:24                   ` Fengguang Wu
2012-02-16  0:00             ` KAMEZAWA Hiroyuki
2012-02-16  3:04               ` Wu Fengguang
2012-02-16  3:52                 ` KAMEZAWA Hiroyuki
2012-02-16  4:05                   ` Wu Fengguang

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20120216124445.GB18613@quack.suse.cz \
    --to=jack@suse.cz \
    --cc=bsingharora@gmail.com \
    --cc=fengguang.wu@intel.com \
    --cc=gthelen@google.com \
    --cc=hannes@cmpxchg.org \
    --cc=hughd@google.com \
    --cc=kamezawa.hiroyu@jp.fujitsu.com \
    --cc=linux-mm@kvack.org \
    --cc=mgorman@suse.de \
    --cc=mhocko@suse.cz \
    --cc=minchan.kim@gmail.com \
    --cc=riel@redhat.com \
    --cc=yinghan@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox