From: Jan Kara <jack@suse.cz>
To: Wu Fengguang <fengguang.wu@intel.com>
Cc: Jan Kara <jack@suse.cz>, Rik van Riel <riel@redhat.com>,
Greg Thelen <gthelen@google.com>,
"bsingharora@gmail.com" <bsingharora@gmail.com>,
Hugh Dickins <hughd@google.com>, Michal Hocko <mhocko@suse.cz>,
linux-mm@kvack.org, Mel Gorman <mgorman@suse.de>,
Ying Han <yinghan@google.com>,
"hannes@cmpxchg.org" <hannes@cmpxchg.org>,
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>,
Minchan Kim <minchan.kim@gmail.com>
Subject: Re: reclaim the LRU lists full of dirty/writeback pages
Date: Thu, 16 Feb 2012 13:44:45 +0100 [thread overview]
Message-ID: <20120216124445.GB18613@quack.suse.cz> (raw)
In-Reply-To: <20120216040019.GB17597@localhost>
On Thu 16-02-12 12:00:19, Wu Fengguang wrote:
> On Tue, Feb 14, 2012 at 02:29:50PM +0100, Jan Kara wrote:
> > > > I wonder what happens if you run:
> > > > mkdir /cgroup/x
> > > > echo 100M > /cgroup/x/memory.limit_in_bytes
> > > > echo $$ > /cgroup/x/tasks
> > > >
> > > > for (( i = 0; i < 2; i++ )); do
> > > > mkdir /fs/d$i
> > > > for (( j = 0; j < 5000; j++ )); do
> > > > dd if=/dev/zero of=/fs/d$i/f$j bs=1k count=50
> > > > done &
> > > > done
> > >
> > > That's a very good case, thanks!
> > >
> > > > Because for small files the writearound logic won't help much...
> > >
> > > Right, it also means the native background work cannot be more I/O
> > > efficient than the pageout works, except for the overheads of more
> > > work items..
> > Yes, that's true.
> >
> > > > Also the number of work items queued might become interesting.
> > >
> > > It turns out that the 1024 mempool reservations are not exhausted at
> > > all (the below patch as a trace_printk on alloc failure and it didn't
> > > trigger at all).
> > >
> > > Here is the representative iostat lines on XFS (full "iostat -kx 1 20" log attached):
> > >
> > > avg-cpu: %user %nice %system %iowait %steal %idle
> > > 0.80 0.00 6.03 0.03 0.00 93.14
> > >
> > > Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util
> > > sda 0.00 205.00 0.00 163.00 0.00 16900.00 207.36 4.09 21.63 1.88 30.70
> > >
> > > The attached dirtied/written progress graph looks interesting.
> > > Although the iostat disk utilization is low, the "dirtied" progress
> > > line is pretty straight and there is no single congestion_wait event
> > > in the trace log. Which makes me wonder if there are some unknown
> > > blocking issues in the way.
> > Interesting. I'd also expect we should block in reclaim path. How fast
> > can dd threads progress when there is no cgroup involved?
>
> I tried running the dd tasks in global context with
>
> echo $((100<<20)) > /proc/sys/vm/dirty_bytes
>
> and got mostly the same results on XFS:
>
> avg-cpu: %user %nice %system %iowait %steal %idle
> 0.85 0.00 8.88 0.00 0.00 90.26
>
> Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util
> sda 0.00 0.00 0.00 50.00 0.00 23036.00 921.44 9.59 738.02 7.38 36.90
>
> avg-cpu: %user %nice %system %iowait %steal %idle
> 0.95 0.00 8.95 0.00 0.00 90.11
>
> Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util
> sda 0.00 854.00 0.00 99.00 0.00 19552.00 394.99 34.14 87.98 3.82 37.80
OK, so it seems that reclaiming pages in memcg reclaim acted as a natural
throttling similar to what balance_dirty_pages() does in the global case.
> Interestingly, ext4 shows comparable throughput, however is reporting
> near 100% disk utilization:
>
> avg-cpu: %user %nice %system %iowait %steal %idle
> 0.76 0.00 9.02 0.00 0.00 90.23
>
> Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util
> sda 0.00 0.00 0.00 317.00 0.00 20956.00 132.21 28.57 82.71 3.16 100.10
>
> avg-cpu: %user %nice %system %iowait %steal %idle
> 0.82 0.00 8.95 0.00 0.00 90.23
>
> Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util
> sda 0.00 0.00 0.00 402.00 0.00 24388.00 121.33 21.09 58.55 2.42 97.40
>
> avg-cpu: %user %nice %system %iowait %steal %idle
> 0.82 0.00 8.99 0.00 0.00 90.19
>
> Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util
> sda 0.00 0.00 0.00 409.00 0.00 21996.00 107.56 15.25 36.74 2.30 94.10
Average request size is smaller so maybe ext4 does more seeking.
> > > > Another common case to test - run 'slapadd' command in each cgroup to
> > > > create big LDAP database. That does pretty much random IO on a big mmaped
> > > > DB file.
> > >
> > > I've not used this. Will it need some configuration and data feed?
> > > fio looks more handy to me for emulating mmap random IO.
> > Yes, fio can generate random mmap IO. It's just that this is a real life
> > workload. So it is not completely random, it happens on several files and
> > is also interleaved with other memory allocations from DB. I can send you
> > the config files and data feed if you are interested.
>
> I'm very interested, thank you!
OK, I'll send it in private email...
> > > > > +/*
> > > > > + * schedule writeback on a range of inode pages.
> > > > > + */
> > > > > +static struct wb_writeback_work *
> > > > > +bdi_flush_inode_range(struct backing_dev_info *bdi,
> > > > > + struct inode *inode,
> > > > > + pgoff_t offset,
> > > > > + pgoff_t len,
> > > > > + bool wait)
> > > > > +{
> > > > > + struct wb_writeback_work *work;
> > > > > +
> > > > > + if (!igrab(inode))
> > > > > + return ERR_PTR(-ENOENT);
> > > > One technical note here: If the inode is deleted while it is queued, this
> > > > reference will keep it living until flusher thread gets to it. Then when
> > > > flusher thread puts its reference, the inode will get deleted in flusher
> > > > thread context. I don't see an immediate problem in that but it might be
> > > > surprising sometimes. Another problem I see is that if you try to
> > > > unmount the filesystem while the work item is queued, you'll get EBUSY for
> > > > no apparent reason (for userspace).
> > >
> > > Yeah, we need to make umount work.
> > The positive thing is that if the inode is reaped while the work item is
> > queue, we know all that needed to be done is done. So we don't really need
> > to pin the inode.
>
> But I do need to make sure the *inode pointer does not point to some
> invalid memory at work exec time. Is this possible without raising
> ->i_count?
I was thinking about it and what should work is that we have inode
reference in work item but in generic_shutdown_super() we go through
the worklist and drop all work items for superblock before calling
evict_inodes()...
Honza
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
next prev parent reply other threads:[~2012-02-16 12:44 UTC|newest]
Thread overview: 33+ messages / expand[flat|nested] mbox.gz Atom feed top
2012-02-08 7:55 memcg writeback (was Re: [Lsf-pc] [LSF/MM TOPIC] memcg topics.) Greg Thelen
2012-02-08 9:31 ` Wu Fengguang
2012-02-08 20:54 ` Ying Han
2012-02-09 13:50 ` Wu Fengguang
2012-02-13 18:40 ` Ying Han
2012-02-10 5:51 ` Greg Thelen
2012-02-10 5:52 ` Greg Thelen
2012-02-10 9:20 ` Wu Fengguang
2012-02-10 11:47 ` Wu Fengguang
2012-02-11 12:44 ` reclaim the LRU lists full of dirty/writeback pages Wu Fengguang
2012-02-11 14:55 ` Rik van Riel
2012-02-12 3:10 ` Wu Fengguang
2012-02-12 6:45 ` Wu Fengguang
2012-02-13 15:43 ` Jan Kara
2012-02-14 10:03 ` Wu Fengguang
2012-02-14 13:29 ` Jan Kara
2012-02-16 4:00 ` Wu Fengguang
2012-02-16 12:44 ` Jan Kara [this message]
2012-02-16 13:32 ` Wu Fengguang
2012-02-16 14:06 ` Wu Fengguang
2012-02-17 16:41 ` Wu Fengguang
2012-02-20 14:00 ` Jan Kara
2012-02-14 10:19 ` Mel Gorman
2012-02-14 13:18 ` Wu Fengguang
2012-02-14 13:35 ` Wu Fengguang
2012-02-14 15:51 ` Mel Gorman
2012-02-16 9:50 ` Wu Fengguang
2012-02-16 17:31 ` Mel Gorman
2012-02-27 14:24 ` Fengguang Wu
2012-02-16 0:00 ` KAMEZAWA Hiroyuki
2012-02-16 3:04 ` Wu Fengguang
2012-02-16 3:52 ` KAMEZAWA Hiroyuki
2012-02-16 4:05 ` Wu Fengguang
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20120216124445.GB18613@quack.suse.cz \
--to=jack@suse.cz \
--cc=bsingharora@gmail.com \
--cc=fengguang.wu@intel.com \
--cc=gthelen@google.com \
--cc=hannes@cmpxchg.org \
--cc=hughd@google.com \
--cc=kamezawa.hiroyu@jp.fujitsu.com \
--cc=linux-mm@kvack.org \
--cc=mgorman@suse.de \
--cc=mhocko@suse.cz \
--cc=minchan.kim@gmail.com \
--cc=riel@redhat.com \
--cc=yinghan@google.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox