From: Mel Gorman <mel@csn.ul.ie>
To: Ying Han <yinghan@google.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>,
Balbir Singh <balbir@linux.vnet.ibm.com>,
Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>,
Andrew Morton <akpm@linux-foundation.org>,
Johannes Weiner <hannes@cmpxchg.org>,
Christoph Lameter <cl@linux.com>,
Wu Fengguang <fengguang.wu@intel.com>,
Andi Kleen <ak@linux.intel.com>, Hugh Dickins <hughd@google.com>,
Rik van Riel <riel@redhat.com>,
KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>,
Tejun Heo <tj@kernel.org>,
linux-mm@kvack.org
Subject: Re: [PATCH 1/4] Add kswapd descriptor.
Date: Wed, 8 Dec 2010 12:19:51 +0000 [thread overview]
Message-ID: <20101208121951.GK5422@csn.ul.ie> (raw)
In-Reply-To: <AANLkTin+p5WnLjMkr8Qntkt4fR1+fdY=t6hkvV6G8Mok@mail.gmail.com>
On Tue, Dec 07, 2010 at 05:24:12PM -0800, Ying Han wrote:
> On Tue, Dec 7, 2010 at 4:39 PM, KAMEZAWA Hiroyuki
> <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > On Tue, 7 Dec 2010 09:28:01 -0800
> > Ying Han <yinghan@google.com> wrote:
> >
> >> On Tue, Dec 7, 2010 at 4:33 AM, Mel Gorman <mel@csn.ul.ie> wrote:
> >
> >> Potentially there will
> >> > also be a very large number of new IO sources. I confess I haven't read the
> >> > thread yet so maybe this has already been thought of but it might make sense
> >> > to have a 1:N relationship between kswapd and memcgroups and cycle between
> >> > containers. The difficulty will be a latency between when kswapd wakes up
> >> > and when a particular container is scanned. The closer the ratio is to 1:1,
> >> > the less the latency will be but the higher the contenion on the LRU lock
> >> > and IO will be.
> >>
> >> No, we weren't talked about the mapping anywhere in the thread. Having
> >> many kswapd threads
> >> at the same time isn't a problem as long as no locking contention (
> >> ext, 1k kswapd threads on
> >> 1k fake numa node system). So breaking the zone->lru_lock should work.
> >>
> >
> > That's me who make zone->lru_lock be shared. And per-memcg lock will makes
> > the maintainance of memcg very bad. That will add many races.
> > Or we need to make memcg's LRU not synchronized with zone's LRU, IOW, we need
> > to have completely independent LRU.
> >
> > I'd like to limit the number of kswapd-for-memcg if zone->lru lock contention
> > is problematic. memcg _can_ work without background reclaim.
>
> >
> > How about adding per-node kswapd-for-memcg it will reclaim pages by a memcg's
> > request ? as
> >
> > memcg_wake_kswapd(struct mem_cgroup *mem)
> > {
> > do {
> > nid = select_victim_node(mem);
> > /* ask kswapd to reclaim memcg's memory */
> > ret = memcg_kswapd_queue_work(nid, mem); /* may return -EBUSY if very busy*/
> > } while()
> > }
> >
> > This will make lock contention minimum. Anyway, using too much cpu for this
> > unnecessary_but_good_for_performance_function is bad. Throttoling is required.
>
> I don't see the problem of one-kswapd-per-cgroup here since there will
> be no performance cost if they are not running.
>
*If* they are not running. There is potentially a massive cost here.
> I haven't measured the lock contention and cputime for each kswapd
> running. Theoretically it would be a problem
> if thousands of cgroups are configured on the the host and all of them
> are under memory pressure.
>
It's not just the locking. If all of these kswapds are running and each
container has a small number of dirty pages, we potentially have tens or
hundreds of kswapd each queueing a small number of pages for IO. Granted,
if we reach the point where these IO sources are delegated to flusher threads
it would be less of a problem but it's not how things currently behave.
> We can either optimize the locking or make each kswapd smarter (hold
> the lock less time).
Holding the lock less time might allow other kswapd instances to make small
amounts of progress but they'll still be wasting a lot of CPU spinning on
the lock. It's not a simple issue which is why I think we need either a)
a means of telling kswapd which containers it should be reclaiming from
or b) a 1:N mapping of kswapd instances to containers from the outset.
Otherwise users with large numbers of containers will see severe slowdowns
under memory pressure where as previously they would have experienced stalls
in individual containers.
> My current plan is to have the
> one-kswapd-per-cgroup on the V2 patch w/ select_victim_node, and the
> optimization for this comes as following patchset.
>
Will read when they come out :)
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
next prev parent reply other threads:[~2010-12-08 12:20 UTC|newest]
Thread overview: 52+ messages / expand[flat|nested] mbox.gz Atom feed top
2010-11-30 6:49 [RFC][PATCH 0/4] memcg: per cgroup background reclaim Ying Han
2010-11-30 6:49 ` [PATCH 1/4] Add kswapd descriptor Ying Han
2010-11-30 7:08 ` KAMEZAWA Hiroyuki
2010-11-30 8:15 ` Minchan Kim
2010-11-30 8:27 ` KAMEZAWA Hiroyuki
2010-11-30 8:54 ` KAMEZAWA Hiroyuki
2010-11-30 20:40 ` Ying Han
2010-11-30 23:46 ` KAMEZAWA Hiroyuki
2010-12-07 6:15 ` Balbir Singh
2010-12-07 6:24 ` KAMEZAWA Hiroyuki
2010-12-07 6:59 ` Balbir Singh
2010-12-07 8:00 ` KAMEZAWA Hiroyuki
2010-11-30 20:26 ` Ying Han
2010-11-30 20:17 ` Ying Han
2010-12-01 0:12 ` KAMEZAWA Hiroyuki
2010-12-07 6:52 ` Balbir Singh
2010-12-07 19:21 ` Ying Han
2010-12-07 12:33 ` Mel Gorman
2010-12-07 17:28 ` Ying Han
2010-12-08 0:39 ` KAMEZAWA Hiroyuki
2010-12-08 1:24 ` Ying Han
2010-12-08 1:28 ` KAMEZAWA Hiroyuki
2010-12-08 2:10 ` Ying Han
2010-12-08 2:13 ` KAMEZAWA Hiroyuki
2010-12-08 12:19 ` Mel Gorman [this message]
2010-12-08 7:21 ` KOSAKI Motohiro
2010-12-07 18:50 ` Ying Han
2010-12-08 7:22 ` KOSAKI Motohiro
2010-12-08 7:37 ` KAMEZAWA Hiroyuki
2010-11-30 6:49 ` [PATCH 2/4] Add per cgroup reclaim watermarks Ying Han
2010-11-30 7:21 ` KAMEZAWA Hiroyuki
2010-11-30 20:44 ` Ying Han
2010-12-01 0:27 ` KAMEZAWA Hiroyuki
2010-12-07 14:56 ` Mel Gorman
2010-11-30 6:49 ` [PATCH 3/4] Per cgroup background reclaim Ying Han
2010-11-30 7:51 ` KAMEZAWA Hiroyuki
2010-11-30 8:07 ` KAMEZAWA Hiroyuki
2010-11-30 22:01 ` Ying Han
2010-11-30 22:00 ` Ying Han
2010-12-07 2:25 ` Ying Han
2010-12-07 5:21 ` KAMEZAWA Hiroyuki
2010-12-01 2:18 ` KOSAKI Motohiro
2010-12-01 2:16 ` KAMEZAWA Hiroyuki
2010-11-30 6:49 ` [PATCH 4/4] Add more per memcg stats Ying Han
2010-11-30 7:53 ` KAMEZAWA Hiroyuki
2010-11-30 18:22 ` Ying Han
2010-11-30 6:54 ` [RFC][PATCH 0/4] memcg: per cgroup background reclaim KOSAKI Motohiro
2010-11-30 7:03 ` Ying Han
2010-12-02 14:41 ` Balbir Singh
2010-12-07 2:29 ` Ying Han
2010-11-30 7:00 ` KAMEZAWA Hiroyuki
2010-11-30 9:05 ` Ying Han
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20101208121951.GK5422@csn.ul.ie \
--to=mel@csn.ul.ie \
--cc=ak@linux.intel.com \
--cc=akpm@linux-foundation.org \
--cc=balbir@linux.vnet.ibm.com \
--cc=cl@linux.com \
--cc=fengguang.wu@intel.com \
--cc=hannes@cmpxchg.org \
--cc=hughd@google.com \
--cc=kamezawa.hiroyu@jp.fujitsu.com \
--cc=kosaki.motohiro@jp.fujitsu.com \
--cc=linux-mm@kvack.org \
--cc=nishimura@mxp.nes.nec.co.jp \
--cc=riel@redhat.com \
--cc=tj@kernel.org \
--cc=yinghan@google.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox