From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail203.messagelabs.com (mail203.messagelabs.com [216.82.254.243]) by kanga.kvack.org (Postfix) with ESMTP id D57E26B0022 for ; Sat, 7 May 2011 17:59:02 -0400 (EDT) Received: by vxk20 with SMTP id 20so6862715vxk.14 for ; Sat, 07 May 2011 14:59:00 -0700 (PDT) MIME-Version: 1.0 In-Reply-To: <20110502224838.GB10278@cmpxchg.org> References: <1304366849.15370.27.camel@mulgrave.site> <20110502224838.GB10278@cmpxchg.org> Date: Sun, 8 May 2011 03:29:00 +0530 Message-ID: Subject: Re: memcg: fix fatal livelock in kswapd From: Balbir Singh Content-Type: multipart/alternative; boundary=bcaec53f2ae7bfda0e04a2b6b606 Sender: owner-linux-mm@kvack.org List-ID: To: Johannes Weiner Cc: James Bottomley , Chris Mason , linux-fsdevel , linux-mm , linux-kernel , Paul Menage , Li Zefan , containers@lists.linux-foundation.org --bcaec53f2ae7bfda0e04a2b6b606 Content-Type: text/plain; charset=ISO-8859-1 On Tue, May 3, 2011 at 4:18 AM, Johannes Weiner wrote: > Hi, > > On Mon, May 02, 2011 at 03:07:29PM -0500, James Bottomley wrote: > > The fatal livelock in kswapd, reported in this thread: > > > > http://marc.info/?t=130392066000001 > > > > Is mitigateable if we prevent the cgroups code being so aggressive in > > its zone shrinking (by reducing it's default shrink from 0 [everything] > > to DEF_PRIORITY [some things]). This will have an obvious knock on > > effect to cgroup accounting, but it's better than hanging systems. > > Actually, it's not that obvious. At least not to me. I added Balbir, > who added said comment and code in the first place, to CC: Here is the > comment in full quote: > > I missed this email in my inbox, just saw it and responding > /* > * NOTE: Although we can get the priority field, using it > * here is not a good idea, since it limits the pages we can scan. > * if we don't reclaim here, the shrink_zone from balance_pgdat > * will pick up pages from other mem cgroup's as well. We hack > * the priority and make it zero. > */ > > The idea is that if one memcg is above its softlimit, we prefer > reducing pages from this memcg over reclaiming random other pages, > including those of other memcgs. > > My comment and code were based on the observations I saw during my tests. With DEF_PRIORITY we see scan >> priority in get_scan_count(), since we know how much exactly we are over the soft limit, it makes sense to go after the pages, so that normal balancing can be restored. > But the code flow looks like this: > > balance_pgdat > mem_cgroup_soft_limit_reclaim > mem_cgroup_shrink_node_zone > shrink_zone(0, zone, &sc) > shrink_zone(prio, zone, &sc) > > so the success of the inner memcg shrink_zone does at least not > explicitely result in the outer, global shrink_zone steering clear of > other memcgs' pages. Yes, but it allows soft reclaim to know what to target first for success > It just tries to move the pressure of balancing > the zones to the memcg with the biggest soft limit excess. That can > only really work if the memcg is a large enough contributor to the > zone's total number of lru pages, though, and looks very likely to hit > the exceeding memcg too hard in other cases. > > I am very much for removing this hack. There is still more scan > pressure applied to memcgs in excess of their soft limit even if the > extra scan is happening at a sane priority level. And the fact that > global reclaim operates completely unaware of memcgs is a different > story. > > However, this code came into place with v2.6.31-8387-g4e41695. Why is > it only now showing up? > > You also wrote in that thread that this happens on a standard F15 > installation. On the F15 I am running here, systemd does not > configure memcgs, however. Did you manually configure memcgs and set > soft limits? Because I wonder how it ended up in soft limit reclaim > in the first place. > > I am running F15 as well, but never hit the problem so far. I am surprised to see the stack posted on the thread, it seemed like you never explicitly enabled anything to wake up the memcg beast :) Balbir --bcaec53f2ae7bfda0e04a2b6b606 Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable

On Tue, May 3, 2011 at 4:18 AM, Johannes= Weiner <hannes@= cmpxchg.org> wrote:
Hi,

On Mon, May 02, 2011 at 03:07:29PM -0500, James Bottomley wrote:
> The fatal livelock in kswapd, reported in this thread:
>
> ht= tp://marc.info/?t=3D130392066000001
>
> Is mitigateable if we prevent the cgroups code being so aggressive in<= br> > its zone shrinking (by reducing it's default shrink from 0 [everyt= hing]
> to DEF_PRIORITY [some things]). =A0This will have an obvious knock on<= br> > effect to cgroup accounting, but it's better than hanging systems.=

Actually, it's not that obvious. =A0At least not to me. =A0I adde= d Balbir,
who added said comment and code in the first place, to CC: Here is the
comment in full quote:


I missed this email in my inbox, just = saw it and responding
=A0
=A0 =A0 =A0 =A0/*
=A0 =A0 =A0 =A0 * NOTE: Although we can get the priority field, using it =A0 =A0 =A0 =A0 * here is not a good idea, since it limits the pages we ca= n scan.
=A0 =A0 =A0 =A0 * if we don't reclaim here, the shri= nk_zone from balance_pgdat
=A0 =A0 =A0 =A0 * will pick up pages from other mem cgroup's as well. = We hack
=A0 =A0 =A0 =A0 * the priority and make it zero.
=A0 =A0 =A0 =A0 */

The idea is that if one memcg is above its softlimit, we prefer
reducing pages from this memcg over reclaiming random other pages,
including those of other memcgs.


My comment and code were based on the = observations I saw during my tests. With DEF_PRIORITY we see scan >> = priority in get_scan_count(), since we know how much exactly we are over th= e soft limit, it makes sense to go after the pages, so that normal balancin= g can be restored.
=A0
But the code flow looks like this:

=A0 =A0 =A0 =A0balance_pgdat
=A0 =A0 =A0 =A0 =A0mem_cgroup_soft_limit_reclaim
=A0 =A0 =A0 =A0 =A0 =A0mem_cgroup_shrink_node_zone
=A0 =A0 =A0 =A0 =A0 =A0 =A0shrink_zone(0, zone, &a= mp;sc)
=A0 =A0 =A0 =A0 =A0shrink_zone(prio, zone, &sc)

so the success of the inner memcg shrink_zone does at least not
explicitely result in the outer, global shrink_zone steering clear of
other memcgs' pages.

Yes, but it allow= s soft reclaim to know what to target first for success
=A0
=
=A0It just tries to move the pressure of balancing
the zones to the memcg with the biggest soft limit excess. =A0That can
only really work if the memcg is a large enough contributor to the
zone's total number of lru pages, though, and looks very likely to hit<= br> the exceeding memcg too hard in other cases.

I am very much for removing this hack. =A0There is still more scan
pressure applied to memcgs in excess of their soft limit even if the
extra scan is happening at a sane priority level. =A0And the fact that
global reclaim operates completely unaware of memcgs is a different
story.

However, this code came into place with v2.6.31-8387-g4e41695. =A0Why is it only now showing up?

You also wrote in that thread that this happens on a standard F15
installation. =A0On the F15 I am running here, systemd does not
configure memcgs, however. =A0Did you manually configure memcgs and set
soft limits? =A0Because I wonder how it ended up in soft limit reclaim
in the first place.


I am running F1= 5 as well, but never hit the problem so far. I am surprised to see the stac= k posted on the thread, it seemed like you never=A0explicitly=A0enabled any= thing to wake up the memcg beast :)

Balbir=A0
--bcaec53f2ae7bfda0e04a2b6b606-- -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: email@kvack.org