Re: [PATCH 2/3] mm/memcg: fix endless iteration in reclaim

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Michal Hocko <mhocko@suse.cz>
To: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>,
	Andrew Morton <akpm@linux-foundation.org>,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org
Subject: Re: [PATCH 2/3] mm/memcg: fix endless iteration in reclaim
Date: Fri, 17 Jan 2014 16:41:43 +0100	[thread overview]
Message-ID: <20140117154143.GF5356@dhcp22.suse.cz> (raw)
In-Reply-To: <alpine.LSU.2.11.1401161011110.1321@eggly.anvils>

On Thu 16-01-14 11:15:36, Hugh Dickins wrote:
> On Thu, 16 Jan 2014, Michal Hocko wrote:
> > From 543df5c82f6eec622f669ea322ba6ff03924fded Mon Sep 17 00:00:00 2001
> > From: Michal Hocko <mhocko@suse.cz>
> > Date: Thu, 16 Jan 2014 16:17:13 +0100
> > Subject: [PATCH] memcg: fix css reference leak from mem_cgroup_iter
> > 
> > 19f39402864e (memcg: simplify mem_cgroup_iter) has introduced a css
> > refrence leak (thus memory leak) because mem_cgroup_iter makes sure it
> > doesn't put a css reference on the root of the tree walk. The mentioned
> > commit however dropped the root check when the css reference is taken
> > while it keept the css_put optimization fora the root in place.
> 
> I don't think that's quite right, actually - and I think it's all
> so confusing that we do need to be pedantic and set it down right.

You are right!

> I spent quite a while yesterday trying out my "cg m" on 3.10, 3.11,

I have done the same now (with a different test - simple mem_eater
with hard_limit really low to trigger reclaim and trace_printk in both
mem_cgroup_css_{alloc,free}) and you are right that 3.10 and 3.11 were
OK regarding the leak. Which is a relief...
3.12 resp. mmotm which I was testing on previously has the leak though.
So there must have been some other escape part which didn't allow
css_tryget on the root.

> 3.12 and 3.13-rc8 on this laptop: first just counting mem_cgroup_allocs
> and frees (if I could get that far without hanging or crashing), then
> also with your patch in (on 3.12 and 3.13-rc8) or the completely
> different patch appended at the bottom (on 3.10 and 3.11), checking
> for leftover mem_cgroups afterwards.
> 
> I saw no evidence of mem_cgroup leakage on 3.10 and 3.11, which had
> 	/*
> 	 * Root is not visited by cgroup iterators so it needs an
> 	 * explicit visit.
> 	 */
> 	if (!last_visited)
> 		return root;
> at the head of __mem_cgroup_iter_next(), removed around the same
> time as changeover from prev_cgroup etc to prev_css etc in 3.12.

Ohh, now I get it. Cgroup iterators originally didn't visit the root and
all the callers had to special case it. Then Tejun changed them to visit
root as well by bd8815a6d802 (cgroup: make css_for_each_descendant() and
friends include the origin css in the iteration) which was a good change
but I didn't realize it would be a problem when I reviewed it. Now it
makes sense again.

> I don't believe 19f39402864e was responsible for a reference leak,
> that came later.  But I think it was responsible for the original
> endless iteration (shrink_zone going around and around getting root
> again and again from mem_cgroup_iter).

So your hang is not within mem_cgroup_iter but you are getting root all
the time without any way out?

[3.10 code base]
shrink_zone
						[rmdir root]
  mem_cgroup_iter(root, NULL, reclaim)
    // prev = NULL
    rcu_read_lock()
    last_visited = iter->last_visited	// gets root || NULL
    css_tryget(last_visited) 		// failed
    last_visited = NULL			[1]
    memcg = root = __mem_cgroup_iter_next(root, NULL)
    iter->last_visited = root;
    reclaim->generation = iter->generation

 mem_cgroup_iter(root, root, reclaim)
   // prev = root
   rcu_read_lock
    last_visited = iter->last_visited	// gets root
    css_tryget(last_visited) 		// failed
    [1]

So we indeed can loop here without any progress. I just fail
to see how my patch could help. We even do not get down to
cgroup_next_descendant_pre.

Or am I missing something?

The following should fix this kind of endless loop:
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 194721839cf5..168e5abcca92 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1221,7 +1221,8 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
 				smp_rmb();
 				last_visited = iter->last_visited;
 				if (last_visited &&
-				    !css_tryget(&last_visited->css))
+				    last_visited != root &&
+				     !css_tryget(&last_visited->css))
 					last_visited = NULL;
 			}
 		}
@@ -1229,7 +1230,7 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
 		memcg = __mem_cgroup_iter_next(root, last_visited);
 
 		if (reclaim) {
-			if (last_visited)
+			if (last_visited && last_visited != root)
 				css_put(&last_visited->css);
 
 			iter->last_visited = memcg;

Not that I like it much :/

> But beware of my conclusion, please check for yourself: with my
> separate kbuilds in separate /cg/cg/? memcgs, what "cg m" is doing
> is very simple and segregated, can hardly be called testing reclaim
> iteration, so I hope you have something better to check it.  Plus
> I was testing on 3.10 and 3.11 vanilla, not latest stable versions.
> 
> (If I'm very honest, I'll admit that I still did not see that hang
> on 3.11 vanilla:

But I assume you can still reproduce it with 3.10, right?
I am sorry but I didn't get to run your script yet.

> what I hit was a crash in kfree instead, but the
> same patch got rid of that too. 

Care to post an oops?

> Of course I ought to investigate
> further, but at some point I just have to give up and move on,
> there's just too much breakage to chase all over the kernel...)
> 
> > This means that css_put is not called and so css along with mem_cgroup
> > and other cgroup internal object tied by css lifetime are never freed.
> > 
> > Fix the issue by reintroducing root check in __mem_cgroup_iter_next.
> > 
> > This patch also fixes issue reported by Hugh Dickins when
> > mem_cgroup_iter might end up in an endless loop because a group which is
> > under hard limit reclaim is removed in parallel with iteration.
> > __mem_cgroup_iter_next would always return NULL because css_tryget on
> > the root (reclaimed memcg) would fail and there are no other memcg in
> > the hierarchy. prev == NULL in mem_cgroup_iter would prevent break out
> > from the root and so the while (!memcg) loop would never terminate.
> > as css_tryget is no longer called for the root of the tree walk this
> > doesn't happen anymore.
> > 
> > [hughd@google.com: Fixed root vs. root->css fix]
> > [hughd@google.com: Get rid of else branch because it is ugly]
> 
> Thanks for your courtesy!  But let's not clutter it with those two.
> 
> > <Hugh's-selection>-by: Hugh Dickins <hughd@google.com>
> 
> You already credited me above, but "Reported-by:" here if you insist.
> 
> > Cc: stable@vger.kernel.org # 3.10+
> 
> Well, I'm okay with that, if we use that as a way to shoehorn in the
> patch at the bottom instead for 3.10 and 3.11 stables.

So far I do not see how it would make a change for those two kernels as
they have the special handling for root.

[...]
> "Equivalent" patch for 3.10 or 3.11: fixing similar hangs but no leakage.
> 
> Signed-off-by: Hugh Dickins <hughd@google.com>
> 
> --- v3.10/mm/memcontrol.c	2013-06-30 15:13:29.000000000 -0700
> +++ linux/mm/memcontrol.c	2014-01-15 18:18:24.476566659 -0800
> @@ -1226,7 +1226,8 @@ struct mem_cgroup *mem_cgroup_iter(struc
>  			}
>  		}
>  
> -		memcg = __mem_cgroup_iter_next(root, last_visited);
> +		if (!prev || last_visited)
> +			memcg = __mem_cgroup_iter_next(root, last_visited);

I am confused. What would change between those two calls to change the
outcome? The function doesn't have any internal state.

>  
>  		if (reclaim) {
>  			if (last_visited)

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

next prev parent reply	other threads:[~2014-01-17 15:41 UTC|newest]

Thread overview: 30+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-01-14  1:50 [PATCH 1/3] mm/memcg: fix last_dead_count memory wastage Hugh Dickins
2014-01-14  1:52 ` [PATCH 2/3] mm/memcg: fix endless iteration in reclaim Hugh Dickins
2014-01-14 13:27   ` Michal Hocko
2014-01-14 13:34     ` Michal Hocko
2014-01-14 14:26     ` Michal Hocko
2014-01-14 20:42       ` Hugh Dickins
2014-01-15  9:58         ` Michal Hocko
2014-01-15 12:17           ` Michal Hocko
2014-01-15 21:24             ` Hugh Dickins
2014-01-16  8:17               ` Michal Hocko
2014-01-16 15:22                 ` Michal Hocko
2014-01-16 19:15                   ` Hugh Dickins
2014-01-17 15:41                     ` Michal Hocko [this message]
2014-01-21  5:16                       ` Hugh Dickins
2014-01-21  8:34                         ` Michal Hocko
2014-01-21 10:45                           ` [PATCH -mm 1/2] memcg: fix endless loop caused by mem_cgroup_iter Michal Hocko
2014-01-21 10:45                             ` [PATCH -mm 2/2] memcg: fix css reference leak and endless loop in mem_cgroup_iter Michal Hocko
2014-01-21 19:42                               ` Andrew Morton
2014-01-21 21:18                                 ` Hugh Dickins
2014-01-22  8:27                                   ` Michal Hocko
2014-01-23 10:42                                     ` Hugh Dickins
2014-01-23 11:09                                       ` Michal Hocko
2014-01-23 12:53                                         ` Hugh Dickins
2014-01-22  8:12                                 ` Michal Hocko
2014-01-14  1:54 ` [PATCH 3/3] mm/memcg: iteration skip memcgs not yet fully initialized Hugh Dickins
2014-01-14 13:30   ` Michal Hocko
2014-01-14 14:29     ` Tejun Heo
2014-01-15  8:20       ` Michal Hocko
2014-01-15  8:21   ` Michal Hocko
2014-01-14 13:03 ` [PATCH 1/3] mm/memcg: fix last_dead_count memory wastage Michal Hocko

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20140117154143.GF5356@dhcp22.suse.cz \
    --to=mhocko@suse.cz \
    --cc=akpm@linux-foundation.org \
    --cc=hannes@cmpxchg.org \
    --cc=hughd@google.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox