Re: [PATCH] mm: memcontrol: fix possible memcg leak due to interrupted reclaim

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Vladimir Davydov <vdavydov@virtuozzo.com>
To: Johannes Weiner <hannes@cmpxchg.org>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	Michal Hocko <mhocko@kernel.org>,
	stable@vger.kernel.org, linux-mm@kvack.org,
	linux-kernel@vger.kernel.org
Subject: Re: [PATCH] mm: memcontrol: fix possible memcg leak due to interrupted reclaim
Date: Sat, 12 Dec 2015 22:18:55 +0300	[thread overview]
Message-ID: <20151212191855.GE28521@esperanza> (raw)
In-Reply-To: <20151212164540.GA7107@cmpxchg.org>

On Sat, Dec 12, 2015 at 11:45:40AM -0500, Johannes Weiner wrote:
> On Sat, Dec 12, 2015 at 04:34:02PM +0300, Vladimir Davydov wrote:
> > Memory cgroup reclaim can be interrupted with mem_cgroup_iter_break()
> > once enough pages have been reclaimed, in which case, in contrast to a
> > full round-trip over a cgroup sub-tree, the current position stored in
> > mem_cgroup_reclaim_iter of the target cgroup does not get invalidated
> > and so is left holding the reference to the last scanned cgroup. If the
> > target cgroup does not get scanned again (we might have just reclaimed
> > the last page or all processes might exit and free their memory
> > voluntary), we will leak it, because there is nobody to put the
> > reference held by the iterator.
> > 
> > The problem is easy to reproduce by running the following command
> > sequence in a loop:
> > 
> >     mkdir /sys/fs/cgroup/memory/test
> >     echo 100M > /sys/fs/cgroup/memory/test/memory.limit_in_bytes
> >     echo $$ > /sys/fs/cgroup/memory/test/cgroup.procs
> >     memhog 150M
> >     echo $$ > /sys/fs/cgroup/memory/cgroup.procs
> >     rmdir test
> > 
> > The cgroups generated by it will never get freed.
> > 
> > This patch fixes this issue by making mem_cgroup_iter_break() clear
> > mem_cgroup_reclaim_iter->position in case it points to the memory cgroup
> > we interrupted reclaim on.
> > 
> > Fixes: 5ac8fb31ad2e ("mm: memcontrol: convert reclaim iterator to simple css refcounting")
> > Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com>
> > Cc: Johannes Weiner <hannes@cmpxchg.org>
> > Cc: Michal Hocko <mhocko@kernel.org>
> > Cc: <stable@vger.kernel.org> # 3.19+
> 
> Good catch!
> 
> The downside of not remembering the last position across incomplete
> reclaim cycles is that we always restart at the same position. If a
> cgroup has a certain number of children, it's conceivable that we
> might never get to the youngest cgroups in the subtree anymore.

That's true, scratch this patch.

> 
> So I'd be more comfortable removing incomplete reclaim walks. It was a
> nice little optimization we could do while supporting interleave walks,
> but it's not necessary: when global reclaim can walk the entire system,
> limit reclaim should be okay walking subtrees exhaustively.
> 
> How about this?
...
> @@ -2425,21 +2425,6 @@ static bool shrink_zone(struct zone *zone, struct scan_control *sc,
>  				   sc->nr_scanned - scanned,
>  				   sc->nr_reclaimed - reclaimed);
>  
> -			/*
> -			 * Direct reclaim and kswapd have to scan all memory
> -			 * cgroups to fulfill the overall scan target for the
> -			 * zone.
> -			 *
> -			 * Limit reclaim, on the other hand, only cares about
> -			 * nr_to_reclaim pages to be reclaimed and it will
> -			 * retry with decreasing priority if one round over the
> -			 * whole hierarchy is not sufficient.
> -			 */
> -			if (!global_reclaim(sc) &&
> -					sc->nr_reclaimed >= sc->nr_to_reclaim) {
> -				mem_cgroup_iter_break(root, memcg);
> -				break;
> -			}

Dunno. I like it, because it's simple and clean, but I'm unsure: can't
it result in lags when performing memcg reclaim for deep hierarchies?
For global reclaim we have kswapd, which tries to keep the system within
bounds so as to avoid direct reclaim at all. Memcg lacks such thing, and
interleave walks looks like a good compensation for it.

Alternatively, we could avoid taking reference to iter->position and
make use of css_released cgroup callback to invalidate reclaim
iterators. With this approach, upper level cgroups shouldn't receive
unfairly high pressure in comparison to their children. Something like
this, maybe?

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 87af26a24491..fcc5133210a0 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -859,14 +859,12 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
 		if (prev && reclaim->generation != iter->generation)
 			goto out_unlock;
 
-		do {
+		while (1) {
 			pos = READ_ONCE(iter->position);
-			/*
-			 * A racing update may change the position and
-			 * put the last reference, hence css_tryget(),
-			 * or retry to see the updated position.
-			 */
-		} while (pos && !css_tryget(&pos->css));
+			if (!pos || css_tryget(&pos->css))
+				break;
+			cmpxchg(&iter->position, pos, NULL);
+		}
 	}
 
 	if (pos)
@@ -912,12 +910,7 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
 	}
 
 	if (reclaim) {
-		if (cmpxchg(&iter->position, pos, memcg) == pos) {
-			if (memcg)
-				css_get(&memcg->css);
-			if (pos)
-				css_put(&pos->css);
-		}
+		cmpxchg(&iter->position, pos, memcg);
 
 		/*
 		 * pairs with css_tryget when dereferencing iter->position
@@ -955,6 +948,28 @@ void mem_cgroup_iter_break(struct mem_cgroup *root,
 		css_put(&prev->css);
 }
 
+static void invalidate_reclaim_iterators(struct mem_cgroup *dead_memcg)
+{
+	struct mem_cgroup *memcg = dead_memcg;
+	struct mem_cgroup_reclaim_iter *iter;
+	struct mem_cgroup_per_zone *mz;
+	int nid, zid;
+	int i;
+
+	while ((memcg = parent_mem_cgroup(memcg))) {
+		for_each_node(nid) {
+			for (zid = 0; zid < MAX_NR_ZONES; zid++) {
+				mz = &memcg->nodeinfo[nid]->zoneinfo[zid];
+				for (i = 0; i <= DEF_PRIORITY; i++) {
+					iter = &mz->iter[i];
+					cmpxchg(&iter->position,
+						dead_memcg, NULL);
+				}
+			}
+		}
+	}
+}
+
 /*
  * Iteration constructs for visiting all cgroups (under a tree).  If
  * loops are exited prematurely (break), mem_cgroup_iter_break() must
@@ -4375,6 +4390,13 @@ static void mem_cgroup_css_offline(struct cgroup_subsys_state *css)
 	wb_memcg_offline(memcg);
 }
 
+static void mem_cgroup_css_released(struct cgroup_subsys_state *css)
+{
+	struct mem_cgroup *memcg = mem_cgroup_from_css(css);
+
+	invalidate_reclaim_iterators(memcg);
+}
+
 static void mem_cgroup_css_free(struct cgroup_subsys_state *css)
 {
 	struct mem_cgroup *memcg = mem_cgroup_from_css(css);
@@ -5229,6 +5251,7 @@ struct cgroup_subsys memory_cgrp_subsys = {
 	.css_alloc = mem_cgroup_css_alloc,
 	.css_online = mem_cgroup_css_online,
 	.css_offline = mem_cgroup_css_offline,
+	.css_released = mem_cgroup_css_released,
 	.css_free = mem_cgroup_css_free,
 	.css_reset = mem_cgroup_css_reset,
 	.can_attach = mem_cgroup_can_attach,

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

next prev parent reply	other threads:[~2015-12-12 19:19 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-12-12 13:34 Vladimir Davydov
2015-12-12 16:45 ` Johannes Weiner
2015-12-12 19:18   ` Vladimir Davydov [this message]
2015-12-14 15:19     ` Johannes Weiner
2015-12-14 18:57       ` Vladimir Davydov
2015-12-15  9:58     ` Michal Hocko

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20151212191855.GE28521@esperanza \
    --to=vdavydov@virtuozzo.com \
    --cc=akpm@linux-foundation.org \
    --cc=hannes@cmpxchg.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mhocko@kernel.org \
    --cc=stable@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox