linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Johannes Weiner <hannes@cmpxchg.org>
To: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Alexander Fedorov <halcien@gmail.com>,
	Michal Hocko <mhocko@suse.com>,
	Shakeel Butt <shakeelb@google.com>,
	Vladimir Davydov <vdavydov.dev@gmail.com>,
	Muchun Song <songmuchun@bytedance.com>,
	Sebastian Andrzej Siewior <bigeasy@linutronix.de>,
	cgroups@vger.kernel.org, linux-mm@kvack.org
Subject: Re: Possible race in obj_stock_flush_required() vs drain_obj_stock()
Date: Wed, 12 Oct 2022 13:23:11 -0400	[thread overview]
Message-ID: <Y0b3/wGDBL7GaNWJ@cmpxchg.org> (raw)
In-Reply-To: <Yzxc0jzOnAu667F8@P9FQF9L96D.lan>

On Tue, Oct 04, 2022 at 09:18:26AM -0700, Roman Gushchin wrote:
> On Mon, Oct 03, 2022 at 06:01:35PM +0300, Alexander Fedorov wrote:
> > On 03.10.2022 17:27, Michal Hocko wrote:
> > > On Mon 03-10-22 17:09:15, Alexander Fedorov wrote:
> > >> On 03.10.2022 16:32, Michal Hocko wrote:
> > >>> On Mon 03-10-22 15:47:10, Alexander Fedorov wrote:
> > >>>> @@ -3197,17 +3197,30 @@ static void drain_obj_stock(struct memcg_stock_pcp *stock)
> > >>>>  		stock->nr_bytes = 0;
> > >>>>  	}
> > >>>>  
> > >>>> -	obj_cgroup_put(old);
> > >>>> +	/*
> > >>>> +	 * Clear pointer before freeing memory so that
> > >>>> +	 * drain_all_stock() -> obj_stock_flush_required()
> > >>>> +	 * does not see a freed pointer.
> > >>>> +	 */
> > >>>>  	stock->cached_objcg = NULL;
> > >>>> +	obj_cgroup_put(old);
> > >>>
> > >>> Do we need barrier() or something else to ensure there is no reordering?
> > >>> I am not reallyu sure what kind of barriers are implied by the pcp ref
> > >>> counting.
> > >>
> > >> obj_cgroup_put() -> kfree_rcu() -> synchronize_rcu() should take care
> > >> of this:
> > > 
> > > This is a very subtle guarantee. Also it would only apply if this is the
> > > last reference, right?
> > 
> > Hmm, yes, for the last reference only, also not sure about pcp ref
> > counter ordering rules for previous references.
> > 
> > > Is there any reason to not use
> > > 	WRITE_ONCE(stock->cached_objcg, NULL);
> > > 	obj_cgroup_put(old);
> > > 
> > > IIRC this should prevent any reordering. 
> > 
> > Now that I think about it we actually must use WRITE_ONCE everywhere
> > when writing cached_objcg because otherwise compiler might split the
> > pointer-sized store into several smaller-sized ones (store tearing),
> > and obj_stock_flush_required() would read garbage instead of pointer.
> >
> > And thinking about memory barriers, maybe we need them too alongside
> > WRITE_ONCE when setting pointer to non-null value?  Otherwise
> > drain_all_stock() -> obj_stock_flush_required() might read old data.
> > Since that's exactly what rcu_assign_pointer() does, it seems
> > that we are going back to using rcu_*() primitives everywhere?
> 
> Hm, Idk, I'm still somewhat resistant to the idea of putting rcu primitives,
> but maybe it's the right thing. Maybe instead we should always schedule draining
> on all cpus instead and perform a cpu-local check and bail out if a flush is not
> required? Michal, Johannes, what do you think?

I agree it's overkill.

This is a speculative check, and we don't need any state coherency,
just basic lifetime. READ_ONCE should fully address this problem. That
said, I think the code could be a bit clearer and better documented.

How about the below?

(Nevermind the ifdef, I'm working on removing CONFIG_MEMCG_KMEM
altogether, as it's a really strange way to say !SLOB at this point)

---

From 22855af38b116ec030286975ed2aa06851680296 Mon Sep 17 00:00:00 2001
From: Johannes Weiner <hannes@cmpxchg.org>
Date: Wed, 12 Oct 2022 12:59:07 -0400
Subject: [PATCH] mm: memcontrol: fix NULL deref race condition during cgroup
 deletion

Alexander Fedorov reports a race condition between two concurrent
stock draining operations, where the first one clears the stock's obj
pointer between the pointer test and deref of the second. Analysis:

1) First CPU:
   css_killed_work_fn() -> mem_cgroup_css_offline() ->
drain_all_stock() -> obj_stock_flush_required()
	if (stock->cached_objcg) {

This check sees a non-NULL pointer for *another* CPU's `memcg_stock` instance.

2) Second CPU:
  css_free_rwork_fn() -> __mem_cgroup_free() -> free_percpu() ->
obj_cgroup_uncharge() -> drain_obj_stock()
It frees `cached_objcg` pointer in its own `memcg_stock` instance:
	struct obj_cgroup *old = stock->cached_objcg;
	< ... >
	obj_cgroup_put(old);
	stock->cached_objcg = NULL;

3) First CPU continues after the 'if' check and re-reads the pointer
again, now it is NULL and dereferencing it leads to kernel panic:
static bool obj_stock_flush_required(struct memcg_stock_pcp *stock,
				     struct mem_cgroup *root_memcg)
{
< ... >
	if (stock->cached_objcg) {
		memcg = obj_cgroup_memcg(stock->cached_objcg);

There is already RCU protection in place to ensure lifetime. Add the
missing READ_ONCE to the cgroup pointers to fix the TOCTOU, and
consolidate and document the speculative code.

Reported-by: Alexander Fedorov <halcien@gmail.com>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 mm/memcontrol.c | 44 ++++++++++++++++++++------------------------
 1 file changed, 20 insertions(+), 24 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 2d8549ae1b30..09ac2f8991ee 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2190,8 +2190,6 @@ static DEFINE_MUTEX(percpu_charge_mutex);
 
 #ifdef CONFIG_MEMCG_KMEM
 static struct obj_cgroup *drain_obj_stock(struct memcg_stock_pcp *stock);
-static bool obj_stock_flush_required(struct memcg_stock_pcp *stock,
-				     struct mem_cgroup *root_memcg);
 static void memcg_account_kmem(struct mem_cgroup *memcg, int nr_pages);
 
 #else
@@ -2199,11 +2197,6 @@ static inline struct obj_cgroup *drain_obj_stock(struct memcg_stock_pcp *stock)
 {
 	return NULL;
 }
-static bool obj_stock_flush_required(struct memcg_stock_pcp *stock,
-				     struct mem_cgroup *root_memcg)
-{
-	return false;
-}
 static void memcg_account_kmem(struct mem_cgroup *memcg, int nr_pages)
 {
 }
@@ -2339,13 +2332,30 @@ static void drain_all_stock(struct mem_cgroup *root_memcg)
 		struct mem_cgroup *memcg;
 		bool flush = false;
 
+		/*
+		 * Speculatively check up front if this CPU has any
+		 * cached charges that belong to the specified
+		 * root_memcg. The state may change from under us -
+		 * which is okay, because the draining itself is a
+		 * best-effort operation. Just ensure lifetime of
+		 * whatever we end up looking at.
+		 */
 		rcu_read_lock();
-		memcg = stock->cached;
+		memcg = READ_ONCE(stock->cached);
 		if (memcg && stock->nr_pages &&
 		    mem_cgroup_is_descendant(memcg, root_memcg))
 			flush = true;
-		else if (obj_stock_flush_required(stock, root_memcg))
-			flush = true;
+#ifdef CONFIG_MEMCG_KMEM
+		else {
+			struct obj_cgroup *objcg;
+
+			objcg = READ_ONCE(stock->cached_objcg);
+			if (objcg && stock->nr_bytes &&
+			    mem_cgroup_is_descendant(obj_cgroup_memcg(objcg),
+						     root_memcg))
+				flush = true;
+		}
+#endif
 		rcu_read_unlock();
 
 		if (flush &&
@@ -3297,20 +3307,6 @@ static struct obj_cgroup *drain_obj_stock(struct memcg_stock_pcp *stock)
 	return old;
 }
 
-static bool obj_stock_flush_required(struct memcg_stock_pcp *stock,
-				     struct mem_cgroup *root_memcg)
-{
-	struct mem_cgroup *memcg;
-
-	if (stock->cached_objcg) {
-		memcg = obj_cgroup_memcg(stock->cached_objcg);
-		if (memcg && mem_cgroup_is_descendant(memcg, root_memcg))
-			return true;
-	}
-
-	return false;
-}
-
 static void refill_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes,
 			     bool allow_uncharge)
 {
-- 
2.37.3



  reply	other threads:[~2022-10-12 17:23 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-09-30 14:06 Alexander Fedorov
2022-09-30 18:26 ` Roman Gushchin
2022-10-01 12:38   ` Alexander Fedorov
2022-10-02 16:16     ` Roman Gushchin
2022-10-03 12:47       ` Alexander Fedorov
2022-10-03 13:32         ` Michal Hocko
2022-10-03 14:09           ` Alexander Fedorov
2022-10-03 14:27             ` Michal Hocko
2022-10-03 15:01               ` Alexander Fedorov
2022-10-04 16:18                 ` Roman Gushchin
2022-10-12 17:23                   ` Johannes Weiner [this message]
2022-10-12 18:49                     ` Roman Gushchin
2022-10-12 19:18                       ` Johannes Weiner

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Y0b3/wGDBL7GaNWJ@cmpxchg.org \
    --to=hannes@cmpxchg.org \
    --cc=bigeasy@linutronix.de \
    --cc=cgroups@vger.kernel.org \
    --cc=halcien@gmail.com \
    --cc=linux-mm@kvack.org \
    --cc=mhocko@suse.com \
    --cc=roman.gushchin@linux.dev \
    --cc=shakeelb@google.com \
    --cc=songmuchun@bytedance.com \
    --cc=vdavydov.dev@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox