From: Johannes Weiner <hannes@cmpxchg.org>
To: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Alexander Fedorov <halcien@gmail.com>,
Michal Hocko <mhocko@suse.com>,
Shakeel Butt <shakeelb@google.com>,
Vladimir Davydov <vdavydov.dev@gmail.com>,
Muchun Song <songmuchun@bytedance.com>,
Sebastian Andrzej Siewior <bigeasy@linutronix.de>,
cgroups@vger.kernel.org, linux-mm@kvack.org
Subject: Re: Possible race in obj_stock_flush_required() vs drain_obj_stock()
Date: Wed, 12 Oct 2022 15:18:25 -0400 [thread overview]
Message-ID: <Y0cTAdntxrn8zFbX@cmpxchg.org> (raw)
In-Reply-To: <Y0cMMPwE4aus3P9c@P9FQF9L96D.corp.robot.car>
On Wed, Oct 12, 2022 at 11:49:20AM -0700, Roman Gushchin wrote:
> On Wed, Oct 12, 2022 at 01:23:11PM -0400, Johannes Weiner wrote:
> > On Tue, Oct 04, 2022 at 09:18:26AM -0700, Roman Gushchin wrote:
> > > On Mon, Oct 03, 2022 at 06:01:35PM +0300, Alexander Fedorov wrote:
> > > > On 03.10.2022 17:27, Michal Hocko wrote:
> > > > > On Mon 03-10-22 17:09:15, Alexander Fedorov wrote:
> > > > >> On 03.10.2022 16:32, Michal Hocko wrote:
> > > > >>> On Mon 03-10-22 15:47:10, Alexander Fedorov wrote:
> > > > >>>> @@ -3197,17 +3197,30 @@ static void drain_obj_stock(struct memcg_stock_pcp *stock)
> > > > >>>> stock->nr_bytes = 0;
> > > > >>>> }
> > > > >>>>
> > > > >>>> - obj_cgroup_put(old);
> > > > >>>> + /*
> > > > >>>> + * Clear pointer before freeing memory so that
> > > > >>>> + * drain_all_stock() -> obj_stock_flush_required()
> > > > >>>> + * does not see a freed pointer.
> > > > >>>> + */
> > > > >>>> stock->cached_objcg = NULL;
> > > > >>>> + obj_cgroup_put(old);
> > > > >>>
> > > > >>> Do we need barrier() or something else to ensure there is no reordering?
> > > > >>> I am not reallyu sure what kind of barriers are implied by the pcp ref
> > > > >>> counting.
> > > > >>
> > > > >> obj_cgroup_put() -> kfree_rcu() -> synchronize_rcu() should take care
> > > > >> of this:
> > > > >
> > > > > This is a very subtle guarantee. Also it would only apply if this is the
> > > > > last reference, right?
> > > >
> > > > Hmm, yes, for the last reference only, also not sure about pcp ref
> > > > counter ordering rules for previous references.
> > > >
> > > > > Is there any reason to not use
> > > > > WRITE_ONCE(stock->cached_objcg, NULL);
> > > > > obj_cgroup_put(old);
> > > > >
> > > > > IIRC this should prevent any reordering.
> > > >
> > > > Now that I think about it we actually must use WRITE_ONCE everywhere
> > > > when writing cached_objcg because otherwise compiler might split the
> > > > pointer-sized store into several smaller-sized ones (store tearing),
> > > > and obj_stock_flush_required() would read garbage instead of pointer.
> > > >
> > > > And thinking about memory barriers, maybe we need them too alongside
> > > > WRITE_ONCE when setting pointer to non-null value? Otherwise
> > > > drain_all_stock() -> obj_stock_flush_required() might read old data.
> > > > Since that's exactly what rcu_assign_pointer() does, it seems
> > > > that we are going back to using rcu_*() primitives everywhere?
> > >
> > > Hm, Idk, I'm still somewhat resistant to the idea of putting rcu primitives,
> > > but maybe it's the right thing. Maybe instead we should always schedule draining
> > > on all cpus instead and perform a cpu-local check and bail out if a flush is not
> > > required? Michal, Johannes, what do you think?
> >
> > I agree it's overkill.
> >
> > This is a speculative check, and we don't need any state coherency,
> > just basic lifetime. READ_ONCE should fully address this problem. That
> > said, I think the code could be a bit clearer and better documented.
> >
> > How about the below?
>
> I'm fine with using READ_ONCE() to fix this immediate issue (I suggested it
> in the thread above), please feel free to add my ack:
> Acked-by: Roman Gushchin <roman.gushchin@linux.dev> .
Thanks!
> We might need a barrier() between zeroing stock->cached and dropping the last
> reference, as discussed above, however I don't think this issue can be
> realistically trgiggered in the real life.
Hm, plus the load tearing.
We can do WRITE_ONCE() just for ->cached and ->cached_objcg. That will
take care of both: load tearing, as well as the compile-time order
with the RCU free call. RCU will then handle the SMP effects.
I still prefer it over rcuifying the pointers completely just for that
one (questionable) optimization.
Updated patch below.
> However I think our overall approach to flushing is questionable:
> 1) we often don't flush when it's necessary: if there is a concurrent flushing
> we just bail out, even if that flushing is related to a completely different
> part of the cgroup tree (e.g. a leaf node belonging to a distant branch).
Right.
> 2) we can race and flush when it's not necessarily: if another cpu is busy,
> likely by the time when work will be executed there will be already another
> memcg cached. So IMO we need to move this check into the flushing thread.
We might just be able to remove all the speculative
checks. drain_all_stock() is slowpath after all...
> I'm working on a different approach, but it will take time and also likely be
> too invasive for @stable, so fixing the crash discovered by Alexander with
> READ_ONCE() is a good idea.
Sounds good, I'm looking forward to those changes.
---
From c9b940db5f75160b5e80c4ae83ea760ad29e8ef9 Mon Sep 17 00:00:00 2001
From: Johannes Weiner <hannes@cmpxchg.org>
Date: Wed, 12 Oct 2022 12:59:07 -0400
Subject: [PATCH] mm: memcontrol: fix NULL deref race condition during cgroup
deletion
Alexander Fedorov reports a race condition between two concurrent
stock draining operations, where the first one clears the stock's obj
pointer between the pointer test and deref of the second. Analysis:
1) First CPU:
css_killed_work_fn() -> mem_cgroup_css_offline() ->
drain_all_stock() -> obj_stock_flush_required()
if (stock->cached_objcg) {
This check sees a non-NULL pointer for *another* CPU's `memcg_stock` instance.
2) Second CPU:
css_free_rwork_fn() -> __mem_cgroup_free() -> free_percpu() ->
obj_cgroup_uncharge() -> drain_obj_stock()
It frees `cached_objcg` pointer in its own `memcg_stock` instance:
struct obj_cgroup *old = stock->cached_objcg;
< ... >
obj_cgroup_put(old);
stock->cached_objcg = NULL;
3) First CPU continues after the 'if' check and re-reads the pointer
again, now it is NULL and dereferencing it leads to kernel panic:
static bool obj_stock_flush_required(struct memcg_stock_pcp *stock,
struct mem_cgroup *root_memcg)
{
< ... >
if (stock->cached_objcg) {
memcg = obj_cgroup_memcg(stock->cached_objcg);
There is already RCU protection in place to ensure lifetime. Add the
missing READ_ONCE to the cgroup pointers to fix the TOCTOU, and
consolidate and document the speculative code.
Reported-by: Alexander Fedorov <halcien@gmail.com>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Roman Gushchin <roman.gushchin@linux.dev>
---
mm/memcontrol.c | 54 +++++++++++++++++++++++--------------------------
1 file changed, 25 insertions(+), 29 deletions(-)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 2d8549ae1b30..4357dadae95d 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2190,8 +2190,6 @@ static DEFINE_MUTEX(percpu_charge_mutex);
#ifdef CONFIG_MEMCG_KMEM
static struct obj_cgroup *drain_obj_stock(struct memcg_stock_pcp *stock);
-static bool obj_stock_flush_required(struct memcg_stock_pcp *stock,
- struct mem_cgroup *root_memcg);
static void memcg_account_kmem(struct mem_cgroup *memcg, int nr_pages);
#else
@@ -2199,11 +2197,6 @@ static inline struct obj_cgroup *drain_obj_stock(struct memcg_stock_pcp *stock)
{
return NULL;
}
-static bool obj_stock_flush_required(struct memcg_stock_pcp *stock,
- struct mem_cgroup *root_memcg)
-{
- return false;
-}
static void memcg_account_kmem(struct mem_cgroup *memcg, int nr_pages)
{
}
@@ -2259,8 +2252,8 @@ static void drain_stock(struct memcg_stock_pcp *stock)
stock->nr_pages = 0;
}
+ WRITE_ONCE(stock->cached, NULL);
css_put(&old->css);
- stock->cached = NULL;
}
static void drain_local_stock(struct work_struct *dummy)
@@ -2298,7 +2291,7 @@ static void __refill_stock(struct mem_cgroup *memcg, unsigned int nr_pages)
if (stock->cached != memcg) { /* reset if necessary */
drain_stock(stock);
css_get(&memcg->css);
- stock->cached = memcg;
+ WRITE_ONCE(stock->cached, memcg);
}
stock->nr_pages += nr_pages;
@@ -2339,13 +2332,30 @@ static void drain_all_stock(struct mem_cgroup *root_memcg)
struct mem_cgroup *memcg;
bool flush = false;
+ /*
+ * Speculatively check up front if this CPU has any
+ * cached charges that belong to the specified
+ * root_memcg. The state may change from under us -
+ * which is okay, because the draining itself is a
+ * best-effort operation. Just ensure lifetime of
+ * whatever we end up looking at.
+ */
rcu_read_lock();
- memcg = stock->cached;
+ memcg = READ_ONCE(stock->cached);
if (memcg && stock->nr_pages &&
mem_cgroup_is_descendant(memcg, root_memcg))
flush = true;
- else if (obj_stock_flush_required(stock, root_memcg))
- flush = true;
+#ifdef CONFIG_MEMCG_KMEM
+ else {
+ struct obj_cgroup *objcg;
+
+ objcg = READ_ONCE(stock->cached_objcg);
+ if (objcg && stock->nr_bytes &&
+ mem_cgroup_is_descendant(obj_cgroup_memcg(objcg),
+ root_memcg))
+ flush = true;
+ }
+#endif
rcu_read_unlock();
if (flush &&
@@ -3170,7 +3180,7 @@ void mod_objcg_state(struct obj_cgroup *objcg, struct pglist_data *pgdat,
obj_cgroup_get(objcg);
stock->nr_bytes = atomic_read(&objcg->nr_charged_bytes)
? atomic_xchg(&objcg->nr_charged_bytes, 0) : 0;
- stock->cached_objcg = objcg;
+ WRITE_ONCE(stock->cached_objcg, objcg);
stock->cached_pgdat = pgdat;
} else if (stock->cached_pgdat != pgdat) {
/* Flush the existing cached vmstat data */
@@ -3289,7 +3299,7 @@ static struct obj_cgroup *drain_obj_stock(struct memcg_stock_pcp *stock)
stock->cached_pgdat = NULL;
}
- stock->cached_objcg = NULL;
+ WRITE_ONCE(stock->cached_objcg, NULL);
/*
* The `old' objects needs to be released by the caller via
* obj_cgroup_put() outside of memcg_stock_pcp::stock_lock.
@@ -3297,20 +3307,6 @@ static struct obj_cgroup *drain_obj_stock(struct memcg_stock_pcp *stock)
return old;
}
-static bool obj_stock_flush_required(struct memcg_stock_pcp *stock,
- struct mem_cgroup *root_memcg)
-{
- struct mem_cgroup *memcg;
-
- if (stock->cached_objcg) {
- memcg = obj_cgroup_memcg(stock->cached_objcg);
- if (memcg && mem_cgroup_is_descendant(memcg, root_memcg))
- return true;
- }
-
- return false;
-}
-
static void refill_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes,
bool allow_uncharge)
{
@@ -3325,7 +3321,7 @@ static void refill_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes,
if (stock->cached_objcg != objcg) { /* reset if necessary */
old = drain_obj_stock(stock);
obj_cgroup_get(objcg);
- stock->cached_objcg = objcg;
+ WRITE_ONCE(stock->cached_objcg, objcg);
stock->nr_bytes = atomic_read(&objcg->nr_charged_bytes)
? atomic_xchg(&objcg->nr_charged_bytes, 0) : 0;
allow_uncharge = true; /* Allow uncharge when objcg changes */
--
2.37.3
prev parent reply other threads:[~2022-10-12 19:18 UTC|newest]
Thread overview: 13+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-09-30 14:06 Alexander Fedorov
2022-09-30 18:26 ` Roman Gushchin
2022-10-01 12:38 ` Alexander Fedorov
2022-10-02 16:16 ` Roman Gushchin
2022-10-03 12:47 ` Alexander Fedorov
2022-10-03 13:32 ` Michal Hocko
2022-10-03 14:09 ` Alexander Fedorov
2022-10-03 14:27 ` Michal Hocko
2022-10-03 15:01 ` Alexander Fedorov
2022-10-04 16:18 ` Roman Gushchin
2022-10-12 17:23 ` Johannes Weiner
2022-10-12 18:49 ` Roman Gushchin
2022-10-12 19:18 ` Johannes Weiner [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=Y0cTAdntxrn8zFbX@cmpxchg.org \
--to=hannes@cmpxchg.org \
--cc=bigeasy@linutronix.de \
--cc=cgroups@vger.kernel.org \
--cc=halcien@gmail.com \
--cc=linux-mm@kvack.org \
--cc=mhocko@suse.com \
--cc=roman.gushchin@linux.dev \
--cc=shakeelb@google.com \
--cc=songmuchun@bytedance.com \
--cc=vdavydov.dev@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox