From: Roman Gushchin <guro@fb.com>
To: Karsten Graul <kgraul@linux.ibm.com>
Cc: Shakeel Butt <shakeelb@google.com>,
Vladimir Davydov <vdavydov.dev@gmail.com>,
David Rientjes <rientjes@google.com>,
",Christoph Lameter" <cl@linux.com>,
Pekka Enberg <penberg@kernel.org>,
Joonsoo Kim <iamjoonsoo.kim@lge.com>,
Andrew Morton <akpm@linux-foundation.org>,
"linux-mm@kvack.org" <linux-mm@kvack.org>
Subject: Re: BUG: Crash in __free_slab() using SLAB_TYPESAFE_BY_RCU
Date: Thu, 3 Oct 2019 17:34:26 +0000 [thread overview]
Message-ID: <20191003173421.GA30875@castle.DHCP.thefacebook.com> (raw)
In-Reply-To: <e3786b32-d6b3-98f8-6d8f-b6db08725a7d@linux.ibm.com>
On Thu, Oct 03, 2019 at 06:24:27PM +0200, Karsten Graul wrote:
> On 03/10/2019 18:11, Roman Gushchin wrote:
> > On Thu, Oct 03, 2019 at 10:48:15AM +0200, Karsten Graul wrote:
> >> Hello, Roman!
> >>
> >> On 03/10/2019 05:35, Roman Gushchin wrote:
> >>> On Wed, Oct 02, 2019 at 12:41:29PM -0700, Roman Gushchin wrote:
> >>>> Hello, Karsten!
> >>>>
> >>>> Thank you for the report!
> >>>>
> >>>> On Wed, Oct 02, 2019 at 04:50:53PM +0200, Karsten Graul wrote:
> >>>>>
> >>>>> net/smc is calling proto_register(&smc_proto, 1) with smc_proto.slab_flags = SLAB_TYPESAFE_BY_RCU.
> >>>>> Right after the last SMC socket is destroyed, proto_unregister(&smc_proto) is called, which
> >>>>> calls kmem_cache_destroy(prot->slab). This results in a kernel crash in __free_slab().
> >>>>> Platform is s390x, reproduced on kernel 5.4-rc1. The problem was introduced by commit
> >>>>> fb2f2b0adb98 ("mm: memcg/slab: reparent memcg kmem_caches on cgroup removal")
> >>>>>
> >>>>> I added a 'call graph', below of that is the crash log and a (simple) patch that works for me,
> >>>>> but I don't know if this is the correct way to fix it.
> >>>>>
> >>>>> (Please keep me on CC of this thread because I do not follow the mm mailing list, thank you)
> >>>>>
> >>>>>
> >>>>> kmem_cache_destroy()
> >>>>> -> shutdown_memcg_caches()
> >>>>> -> shutdown_cache()
> >>>>> -> __kmem_cache_shutdown() (slub.c)
> >>>>> -> free_partial()
> >>>>> -> discard_slab()
> >>>>> -> free_slab() -- call to __free_slab() is delayed
> >>>>> -> call_rcu(rcu_free_slab)
> >>>>> -> memcg_unlink_cache()
> >>>>> -> WRITE_ONCE(s->memcg_params.memcg, NULL); -- !!!
> >>>>> -> list_add_tail(&s->list, &slab_caches_to_rcu_destroy);
> >>>>> -> schedule_work(&slab_caches_to_rcu_destroy_work); -> work_fn uses rcu_barrier() to wait for rcu_batch,
> >>>>> so work_fn is not further involved here...
> >>>>> ... rcu grace period ...
> >>>>> rcu_batch()
> >>>>> ...
> >>>>> -> rcu_free_slab() (slub.c)
> >>>>> -> __free_slab()
> >>>>> -> uncharge_slab_page()
> >>>>> -> memcg_uncharge_slab()
> >>>>> -> memcg = READ_ONCE(s->memcg_params.memcg); -- !!! memcg == NULL
> >>>>> -> mem_cgroup_lruvec(memcg)
> >>>>> -> mz = mem_cgroup_nodeinfo(memcg, pgdat->node_id); -- mz == NULL
> >>>>> -> lruvec = &mz->lruvec; -- lruvec == NULL
> >>>>> -> lruvec->pgdat = pgdat; -- *crash*
> >>>>>
> >>>>> The crash log:
> >>>>
> >>>> Hm, I might be wrong, but it seems that the problem is deeper: __free_slab()
> >>>> called from the rcu path races with kmem_cache_destroy(), which is supposed
> >>>> to be called when there are no outstanding allocations (and corresponding pages).
> >>>> Any charged slab page actually holds a reference to the kmem_cache, which prevents
> >>>> its destruction (look at s->memcg_params.refcnt), but kmem_cache_destroy() ignores
> >>>> it.
> >>
> >> I don't see a race between kmem_cache_destroy() and __fre_slab(). kmem_cache_destroy()
> >> is already done when __free_slab() is called. Maybe the below trace shows you the order of calls on my
> >> system: kmem_cache_destroy() unlinks the memcg caches, sets up the rcu callbacks, then
> >> it starts the slab_caches_to_rcu_destroy_workfn() worker and then kmem_cache_destroy() is done.
> >
> > Right, and this is the problem. The question when call_rcu() in free_slab() has
> > been called: if it did happen before the kmem_cache_destroy(), it's clearly
> > a bug inside the slab allocator. Otherwise it's probably an incorrect API
> > invocation.
> >
> >> You see that the smc code is getting control again after that.
> >> The worker starts in between (before the smc_exit trace), but keeps waiting on the rcu_barrier().
> >> Ages later (see time difference) the rcu grace period ends and rcu_free_slab() is called, and it
> >> crashes.
> >> I hope that helps!
> >>
> >> [ 145.539917] kmem_cache_destroy before shutdown_memcg_caches() for 0000000068106f00
> >> [ 145.539929] free_slab call_rcu() for 00000000392c2e00, page is 000003d080e45000
> >> [ 145.539961] memcg_unlink_cache clearing memcg for 00000000392c2e00
> >> [ 145.539970] shutdown_cache adding to slab_caches_to_rcu_destroy queue for work: 00000000392c2e00
> >
> > Does it mean that call_rcu() has been called after kmem_cache_destroy()?
> > In this case, do you know who called it?
> >
>
> Let me show the 'call graph' again, call_rcu() is called by free_slab()
> as part of kmem_cache_destroy(), and just before memcg_unlink_cache() clears
> the memcg reference.
>
> kmem_cache_destroy()
> -> shutdown_memcg_caches()
> -> shutdown_cache()
> -> __kmem_cache_shutdown() (slub.c)
> -> free_partial()
> -> discard_slab()
> -> free_slab() -- call to __free_slab() is delayed
> -> call_rcu(rcu_free_slab)
> -> memcg_unlink_cache()
> -> WRITE_ONCE(s->memcg_params.memcg, NULL); -- !!!
Ah, got it, thank you!
Then something like this should work. Can you, please, confirm that is solves
the problem?
diff --git a/mm/slab_common.c b/mm/slab_common.c
index 807490fe217a..d916e986f094 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -180,8 +180,11 @@ static void destroy_memcg_params(struct kmem_cache *s)
{
if (is_root_cache(s))
kvfree(rcu_access_pointer(s->memcg_params.memcg_caches));
- else
+ else {
+ mem_cgroup_put(s->memcg_params.memcg);
+ WRITE_ONCE(s->memcg_params.memcg, NULL);
percpu_ref_exit(&s->memcg_params.refcnt);
+ }
}
static void free_memcg_params(struct rcu_head *rcu)
@@ -253,8 +256,6 @@ static void memcg_unlink_cache(struct kmem_cache *s)
} else {
list_del(&s->memcg_params.children_node);
list_del(&s->memcg_params.kmem_caches_node);
- mem_cgroup_put(s->memcg_params.memcg);
- WRITE_ONCE(s->memcg_params.memcg, NULL);
}
}
#else
--
Thank you!
Roman
>
>
> > I'd add an atomic flag to the root kmem_cache, set it at the beginning of the
> > kmem_cache_destroy() and check it in free_slab(). If set, dump the stacktrace.
> > Just please make sure you're looking at the root kmem_cache flag, not the memcg
> > one.
> >
> > Thank you!
> >
> > Roman
> >
> >>
> >> [ 145.540001] free_slab call_rcu() for 00000000392c2900, page is 000003d080e4a200
> >> [ 145.540031] memcg_unlink_cache clearing memcg for 00000000392c2900
> >> [ 145.540041] shutdown_cache adding to slab_caches_to_rcu_destroy queue for work: 00000000392c2900
> >>
> >> [ 145.540066] kmem_cache_destroy after shutdown_memcg_caches() for 0000000068106f00
> >>
> >> [ 145.540075] kmem_cache_destroy before final shutdown_cache() for 0000000068106f00
> >> [ 145.540086] free_slab call_rcu() for 0000000068106f00, page is 000003d080e0a800
> >> [ 145.540189] shutdown_cache adding to slab_caches_to_rcu_destroy queue for work: 0000000068106f00
> >>
> >> [ 145.540548] kmem_cache_destroy after final shutdown_cache() for 0000000068106f00
> >> kmem_cache_destroy is done
> >> [ 145.540573] slab_caches_to_rcu_destroy_workfn before rcu_barrier() in workfunc
> >> slab_caches_to_rcu_destroy_workfn started and waits in rcu_barrier() now
> >> [ 145.540619] smc.0698ae: smc_exit before smc_pnet_exit
> >> smc module exit code gets back control ...
> >> [ 145.540699] smc.616283: smc_exit before unregister_pernet_subsys
> >> [ 145.619747] rcu_free_slab called for 00000000392c2e00, page is 000003d080e45000
> >> much later the rcu callbacks are invoked, and will crash
> >>
> >>>>
> >>>> If my thoughts are correct, the commit you've mentioned didn't introduced this
> >>>> issue, it just made it easier to reproduce.
> >>>>
> >>>> The proposed fix looks dubious to me: the problem isn't in the memcg pointer
> >>>> (it's just a luck that it crashes on it), and it seems incorrect to not decrease
> >>>> the slab statistics of the original memory cgroup.
> >>
> >> I was quite sure that my approach is way to simple, it's better when the mm experts
> >> work on that.
> >>
> >>>>
> >>>> What we probably need to do instead is to extend flush_memcg_workqueue() to
> >>>> wait for all outstanding rcu free callbacks. I have to think a bit what's the best
> >>>> way to fix it. How easy is to reproduce the problem?
> >>
> >> I can reproduce this at will and I am happy to test any fixes you provide.
> >>
> >>>
> >>> After a second thought, flush_memcg_workqueue() already contains
> >>> a rcu_barrier() call, so now first suspicion is that the last free() call
> >>> occurs after the kmem_cache_destroy() call. Can you, please, check if it's not
> >>> a case?
> >>>
> >>
> >> In kmem_cache_destroy(), the flush_memcg_workqueue() call is the first one, and after
> >> that shutdown_memcg_caches() is called which setup the rcu callbacks.
> >
> > These are callbacks to destroy kmem_caches, not pages.
> >
> >> So flush_memcg_workqueue() can not wait for them. If you follow the 'call graph' above
> >> using the RCU path in slub.c you can see when the callbacks are set up and why no warning
> >> is printed.
> >>
> >>
> >> Second thought after I wrote all of the above: when flush_memcg_workqueue() already contains
> >> an rcu_barrier(), whats the point of delaying the slab freeing in the rcu case? All rcu
> >> readers should be done now, so the rcu callbacks and the worker are not needed?
> >> What am I missing here (I am sure I miss something, I am completely new in the mm area)?
> >>
> >>> Thanks!
> >>>
> >>>>
> >>>>>
> >>>>> 349.361168¨ Unable to handle kernel pointer dereference in virtual kernel address space
> >>>>
> >>>> Btw, haven't you noticed anything suspicious in dmesg before this line?
> >>
> >> There is no error or warning line in dmesg before this line. Actually, I think that
> >> all pages are no longer in use so no warning is printed. Anyway, the slab freeing is
> >> delayed in any case when RCU is in use, right?
> >>
> >>
> >> Karsten
> >>
> >>>>
> >>>> Thank you!
> >>>>
> >>>> Roman
> >>>>
> >>>>> 349.361210¨ Failing address: 0000000000000000 TEID: 0000000000000483
> >>>>> 349.361223¨ Fault in home space mode while using kernel ASCE.
> >>>>> 349.361240¨ AS:00000000017d4007 R3:000000007fbd0007 S:000000007fbff000 P:000000000000003d
> >>>>> 349.361340¨ Oops: 0004 ilc:3 Ý#1¨ PREEMPT SMP
> >>>>> 349.361349¨ Modules linked in: tcp_diag inet_diag xt_tcpudp ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 ipt_REJECT nf_reject_ipv4 xt_conntrack ip6table_nat ip6table_mangle ip6table_raw ip6table_security iptable_at nf_nat iptable_mangle iptable_raw iptable_security nf_conntrack nf_defrag_ipv6 nf_de
> >>>>> 349.361436¨ CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.3.0-05872-g6133e3e4bada-dirty #14
> >>>>> 349.361445¨ Hardware name: IBM 2964 NC9 702 (z/VM 6.4.0)
> >>>>> 349.361450¨ Krnl PSW : 0704d00180000000 00000000003cadb6 (__free_slab+0x686/0x6b0)
> >>>>> 349.361464¨ R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:1 PM:0 RI:0 EA:3
> >>>>> 349.361470¨ Krnl GPRS: 00000000f3a32928 0000000000000000 000000007fbf5d00 000000000117c4b8
> >>>>> 349.361475¨ 0000000000000000 000000009e3291c1 0000000000000000 0000000000000000
> >>>>> 349.361481¨ 0000000000000003 0000000000000008 000000002b478b00 000003d080a97600
> >>>>> 349.361481¨ 0000000000000003 0000000000000008 000000002b478b00 000003d080a97600
> >>>>> 349.361486¨ 000000000117ba00 000003e000057db0 00000000003cabcc 000003e000057c78
> >>>>> 349.361500¨ Krnl Code: 00000000003cada6: e310a1400004 lg %r1,320(%r10)
> >>>>> 349.361500¨ 00000000003cadac: c0e50046c286 brasl %r14,ca32b8
> >>>>> 349.361500¨ #00000000003cadb2: a7f4fe36 brc 15,3caa1e
> >>>>> 349.361500¨ >00000000003cadb6: e32060800024 stg %r2,128(%r6)
> >>>>> 349.361500¨ 00000000003cadbc: a7f4fd9e brc 15,3ca8f8
> >>>>> 349.361500¨ 00000000003cadc0: c0e50046790c brasl %r14,c99fd8
> >>>>> 349.361500¨ 00000000003cadc6: a7f4fe2c brc 15,3caa
> >>>>> 349.361500¨ 00000000003cadc6: a7f4fe2c brc 15,3caa1e
> >>>>> 349.361500¨ 00000000003cadca: ecb1ffff00d9 aghik %r11,%r1,-1
> >>>>> 349.361619¨ Call Trace:
> >>>>> 349.361627¨ (Ý<00000000003cabcc>¨ __free_slab+0x49c/0x6b0)
> >>>>> 349.361634¨ Ý<00000000001f5886>¨ rcu_core+0x5a6/0x7e0
> >>>>> 349.361643¨ Ý<0000000000ca2dea>¨ __do_softirq+0xf2/0x5c0
> >>>>> 349.361652¨ Ý<0000000000152644>¨ irq_exit+0x104/0x130
> >>>>> 349.361659¨ Ý<000000000010d222>¨ do_IRQ+0x9a/0xf0
> >>>>> 349.361667¨ Ý<0000000000ca2344>¨ ext_int_handler+0x130/0x134
> >>>>> 349.361674¨ Ý<0000000000103648>¨ enabled_wait+0x58/0x128
> >>>>> 349.361681¨ (Ý<0000000000103634>¨ enabled_wait+0x44/0x128)
> >>>>> 349.361688¨ Ý<0000000000103b00>¨ arch_cpu_idle+0x40/0x58
> >>>>> 349.361695¨ Ý<0000000000ca0544>¨ default_idle_call+0x3c/0x68
> >>>>> 349.361704¨ Ý<000000000018eaa4>¨ do_idle+0xec/0x1c0
> >>>>> 349.361748¨ Ý<000000000018ee0e>¨ cpu_startup_entry+0x36/0x40
> >>>>> 349.361756¨ Ý<000000000122df34>¨ arch_call_rest_init+0x5c/0x88
> >>>>> 349.361761¨ Ý<0000000000000000>¨ 0x0
> >>>>> 349.361765¨ INFO: lockdep is turned off.
> >>>>> 349.361769¨ Last Breaking-Event-Address:
> >>>>> 349.361774¨ Ý<00000000003ca8f4>¨ __free_slab+0x1c4/0x6b0
> >>>>> 349.361781¨ Kernel panic - not syncing: Fatal exception in interrupt
> >>>>>
> >>>>>
> >>>>> A fix that works for me (RFC):
> >>>>>
> >>>>> diff --git a/mm/slab.h b/mm/slab.h
> >>>>> index a62372d0f271..b19a3f940338 100644
> >>>>> --- a/mm/slab.h
> >>>>> +++ b/mm/slab.h
> >>>>> @@ -328,7 +328,7 @@ static __always_inline void memcg_uncharge_slab(struct page *page, int order,
> >>>>>
> >>>>> rcu_read_lock();
> >>>>> memcg = READ_ONCE(s->memcg_params.memcg);
> >>>>> - if (likely(!mem_cgroup_is_root(memcg))) {
> >>>>> + if (likely(memcg && !mem_cgroup_is_root(memcg))) {
> >>>>> lruvec = mem_cgroup_lruvec(page_pgdat(page), memcg);
> >>>>> mod_lruvec_state(lruvec, cache_vmstat_idx(s), -(1 << order));
> >>>>> memcg_kmem_uncharge_memcg(page, order, memcg);
> >>>>>
> >>>>> --
> >>>>> Karsten
> >>>>>
> >>>>> (I'm a dude)
> >>>>>
> >>>>>
> >>
>
> --
> Karsten
>
> (I'm a dude)
>
next prev parent reply other threads:[~2019-10-03 17:34 UTC|newest]
Thread overview: 8+ messages / expand[flat|nested] mbox.gz Atom feed top
2019-10-02 14:50 Karsten Graul
2019-10-02 19:41 ` Roman Gushchin
2019-10-03 3:35 ` Roman Gushchin
2019-10-03 8:48 ` Karsten Graul
2019-10-03 16:11 ` Roman Gushchin
2019-10-03 16:24 ` Karsten Graul
2019-10-03 17:34 ` Roman Gushchin [this message]
2019-10-04 8:59 ` Karsten Graul
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20191003173421.GA30875@castle.DHCP.thefacebook.com \
--to=guro@fb.com \
--cc=akpm@linux-foundation.org \
--cc=cl@linux.com \
--cc=iamjoonsoo.kim@lge.com \
--cc=kgraul@linux.ibm.com \
--cc=linux-mm@kvack.org \
--cc=penberg@kernel.org \
--cc=rientjes@google.com \
--cc=shakeelb@google.com \
--cc=vdavydov.dev@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox