linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Roman Gushchin <guro@fb.com>
To: Karsten Graul <kgraul@linux.ibm.com>
Cc: Shakeel Butt <shakeelb@google.com>,
	Vladimir Davydov <vdavydov.dev@gmail.com>,
	David Rientjes <rientjes@google.com>,
	",Christoph Lameter" <cl@linux.com>,
	Pekka Enberg <penberg@kernel.org>,
	Joonsoo Kim <iamjoonsoo.kim@lge.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	"linux-mm@kvack.org" <linux-mm@kvack.org>
Subject: Re: BUG: Crash in __free_slab() using SLAB_TYPESAFE_BY_RCU
Date: Thu, 3 Oct 2019 03:35:46 +0000	[thread overview]
Message-ID: <20191003033540.GA10017@castle.DHCP.thefacebook.com> (raw)
In-Reply-To: <20191002194121.GA9033@castle.DHCP.thefacebook.com>

On Wed, Oct 02, 2019 at 12:41:29PM -0700, Roman Gushchin wrote:
> Hello, Karsten!
> 
> Thank you for the report!
> 
> On Wed, Oct 02, 2019 at 04:50:53PM +0200, Karsten Graul wrote:
> > 
> > net/smc is calling proto_register(&smc_proto, 1) with smc_proto.slab_flags = SLAB_TYPESAFE_BY_RCU.
> > Right after the last SMC socket is destroyed, proto_unregister(&smc_proto) is called, which 
> > calls kmem_cache_destroy(prot->slab). This results in a kernel crash in __free_slab().
> > Platform is s390x, reproduced on kernel 5.4-rc1. The problem was introduced by commit
> > fb2f2b0adb98 ("mm: memcg/slab: reparent memcg kmem_caches on cgroup removal")
> > 
> > I added a 'call graph', below of that is the crash log and a (simple) patch that works for me,
> > but I don't know if this is the correct way to fix it.
> > 
> > (Please keep me on CC of this thread because I do not follow the mm mailing list, thank you)
> > 
> > 
> > kmem_cache_destroy() 
> >   -> shutdown_memcg_caches()
> >     -> shutdown_cache()
> >       -> __kmem_cache_shutdown()  (slub.c)
> >         -> free_partial()
> >           -> discard_slab()
> > 	    -> free_slab()                                      -- call to __free_slab() is delayed
> > 	      -> call_rcu(rcu_free_slab)
> >     -> memcg_unlink_cache()
> >       -> WRITE_ONCE(s->memcg_params.memcg, NULL);               -- !!!
> >     -> list_add_tail(&s->list, &slab_caches_to_rcu_destroy);
> >     -> schedule_work(&slab_caches_to_rcu_destroy_work);  -> work_fn uses rcu_barrier() to wait for rcu_batch, 
> >                                                             so work_fn is not further involved here...
> > ... rcu grace period ...
> > rcu_batch()
> >   ...
> >   -> rcu_free_slab()   (slub.c)
> >     -> __free_slab()
> >       -> uncharge_slab_page()
> >         -> memcg_uncharge_slab()
> > 	  -> memcg = READ_ONCE(s->memcg_params.memcg);          -- !!! memcg == NULL
> > 	  -> mem_cgroup_lruvec(memcg)
> > 	    -> mz = mem_cgroup_nodeinfo(memcg, pgdat->node_id); -- mz == NULL
> > 	    -> lruvec = &mz->lruvec;                            -- lruvec == NULL
> > 	    -> lruvec->pgdat = pgdat;                           -- *crash*
> > 
> > The crash log:
> 
> Hm, I might be wrong, but it seems that the problem is deeper: __free_slab()
> called from the rcu path races with kmem_cache_destroy(), which is supposed
> to be called when there are no outstanding allocations (and corresponding pages).
> Any charged slab page actually holds a reference to the kmem_cache, which prevents
> its destruction (look at s->memcg_params.refcnt), but kmem_cache_destroy() ignores
> it.
> 
> If my thoughts are correct, the commit you've mentioned didn't introduced this
> issue, it just made it easier to reproduce.
> 
> The proposed fix looks dubious to me: the problem isn't in the memcg pointer
> (it's just a luck that it crashes on it), and it seems incorrect to not decrease
> the slab statistics of the original memory cgroup.
> 
> What we probably need to do instead is to extend flush_memcg_workqueue() to
> wait for all outstanding rcu free callbacks. I have to think a bit what's the best
> way to fix it. How easy is to reproduce the problem?

After a second thought, flush_memcg_workqueue() already contains
a rcu_barrier() call, so now first suspicion is that the last free() call
occurs after the kmem_cache_destroy() call. Can you, please, check if it's not
a case?

Thanks!

> 
> > 
> > 349.361168¨ Unable to handle kernel pointer dereference in virtual kernel address space
> 
> Btw, haven't you noticed anything suspicious in dmesg before this line?
> 
> Thank you!
> 
> Roman
> 
> > 349.361210¨ Failing address: 0000000000000000 TEID: 0000000000000483
> > 349.361223¨ Fault in home space mode while using kernel ASCE.
> > 349.361240¨ AS:00000000017d4007 R3:000000007fbd0007 S:000000007fbff000 P:000000000000003d
> > 349.361340¨ Oops: 0004 ilc:3 Ý#1¨ PREEMPT SMP
> > 349.361349¨ Modules linked in: tcp_diag inet_diag xt_tcpudp ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 ipt_REJECT nf_reject_ipv4 xt_conntrack ip6table_nat ip6table_mangle ip6table_raw ip6table_security iptable_at nf_nat iptable_mangle iptable_raw iptable_security nf_conntrack nf_defrag_ipv6 nf_de
> > 349.361436¨ CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.3.0-05872-g6133e3e4bada-dirty #14
> > 349.361445¨ Hardware name: IBM 2964 NC9 702 (z/VM 6.4.0)
> > 349.361450¨ Krnl PSW : 0704d00180000000 00000000003cadb6 (__free_slab+0x686/0x6b0)
> > 349.361464¨            R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:1 PM:0 RI:0 EA:3
> > 349.361470¨ Krnl GPRS: 00000000f3a32928 0000000000000000 000000007fbf5d00 000000000117c4b8
> > 349.361475¨            0000000000000000 000000009e3291c1 0000000000000000 0000000000000000
> > 349.361481¨            0000000000000003 0000000000000008 000000002b478b00 000003d080a97600
> > 349.361481¨            0000000000000003 0000000000000008 000000002b478b00 000003d080a97600
> > 349.361486¨            000000000117ba00 000003e000057db0 00000000003cabcc 000003e000057c78
> > 349.361500¨ Krnl Code: 00000000003cada6: e310a1400004        lg      %r1,320(%r10)
> > 349.361500¨            00000000003cadac: c0e50046c286        brasl   %r14,ca32b8
> > 349.361500¨           #00000000003cadb2: a7f4fe36            brc     15,3caa1e
> > 349.361500¨           >00000000003cadb6: e32060800024        stg     %r2,128(%r6)
> > 349.361500¨            00000000003cadbc: a7f4fd9e            brc     15,3ca8f8
> > 349.361500¨            00000000003cadc0: c0e50046790c        brasl   %r14,c99fd8
> > 349.361500¨            00000000003cadc6: a7f4fe2c            brc     15,3caa
> > 349.361500¨            00000000003cadc6: a7f4fe2c            brc     15,3caa1e
> > 349.361500¨            00000000003cadca: ecb1ffff00d9        aghik   %r11,%r1,-1
> > 349.361619¨ Call Trace:
> > 349.361627¨ (Ý<00000000003cabcc>¨ __free_slab+0x49c/0x6b0)
> > 349.361634¨  Ý<00000000001f5886>¨ rcu_core+0x5a6/0x7e0
> > 349.361643¨  Ý<0000000000ca2dea>¨ __do_softirq+0xf2/0x5c0
> > 349.361652¨  Ý<0000000000152644>¨ irq_exit+0x104/0x130
> > 349.361659¨  Ý<000000000010d222>¨ do_IRQ+0x9a/0xf0
> > 349.361667¨  Ý<0000000000ca2344>¨ ext_int_handler+0x130/0x134
> > 349.361674¨  Ý<0000000000103648>¨ enabled_wait+0x58/0x128
> > 349.361681¨ (Ý<0000000000103634>¨ enabled_wait+0x44/0x128)
> > 349.361688¨  Ý<0000000000103b00>¨ arch_cpu_idle+0x40/0x58
> > 349.361695¨  Ý<0000000000ca0544>¨ default_idle_call+0x3c/0x68
> > 349.361704¨  Ý<000000000018eaa4>¨ do_idle+0xec/0x1c0
> > 349.361748¨  Ý<000000000018ee0e>¨ cpu_startup_entry+0x36/0x40
> > 349.361756¨  Ý<000000000122df34>¨ arch_call_rest_init+0x5c/0x88
> > 349.361761¨  Ý<0000000000000000>¨ 0x0
> > 349.361765¨ INFO: lockdep is turned off.
> > 349.361769¨ Last Breaking-Event-Address:
> > 349.361774¨  Ý<00000000003ca8f4>¨ __free_slab+0x1c4/0x6b0
> > 349.361781¨ Kernel panic - not syncing: Fatal exception in interrupt
> > 
> > 
> > A fix that works for me (RFC):
> > 
> > diff --git a/mm/slab.h b/mm/slab.h
> > index a62372d0f271..b19a3f940338 100644
> > --- a/mm/slab.h
> > +++ b/mm/slab.h
> > @@ -328,7 +328,7 @@ static __always_inline void memcg_uncharge_slab(struct page *page, int order,
> > 
> >         rcu_read_lock();
> >         memcg = READ_ONCE(s->memcg_params.memcg);
> > -       if (likely(!mem_cgroup_is_root(memcg))) {
> > +       if (likely(memcg && !mem_cgroup_is_root(memcg))) {
> >                 lruvec = mem_cgroup_lruvec(page_pgdat(page), memcg);
> >                 mod_lruvec_state(lruvec, cache_vmstat_idx(s), -(1 << order));
> >                 memcg_kmem_uncharge_memcg(page, order, memcg);
> > 
> > -- 
> > Karsten
> > 
> > (I'm a dude)
> > 
> > 


  reply	other threads:[~2019-10-03  3:36 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-10-02 14:50 Karsten Graul
2019-10-02 19:41 ` Roman Gushchin
2019-10-03  3:35   ` Roman Gushchin [this message]
2019-10-03  8:48     ` Karsten Graul
2019-10-03 16:11       ` Roman Gushchin
2019-10-03 16:24         ` Karsten Graul
2019-10-03 17:34           ` Roman Gushchin
2019-10-04  8:59             ` Karsten Graul

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20191003033540.GA10017@castle.DHCP.thefacebook.com \
    --to=guro@fb.com \
    --cc=akpm@linux-foundation.org \
    --cc=cl@linux.com \
    --cc=iamjoonsoo.kim@lge.com \
    --cc=kgraul@linux.ibm.com \
    --cc=linux-mm@kvack.org \
    --cc=penberg@kernel.org \
    --cc=rientjes@google.com \
    --cc=shakeelb@google.com \
    --cc=vdavydov.dev@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox