From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-5.3 required=3.0 tests=HEADER_FROM_DIFFERENT_DOMAINS, INCLUDES_PATCH,MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,USER_AGENT_SANE_1 autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 07AD3C4CED1 for ; Thu, 3 Oct 2019 16:24:40 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id A1E6E20867 for ; Thu, 3 Oct 2019 16:24:39 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org A1E6E20867 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=linux.ibm.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 23C1A6B0005; Thu, 3 Oct 2019 12:24:39 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 213738E0005; Thu, 3 Oct 2019 12:24:39 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 0B6078E0003; Thu, 3 Oct 2019 12:24:39 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0227.hostedemail.com [216.40.44.227]) by kanga.kvack.org (Postfix) with ESMTP id D2C796B0005 for ; Thu, 3 Oct 2019 12:24:38 -0400 (EDT) Received: from smtpin23.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay01.hostedemail.com (Postfix) with SMTP id 5F459180AD803 for ; Thu, 3 Oct 2019 16:24:38 +0000 (UTC) X-FDA: 76002996636.23.force12_7fadc5ab1cb43 X-HE-Tag: force12_7fadc5ab1cb43 X-Filterd-Recvd-Size: 17125 Received: from mx0a-001b2d01.pphosted.com (mx0a-001b2d01.pphosted.com [148.163.156.1]) by imf05.hostedemail.com (Postfix) with ESMTP for ; Thu, 3 Oct 2019 16:24:37 +0000 (UTC) Received: from pps.filterd (m0098410.ppops.net [127.0.0.1]) by mx0a-001b2d01.pphosted.com (8.16.0.27/8.16.0.27) with SMTP id x93GN21a006359 for ; Thu, 3 Oct 2019 12:24:36 -0400 Received: from e06smtp04.uk.ibm.com (e06smtp04.uk.ibm.com [195.75.94.100]) by mx0a-001b2d01.pphosted.com with ESMTP id 2vdk21kgcd-1 (version=TLSv1.2 cipher=AES256-GCM-SHA384 bits=256 verify=NOT) for ; Thu, 03 Oct 2019 12:24:35 -0400 Received: from localhost by e06smtp04.uk.ibm.com with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted for from ; Thu, 3 Oct 2019 17:24:33 +0100 Received: from b06avi18626390.portsmouth.uk.ibm.com (9.149.26.192) by e06smtp04.uk.ibm.com (192.168.101.134) with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted; (version=TLSv1/SSLv3 cipher=AES256-GCM-SHA384 bits=256/256) Thu, 3 Oct 2019 17:24:29 +0100 Received: from b06wcsmtp001.portsmouth.uk.ibm.com (b06wcsmtp001.portsmouth.uk.ibm.com [9.149.105.160]) by b06avi18626390.portsmouth.uk.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id x93GNxlN33030556 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Thu, 3 Oct 2019 16:23:59 GMT Received: from b06wcsmtp001.portsmouth.uk.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id BDF6CA405B; Thu, 3 Oct 2019 16:24:28 +0000 (GMT) Received: from b06wcsmtp001.portsmouth.uk.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 3E4C2A405F; Thu, 3 Oct 2019 16:24:28 +0000 (GMT) Received: from [9.145.71.173] (unknown [9.145.71.173]) by b06wcsmtp001.portsmouth.uk.ibm.com (Postfix) with ESMTP; Thu, 3 Oct 2019 16:24:28 +0000 (GMT) Subject: Re: BUG: Crash in __free_slab() using SLAB_TYPESAFE_BY_RCU To: Roman Gushchin Cc: Shakeel Butt , Vladimir Davydov , David Rientjes , ",Christoph Lameter" , Pekka Enberg , Joonsoo Kim , Andrew Morton , "linux-mm@kvack.org" References: <4a5108b4-5a2f-f83c-e6a8-5e0c9074ac69@linux.ibm.com> <20191002194121.GA9033@castle.DHCP.thefacebook.com> <20191003033540.GA10017@castle.DHCP.thefacebook.com> <20191003161149.GB13950@castle.DHCP.thefacebook.com> From: Karsten Graul Organization: IBM Deutschland Research & Development GmbH Date: Thu, 3 Oct 2019 18:24:27 +0200 User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:60.0) Gecko/20100101 Thunderbird/60.9.0 MIME-Version: 1.0 In-Reply-To: <20191003161149.GB13950@castle.DHCP.thefacebook.com> Content-Type: text/plain; charset=utf-8 Content-Language: de-DE X-TM-AS-GCONF: 00 x-cbid: 19100316-0016-0000-0000-000002B3BCD9 X-IBM-AV-DETECTION: SAVI=unused REMOTE=unused XFE=unused x-cbparentid: 19100316-0017-0000-0000-00003314C82A Message-Id: X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10434:,, definitions=2019-10-03_06:,, signatures=0 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 priorityscore=1501 malwarescore=0 suspectscore=2 phishscore=0 bulkscore=0 spamscore=0 clxscore=1015 lowpriorityscore=0 mlxscore=0 impostorscore=0 mlxlogscore=999 adultscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.0.1-1908290000 definitions=main-1910030145 Content-Transfer-Encoding: quoted-printable X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On 03/10/2019 18:11, Roman Gushchin wrote: > On Thu, Oct 03, 2019 at 10:48:15AM +0200, Karsten Graul wrote: >> Hello, Roman! >> >> On 03/10/2019 05:35, Roman Gushchin wrote: >>> On Wed, Oct 02, 2019 at 12:41:29PM -0700, Roman Gushchin wrote: >>>> Hello, Karsten! >>>> >>>> Thank you for the report! >>>> >>>> On Wed, Oct 02, 2019 at 04:50:53PM +0200, Karsten Graul wrote: >>>>> >>>>> net/smc is calling proto_register(&smc_proto, 1) with smc_proto.sla= b_flags =3D SLAB_TYPESAFE_BY_RCU. >>>>> Right after the last SMC socket is destroyed, proto_unregister(&smc= _proto) is called, which=20 >>>>> calls kmem_cache_destroy(prot->slab). This results in a kernel cras= h in __free_slab(). >>>>> Platform is s390x, reproduced on kernel 5.4-rc1. The problem was in= troduced by commit >>>>> fb2f2b0adb98 ("mm: memcg/slab: reparent memcg kmem_caches on cgroup= removal") >>>>> >>>>> I added a 'call graph', below of that is the crash log and a (simpl= e) patch that works for me, >>>>> but I don't know if this is the correct way to fix it. >>>>> >>>>> (Please keep me on CC of this thread because I do not follow the mm= mailing list, thank you) >>>>> >>>>> >>>>> kmem_cache_destroy()=20 >>>>> -> shutdown_memcg_caches() >>>>> -> shutdown_cache() >>>>> -> __kmem_cache_shutdown() (slub.c) >>>>> -> free_partial() >>>>> -> discard_slab() >>>>> -> free_slab() -- call to= __free_slab() is delayed >>>>> -> call_rcu(rcu_free_slab) >>>>> -> memcg_unlink_cache() >>>>> -> WRITE_ONCE(s->memcg_params.memcg, NULL); -- = !!! >>>>> -> list_add_tail(&s->list, &slab_caches_to_rcu_destroy); >>>>> -> schedule_work(&slab_caches_to_rcu_destroy_work); -> work_fn= uses rcu_barrier() to wait for rcu_batch,=20 >>>>> so work= _fn is not further involved here... >>>>> ... rcu grace period ... >>>>> rcu_batch() >>>>> ... >>>>> -> rcu_free_slab() (slub.c) >>>>> -> __free_slab() >>>>> -> uncharge_slab_page() >>>>> -> memcg_uncharge_slab() >>>>> -> memcg =3D READ_ONCE(s->memcg_params.memcg); -- !!! m= emcg =3D=3D NULL >>>>> -> mem_cgroup_lruvec(memcg) >>>>> -> mz =3D mem_cgroup_nodeinfo(memcg, pgdat->node_id); -- mz =3D= =3D NULL >>>>> -> lruvec =3D &mz->lruvec; -- lruve= c =3D=3D NULL >>>>> -> lruvec->pgdat =3D pgdat; -- *cras= h* >>>>> >>>>> The crash log: >>>> >>>> Hm, I might be wrong, but it seems that the problem is deeper: __fre= e_slab() >>>> called from the rcu path races with kmem_cache_destroy(), which is s= upposed >>>> to be called when there are no outstanding allocations (and correspo= nding pages). >>>> Any charged slab page actually holds a reference to the kmem_cache, = which prevents >>>> its destruction (look at s->memcg_params.refcnt), but kmem_cache_des= troy() ignores >>>> it. >> >> I don't see a race between kmem_cache_destroy() and __fre_slab(). kmem= _cache_destroy() >> is already done when __free_slab() is called. Maybe the below trace sh= ows you the order of calls on my >> system: kmem_cache_destroy() unlinks the memcg caches, sets up the rcu= callbacks, then=20 >> it starts the slab_caches_to_rcu_destroy_workfn() worker and then kmem= _cache_destroy() is done. >=20 > Right, and this is the problem. The question when call_rcu() in free_sl= ab() has > been called: if it did happen before the kmem_cache_destroy(), it's cle= arly > a bug inside the slab allocator. Otherwise it's probably an incorrect A= PI > invocation. >=20 >> You see that the smc code is getting control again after that. >> The worker starts in between (before the smc_exit trace), but keeps wa= iting on the rcu_barrier(). >> Ages later (see time difference) the rcu grace period ends and rcu_fre= e_slab() is called, and it >> crashes.=20 >> I hope that helps! >> >> [ 145.539917] kmem_cache_destroy before shutdown_memcg_caches() for 0= 000000068106f00 >> [ 145.539929] free_slab call_rcu() for 00000000392c2e00, page is 0000= 03d080e45000 >> [ 145.539961] memcg_unlink_cache clearing memcg for 00000000392c2e00 >> [ 145.539970] shutdown_cache adding to slab_caches_to_rcu_destroy que= ue for work: 00000000392c2e00 >=20 > Does it mean that call_rcu() has been called after kmem_cache_destroy()= ? > In this case, do you know who called it? >=20 Let me show the 'call graph' again, call_rcu() is called by free_slab() as part of kmem_cache_destroy(), and just before memcg_unlink_cache() cle= ars the memcg reference. kmem_cache_destroy()=20 -> shutdown_memcg_caches() -> shutdown_cache() -> __kmem_cache_shutdown() (slub.c) -> free_partial() -> discard_slab() -> free_slab() -- call to __fre= e_slab() is delayed -> call_rcu(rcu_free_slab) -> memcg_unlink_cache() -> WRITE_ONCE(s->memcg_params.memcg, NULL); -- !!! > I'd add an atomic flag to the root kmem_cache, set it at the beginning = of the > kmem_cache_destroy() and check it in free_slab(). If set, dump the stac= ktrace. > Just please make sure you're looking at the root kmem_cache flag, not t= he memcg > one. >=20 > Thank you! >=20 > Roman >=20 >> >> [ 145.540001] free_slab call_rcu() for 00000000392c2900, page is 0000= 03d080e4a200 >> [ 145.540031] memcg_unlink_cache clearing memcg for 00000000392c2900 >> [ 145.540041] shutdown_cache adding to slab_caches_to_rcu_destroy que= ue for work: 00000000392c2900 >> >> [ 145.540066] kmem_cache_destroy after shutdown_memcg_caches() for 00= 00000068106f00 >> >> [ 145.540075] kmem_cache_destroy before final shutdown_cache() for 00= 00000068106f00 >> [ 145.540086] free_slab call_rcu() for 0000000068106f00, page is 0000= 03d080e0a800 >> [ 145.540189] shutdown_cache adding to slab_caches_to_rcu_destroy que= ue for work: 0000000068106f00 >> >> [ 145.540548] kmem_cache_destroy after final shutdown_cache() for 000= 0000068106f00 >> kmem_cache_destroy is done >> [ 145.540573] slab_caches_to_rcu_destroy_workfn before rcu_barrier() = in workfunc >> slab_caches_to_rcu_destroy_workfn started and waits in rcu_barrier(= ) now >> [ 145.540619] smc.0698ae: smc_exit before smc_pnet_exit >> smc module exit code gets back control ... >> [ 145.540699] smc.616283: smc_exit before unregister_pernet_subsys >> [ 145.619747] rcu_free_slab called for 00000000392c2e00, page is 0000= 03d080e45000 >> much later the rcu callbacks are invoked, and will crash >> >>>> >>>> If my thoughts are correct, the commit you've mentioned didn't intro= duced this >>>> issue, it just made it easier to reproduce. >>>> >>>> The proposed fix looks dubious to me: the problem isn't in the memcg= pointer >>>> (it's just a luck that it crashes on it), and it seems incorrect to = not decrease >>>> the slab statistics of the original memory cgroup. >> >> I was quite sure that my approach is way to simple, it's better when t= he mm experts >> work on that. >> >>>> >>>> What we probably need to do instead is to extend flush_memcg_workque= ue() to >>>> wait for all outstanding rcu free callbacks. I have to think a bit w= hat's the best >>>> way to fix it. How easy is to reproduce the problem? >> >> I can reproduce this at will and I am happy to test any fixes you prov= ide. >> >>> >>> After a second thought, flush_memcg_workqueue() already contains >>> a rcu_barrier() call, so now first suspicion is that the last free() = call >>> occurs after the kmem_cache_destroy() call. Can you, please, check if= it's not >>> a case? >>> >> >> In kmem_cache_destroy(), the flush_memcg_workqueue() call is the first= one, and after >> that shutdown_memcg_caches() is called which setup the rcu callbacks. >=20 > These are callbacks to destroy kmem_caches, not pages. >=20 >> So flush_memcg_workqueue() can not wait for them. If you follow the 'c= all graph' above=20 >> using the RCU path in slub.c you can see when the callbacks are set up= and why no warning=20 >> is printed. >> >> >> Second thought after I wrote all of the above: when flush_memcg_workqu= eue() already contains >> an rcu_barrier(), whats the point of delaying the slab freeing in the = rcu case? All rcu >> readers should be done now, so the rcu callbacks and the worker are no= t needed? >> What am I missing here (I am sure I miss something, I am completely ne= w in the mm area)? >> >>> Thanks! >>> >>>> >>>>> >>>>> 349.361168=C2=A8 Unable to handle kernel pointer dereference in vir= tual kernel address space >>>> >>>> Btw, haven't you noticed anything suspicious in dmesg before this li= ne? >> >> There is no error or warning line in dmesg before this line. Actually,= I think that >> all pages are no longer in use so no warning is printed. Anyway, the s= lab freeing is >> delayed in any case when RCU is in use, right? >> >> >> Karsten >> >>>> >>>> Thank you! >>>> >>>> Roman >>>> >>>>> 349.361210=C2=A8 Failing address: 0000000000000000 TEID: 0000000000= 000483 >>>>> 349.361223=C2=A8 Fault in home space mode while using kernel ASCE. >>>>> 349.361240=C2=A8 AS:00000000017d4007 R3:000000007fbd0007 S:00000000= 7fbff000 P:000000000000003d >>>>> 349.361340=C2=A8 Oops: 0004 ilc:3 =C3=9D#1=C2=A8 PREEMPT SMP >>>>> 349.361349=C2=A8 Modules linked in: tcp_diag inet_diag xt_tcpudp ip= 6t_rpfilter ip6t_REJECT nf_reject_ipv6 ipt_REJECT nf_reject_ipv4 xt_connt= rack ip6table_nat ip6table_mangle ip6table_raw ip6table_security iptable_= at nf_nat iptable_mangle iptable_raw iptable_security nf_conntrack nf_def= rag_ipv6 nf_de >>>>> 349.361436=C2=A8 CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.3.0-05= 872-g6133e3e4bada-dirty #14 >>>>> 349.361445=C2=A8 Hardware name: IBM 2964 NC9 702 (z/VM 6.4.0) >>>>> 349.361450=C2=A8 Krnl PSW : 0704d00180000000 00000000003cadb6 (__fr= ee_slab+0x686/0x6b0) >>>>> 349.361464=C2=A8 R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:= 3 CC:1 PM:0 RI:0 EA:3 >>>>> 349.361470=C2=A8 Krnl GPRS: 00000000f3a32928 0000000000000000 00000= 0007fbf5d00 000000000117c4b8 >>>>> 349.361475=C2=A8 0000000000000000 000000009e3291c1 00000= 00000000000 0000000000000000 >>>>> 349.361481=C2=A8 0000000000000003 0000000000000008 00000= 0002b478b00 000003d080a97600 >>>>> 349.361481=C2=A8 0000000000000003 0000000000000008 00000= 0002b478b00 000003d080a97600 >>>>> 349.361486=C2=A8 000000000117ba00 000003e000057db0 00000= 000003cabcc 000003e000057c78 >>>>> 349.361500=C2=A8 Krnl Code: 00000000003cada6: e310a1400004 l= g %r1,320(%r10) >>>>> 349.361500=C2=A8 00000000003cadac: c0e50046c286 b= rasl %r14,ca32b8 >>>>> 349.361500=C2=A8 #00000000003cadb2: a7f4fe36 b= rc 15,3caa1e >>>>> 349.361500=C2=A8 >00000000003cadb6: e32060800024 s= tg %r2,128(%r6) >>>>> 349.361500=C2=A8 00000000003cadbc: a7f4fd9e b= rc 15,3ca8f8 >>>>> 349.361500=C2=A8 00000000003cadc0: c0e50046790c b= rasl %r14,c99fd8 >>>>> 349.361500=C2=A8 00000000003cadc6: a7f4fe2c b= rc 15,3caa >>>>> 349.361500=C2=A8 00000000003cadc6: a7f4fe2c b= rc 15,3caa1e >>>>> 349.361500=C2=A8 00000000003cadca: ecb1ffff00d9 a= ghik %r11,%r1,-1 >>>>> 349.361619=C2=A8 Call Trace: >>>>> 349.361627=C2=A8 (=C3=9D<00000000003cabcc>=C2=A8 __free_slab+0x49c/= 0x6b0) >>>>> 349.361634=C2=A8 =C3=9D<00000000001f5886>=C2=A8 rcu_core+0x5a6/0x7= e0 >>>>> 349.361643=C2=A8 =C3=9D<0000000000ca2dea>=C2=A8 __do_softirq+0xf2/= 0x5c0 >>>>> 349.361652=C2=A8 =C3=9D<0000000000152644>=C2=A8 irq_exit+0x104/0x1= 30 >>>>> 349.361659=C2=A8 =C3=9D<000000000010d222>=C2=A8 do_IRQ+0x9a/0xf0 >>>>> 349.361667=C2=A8 =C3=9D<0000000000ca2344>=C2=A8 ext_int_handler+0x= 130/0x134 >>>>> 349.361674=C2=A8 =C3=9D<0000000000103648>=C2=A8 enabled_wait+0x58/= 0x128 >>>>> 349.361681=C2=A8 (=C3=9D<0000000000103634>=C2=A8 enabled_wait+0x44/= 0x128) >>>>> 349.361688=C2=A8 =C3=9D<0000000000103b00>=C2=A8 arch_cpu_idle+0x40= /0x58 >>>>> 349.361695=C2=A8 =C3=9D<0000000000ca0544>=C2=A8 default_idle_call+= 0x3c/0x68 >>>>> 349.361704=C2=A8 =C3=9D<000000000018eaa4>=C2=A8 do_idle+0xec/0x1c0 >>>>> 349.361748=C2=A8 =C3=9D<000000000018ee0e>=C2=A8 cpu_startup_entry+= 0x36/0x40 >>>>> 349.361756=C2=A8 =C3=9D<000000000122df34>=C2=A8 arch_call_rest_ini= t+0x5c/0x88 >>>>> 349.361761=C2=A8 =C3=9D<0000000000000000>=C2=A8 0x0 >>>>> 349.361765=C2=A8 INFO: lockdep is turned off. >>>>> 349.361769=C2=A8 Last Breaking-Event-Address: >>>>> 349.361774=C2=A8 =C3=9D<00000000003ca8f4>=C2=A8 __free_slab+0x1c4/= 0x6b0 >>>>> 349.361781=C2=A8 Kernel panic - not syncing: Fatal exception in int= errupt >>>>> >>>>> >>>>> A fix that works for me (RFC): >>>>> >>>>> diff --git a/mm/slab.h b/mm/slab.h >>>>> index a62372d0f271..b19a3f940338 100644 >>>>> --- a/mm/slab.h >>>>> +++ b/mm/slab.h >>>>> @@ -328,7 +328,7 @@ static __always_inline void memcg_uncharge_slab= (struct page *page, int order, >>>>> >>>>> rcu_read_lock(); >>>>> memcg =3D READ_ONCE(s->memcg_params.memcg); >>>>> - if (likely(!mem_cgroup_is_root(memcg))) { >>>>> + if (likely(memcg && !mem_cgroup_is_root(memcg))) { >>>>> lruvec =3D mem_cgroup_lruvec(page_pgdat(page), memc= g); >>>>> mod_lruvec_state(lruvec, cache_vmstat_idx(s), -(1 <= < order)); >>>>> memcg_kmem_uncharge_memcg(page, order, memcg); >>>>> >>>>> --=20 >>>>> Karsten >>>>> >>>>> (I'm a dude) >>>>> >>>>> >> --=20 Karsten (I'm a dude)