From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=POAe=X5=kvack.org=owner-linux-mm@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-5.2 required=3.0 tests=HEADER_FROM_DIFFERENT_DOMAINS,
	INCLUDES_PATCH,MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,USER_AGENT_SANE_1
	autolearn=ham autolearn_force=no version=3.4.0
Received: from mail.kernel.org (mail.kernel.org [198.145.29.99])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 3EC7AC4360C
	for <linux-mm@archiver.kernel.org>; Fri,  4 Oct 2019 08:59:19 +0000 (UTC)
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by mail.kernel.org (Postfix) with ESMTP id A1D852070B
	for <linux-mm@archiver.kernel.org>; Fri,  4 Oct 2019 08:59:18 +0000 (UTC)
DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org A1D852070B
Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=linux.ibm.com
Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix)
	id DD28B8E0003; Fri,  4 Oct 2019 04:59:17 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id D83A26B0005; Fri,  4 Oct 2019 04:59:17 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id C73718E0003; Fri,  4 Oct 2019 04:59:17 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from forelay.hostedemail.com (smtprelay0237.hostedemail.com [216.40.44.237])
	by kanga.kvack.org (Postfix) with ESMTP id 9D9C56B0003
	for <linux-mm@kvack.org>; Fri,  4 Oct 2019 04:59:17 -0400 (EDT)
Received: from smtpin27.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251])
	by forelay02.hostedemail.com (Postfix) with SMTP id 42243BF06
	for <linux-mm@kvack.org>; Fri,  4 Oct 2019 08:59:17 +0000 (UTC)
X-FDA: 76005503154.27.pigs44_50fa358bf8a37
X-HE-Tag: pigs44_50fa358bf8a37
X-Filterd-Recvd-Size: 14451
Received: from mx0a-001b2d01.pphosted.com (mx0a-001b2d01.pphosted.com [148.163.156.1])
	by imf36.hostedemail.com (Postfix) with ESMTP
	for <linux-mm@kvack.org>; Fri,  4 Oct 2019 08:59:16 +0000 (UTC)
Received: from pps.filterd (m0098394.ppops.net [127.0.0.1])
	by mx0a-001b2d01.pphosted.com (8.16.0.27/8.16.0.27) with SMTP id x948bXAj123019
	for <linux-mm@kvack.org>; Fri, 4 Oct 2019 04:59:14 -0400
Received: from e06smtp02.uk.ibm.com (e06smtp02.uk.ibm.com [195.75.94.98])
	by mx0a-001b2d01.pphosted.com with ESMTP id 2ve1b5m2u8-1
	(version=TLSv1.2 cipher=AES256-GCM-SHA384 bits=256 verify=NOT)
	for <linux-mm@kvack.org>; Fri, 04 Oct 2019 04:59:14 -0400
Received: from localhost
	by e06smtp02.uk.ibm.com with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted
	for <linux-mm@kvack.org> from <kgraul@linux.ibm.com>;
	Fri, 4 Oct 2019 09:59:12 +0100
Received: from b06avi18626390.portsmouth.uk.ibm.com (9.149.26.192)
	by e06smtp02.uk.ibm.com (192.168.101.132) with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted;
	(version=TLSv1/SSLv3 cipher=AES256-GCM-SHA384 bits=256/256)
	Fri, 4 Oct 2019 09:59:07 +0100
Received: from d06av26.portsmouth.uk.ibm.com (d06av26.portsmouth.uk.ibm.com [9.149.105.62])
	by b06avi18626390.portsmouth.uk.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id x948wbsl37421446
	(version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK);
	Fri, 4 Oct 2019 08:58:37 GMT
Received: from d06av26.portsmouth.uk.ibm.com (unknown [127.0.0.1])
	by IMSVA (Postfix) with ESMTP id 88EA5AE045;
	Fri,  4 Oct 2019 08:59:06 +0000 (GMT)
Received: from d06av26.portsmouth.uk.ibm.com (unknown [127.0.0.1])
	by IMSVA (Postfix) with ESMTP id 94B51AE059;
	Fri,  4 Oct 2019 08:59:05 +0000 (GMT)
Received: from [9.145.18.69] (unknown [9.145.18.69])
	by d06av26.portsmouth.uk.ibm.com (Postfix) with ESMTP;
	Fri,  4 Oct 2019 08:59:05 +0000 (GMT)
Subject: Re: BUG: Crash in __free_slab() using SLAB_TYPESAFE_BY_RCU
To: Roman Gushchin <guro@fb.com>
Cc: Shakeel Butt <shakeelb@google.com>,
        Vladimir Davydov <vdavydov.dev@gmail.com>,
        David Rientjes <rientjes@google.com>,
        ",Christoph Lameter" <cl@linux.com>, Pekka Enberg <penberg@kernel.org>,
        Joonsoo Kim <iamjoonsoo.kim@lge.com>,
        Andrew Morton <akpm@linux-foundation.org>,
        "linux-mm@kvack.org" <linux-mm@kvack.org>
References: <4a5108b4-5a2f-f83c-e6a8-5e0c9074ac69@linux.ibm.com>
 <20191002194121.GA9033@castle.DHCP.thefacebook.com>
 <20191003033540.GA10017@castle.DHCP.thefacebook.com>
 <da3c67e7-781c-e145-5c6e-c9f3ed4e57fb@linux.ibm.com>
 <20191003161149.GB13950@castle.DHCP.thefacebook.com>
 <e3786b32-d6b3-98f8-6d8f-b6db08725a7d@linux.ibm.com>
 <20191003173421.GA30875@castle.DHCP.thefacebook.com>
From: Karsten Graul <kgraul@linux.ibm.com>
Organization: IBM Deutschland Research & Development GmbH
Date: Fri, 4 Oct 2019 10:59:06 +0200
User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:60.0) Gecko/20100101
 Thunderbird/60.9.0
MIME-Version: 1.0
In-Reply-To: <20191003173421.GA30875@castle.DHCP.thefacebook.com>
Content-Type: text/plain; charset=utf-8
Content-Language: de-DE
X-TM-AS-GCONF: 00
x-cbid: 19100408-0008-0000-0000-0000031DEEA6
X-IBM-AV-DETECTION: SAVI=unused REMOTE=unused XFE=unused
x-cbparentid: 19100408-0009-0000-0000-00004A3CF92E
Message-Id: <3bab4ca6-626a-e110-1f38-3b134d492590@linux.ibm.com>
X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10434:,, definitions=2019-10-04_05:,,
 signatures=0
X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 priorityscore=1501
 malwarescore=0 suspectscore=2 phishscore=0 bulkscore=0 spamscore=0
 clxscore=1015 lowpriorityscore=0 mlxscore=0 impostorscore=0
 mlxlogscore=999 adultscore=0 classifier=spam adjust=0 reason=mlx
 scancount=1 engine=8.0.1-1908290000 definitions=main-1910040081
Content-Transfer-Encoding: quoted-printable
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>


>> Let me show the 'call graph' again, call_rcu() is called by free_slab(=
)
>> as part of kmem_cache_destroy(), and just before memcg_unlink_cache() =
clears
>> the memcg reference.
>>
>> kmem_cache_destroy()=20
>>   -> shutdown_memcg_caches()
>>     -> shutdown_cache()
>>       -> __kmem_cache_shutdown()  (slub.c)
>>         -> free_partial()
>>           -> discard_slab()
>> 	    -> free_slab()                                      -- call to __=
free_slab() is delayed
>> 	      -> call_rcu(rcu_free_slab)
>>     -> memcg_unlink_cache()
>>       -> WRITE_ONCE(s->memcg_params.memcg, NULL);               -- !!!
>=20
> Ah, got it, thank you!
>=20
> Then something like this should work. Can you, please, confirm that is =
solves
> the problem?
>=20
> diff --git a/mm/slab_common.c b/mm/slab_common.c
> index 807490fe217a..d916e986f094 100644
> --- a/mm/slab_common.c
> +++ b/mm/slab_common.c
> @@ -180,8 +180,11 @@ static void destroy_memcg_params(struct kmem_cache=
 *s)
>  {
>         if (is_root_cache(s))
>                 kvfree(rcu_access_pointer(s->memcg_params.memcg_caches)=
);
> -       else
> +       else {
> +               mem_cgroup_put(s->memcg_params.memcg);
> +               WRITE_ONCE(s->memcg_params.memcg, NULL);
>                 percpu_ref_exit(&s->memcg_params.refcnt);
> +       }
>  }
> =20
>  static void free_memcg_params(struct rcu_head *rcu)
> @@ -253,8 +256,6 @@ static void memcg_unlink_cache(struct kmem_cache *s=
)
>         } else {
>                 list_del(&s->memcg_params.children_node);
>                 list_del(&s->memcg_params.kmem_caches_node);
> -               mem_cgroup_put(s->memcg_params.memcg);
> -               WRITE_ONCE(s->memcg_params.memcg, NULL);
>         }
>  }
>  #else
>=20

I tested this fix and can confirm that it solved the problem!

>=20
> --
>=20
> Thank you!
>=20
> Roman
>=20
>=20
>>
>>
>>> I'd add an atomic flag to the root kmem_cache, set it at the beginnin=
g of the
>>> kmem_cache_destroy() and check it in free_slab(). If set, dump the st=
acktrace.
>>> Just please make sure you're looking at the root kmem_cache flag, not=
 the memcg
>>> one.
>>>
>>> Thank you!
>>>
>>> Roman
>>>
>>>>
>>>> [  145.540001] free_slab call_rcu() for 00000000392c2900, page is 00=
0003d080e4a200
>>>> [  145.540031] memcg_unlink_cache clearing memcg for 00000000392c290=
0
>>>> [  145.540041] shutdown_cache adding to slab_caches_to_rcu_destroy q=
ueue for work: 00000000392c2900
>>>>
>>>> [  145.540066] kmem_cache_destroy after shutdown_memcg_caches() for =
0000000068106f00
>>>>
>>>> [  145.540075] kmem_cache_destroy before final shutdown_cache() for =
0000000068106f00
>>>> [  145.540086] free_slab call_rcu() for 0000000068106f00, page is 00=
0003d080e0a800
>>>> [  145.540189] shutdown_cache adding to slab_caches_to_rcu_destroy q=
ueue for work: 0000000068106f00
>>>>
>>>> [  145.540548] kmem_cache_destroy after final shutdown_cache() for 0=
000000068106f00
>>>>    kmem_cache_destroy is done
>>>> [  145.540573] slab_caches_to_rcu_destroy_workfn before rcu_barrier(=
) in workfunc
>>>>    slab_caches_to_rcu_destroy_workfn started and waits in rcu_barrie=
r() now
>>>> [  145.540619] smc.0698ae: smc_exit before smc_pnet_exit
>>>>    smc module exit code gets back control ...
>>>> [  145.540699] smc.616283: smc_exit before unregister_pernet_subsys
>>>> [  145.619747] rcu_free_slab called for 00000000392c2e00, page is 00=
0003d080e45000
>>>>    much later the rcu callbacks are invoked, and will crash
>>>>
>>>>>>
>>>>>> If my thoughts are correct, the commit you've mentioned didn't int=
roduced this
>>>>>> issue, it just made it easier to reproduce.
>>>>>>
>>>>>> The proposed fix looks dubious to me: the problem isn't in the mem=
cg pointer
>>>>>> (it's just a luck that it crashes on it), and it seems incorrect t=
o not decrease
>>>>>> the slab statistics of the original memory cgroup.
>>>>
>>>> I was quite sure that my approach is way to simple, it's better when=
 the mm experts
>>>> work on that.
>>>>
>>>>>>
>>>>>> What we probably need to do instead is to extend flush_memcg_workq=
ueue() to
>>>>>> wait for all outstanding rcu free callbacks. I have to think a bit=
 what's the best
>>>>>> way to fix it. How easy is to reproduce the problem?
>>>>
>>>> I can reproduce this at will and I am happy to test any fixes you pr=
ovide.
>>>>
>>>>>
>>>>> After a second thought, flush_memcg_workqueue() already contains
>>>>> a rcu_barrier() call, so now first suspicion is that the last free(=
) call
>>>>> occurs after the kmem_cache_destroy() call. Can you, please, check =
if it's not
>>>>> a case?
>>>>>
>>>>
>>>> In kmem_cache_destroy(), the flush_memcg_workqueue() call is the fir=
st one, and after
>>>> that shutdown_memcg_caches() is called which setup the rcu callbacks=
.
>>>
>>> These are callbacks to destroy kmem_caches, not pages.
>>>
>>>> So flush_memcg_workqueue() can not wait for them. If you follow the =
'call graph' above=20
>>>> using the RCU path in slub.c you can see when the callbacks are set =
up and why no warning=20
>>>> is printed.
>>>>
>>>>
>>>> Second thought after I wrote all of the above: when flush_memcg_work=
queue() already contains
>>>> an rcu_barrier(), whats the point of delaying the slab freeing in th=
e rcu case? All rcu
>>>> readers should be done now, so the rcu callbacks and the worker are =
not needed?
>>>> What am I missing here (I am sure I miss something, I am completely =
new in the mm area)?
>>>>
>>>>> Thanks!
>>>>>
>>>>>>
>>>>>>>
>>>>>>> 349.361168=C2=A8 Unable to handle kernel pointer dereference in v=
irtual kernel address space
>>>>>>
>>>>>> Btw, haven't you noticed anything suspicious in dmesg before this =
line?
>>>>
>>>> There is no error or warning line in dmesg before this line. Actuall=
y, I think that
>>>> all pages are no longer in use so no warning is printed. Anyway, the=
 slab freeing is
>>>> delayed in any case when RCU is in use, right?
>>>>
>>>>
>>>> Karsten
>>>>
>>>>>>
>>>>>> Thank you!
>>>>>>
>>>>>> Roman
>>>>>>
>>>>>>> 349.361210=C2=A8 Failing address: 0000000000000000 TEID: 00000000=
00000483
>>>>>>> 349.361223=C2=A8 Fault in home space mode while using kernel ASCE=
.
>>>>>>> 349.361240=C2=A8 AS:00000000017d4007 R3:000000007fbd0007 S:000000=
007fbff000 P:000000000000003d
>>>>>>> 349.361340=C2=A8 Oops: 0004 ilc:3 =C3=9D#1=C2=A8 PREEMPT SMP
>>>>>>> 349.361349=C2=A8 Modules linked in: tcp_diag inet_diag xt_tcpudp =
ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 ipt_REJECT nf_reject_ipv4 xt_con=
ntrack ip6table_nat ip6table_mangle ip6table_raw ip6table_security iptabl=
e_at nf_nat iptable_mangle iptable_raw iptable_security nf_conntrack nf_d=
efrag_ipv6 nf_de
>>>>>>> 349.361436=C2=A8 CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.3.0-=
05872-g6133e3e4bada-dirty #14
>>>>>>> 349.361445=C2=A8 Hardware name: IBM 2964 NC9 702 (z/VM 6.4.0)
>>>>>>> 349.361450=C2=A8 Krnl PSW : 0704d00180000000 00000000003cadb6 (__=
free_slab+0x686/0x6b0)
>>>>>>> 349.361464=C2=A8            R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 A=
S:3 CC:1 PM:0 RI:0 EA:3
>>>>>>> 349.361470=C2=A8 Krnl GPRS: 00000000f3a32928 0000000000000000 000=
000007fbf5d00 000000000117c4b8
>>>>>>> 349.361475=C2=A8            0000000000000000 000000009e3291c1 000=
0000000000000 0000000000000000
>>>>>>> 349.361481=C2=A8            0000000000000003 0000000000000008 000=
000002b478b00 000003d080a97600
>>>>>>> 349.361481=C2=A8            0000000000000003 0000000000000008 000=
000002b478b00 000003d080a97600
>>>>>>> 349.361486=C2=A8            000000000117ba00 000003e000057db0 000=
00000003cabcc 000003e000057c78
>>>>>>> 349.361500=C2=A8 Krnl Code: 00000000003cada6: e310a1400004       =
 lg      %r1,320(%r10)
>>>>>>> 349.361500=C2=A8            00000000003cadac: c0e50046c286       =
 brasl   %r14,ca32b8
>>>>>>> 349.361500=C2=A8           #00000000003cadb2: a7f4fe36           =
 brc     15,3caa1e
>>>>>>> 349.361500=C2=A8           >00000000003cadb6: e32060800024       =
 stg     %r2,128(%r6)
>>>>>>> 349.361500=C2=A8            00000000003cadbc: a7f4fd9e           =
 brc     15,3ca8f8
>>>>>>> 349.361500=C2=A8            00000000003cadc0: c0e50046790c       =
 brasl   %r14,c99fd8
>>>>>>> 349.361500=C2=A8            00000000003cadc6: a7f4fe2c           =
 brc     15,3caa
>>>>>>> 349.361500=C2=A8            00000000003cadc6: a7f4fe2c           =
 brc     15,3caa1e
>>>>>>> 349.361500=C2=A8            00000000003cadca: ecb1ffff00d9       =
 aghik   %r11,%r1,-1
>>>>>>> 349.361619=C2=A8 Call Trace:
>>>>>>> 349.361627=C2=A8 (=C3=9D<00000000003cabcc>=C2=A8 __free_slab+0x49=
c/0x6b0)
>>>>>>> 349.361634=C2=A8  =C3=9D<00000000001f5886>=C2=A8 rcu_core+0x5a6/0=
x7e0
>>>>>>> 349.361643=C2=A8  =C3=9D<0000000000ca2dea>=C2=A8 __do_softirq+0xf=
2/0x5c0
>>>>>>> 349.361652=C2=A8  =C3=9D<0000000000152644>=C2=A8 irq_exit+0x104/0=
x130
>>>>>>> 349.361659=C2=A8  =C3=9D<000000000010d222>=C2=A8 do_IRQ+0x9a/0xf0
>>>>>>> 349.361667=C2=A8  =C3=9D<0000000000ca2344>=C2=A8 ext_int_handler+=
0x130/0x134
>>>>>>> 349.361674=C2=A8  =C3=9D<0000000000103648>=C2=A8 enabled_wait+0x5=
8/0x128
>>>>>>> 349.361681=C2=A8 (=C3=9D<0000000000103634>=C2=A8 enabled_wait+0x4=
4/0x128)
>>>>>>> 349.361688=C2=A8  =C3=9D<0000000000103b00>=C2=A8 arch_cpu_idle+0x=
40/0x58
>>>>>>> 349.361695=C2=A8  =C3=9D<0000000000ca0544>=C2=A8 default_idle_cal=
l+0x3c/0x68
>>>>>>> 349.361704=C2=A8  =C3=9D<000000000018eaa4>=C2=A8 do_idle+0xec/0x1=
c0
>>>>>>> 349.361748=C2=A8  =C3=9D<000000000018ee0e>=C2=A8 cpu_startup_entr=
y+0x36/0x40
>>>>>>> 349.361756=C2=A8  =C3=9D<000000000122df34>=C2=A8 arch_call_rest_i=
nit+0x5c/0x88
>>>>>>> 349.361761=C2=A8  =C3=9D<0000000000000000>=C2=A8 0x0
>>>>>>> 349.361765=C2=A8 INFO: lockdep is turned off.
>>>>>>> 349.361769=C2=A8 Last Breaking-Event-Address:
>>>>>>> 349.361774=C2=A8  =C3=9D<00000000003ca8f4>=C2=A8 __free_slab+0x1c=
4/0x6b0
>>>>>>> 349.361781=C2=A8 Kernel panic - not syncing: Fatal exception in i=
nterrupt
>>>>>>>
>>>>>>>
>>>>>>> A fix that works for me (RFC):
>>>>>>>
>>>>>>> diff --git a/mm/slab.h b/mm/slab.h
>>>>>>> index a62372d0f271..b19a3f940338 100644
>>>>>>> --- a/mm/slab.h
>>>>>>> +++ b/mm/slab.h
>>>>>>> @@ -328,7 +328,7 @@ static __always_inline void memcg_uncharge_sl=
ab(struct page *page, int order,
>>>>>>>
>>>>>>>         rcu_read_lock();
>>>>>>>         memcg =3D READ_ONCE(s->memcg_params.memcg);
>>>>>>> -       if (likely(!mem_cgroup_is_root(memcg))) {
>>>>>>> +       if (likely(memcg && !mem_cgroup_is_root(memcg))) {
>>>>>>>                 lruvec =3D mem_cgroup_lruvec(page_pgdat(page), me=
mcg);
>>>>>>>                 mod_lruvec_state(lruvec, cache_vmstat_idx(s), -(1=
 << order));
>>>>>>>                 memcg_kmem_uncharge_memcg(page, order, memcg);
>>>>>>>
>>>>>>> --=20
>>>>>>> Karsten
>>>>>>>
>>>>>>> (I'm a dude)
>>>>>>>
>>>>>>>
>>>>
>>
>> --=20
>> Karsten
>>
>> (I'm a dude)
>>

--=20
Karsten

(I'm a dude)