From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=cKhE=X4=kvack.org=owner-linux-mm@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-5.3 required=3.0 tests=HEADER_FROM_DIFFERENT_DOMAINS,
	INCLUDES_PATCH,MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,USER_AGENT_SANE_1
	autolearn=ham autolearn_force=no version=3.4.0
Received: from mail.kernel.org (mail.kernel.org [198.145.29.99])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 22F63C32792
	for <linux-mm@archiver.kernel.org>; Thu,  3 Oct 2019 08:48:36 +0000 (UTC)
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by mail.kernel.org (Postfix) with ESMTP id C72A521A4C
	for <linux-mm@archiver.kernel.org>; Thu,  3 Oct 2019 08:48:35 +0000 (UTC)
DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org C72A521A4C
Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=linux.ibm.com
Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix)
	id 53F216B0005; Thu,  3 Oct 2019 04:48:35 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 4C81D8E0003; Thu,  3 Oct 2019 04:48:35 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 369C96B0007; Thu,  3 Oct 2019 04:48:35 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from forelay.hostedemail.com (smtprelay0099.hostedemail.com [216.40.44.99])
	by kanga.kvack.org (Postfix) with ESMTP id EE9226B0005
	for <linux-mm@kvack.org>; Thu,  3 Oct 2019 04:48:34 -0400 (EDT)
Received: from smtpin08.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251])
	by forelay03.hostedemail.com (Postfix) with SMTP id 7723F824CA35
	for <linux-mm@kvack.org>; Thu,  3 Oct 2019 08:48:34 +0000 (UTC)
X-FDA: 76001847348.08.boys16_4ac5e597f3127
X-HE-Tag: boys16_4ac5e597f3127
X-Filterd-Recvd-Size: 15035
Received: from mx0a-001b2d01.pphosted.com (mx0a-001b2d01.pphosted.com [148.163.156.1])
	by imf37.hostedemail.com (Postfix) with ESMTP
	for <linux-mm@kvack.org>; Thu,  3 Oct 2019 08:48:33 +0000 (UTC)
Received: from pps.filterd (m0187473.ppops.net [127.0.0.1])
	by mx0a-001b2d01.pphosted.com (8.16.0.27/8.16.0.27) with SMTP id x938lVHq101806
	for <linux-mm@kvack.org>; Thu, 3 Oct 2019 04:48:32 -0400
Received: from e06smtp01.uk.ibm.com (e06smtp01.uk.ibm.com [195.75.94.97])
	by mx0a-001b2d01.pphosted.com with ESMTP id 2vd1gy94nd-1
	(version=TLSv1.2 cipher=AES256-GCM-SHA384 bits=256 verify=NOT)
	for <linux-mm@kvack.org>; Thu, 03 Oct 2019 04:48:29 -0400
Received: from localhost
	by e06smtp01.uk.ibm.com with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted
	for <linux-mm@kvack.org> from <kgraul@linux.ibm.com>;
	Thu, 3 Oct 2019 09:48:17 +0100
Received: from b06avi18626390.portsmouth.uk.ibm.com (9.149.26.192)
	by e06smtp01.uk.ibm.com (192.168.101.131) with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted;
	(version=TLSv1/SSLv3 cipher=AES256-GCM-SHA384 bits=256/256)
	Thu, 3 Oct 2019 09:48:14 +0100
Received: from d06av24.portsmouth.uk.ibm.com (d06av24.portsmouth.uk.ibm.com [9.149.105.60])
	by b06avi18626390.portsmouth.uk.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id x938lifx35979772
	(version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK);
	Thu, 3 Oct 2019 08:47:44 GMT
Received: from d06av24.portsmouth.uk.ibm.com (unknown [127.0.0.1])
	by IMSVA (Postfix) with ESMTP id 2ABF842049;
	Thu,  3 Oct 2019 08:48:13 +0000 (GMT)
Received: from d06av24.portsmouth.uk.ibm.com (unknown [127.0.0.1])
	by IMSVA (Postfix) with ESMTP id DC53E42042;
	Thu,  3 Oct 2019 08:48:11 +0000 (GMT)
Received: from [9.145.47.111] (unknown [9.145.47.111])
	by d06av24.portsmouth.uk.ibm.com (Postfix) with ESMTP;
	Thu,  3 Oct 2019 08:48:11 +0000 (GMT)
Subject: Re: BUG: Crash in __free_slab() using SLAB_TYPESAFE_BY_RCU
To: Roman Gushchin <guro@fb.com>
Cc: Shakeel Butt <shakeelb@google.com>,
        Vladimir Davydov <vdavydov.dev@gmail.com>,
        David Rientjes <rientjes@google.com>,
        ",Christoph Lameter" <cl@linux.com>, Pekka Enberg <penberg@kernel.org>,
        Joonsoo Kim <iamjoonsoo.kim@lge.com>,
        Andrew Morton <akpm@linux-foundation.org>,
        "linux-mm@kvack.org" <linux-mm@kvack.org>
References: <4a5108b4-5a2f-f83c-e6a8-5e0c9074ac69@linux.ibm.com>
 <20191002194121.GA9033@castle.DHCP.thefacebook.com>
 <20191003033540.GA10017@castle.DHCP.thefacebook.com>
From: Karsten Graul <kgraul@linux.ibm.com>
Organization: IBM Deutschland Research & Development GmbH
Date: Thu, 3 Oct 2019 10:48:15 +0200
User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:60.0) Gecko/20100101
 Thunderbird/60.9.0
MIME-Version: 1.0
In-Reply-To: <20191003033540.GA10017@castle.DHCP.thefacebook.com>
Content-Type: text/plain; charset=utf-8
Content-Language: de-DE
X-TM-AS-GCONF: 00
x-cbid: 19100308-4275-0000-0000-0000036D9A1C
X-IBM-AV-DETECTION: SAVI=unused REMOTE=unused XFE=unused
x-cbparentid: 19100308-4276-0000-0000-000038809F6B
Message-Id: <da3c67e7-781c-e145-5c6e-c9f3ed4e57fb@linux.ibm.com>
X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10434:,, definitions=2019-10-03_04:,,
 signatures=0
X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 priorityscore=1501
 malwarescore=0 suspectscore=2 phishscore=0 bulkscore=0 spamscore=0
 clxscore=1015 lowpriorityscore=0 mlxscore=0 impostorscore=0
 mlxlogscore=999 adultscore=0 classifier=spam adjust=0 reason=mlx
 scancount=1 engine=8.0.1-1908290000 definitions=main-1910030085
Content-Transfer-Encoding: quoted-printable
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

Hello, Roman!

On 03/10/2019 05:35, Roman Gushchin wrote:
> On Wed, Oct 02, 2019 at 12:41:29PM -0700, Roman Gushchin wrote:
>> Hello, Karsten!
>>
>> Thank you for the report!
>>
>> On Wed, Oct 02, 2019 at 04:50:53PM +0200, Karsten Graul wrote:
>>>
>>> net/smc is calling proto_register(&smc_proto, 1) with smc_proto.slab_=
flags =3D SLAB_TYPESAFE_BY_RCU.
>>> Right after the last SMC socket is destroyed, proto_unregister(&smc_p=
roto) is called, which=20
>>> calls kmem_cache_destroy(prot->slab). This results in a kernel crash =
in __free_slab().
>>> Platform is s390x, reproduced on kernel 5.4-rc1. The problem was intr=
oduced by commit
>>> fb2f2b0adb98 ("mm: memcg/slab: reparent memcg kmem_caches on cgroup r=
emoval")
>>>
>>> I added a 'call graph', below of that is the crash log and a (simple)=
 patch that works for me,
>>> but I don't know if this is the correct way to fix it.
>>>
>>> (Please keep me on CC of this thread because I do not follow the mm m=
ailing list, thank you)
>>>
>>>
>>> kmem_cache_destroy()=20
>>>   -> shutdown_memcg_caches()
>>>     -> shutdown_cache()
>>>       -> __kmem_cache_shutdown()  (slub.c)
>>>         -> free_partial()
>>>           -> discard_slab()
>>> 	    -> free_slab()                                      -- call to _=
_free_slab() is delayed
>>> 	      -> call_rcu(rcu_free_slab)
>>>     -> memcg_unlink_cache()
>>>       -> WRITE_ONCE(s->memcg_params.memcg, NULL);               -- !!=
!
>>>     -> list_add_tail(&s->list, &slab_caches_to_rcu_destroy);
>>>     -> schedule_work(&slab_caches_to_rcu_destroy_work);  -> work_fn u=
ses rcu_barrier() to wait for rcu_batch,=20
>>>                                                             so work_f=
n is not further involved here...
>>> ... rcu grace period ...
>>> rcu_batch()
>>>   ...
>>>   -> rcu_free_slab()   (slub.c)
>>>     -> __free_slab()
>>>       -> uncharge_slab_page()
>>>         -> memcg_uncharge_slab()
>>> 	  -> memcg =3D READ_ONCE(s->memcg_params.memcg);          -- !!! mem=
cg =3D=3D NULL
>>> 	  -> mem_cgroup_lruvec(memcg)
>>> 	    -> mz =3D mem_cgroup_nodeinfo(memcg, pgdat->node_id); -- mz =3D=3D=
 NULL
>>> 	    -> lruvec =3D &mz->lruvec;                            -- lruvec =
=3D=3D NULL
>>> 	    -> lruvec->pgdat =3D pgdat;                           -- *crash*
>>>
>>> The crash log:
>>
>> Hm, I might be wrong, but it seems that the problem is deeper: __free_=
slab()
>> called from the rcu path races with kmem_cache_destroy(), which is sup=
posed
>> to be called when there are no outstanding allocations (and correspond=
ing pages).
>> Any charged slab page actually holds a reference to the kmem_cache, wh=
ich prevents
>> its destruction (look at s->memcg_params.refcnt), but kmem_cache_destr=
oy() ignores
>> it.

I don't see a race between kmem_cache_destroy() and __fre_slab(). kmem_ca=
che_destroy()
is already done when __free_slab() is called. Maybe the below trace shows=
 you the order of calls on my
system: kmem_cache_destroy() unlinks the memcg caches, sets up the rcu ca=
llbacks, then=20
it starts the slab_caches_to_rcu_destroy_workfn() worker and then kmem_ca=
che_destroy() is done.
You see that the smc code is getting control again after that.
The worker starts in between (before the smc_exit trace), but keeps waiti=
ng on the rcu_barrier().
Ages later (see time difference) the rcu grace period ends and rcu_free_s=
lab() is called, and it
crashes.=20
I hope that helps!

[  145.539917] kmem_cache_destroy before shutdown_memcg_caches() for 0000=
000068106f00
[  145.539929] free_slab call_rcu() for 00000000392c2e00, page is 000003d=
080e45000
[  145.539961] memcg_unlink_cache clearing memcg for 00000000392c2e00
[  145.539970] shutdown_cache adding to slab_caches_to_rcu_destroy queue =
for work: 00000000392c2e00

[  145.540001] free_slab call_rcu() for 00000000392c2900, page is 000003d=
080e4a200
[  145.540031] memcg_unlink_cache clearing memcg for 00000000392c2900
[  145.540041] shutdown_cache adding to slab_caches_to_rcu_destroy queue =
for work: 00000000392c2900

[  145.540066] kmem_cache_destroy after shutdown_memcg_caches() for 00000=
00068106f00

[  145.540075] kmem_cache_destroy before final shutdown_cache() for 00000=
00068106f00
[  145.540086] free_slab call_rcu() for 0000000068106f00, page is 000003d=
080e0a800
[  145.540189] shutdown_cache adding to slab_caches_to_rcu_destroy queue =
for work: 0000000068106f00

[  145.540548] kmem_cache_destroy after final shutdown_cache() for 000000=
0068106f00
   kmem_cache_destroy is done
[  145.540573] slab_caches_to_rcu_destroy_workfn before rcu_barrier() in =
workfunc
   slab_caches_to_rcu_destroy_workfn started and waits in rcu_barrier() n=
ow
[  145.540619] smc.0698ae: smc_exit before smc_pnet_exit
   smc module exit code gets back control ...
[  145.540699] smc.616283: smc_exit before unregister_pernet_subsys
[  145.619747] rcu_free_slab called for 00000000392c2e00, page is 000003d=
080e45000
   much later the rcu callbacks are invoked, and will crash

>>
>> If my thoughts are correct, the commit you've mentioned didn't introdu=
ced this
>> issue, it just made it easier to reproduce.
>>
>> The proposed fix looks dubious to me: the problem isn't in the memcg p=
ointer
>> (it's just a luck that it crashes on it), and it seems incorrect to no=
t decrease
>> the slab statistics of the original memory cgroup.

I was quite sure that my approach is way to simple, it's better when the =
mm experts
work on that.

>>
>> What we probably need to do instead is to extend flush_memcg_workqueue=
() to
>> wait for all outstanding rcu free callbacks. I have to think a bit wha=
t's the best
>> way to fix it. How easy is to reproduce the problem?

I can reproduce this at will and I am happy to test any fixes you provide=
.

>=20
> After a second thought, flush_memcg_workqueue() already contains
> a rcu_barrier() call, so now first suspicion is that the last free() ca=
ll
> occurs after the kmem_cache_destroy() call. Can you, please, check if i=
t's not
> a case?
>=20

In kmem_cache_destroy(), the flush_memcg_workqueue() call is the first on=
e, and after
that shutdown_memcg_caches() is called which setup the rcu callbacks.=20
So flush_memcg_workqueue() can not wait for them. If you follow the 'call=
 graph' above=20
using the RCU path in slub.c you can see when the callbacks are set up an=
d why no warning=20
is printed.


Second thought after I wrote all of the above: when flush_memcg_workqueue=
() already contains
an rcu_barrier(), whats the point of delaying the slab freeing in the rcu=
 case? All rcu
readers should be done now, so the rcu callbacks and the worker are not n=
eeded?
What am I missing here (I am sure I miss something, I am completely new i=
n the mm area)?

> Thanks!
>=20
>>
>>>
>>> 349.361168=C2=A8 Unable to handle kernel pointer dereference in virtu=
al kernel address space
>>
>> Btw, haven't you noticed anything suspicious in dmesg before this line=
?

There is no error or warning line in dmesg before this line. Actually, I =
think that
all pages are no longer in use so no warning is printed. Anyway, the slab=
 freeing is
delayed in any case when RCU is in use, right?


Karsten

>>
>> Thank you!
>>
>> Roman
>>
>>> 349.361210=C2=A8 Failing address: 0000000000000000 TEID: 000000000000=
0483
>>> 349.361223=C2=A8 Fault in home space mode while using kernel ASCE.
>>> 349.361240=C2=A8 AS:00000000017d4007 R3:000000007fbd0007 S:000000007f=
bff000 P:000000000000003d
>>> 349.361340=C2=A8 Oops: 0004 ilc:3 =C3=9D#1=C2=A8 PREEMPT SMP
>>> 349.361349=C2=A8 Modules linked in: tcp_diag inet_diag xt_tcpudp ip6t=
_rpfilter ip6t_REJECT nf_reject_ipv6 ipt_REJECT nf_reject_ipv4 xt_conntra=
ck ip6table_nat ip6table_mangle ip6table_raw ip6table_security iptable_at=
 nf_nat iptable_mangle iptable_raw iptable_security nf_conntrack nf_defra=
g_ipv6 nf_de
>>> 349.361436=C2=A8 CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.3.0-0587=
2-g6133e3e4bada-dirty #14
>>> 349.361445=C2=A8 Hardware name: IBM 2964 NC9 702 (z/VM 6.4.0)
>>> 349.361450=C2=A8 Krnl PSW : 0704d00180000000 00000000003cadb6 (__free=
_slab+0x686/0x6b0)
>>> 349.361464=C2=A8            R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 =
CC:1 PM:0 RI:0 EA:3
>>> 349.361470=C2=A8 Krnl GPRS: 00000000f3a32928 0000000000000000 0000000=
07fbf5d00 000000000117c4b8
>>> 349.361475=C2=A8            0000000000000000 000000009e3291c1 0000000=
000000000 0000000000000000
>>> 349.361481=C2=A8            0000000000000003 0000000000000008 0000000=
02b478b00 000003d080a97600
>>> 349.361481=C2=A8            0000000000000003 0000000000000008 0000000=
02b478b00 000003d080a97600
>>> 349.361486=C2=A8            000000000117ba00 000003e000057db0 0000000=
0003cabcc 000003e000057c78
>>> 349.361500=C2=A8 Krnl Code: 00000000003cada6: e310a1400004        lg =
     %r1,320(%r10)
>>> 349.361500=C2=A8            00000000003cadac: c0e50046c286        bra=
sl   %r14,ca32b8
>>> 349.361500=C2=A8           #00000000003cadb2: a7f4fe36            brc=
     15,3caa1e
>>> 349.361500=C2=A8           >00000000003cadb6: e32060800024        stg=
     %r2,128(%r6)
>>> 349.361500=C2=A8            00000000003cadbc: a7f4fd9e            brc=
     15,3ca8f8
>>> 349.361500=C2=A8            00000000003cadc0: c0e50046790c        bra=
sl   %r14,c99fd8
>>> 349.361500=C2=A8            00000000003cadc6: a7f4fe2c            brc=
     15,3caa
>>> 349.361500=C2=A8            00000000003cadc6: a7f4fe2c            brc=
     15,3caa1e
>>> 349.361500=C2=A8            00000000003cadca: ecb1ffff00d9        agh=
ik   %r11,%r1,-1
>>> 349.361619=C2=A8 Call Trace:
>>> 349.361627=C2=A8 (=C3=9D<00000000003cabcc>=C2=A8 __free_slab+0x49c/0x=
6b0)
>>> 349.361634=C2=A8  =C3=9D<00000000001f5886>=C2=A8 rcu_core+0x5a6/0x7e0
>>> 349.361643=C2=A8  =C3=9D<0000000000ca2dea>=C2=A8 __do_softirq+0xf2/0x=
5c0
>>> 349.361652=C2=A8  =C3=9D<0000000000152644>=C2=A8 irq_exit+0x104/0x130
>>> 349.361659=C2=A8  =C3=9D<000000000010d222>=C2=A8 do_IRQ+0x9a/0xf0
>>> 349.361667=C2=A8  =C3=9D<0000000000ca2344>=C2=A8 ext_int_handler+0x13=
0/0x134
>>> 349.361674=C2=A8  =C3=9D<0000000000103648>=C2=A8 enabled_wait+0x58/0x=
128
>>> 349.361681=C2=A8 (=C3=9D<0000000000103634>=C2=A8 enabled_wait+0x44/0x=
128)
>>> 349.361688=C2=A8  =C3=9D<0000000000103b00>=C2=A8 arch_cpu_idle+0x40/0=
x58
>>> 349.361695=C2=A8  =C3=9D<0000000000ca0544>=C2=A8 default_idle_call+0x=
3c/0x68
>>> 349.361704=C2=A8  =C3=9D<000000000018eaa4>=C2=A8 do_idle+0xec/0x1c0
>>> 349.361748=C2=A8  =C3=9D<000000000018ee0e>=C2=A8 cpu_startup_entry+0x=
36/0x40
>>> 349.361756=C2=A8  =C3=9D<000000000122df34>=C2=A8 arch_call_rest_init+=
0x5c/0x88
>>> 349.361761=C2=A8  =C3=9D<0000000000000000>=C2=A8 0x0
>>> 349.361765=C2=A8 INFO: lockdep is turned off.
>>> 349.361769=C2=A8 Last Breaking-Event-Address:
>>> 349.361774=C2=A8  =C3=9D<00000000003ca8f4>=C2=A8 __free_slab+0x1c4/0x=
6b0
>>> 349.361781=C2=A8 Kernel panic - not syncing: Fatal exception in inter=
rupt
>>>
>>>
>>> A fix that works for me (RFC):
>>>
>>> diff --git a/mm/slab.h b/mm/slab.h
>>> index a62372d0f271..b19a3f940338 100644
>>> --- a/mm/slab.h
>>> +++ b/mm/slab.h
>>> @@ -328,7 +328,7 @@ static __always_inline void memcg_uncharge_slab(s=
truct page *page, int order,
>>>
>>>         rcu_read_lock();
>>>         memcg =3D READ_ONCE(s->memcg_params.memcg);
>>> -       if (likely(!mem_cgroup_is_root(memcg))) {
>>> +       if (likely(memcg && !mem_cgroup_is_root(memcg))) {
>>>                 lruvec =3D mem_cgroup_lruvec(page_pgdat(page), memcg)=
;
>>>                 mod_lruvec_state(lruvec, cache_vmstat_idx(s), -(1 << =
order));
>>>                 memcg_kmem_uncharge_memcg(page, order, memcg);
>>>
>>> --=20
>>> Karsten
>>>
>>> (I'm a dude)
>>>
>>>