From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 13131C43334
	for <linux-mm@archiver.kernel.org>; Mon, 13 Jun 2022 12:49:31 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 740178D0184; Mon, 13 Jun 2022 08:49:30 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 6EFF68D0171; Mon, 13 Jun 2022 08:49:30 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 590A08D0184; Mon, 13 Jun 2022 08:49:30 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13])
	by kanga.kvack.org (Postfix) with ESMTP id 4772A8D0171
	for <linux-mm@kvack.org>; Mon, 13 Jun 2022 08:49:30 -0400 (EDT)
Received: from smtpin06.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay11.hostedemail.com (Postfix) with ESMTP id 6ECE9808B9
	for <linux-mm@kvack.org>; Mon, 13 Jun 2022 12:49:29 +0000 (UTC)
X-FDA: 79573193658.06.1D391D3
Received: from mail-pg1-f170.google.com (mail-pg1-f170.google.com [209.85.215.170])
	by imf31.hostedemail.com (Postfix) with ESMTP id 0EAC6200A1
	for <linux-mm@kvack.org>; Mon, 13 Jun 2022 12:49:28 +0000 (UTC)
Received: by mail-pg1-f170.google.com with SMTP id 123so5445045pgb.5
        for <linux-mm@kvack.org>; Mon, 13 Jun 2022 05:49:28 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20210112;
        h=date:from:to:cc:subject:message-id:references:mime-version
         :content-disposition:in-reply-to;
        bh=gPvvSe6FTsRcXPQ951pr09BFrUgePKl3dqQkTDR/CbA=;
        b=aTF9amwoZBKqnS6EeoUA3wklSe4v5hYF/iEedNdET2+BKeTE4Xa4uyyUI74TeuiQca
         iqswv0uIgsfH/umy67XVJvPFP30umF1wj5lqLawZ4VCw14X+bZzaKdVY/JH5tWLqfYx0
         nibRrW9jmT3ywWSNPT2IyGKfUhcuiXnS9JjFqobeeAWi1Y9/NrJj0oaGIObNt3rby4T2
         xHo3sPFV7Jh8G9tkxm5NBqhd1wv56hDkY3c3857jtoSa1p8HrmJq6Q7ks0k3h2Elmytk
         a9HQei/J/DioYv3DGrSPkzCu+UI3DsSNl+J7d4kEUFHe9qCaP+epuUs1dkvTq63gtmBK
         xAqA==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20210112;
        h=x-gm-message-state:date:from:to:cc:subject:message-id:references
         :mime-version:content-disposition:in-reply-to;
        bh=gPvvSe6FTsRcXPQ951pr09BFrUgePKl3dqQkTDR/CbA=;
        b=Nk9t1IwJHqgpK4pjRMNrZ2sPLHBhTCiM+e9aIuIUhvFvvUvNE6WSLFsiVueY7oGL6a
         9EKs/NYpXhhSvjJBAh0G06FMQomSWNbDLqahs0/O9PcM3S4gq0bvaWWxVeh8OhG1/mhj
         t5J1ShOvRN68rfMQTzr88pO3f17bYXCbphUjxvz7Ei65rdLQGTXIKwyVqQHPmW9ZaQW9
         LoHEIohJcbKQA7uy8U27smfaCqQ+kyYY0qn1iLLGpDvVz6KkgXRGZQjsvsAcYRXH20Kq
         er8Wf6z3icqfAH9YFR+VRubirSJY9L+Mnl/57ZctR8SIOCgIxW+lcxB1jQ9cvolADLjc
         XGMw==
X-Gm-Message-State: AOAM532FqQFRZLLC72pZnJo1fcOpttu6qow/foEypxs+1HywfNsfUFCC
	5xg/ZJTHdPpmadIJlR2/uQY=
X-Google-Smtp-Source: ABdhPJxFIOEU7ll2wYEtvRIrzPzwiiuRm8s/BGB/bNT9QaatYBwZXao9gACSHEgYjz3KDf3im/iSJg==
X-Received: by 2002:a63:4a4b:0:b0:408:9a69:3610 with SMTP id j11-20020a634a4b000000b004089a693610mr863285pgl.433.1655124567737;
        Mon, 13 Jun 2022 05:49:27 -0700 (PDT)
Received: from hyeyoo ([114.29.24.243])
        by smtp.gmail.com with ESMTPSA id c125-20020a624e83000000b0051850716942sm5457595pfb.140.2022.06.13.05.49.24
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Mon, 13 Jun 2022 05:49:26 -0700 (PDT)
Date: Mon, 13 Jun 2022 21:49:07 +0900
From: Hyeonggon Yoo <42.hyeyoo@gmail.com>
To: Jann Horn <jannh@google.com>
Cc: Christoph Lameter <cl@linux.com>, Pekka Enberg <penberg@kernel.org>,
	David Rientjes <rientjes@google.com>,
	Joonsoo Kim <iamjoonsoo.kim@lge.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Vlastimil Babka <vbabka@suse.cz>, linux-mm@kvack.org,
	linux-kernel@vger.kernel.org
Subject: Re: [PATCH] mm/slub: add missing TID updates on slab deactivation
Message-ID: <YqcyQwCzSuFKkIpr@hyeyoo>
References: <20220608182205.2945720-1-jannh@google.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20220608182205.2945720-1-jannh@google.com>
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1655124569;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=gPvvSe6FTsRcXPQ951pr09BFrUgePKl3dqQkTDR/CbA=;
	b=NazX5miUbUvgSyaOt7PhEl0OguKTllQ4DanLmMOX59FYcvSoGBLnzfedrHprnWPtbY6ejX
	Wx0RmSv2pbgcOv6wADdgwWFZyM6dz/l/7vm6XGLTaR/6QlG9EqCWl9ddxUBK6Droegh0PB
	OWEWOYfRyrWJSFQwoGpDoDszPrGc3QE=
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1655124569; a=rsa-sha256;
	cv=none;
	b=3cyOyr3Gb7ECfhjqz1aT8BhXlUyxtqOcRU03AWozw+evf4XamADPxf/FMverfzwa7tVSJC
	DMmFs6u/YPnvzrP9/lbO57oUyxKSn55w2Tg8nKf6RBad+LNV8f0Hi4RJNwkGeRfS7AqEj1
	g4VihSZDyNCMwQyXvIZ6XoBnH7h3G7A=
ARC-Authentication-Results: i=1;
	imf31.hostedemail.com;
	dkim=pass header.d=gmail.com header.s=20210112 header.b=aTF9amwo;
	dmarc=pass (policy=none) header.from=gmail.com;
	spf=pass (imf31.hostedemail.com: domain of 42.hyeyoo@gmail.com designates 209.85.215.170 as permitted sender) smtp.mailfrom=42.hyeyoo@gmail.com
Authentication-Results: imf31.hostedemail.com;
	dkim=pass header.d=gmail.com header.s=20210112 header.b=aTF9amwo;
	dmarc=pass (policy=none) header.from=gmail.com;
	spf=pass (imf31.hostedemail.com: domain of 42.hyeyoo@gmail.com designates 209.85.215.170 as permitted sender) smtp.mailfrom=42.hyeyoo@gmail.com
X-Rspamd-Server: rspam08
X-Rspam-User: 
X-Stat-Signature: qppgrdfod8rmede68346b1bte1r6xczy
X-Rspamd-Queue-Id: 0EAC6200A1
X-HE-Tag: 1655124568-499509
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

On Wed, Jun 08, 2022 at 08:22:05PM +0200, Jann Horn wrote:
> The fastpath in slab_alloc_node() assumes that c->slab is stable as long as
> the TID stays the same. However, two places in __slab_alloc() currently
> don't update the TID when deactivating the CPU slab.
> 
> If multiple operations race the right way, this could lead to an object
> getting lost; or, in an even more unlikely situation, it could even lead to
> an object being freed onto the wrong slab's freelist, messing up the
> `inuse` counter and eventually causing a page to be freed to the page
> allocator while it still contains slab objects.
> 
> (I haven't actually tested these cases though, this is just based on
> looking at the code. Writing testcases for this stuff seems like it'd be
> a pain...)
> 
> The race leading to state inconsistency is (all operations on the same CPU
> and kmem_cache):
> 
>  - task A: begin do_slab_free():
>     - read TID
>     - read pcpu freelist (==NULL)
>     - check `slab == c->slab` (true)
>  - [PREEMPT A->B]
>  - task B: begin slab_alloc_node():
>     - fastpath fails (`c->freelist` is NULL)
>     - enter __slab_alloc()
>     - slub_get_cpu_ptr() (disables preemption)
>     - enter ___slab_alloc()
>     - take local_lock_irqsave()
>     - read c->freelist as NULL
>     - get_freelist() returns NULL
>     - write `c->slab = NULL`
>     - drop local_unlock_irqrestore()
>     - goto new_slab
>     - slub_percpu_partial() is NULL
>     - get_partial() returns NULL
>     - slub_put_cpu_ptr() (enables preemption)
>  - [PREEMPT B->A]
>  - task A: finish do_slab_free():
>     - this_cpu_cmpxchg_double() succeeds()
>     - [CORRUPT STATE: c->slab==NULL, c->freelist!=NULL]

I can see this happening (!c->slab && c->freelist becoming true)
when I synthetically add scheduling points in the code:

diff --git a/mm/slub.c b/mm/slub.c
index b97fa5e21046..b8012fdf2607 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -3001,6 +3001,10 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
 		goto check_new_slab;

 	slub_put_cpu_ptr(s->cpu_slab);
+
+	if (!in_atomic())
+		schedule();
+
 	slab = new_slab(s, gfpflags, node);
 	c = slub_get_cpu_ptr(s->cpu_slab);

@@ -3456,9 +3460,13 @@ static __always_inline void do_slab_free(struct kmem_cache *s,
 	if (likely(slab == c->slab)) {
 #ifndef CONFIG_PREEMPT_RT
 		void **freelist = READ_ONCE(c->freelist);
+		unsigned long flags;

 		set_freepointer(s, tail_obj, freelist);

+		if (!in_atomic())
+			schedule();
+
 		if (unlikely(!this_cpu_cmpxchg_double(
 				s->cpu_slab->freelist, s->cpu_slab->tid,
 				freelist, tid,
@@ -3467,6 +3475,10 @@ static __always_inline void do_slab_free(struct kmem_cache *s,
 			note_cmpxchg_failure("slab_free", s, tid);
 			goto redo;
 		}
+
+		local_irq_save(flags);
+		WARN_ON(!READ_ONCE(c->slab) && READ_ONCE(c->freelist));
+		local_irq_restore(flags);
 #else /* CONFIG_PREEMPT_RT */
 		/*
 		 * We cannot use the lockless fastpath on PREEMPT_RT because if


> From there, the object on c->freelist will get lost if task B is allowed to
> continue from here: It will proceed to the retry_load_slab label,
> set c->slab, then jump to load_freelist, which clobbers c->freelist.
>
> But if we instead continue as follows, we get worse corruption:
> 
>  - task A: run __slab_free() on object from other struct slab:
>     - CPU_PARTIAL_FREE case (slab was on no list, is now on pcpu partial)
>  - task A: run slab_alloc_node() with NUMA node constraint:
>     - fastpath fails (c->slab is NULL)
>     - call __slab_alloc()
>     - slub_get_cpu_ptr() (disables preemption)
>     - enter ___slab_alloc()
>     - c->slab is NULL: goto new_slab
>     - slub_percpu_partial() is non-NULL
>     - set c->slab to slub_percpu_partial(c)
>     - [CORRUPT STATE: c->slab points to slab-1, c->freelist has objects
>       from slab-2]
>     - goto redo
>     - node_match() fails
>     - goto deactivate_slab
>     - existing c->freelist is passed into deactivate_slab()
>     - inuse count of slab-1 is decremented to account for object from
>       slab-2

I didn't try to reproduce this -- but I agree SLUB can be fooled
by the condition (!c->slab && c->freelist).

> At this point, the inuse count of slab-1 is 1 lower than it should be.
> This means that if we free all allocated objects in slab-1 except for one,
> SLUB will think that slab-1 is completely unused, and may free its page,
> leading to use-after-free.
> 
> Fixes: c17dda40a6a4e ("slub: Separate out kmem_cache_cpu processing from deactivate_slab")
> Fixes: 03e404af26dc2 ("slub: fast release on full slab")
> Cc: stable@vger.kernel.org
> Signed-off-by: Jann Horn <jannh@google.com>
> ---
>  mm/slub.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/mm/slub.c b/mm/slub.c
> index e5535020e0fdf..b97fa5e210469 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -2936,6 +2936,7 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
>  
>  	if (!freelist) {
>  		c->slab = NULL;
> +		c->tid = next_tid(c->tid);
>  		local_unlock_irqrestore(&s->cpu_slab->lock, flags);
>  		stat(s, DEACTIVATE_BYPASS);
>  		goto new_slab;
> @@ -2968,6 +2969,7 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
>  	freelist = c->freelist;
>  	c->slab = NULL;
>  	c->freelist = NULL;
> +	c->tid = next_tid(c->tid);
>  	local_unlock_irqrestore(&s->cpu_slab->lock, flags);
>  	deactivate_slab(s, slab, freelist);
>  
> 
> base-commit: 9886142c7a2226439c1e3f7d9b69f9c7094c3ef6
> -- 
> 2.36.1.476.g0c4daa206d-goog

With this patch I couldn't reproduce it.
This work is really nice. Thanks!

Tested-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>

BTW I wonder how much this race will affect machines in the real world.
Maybe just rare and undetectable memory leak?

-- 
Thanks,
Hyeonggon