From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 3D76DC433F5 for ; Fri, 7 Jan 2022 00:14:04 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id A45526B0074; Thu, 6 Jan 2022 19:14:03 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 9F54B6B0075; Thu, 6 Jan 2022 19:14:03 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 896016B0078; Thu, 6 Jan 2022 19:14:03 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id 744D56B0074 for ; Thu, 6 Jan 2022 19:14:03 -0500 (EST) Received: from smtpin09.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay01.hostedemail.com (Postfix) with ESMTP id 2C75F180C3C46 for ; Fri, 7 Jan 2022 00:14:03 +0000 (UTC) X-FDA: 79001568366.09.DFD0C8B Received: from dfw.source.kernel.org (dfw.source.kernel.org [139.178.84.217]) by imf19.hostedemail.com (Postfix) with ESMTP id 988CE1A0004 for ; Fri, 7 Jan 2022 00:14:02 +0000 (UTC) Received: from smtp.kernel.org (relay.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by dfw.source.kernel.org (Postfix) with ESMTPS id 9444861E87; Fri, 7 Jan 2022 00:14:01 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id 95676C36AE0; Fri, 7 Jan 2022 00:14:00 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1641514441; bh=3FrIFZaXJ02RwBEVOOXJcx4x1yw60ylJkU2mZqQjAMw=; h=Date:From:To:Cc:Subject:References:In-Reply-To:From; b=iNTeTWT/rpj0nfHUCszPJ4rTQzWujIzb6kDqSzNbIAMrPaSuJQ5TCWYUlXg7J58t3 rUhZpTY2Dt8G93NuLuL74WJ1QABK1FSthhFBfPKFHGdT9Mkm0U1QhbwBIdWLlQq23o aTWLXrwJJgmi9G6oqhxp4/fk750HCaa9fI+8zx6GGyAPbw7MeI31PBlzI2YUJMyzTw TqP/FDqyxNP3FgB+z89NywaUrQnMh5agO5V/FLUozA+Yc2jsuL/FnYtBt02d2ECpwt ekcd6/JtC5rC5ngXACrcqoZrBxJkR0I9jXkYbVlQWSeMIkCu9HiNrKATagE2yjkgu8 6r6/KQgCj4/Qg== Date: Thu, 6 Jan 2022 16:13:59 -0800 From: Eric Biggers To: Linus Torvalds Cc: Tejun Heo , Zefan Li , Johannes Weiner , Peter Zijlstra , Juri Lelli , Vincent Guittot , Ingo Molnar , Hillf Danton , syzbot , linux-fsdevel , Linux Kernel Mailing List , syzkaller-bugs , Linux-MM Subject: Re: psi_trigger_poll() is completely broken Message-ID: References: <000000000000e8f8f505d0e479a5@google.com> <20211211015620.1793-1-hdanton@sina.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: Authentication-Results: imf19.hostedemail.com; dkim=pass header.d=kernel.org header.s=k20201202 header.b="iNTeTWT/"; spf=pass (imf19.hostedemail.com: domain of ebiggers@kernel.org designates 139.178.84.217 as permitted sender) smtp.mailfrom=ebiggers@kernel.org; dmarc=pass (policy=none) header.from=kernel.org X-Rspamd-Server: rspam06 X-Rspamd-Queue-Id: 988CE1A0004 X-Stat-Signature: ymbh3y4acax58nntdy9kpwsnbe7zhkpu X-HE-Tag: 1641514442-617385 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Thu, Jan 06, 2022 at 02:59:36PM -0800, Linus Torvalds wrote: > > So here's a COMPLETELY UNTESTED patch to try to fix the lifetime and locking. > > The locking was completely broken, in that psi_trigger_replace() > expected that the caller would hold some exclusive lock so that it > would release the correct previous trigger. The cgroup code doesn't > seem to have any such exclusion. > > This (UNTESTED!) patch fixes that breakage by just using a cmpxchg loop. > > And the lifetime was completely broken (and that's Eric's email) > because psi_trigger_replace() would drop the refcount to the old > trigger - assuming it got the right one - even though the old trigger > could still have active waiters on the waitqueue due to poll() or > select(). > > This (UNTESTED!) patch fixes _that_ breakage by making > psi_trigger_replace() instead just put the previous trigger on the > "stale_trigger" linked list, and never release it at all. > > It now gets released by "psi_trigger_release()" instead, which walks > the list at file release time. Doing "psi_trigger_replace(.., NULL)" > is not valid any more. > > And because the reference cannot go away, we now can throw away all > the incorrect temporary kref_get/put games from psi_trigger_poll(), > which didn't actually fix the race at all, only limited it to the poll > waitqueue. > > That also means we can remove the "synchronize_rcu()" from > psi_trigger_destroy(), since that was trying to hide all the problems > with the "take rcu lock and then do kref_get()" thing not having > locking. The locking still doesn't exist, but since we don't release > the old one when replacing it, the issue is moot. > > NOTE NOTE NOTE! Not only is this patch entirely untested, there are > optimizations you could do if there was some sane synchronization > between psi_trigger_poll() and psi_trigger_replace(). I put comments > about it in the code, but right now the code just assumes that > replacing a trigger is fairly rare (and since it requires write > permissions, it's not something random users can do). > > I'm not proud of this patch, but I think it might fix the fundamental > bugs in the code for now. > > It's not lovely, it has room for improvement, and I wish we didn't > need this kind of thing, but it looks superficially sane as a fix to > me. > > Comments? > > And once again: this is UNTESTED. I've compiled-tested it, it looks > kind of sane to me, but honestly, I don't know the code very well. > > Also, I'm not super-happy with how that 'psi_disabled' static branch > works. If somebody switches it off after it has been on, that will > also disable the freeing code, so now you'll be leaking memory. > > I couldn't find it in myself to care. I had to make the following changes to Linus's patch: diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c index 10430f75f21a..7d5afa89db44 100644 --- a/kernel/sched/psi.c +++ b/kernel/sched/psi.c @@ -1255,7 +1255,7 @@ void psi_trigger_replace(void **trigger_ptr, struct psi_trigger *new) struct psi_trigger *old = *trigger_ptr; new->stale_trigger = old; - if (try_cmpxchg(trigger_ptr, old, new)) + if (try_cmpxchg(trigger_ptr, &old, new)) break; } @@ -1270,7 +1270,7 @@ void psi_trigger_replace(void **trigger_ptr, struct psi_trigger *new) /* No locking needed for final release */ void psi_trigger_release(void **trigger_ptr) { - struct psi_trigger *trigger; + struct psi_trigger *trigger = *trigger_ptr; if (static_branch_likely(&psi_disabled)) return; After that, the two reproducers I gave in https://lore.kernel.org/r/YbQUSlq76Iv5L4cC@sol.localdomain (the ones at the end of my mail, not the syzbot-generated ones which I didn't try) no longer crash the kernel. This is one way to fix the use-after-free, but the fact that it allows anyone who can write to a /proc/pressure/* file to cause the kernel to allocate an unbounded number of 'struct psi_trigger' structs is still really broken. I think we really need an answer to Linus' question: > What are the users? Can we make the rule for -EBUSY simply be that you > can _install_ a trigger, but you can't replace an existing one (except > with NULL, when you close it). ... since that would be a much better fix. The example in Documentation/accounting/psi.rst only does a single write; that case wouldn't be broken if we made multiple writes not work. - Eric