From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id B8E8BC48BEB for ; Fri, 16 Feb 2024 10:15:55 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 2CDC96B009C; Fri, 16 Feb 2024 05:15:55 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 276F96B009D; Fri, 16 Feb 2024 05:15:55 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 0C7866B009E; Fri, 16 Feb 2024 05:15:55 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id EAB2D6B009C for ; Fri, 16 Feb 2024 05:15:54 -0500 (EST) Received: from smtpin09.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay05.hostedemail.com (Postfix) with ESMTP id BFB1240493 for ; Fri, 16 Feb 2024 10:15:54 +0000 (UTC) X-FDA: 81797261028.09.C5EA098 Received: from smtp-out1.suse.de (smtp-out1.suse.de [195.135.223.130]) by imf02.hostedemail.com (Postfix) with ESMTP id 64B7B80005 for ; Fri, 16 Feb 2024 10:15:52 +0000 (UTC) Authentication-Results: imf02.hostedemail.com; dkim=pass header.d=suse.cz header.s=susede2_rsa header.b=LpVGkkeB; dkim=pass header.d=suse.cz header.s=susede2_ed25519 header.b=aPmGard6; dkim=pass header.d=suse.cz header.s=susede2_rsa header.b=LpVGkkeB; dkim=pass header.d=suse.cz header.s=susede2_ed25519 header.b=aPmGard6; dmarc=none; spf=pass (imf02.hostedemail.com: domain of jack@suse.cz designates 195.135.223.130 as permitted sender) smtp.mailfrom=jack@suse.cz ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1708078552; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=RwO2sfrBZnb90QrTcYZAEyA1bZtsunbgLx7PRRLi+s4=; b=NIK7/jTC1JTflGClIUiDJQiFzOEQLRqk0zLGRqtmtmIhUkpP2wV3XDT+t0IOLtIgw0k1iN 9e5TzvHLmWNOuX67q7qocELi+tY1m5A7+Z5hkaxFj24oNFLp3E1XdT2huhzS7fZOi0WtJD bZbh3FBUVSq9nYOsZeVs2kXMcHl4VOA= ARC-Authentication-Results: i=1; imf02.hostedemail.com; dkim=pass header.d=suse.cz header.s=susede2_rsa header.b=LpVGkkeB; dkim=pass header.d=suse.cz header.s=susede2_ed25519 header.b=aPmGard6; dkim=pass header.d=suse.cz header.s=susede2_rsa header.b=LpVGkkeB; dkim=pass header.d=suse.cz header.s=susede2_ed25519 header.b=aPmGard6; dmarc=none; spf=pass (imf02.hostedemail.com: domain of jack@suse.cz designates 195.135.223.130 as permitted sender) smtp.mailfrom=jack@suse.cz ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1708078552; a=rsa-sha256; cv=none; b=CnoRie5GdKyUbq+oYFXU6nGHzDvsBL3ey3L1Xa1GQijuMXG5cPhGpRZOcLCwZH8ZGTFp8V R9UwaCOWU0x6g8t/L+nzbFn2OcAtHZ31DSD2lOkS5ADRb6j4BUAK2XV1SY/hgbLxnpk2fM c8Ip/e8og+FcQYV/Wiup+1wBE9DBbcU= Received: from imap2.dmz-prg2.suse.org (imap2.dmz-prg2.suse.org [10.150.64.98]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (No client certificate requested) by smtp-out1.suse.de (Postfix) with ESMTPS id 76AC721EC9; Fri, 16 Feb 2024 10:15:50 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.cz; s=susede2_rsa; t=1708078550; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=RwO2sfrBZnb90QrTcYZAEyA1bZtsunbgLx7PRRLi+s4=; b=LpVGkkeBrh50QgIsYxDbL+ccnWdpkp5F/sXSjHRPWlHZ4/P+aHbpbdnBU9tPYTYijHQRo5 iFYOs7LIV4rCrj+nDCkb/yB6qcZmYQj9bD0LFEatn1uo1dbxSW/fh2x+BM3W3WNy74A+2A nqm26ZC1nbmEwTqKsVfvgk+46DOhQIk= DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=suse.cz; s=susede2_ed25519; t=1708078550; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=RwO2sfrBZnb90QrTcYZAEyA1bZtsunbgLx7PRRLi+s4=; b=aPmGard6iLZ9Lpa6jmNBfVhUlkN0RZi85t3/p3vBhUwAqdu+BlFLb8ejJeBGGezQhKu1Zt uMz7rBK+9wSXXtCA== DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.cz; s=susede2_rsa; t=1708078550; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=RwO2sfrBZnb90QrTcYZAEyA1bZtsunbgLx7PRRLi+s4=; b=LpVGkkeBrh50QgIsYxDbL+ccnWdpkp5F/sXSjHRPWlHZ4/P+aHbpbdnBU9tPYTYijHQRo5 iFYOs7LIV4rCrj+nDCkb/yB6qcZmYQj9bD0LFEatn1uo1dbxSW/fh2x+BM3W3WNy74A+2A nqm26ZC1nbmEwTqKsVfvgk+46DOhQIk= DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=suse.cz; s=susede2_ed25519; t=1708078550; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=RwO2sfrBZnb90QrTcYZAEyA1bZtsunbgLx7PRRLi+s4=; b=aPmGard6iLZ9Lpa6jmNBfVhUlkN0RZi85t3/p3vBhUwAqdu+BlFLb8ejJeBGGezQhKu1Zt uMz7rBK+9wSXXtCA== Received: from imap2.dmz-prg2.suse.org (localhost [127.0.0.1]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (No client certificate requested) by imap2.dmz-prg2.suse.org (Postfix) with ESMTPS id 5DF6C13343; Fri, 16 Feb 2024 10:15:50 +0000 (UTC) Received: from dovecot-director2.suse.de ([10.150.64.162]) by imap2.dmz-prg2.suse.org with ESMTPSA id +ojoFtY1z2XbBQAAn2gu4w (envelope-from ); Fri, 16 Feb 2024 10:15:50 +0000 Received: by quack3.suse.cz (Postfix, from userid 1000) id 14499A0807; Fri, 16 Feb 2024 11:15:46 +0100 (CET) Date: Fri, 16 Feb 2024 11:15:46 +0100 From: Jan Kara To: "Liam R. Howlett" Cc: Jan Kara , Chuck Lever , viro@zeniv.linux.org.uk, brauner@kernel.org, hughd@google.com, akpm@linux-foundation.org, oliver.sang@intel.com, feng.tang@intel.com, linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, maple-tree@lists.infradead.org, linux-mm@kvack.org, lkp@intel.com Subject: Re: [PATCH RFC 7/7] libfs: Re-arrange locking in offset_iterate_dir() Message-ID: <20240216101546.xjcpzyb3pgf2eqm4@quack3> References: <170785993027.11135.8830043889278631735.stgit@91.116.238.104.host.secureserver.net> <170786028847.11135.14775608389430603086.stgit@91.116.238.104.host.secureserver.net> <20240215131638.cxipaxanhidb3pev@quack3> <20240215170008.22eisfyzumn5pw3f@revolver> <20240215171622.gsbjbjz6vau3emkh@quack3> <20240215210742.grjwdqdypvgrpwih@revolver> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20240215210742.grjwdqdypvgrpwih@revolver> X-Rspam-User: X-Rspamd-Server: rspam12 X-Rspamd-Queue-Id: 64B7B80005 X-Stat-Signature: 3u7aeyapwpygaurkaf6qcofn4ujipnor X-HE-Tag: 1708078552-136032 X-HE-Meta: U2FsdGVkX19ymXo/oYO46N9Pz8vz4SiLsnx0t3Y0ZiLCl7MIW0iBJWPqFM/xnYuxjLcLn5zGxZBmZCLkLQbIpFj2imIVjdsjxh5zf3pZNQzBAbmt4JBIuzbeW6hx92mfFAE0mibpCGCSUNKLnQJ9QrojkUF//O3Op/seyqoRAfDneYxFf3zueXV6Kw9wOAOcq78Op4VNGgFBp3pVHdvdeMkumtBg6VEvqhRyGh3Qs6ICadaF3kM5MAUeRp4RghOPo6rRNGdacF3UDUJ6+jsrRDR3ZbtjLgVtSkyHbRFyI9NkWS8FC3zWNPbMMI9bAHbbqSqoYrNuS41FVQIsEOJseNq5nB2/Y5BsiK1yYnnTVDfLFYUIKHJW78BdDACWR+11uZo4DFZBFkNEWBVznbp3mcpake9fZ0iGraKfsTkzehP5lwFHTF2ZLa62oH2494mfPp9Zbnim/1zIKwBypBaU9x4CBXRoIJuVAnDKl+PUT1N0MpE8YK+dIjfWDTK6ofe6bA4UdoGaMG5lJQNSVLJO+ta4/7KgVA3LQttrgnbSXsJZFSCU4R8h5YLsTyFMxIqyf6dcCEHqfAURlnWJPzLqUbXiC5NaesZxRDdHR12syvghg1TW+ETcGrDOnADJ6ph+8nG6X1g7Ev3r7ITc7KLMGnewBbo2mX//ejD2zqujjLPYxYWOGD1JC+mDioiDvTcgZqwRcWwSM03oPQ8Z9W0BBCdPUMkLS40/kvjq5TWejJJC/LkF4+SARLF/w0O8E2r/0EMxYhVBpLw0rrCU10q88RcyaseWBDPzqnyBff3I8GpgSTLNY/0P4+yQQJvUluyrFIMFB1jQdx8I1aSVqg2FL2Bg4axygIj4z1SzfgrmwpsE0UDuVWAYMeDuWSyW4HF71J/t9rhZMIEHCBEaHhyTfnGe22lQAk0NN8Iny2z9lBeeUTwrmo/3yaxHtjF0fyIjCA6wAuI3+x09IRIMAgp ft66vj+n sfP+Dvir8/bOaOVzAg1923Sv9hzt69Y0xcZjNbRUNVJAvQiSsyNaV2HZisjtRl3TRR/NQX7THIGOheRknd67zHImsbIulL5VxnMHL5v2WLJhDm8KY95RcxjDxGtonUGdmOBbWpHMs1YAHtZJDzoywzGpLyBnlQ9D8EDxxIzeIHQKAHo3Sl96miQgveZMxT41lafEyAiU13a03qOoSYri3vBq4uCdAwJSMn7HdfcYyxGqynUHwLa0O1DLI89edHrgCupiNS1L0towqw3wjn+g7KtIMfGagWK0CeDu4FNj76MUxEk5HE8isKFEEBsYc/ZOGACn0+/S/zm04wT8= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Thu 15-02-24 16:07:42, Liam R. Howlett wrote: > * Jan Kara [240215 12:16]: > > On Thu 15-02-24 12:00:08, Liam R. Howlett wrote: > > > * Jan Kara [240215 08:16]: > > > > On Tue 13-02-24 16:38:08, Chuck Lever wrote: > > > > > From: Chuck Lever > > > > > > > > > > Liam says that, unlike with xarray, once the RCU read lock is > > > > > released ma_state is not safe to re-use for the next mas_find() call. > > > > > But the RCU read lock has to be released on each loop iteration so > > > > > that dput() can be called safely. > > > > > > > > > > Thus we are forced to walk the offset tree with fresh state for each > > > > > directory entry. mt_find() can do this for us, though it might be a > > > > > little less efficient than maintaining ma_state locally. > > > > > > > > > > Since offset_iterate_dir() doesn't build ma_state locally any more, > > > > > there's no longer a strong need for offset_find_next(). Clean up by > > > > > rolling these two helpers together. > > > > > > > > > > Signed-off-by: Chuck Lever > > > > > > > > Well, in general I think even xas_next_entry() is not safe to use how > > > > offset_find_next() was using it. Once you drop rcu_read_lock(), > > > > xas->xa_node could go stale. But since you're holding inode->i_rwsem when > > > > using offset_find_next() you should be protected from concurrent > > > > modifications of the mapping (whatever the underlying data structure is) - > > > > that's what makes xas_next_entry() safe AFAIU. Isn't that enough for the > > > > maple tree? Am I missing something? > > > > > > If you are stopping, you should be pausing the iteration. Although this > > > works today, it's not how it should be used because if we make changes > > > (ie: compaction requires movement of data), then you may end up with a > > > UAF issue. We'd have no way of knowing you are depending on the tree > > > structure to remain consistent. > > > > I see. But we have versions of these structures that have locking external > > to the structure itself, don't we? > > Ah, I do have them - but I don't want to propagate its use as the dream > is that it can be removed. > > > > Then how do you imagine serializing the > > background operations like compaction? As much as I agree your argument is > > "theoretically clean", it seems a bit like a trap and there are definitely > > xarray users that are going to be broken by this (e.g. > > tag_pages_for_writeback())... > > I'm not sure I follow the trap logic. There are locks for the data > structure that need to be followed for reading (rcu) and writing > (spinlock for the maple tree). If you don't correctly lock the data > structure then you really are setting yourself up for potential issues > in the future. > > The limitations are outlined in the documentation as to how and when to > lock. I'm not familiar with the xarray users, but it does check for > locking with lockdep, but the way this is written bypasses the lockdep > checking as the locks are taken and dropped without the proper scope. > > If you feel like this is a trap, then maybe we need to figure out a new > plan to detect incorrect use? OK, I was a bit imprecise. What I wanted to say is that this is a shift in the paradigm in the sense that previously, we mostly had (and still have) data structure APIs (lists, rb-trees, radix-tree, now xarray) that were guaranteeing that unless you call into the function to mutate the data structure it stays intact. Now maple trees are shifting more in a direction of black-box API where you cannot assume what happens inside. Which is fine but then we have e.g. these iterators which do not quite follow this black-box design and you have to remember subtle details like calling "mas_pause()" before unlocking which is IMHO error-prone. Ideally, users of the black-box API shouldn't be exposed to the details of the internal locking at all (but then the performance suffers so I understand why you do things this way). Second to this ideal variant would be if we could detect we unlocked the lock without calling xas_pause() and warn on that. Or maybe xas_unlock*() should be calling xas_pause() automagically and we'd have similar helpers for RCU to do the magic for you? > Looking through tag_pages_for_writeback(), it does what is necessary to > keep a safe state - before it unlocks it calls xas_pause(). We have the > same on maple tree; mas_pause(). This will restart the next operation > from the root of the tree (the root can also change), to ensure that it > is safe. OK, I've missed the xas_pause(). Thanks for correcting me. > If you have other examples you think are unsafe then I can have a look > at them as well. I'm currently not aware of any but I'll let you know if I find some. Missing xas/mas_pause() seems really easy. Honza -- Jan Kara SUSE Labs, CR