From: "Liam R. Howlett" <Liam.Howlett@Oracle.com>
To: Jan Kara <jack@suse.cz>
Cc: Chuck Lever <cel@kernel.org>,
viro@zeniv.linux.org.uk, brauner@kernel.org, hughd@google.com,
akpm@linux-foundation.org, oliver.sang@intel.com,
feng.tang@intel.com, linux-kernel@vger.kernel.org,
linux-fsdevel@vger.kernel.org, maple-tree@lists.infradead.org,
linux-mm@kvack.org, lkp@intel.com
Subject: Re: [PATCH RFC 7/7] libfs: Re-arrange locking in offset_iterate_dir()
Date: Thu, 15 Feb 2024 16:07:42 -0500 [thread overview]
Message-ID: <20240215210742.grjwdqdypvgrpwih@revolver> (raw)
In-Reply-To: <20240215171622.gsbjbjz6vau3emkh@quack3>
* Jan Kara <jack@suse.cz> [240215 12:16]:
> On Thu 15-02-24 12:00:08, Liam R. Howlett wrote:
> > * Jan Kara <jack@suse.cz> [240215 08:16]:
> > > On Tue 13-02-24 16:38:08, Chuck Lever wrote:
> > > > From: Chuck Lever <chuck.lever@oracle.com>
> > > >
> > > > Liam says that, unlike with xarray, once the RCU read lock is
> > > > released ma_state is not safe to re-use for the next mas_find() call.
> > > > But the RCU read lock has to be released on each loop iteration so
> > > > that dput() can be called safely.
> > > >
> > > > Thus we are forced to walk the offset tree with fresh state for each
> > > > directory entry. mt_find() can do this for us, though it might be a
> > > > little less efficient than maintaining ma_state locally.
> > > >
> > > > Since offset_iterate_dir() doesn't build ma_state locally any more,
> > > > there's no longer a strong need for offset_find_next(). Clean up by
> > > > rolling these two helpers together.
> > > >
> > > > Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
> > >
> > > Well, in general I think even xas_next_entry() is not safe to use how
> > > offset_find_next() was using it. Once you drop rcu_read_lock(),
> > > xas->xa_node could go stale. But since you're holding inode->i_rwsem when
> > > using offset_find_next() you should be protected from concurrent
> > > modifications of the mapping (whatever the underlying data structure is) -
> > > that's what makes xas_next_entry() safe AFAIU. Isn't that enough for the
> > > maple tree? Am I missing something?
> >
> > If you are stopping, you should be pausing the iteration. Although this
> > works today, it's not how it should be used because if we make changes
> > (ie: compaction requires movement of data), then you may end up with a
> > UAF issue. We'd have no way of knowing you are depending on the tree
> > structure to remain consistent.
>
> I see. But we have versions of these structures that have locking external
> to the structure itself, don't we?
Ah, I do have them - but I don't want to propagate its use as the dream
is that it can be removed.
> Then how do you imagine serializing the
> background operations like compaction? As much as I agree your argument is
> "theoretically clean", it seems a bit like a trap and there are definitely
> xarray users that are going to be broken by this (e.g.
> tag_pages_for_writeback())...
I'm not sure I follow the trap logic. There are locks for the data
structure that need to be followed for reading (rcu) and writing
(spinlock for the maple tree). If you don't correctly lock the data
structure then you really are setting yourself up for potential issues
in the future.
The limitations are outlined in the documentation as to how and when to
lock. I'm not familiar with the xarray users, but it does check for
locking with lockdep, but the way this is written bypasses the lockdep
checking as the locks are taken and dropped without the proper scope.
If you feel like this is a trap, then maybe we need to figure out a new
plan to detect incorrect use?
Looking through tag_pages_for_writeback(), it does what is necessary to
keep a safe state - before it unlocks it calls xas_pause(). We have the
same on maple tree; mas_pause(). This will restart the next operation
from the root of the tree (the root can also change), to ensure that it
is safe.
If you have other examples you think are unsafe then I can have a look
at them as well.
You can make the existing code safe by also calling xas_pause() before
the rcu lock is dropped, but that is essentially what Chuck has done in
the maple tree conversion by using mt_find().
Regarding compaction, I would expect the write lock to be taken to avoid
any writes happening while compaction occurs. Readers would use rcu
locking to ensure they return either the old or new value. During the
write critical section, other writers would be in a
"mas_pause()/xas_pause()" state - so once they continue, they will
re-start the walk to the next element in the new tree.
If external locks are used, then compaction would be sub-optimal as it
may unnecessarily hold up readers, or partial write work (before the
point of a store into the maple tree).
Thanks,
Liam
next prev parent reply other threads:[~2024-02-15 21:08 UTC|newest]
Thread overview: 29+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-02-13 21:37 [PATCH RFC 0/7] Use Maple Trees for simple_offset utilities Chuck Lever
2024-02-13 21:37 ` [PATCH RFC 1/7] libfs: Rename "so_ctx" Chuck Lever
2024-02-15 12:42 ` Jan Kara
2024-02-13 21:37 ` [PATCH RFC 2/7] libfs: Define a minimum directory offset Chuck Lever
2024-02-15 12:47 ` Jan Kara
2024-02-13 21:37 ` [PATCH RFC 3/7] libfs: Add simple_offset_empty() Chuck Lever
2024-02-15 12:53 ` Jan Kara
2024-02-13 21:37 ` [PATCH RFC 4/7] maple_tree: Add mtree_alloc_cyclic() Chuck Lever
2024-02-13 21:37 ` [PATCH RFC 5/7] test_maple_tree: testing the cyclic allocation Chuck Lever
2024-02-13 21:38 ` [PATCH RFC 6/7] libfs: Convert simple directory offsets to use a Maple Tree Chuck Lever
2024-02-15 13:06 ` Jan Kara
2024-02-15 13:45 ` Chuck Lever
2024-02-15 14:02 ` Jan Kara
2024-02-16 15:15 ` Christian Brauner
2024-02-18 2:02 ` Oliver Sang
2024-02-18 15:57 ` Chuck Lever
2024-02-19 6:00 ` Oliver Sang
2024-02-13 21:38 ` [PATCH RFC 7/7] libfs: Re-arrange locking in offset_iterate_dir() Chuck Lever
2024-02-15 13:16 ` Jan Kara
2024-02-15 17:00 ` Liam R. Howlett
2024-02-15 17:16 ` Jan Kara
2024-02-15 21:07 ` Liam R. Howlett [this message]
2024-02-16 10:15 ` Jan Kara
2024-02-16 15:57 ` Matthew Wilcox
2024-02-16 16:33 ` Liam R. Howlett
2024-02-19 18:06 ` Jan Kara
2024-02-15 17:40 ` Chuck Lever
2024-02-15 21:08 ` Liam R. Howlett
2024-02-13 21:40 ` [PATCH RFC 0/7] Use Maple Trees for simple_offset utilities Chuck Lever III
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20240215210742.grjwdqdypvgrpwih@revolver \
--to=liam.howlett@oracle.com \
--cc=akpm@linux-foundation.org \
--cc=brauner@kernel.org \
--cc=cel@kernel.org \
--cc=feng.tang@intel.com \
--cc=hughd@google.com \
--cc=jack@suse.cz \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=lkp@intel.com \
--cc=maple-tree@lists.infradead.org \
--cc=oliver.sang@intel.com \
--cc=viro@zeniv.linux.org.uk \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox