From: Jan Kara <jack@suse.cz>
To: "Liam R. Howlett" <Liam.Howlett@Oracle.com>
Cc: Jan Kara <jack@suse.cz>, Chuck Lever <cel@kernel.org>,
viro@zeniv.linux.org.uk, brauner@kernel.org, hughd@google.com,
akpm@linux-foundation.org, oliver.sang@intel.com,
feng.tang@intel.com, linux-kernel@vger.kernel.org,
linux-fsdevel@vger.kernel.org, maple-tree@lists.infradead.org,
linux-mm@kvack.org, lkp@intel.com
Subject: Re: [PATCH RFC 7/7] libfs: Re-arrange locking in offset_iterate_dir()
Date: Fri, 16 Feb 2024 11:15:46 +0100 [thread overview]
Message-ID: <20240216101546.xjcpzyb3pgf2eqm4@quack3> (raw)
In-Reply-To: <20240215210742.grjwdqdypvgrpwih@revolver>
On Thu 15-02-24 16:07:42, Liam R. Howlett wrote:
> * Jan Kara <jack@suse.cz> [240215 12:16]:
> > On Thu 15-02-24 12:00:08, Liam R. Howlett wrote:
> > > * Jan Kara <jack@suse.cz> [240215 08:16]:
> > > > On Tue 13-02-24 16:38:08, Chuck Lever wrote:
> > > > > From: Chuck Lever <chuck.lever@oracle.com>
> > > > >
> > > > > Liam says that, unlike with xarray, once the RCU read lock is
> > > > > released ma_state is not safe to re-use for the next mas_find() call.
> > > > > But the RCU read lock has to be released on each loop iteration so
> > > > > that dput() can be called safely.
> > > > >
> > > > > Thus we are forced to walk the offset tree with fresh state for each
> > > > > directory entry. mt_find() can do this for us, though it might be a
> > > > > little less efficient than maintaining ma_state locally.
> > > > >
> > > > > Since offset_iterate_dir() doesn't build ma_state locally any more,
> > > > > there's no longer a strong need for offset_find_next(). Clean up by
> > > > > rolling these two helpers together.
> > > > >
> > > > > Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
> > > >
> > > > Well, in general I think even xas_next_entry() is not safe to use how
> > > > offset_find_next() was using it. Once you drop rcu_read_lock(),
> > > > xas->xa_node could go stale. But since you're holding inode->i_rwsem when
> > > > using offset_find_next() you should be protected from concurrent
> > > > modifications of the mapping (whatever the underlying data structure is) -
> > > > that's what makes xas_next_entry() safe AFAIU. Isn't that enough for the
> > > > maple tree? Am I missing something?
> > >
> > > If you are stopping, you should be pausing the iteration. Although this
> > > works today, it's not how it should be used because if we make changes
> > > (ie: compaction requires movement of data), then you may end up with a
> > > UAF issue. We'd have no way of knowing you are depending on the tree
> > > structure to remain consistent.
> >
> > I see. But we have versions of these structures that have locking external
> > to the structure itself, don't we?
>
> Ah, I do have them - but I don't want to propagate its use as the dream
> is that it can be removed.
>
>
> > Then how do you imagine serializing the
> > background operations like compaction? As much as I agree your argument is
> > "theoretically clean", it seems a bit like a trap and there are definitely
> > xarray users that are going to be broken by this (e.g.
> > tag_pages_for_writeback())...
>
> I'm not sure I follow the trap logic. There are locks for the data
> structure that need to be followed for reading (rcu) and writing
> (spinlock for the maple tree). If you don't correctly lock the data
> structure then you really are setting yourself up for potential issues
> in the future.
>
> The limitations are outlined in the documentation as to how and when to
> lock. I'm not familiar with the xarray users, but it does check for
> locking with lockdep, but the way this is written bypasses the lockdep
> checking as the locks are taken and dropped without the proper scope.
>
> If you feel like this is a trap, then maybe we need to figure out a new
> plan to detect incorrect use?
OK, I was a bit imprecise. What I wanted to say is that this is a shift in
the paradigm in the sense that previously, we mostly had (and still have)
data structure APIs (lists, rb-trees, radix-tree, now xarray) that were
guaranteeing that unless you call into the function to mutate the data
structure it stays intact. Now maple trees are shifting more in a direction
of black-box API where you cannot assume what happens inside. Which is fine
but then we have e.g. these iterators which do not quite follow this
black-box design and you have to remember subtle details like calling
"mas_pause()" before unlocking which is IMHO error-prone. Ideally, users of
the black-box API shouldn't be exposed to the details of the internal
locking at all (but then the performance suffers so I understand why you do
things this way). Second to this ideal variant would be if we could detect
we unlocked the lock without calling xas_pause() and warn on that. Or maybe
xas_unlock*() should be calling xas_pause() automagically and we'd have
similar helpers for RCU to do the magic for you?
> Looking through tag_pages_for_writeback(), it does what is necessary to
> keep a safe state - before it unlocks it calls xas_pause(). We have the
> same on maple tree; mas_pause(). This will restart the next operation
> from the root of the tree (the root can also change), to ensure that it
> is safe.
OK, I've missed the xas_pause(). Thanks for correcting me.
> If you have other examples you think are unsafe then I can have a look
> at them as well.
I'm currently not aware of any but I'll let you know if I find some.
Missing xas/mas_pause() seems really easy.
Honza
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
next prev parent reply other threads:[~2024-02-16 10:15 UTC|newest]
Thread overview: 29+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-02-13 21:37 [PATCH RFC 0/7] Use Maple Trees for simple_offset utilities Chuck Lever
2024-02-13 21:37 ` [PATCH RFC 1/7] libfs: Rename "so_ctx" Chuck Lever
2024-02-15 12:42 ` Jan Kara
2024-02-13 21:37 ` [PATCH RFC 2/7] libfs: Define a minimum directory offset Chuck Lever
2024-02-15 12:47 ` Jan Kara
2024-02-13 21:37 ` [PATCH RFC 3/7] libfs: Add simple_offset_empty() Chuck Lever
2024-02-15 12:53 ` Jan Kara
2024-02-13 21:37 ` [PATCH RFC 4/7] maple_tree: Add mtree_alloc_cyclic() Chuck Lever
2024-02-13 21:37 ` [PATCH RFC 5/7] test_maple_tree: testing the cyclic allocation Chuck Lever
2024-02-13 21:38 ` [PATCH RFC 6/7] libfs: Convert simple directory offsets to use a Maple Tree Chuck Lever
2024-02-15 13:06 ` Jan Kara
2024-02-15 13:45 ` Chuck Lever
2024-02-15 14:02 ` Jan Kara
2024-02-16 15:15 ` Christian Brauner
2024-02-18 2:02 ` Oliver Sang
2024-02-18 15:57 ` Chuck Lever
2024-02-19 6:00 ` Oliver Sang
2024-02-13 21:38 ` [PATCH RFC 7/7] libfs: Re-arrange locking in offset_iterate_dir() Chuck Lever
2024-02-15 13:16 ` Jan Kara
2024-02-15 17:00 ` Liam R. Howlett
2024-02-15 17:16 ` Jan Kara
2024-02-15 21:07 ` Liam R. Howlett
2024-02-16 10:15 ` Jan Kara [this message]
2024-02-16 15:57 ` Matthew Wilcox
2024-02-16 16:33 ` Liam R. Howlett
2024-02-19 18:06 ` Jan Kara
2024-02-15 17:40 ` Chuck Lever
2024-02-15 21:08 ` Liam R. Howlett
2024-02-13 21:40 ` [PATCH RFC 0/7] Use Maple Trees for simple_offset utilities Chuck Lever III
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20240216101546.xjcpzyb3pgf2eqm4@quack3 \
--to=jack@suse.cz \
--cc=Liam.Howlett@Oracle.com \
--cc=akpm@linux-foundation.org \
--cc=brauner@kernel.org \
--cc=cel@kernel.org \
--cc=feng.tang@intel.com \
--cc=hughd@google.com \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=lkp@intel.com \
--cc=maple-tree@lists.infradead.org \
--cc=oliver.sang@intel.com \
--cc=viro@zeniv.linux.org.uk \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox