From: "Liam R. Howlett" <Liam.Howlett@Oracle.com>
To: Jan Kara <jack@suse.cz>
Cc: Chuck Lever <cel@kernel.org>,
viro@zeniv.linux.org.uk, brauner@kernel.org, hughd@google.com,
akpm@linux-foundation.org, oliver.sang@intel.com,
feng.tang@intel.com, linux-kernel@vger.kernel.org,
linux-fsdevel@vger.kernel.org, maple-tree@lists.infradead.org,
linux-mm@kvack.org, lkp@intel.com
Subject: Re: [PATCH RFC 7/7] libfs: Re-arrange locking in offset_iterate_dir()
Date: Fri, 16 Feb 2024 11:33:18 -0500 [thread overview]
Message-ID: <20240216163318.w66ywrhpr5at46pi@revolver> (raw)
In-Reply-To: <20240216101546.xjcpzyb3pgf2eqm4@quack3>
* Jan Kara <jack@suse.cz> [240216 05:15]:
> On Thu 15-02-24 16:07:42, Liam R. Howlett wrote:
> > * Jan Kara <jack@suse.cz> [240215 12:16]:
> > > On Thu 15-02-24 12:00:08, Liam R. Howlett wrote:
> > > > * Jan Kara <jack@suse.cz> [240215 08:16]:
> > > > > On Tue 13-02-24 16:38:08, Chuck Lever wrote:
> > > > > > From: Chuck Lever <chuck.lever@oracle.com>
> > > > > >
> > > > > > Liam says that, unlike with xarray, once the RCU read lock is
> > > > > > released ma_state is not safe to re-use for the next mas_find() call.
> > > > > > But the RCU read lock has to be released on each loop iteration so
> > > > > > that dput() can be called safely.
> > > > > >
...
> >
> > > Then how do you imagine serializing the
> > > background operations like compaction? As much as I agree your argument is
> > > "theoretically clean", it seems a bit like a trap and there are definitely
> > > xarray users that are going to be broken by this (e.g.
> > > tag_pages_for_writeback())...
> >
> > I'm not sure I follow the trap logic. There are locks for the data
> > structure that need to be followed for reading (rcu) and writing
> > (spinlock for the maple tree). If you don't correctly lock the data
> > structure then you really are setting yourself up for potential issues
> > in the future.
> >
> > The limitations are outlined in the documentation as to how and when to
> > lock. I'm not familiar with the xarray users, but it does check for
> > locking with lockdep, but the way this is written bypasses the lockdep
> > checking as the locks are taken and dropped without the proper scope.
> >
> > If you feel like this is a trap, then maybe we need to figure out a new
> > plan to detect incorrect use?
>
> OK, I was a bit imprecise. What I wanted to say is that this is a shift in
> the paradigm in the sense that previously, we mostly had (and still have)
> data structure APIs (lists, rb-trees, radix-tree, now xarray) that were
> guaranteeing that unless you call into the function to mutate the data
> structure it stays intact. Now maple trees are shifting more in a direction
> of black-box API where you cannot assume what happens inside. Which is fine
> but then we have e.g. these iterators which do not quite follow this
> black-box design and you have to remember subtle details like calling
> "mas_pause()" before unlocking which is IMHO error-prone. Ideally, users of
> the black-box API shouldn't be exposed to the details of the internal
> locking at all (but then the performance suffers so I understand why you do
> things this way). Second to this ideal variant would be if we could detect
> we unlocked the lock without calling xas_pause() and warn on that. Or maybe
> xas_unlock*() should be calling xas_pause() automagically and we'd have
> similar helpers for RCU to do the magic for you?
>
...
Fair enough. You can think of the radix-tree and xarray as black boxes
as well; not everyone knows what's going on in there. The rbtree and
list are embedded in your own data structure and you have to do a lot of
work for your operations.
Right now, it is the way you have described; it's a data structure API.
It will work this way, but you lose the really neat and performant part
of the tree. If we do this right, we can compact at the same time as
reading data. We cannot do that when depending on the locks you use
today.
You don't have to use the mas_pause() functions, there are less
error-prone methods such as mt_find() that Chuck used here. If you want
to do really neat stuff though, you should be looking at the mas_*
interface. And yes, we totally hand you enough rope to hang yourself.
I'm not sure we can have an advanced interface without doing this.
Using the correct calls will allow for us to smoothly transition to a
model where you don't depend on the data remaining in the same place in
the tree (or xarray).
>
> > If you have other examples you think are unsafe then I can have a look
> > at them as well.
>
> I'm currently not aware of any but I'll let you know if I find some.
> Missing xas/mas_pause() seems really easy.
What if we convert the rcu_read_lock() to a mas_read_lock() or
xas_read_lock() and we can check to ensure the state isn't being locked
without being in the 'parked' state (paused or otherwise)?
mas_read_lock(struct ma_state *mas) {
assert(!mas_active(mas));
rcu_read_lock();
}
Would that be a reasonable resolution to your concern? Unfortunately,
what was done with the locking in this case would not be detected with
this change unless the rcu_read_lock() was replaced. IOW, people could
still use the rcu_read_lock() and skip the detection.
Doing the same in the mas_unlock() doesn't make as much sense since that
may be called without the intent to reuse the state at all. So we'd be
doing more work than necessary at the end of each loop or use.
Thanks,
Liam
next prev parent reply other threads:[~2024-02-16 16:35 UTC|newest]
Thread overview: 29+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-02-13 21:37 [PATCH RFC 0/7] Use Maple Trees for simple_offset utilities Chuck Lever
2024-02-13 21:37 ` [PATCH RFC 1/7] libfs: Rename "so_ctx" Chuck Lever
2024-02-15 12:42 ` Jan Kara
2024-02-13 21:37 ` [PATCH RFC 2/7] libfs: Define a minimum directory offset Chuck Lever
2024-02-15 12:47 ` Jan Kara
2024-02-13 21:37 ` [PATCH RFC 3/7] libfs: Add simple_offset_empty() Chuck Lever
2024-02-15 12:53 ` Jan Kara
2024-02-13 21:37 ` [PATCH RFC 4/7] maple_tree: Add mtree_alloc_cyclic() Chuck Lever
2024-02-13 21:37 ` [PATCH RFC 5/7] test_maple_tree: testing the cyclic allocation Chuck Lever
2024-02-13 21:38 ` [PATCH RFC 6/7] libfs: Convert simple directory offsets to use a Maple Tree Chuck Lever
2024-02-15 13:06 ` Jan Kara
2024-02-15 13:45 ` Chuck Lever
2024-02-15 14:02 ` Jan Kara
2024-02-16 15:15 ` Christian Brauner
2024-02-18 2:02 ` Oliver Sang
2024-02-18 15:57 ` Chuck Lever
2024-02-19 6:00 ` Oliver Sang
2024-02-13 21:38 ` [PATCH RFC 7/7] libfs: Re-arrange locking in offset_iterate_dir() Chuck Lever
2024-02-15 13:16 ` Jan Kara
2024-02-15 17:00 ` Liam R. Howlett
2024-02-15 17:16 ` Jan Kara
2024-02-15 21:07 ` Liam R. Howlett
2024-02-16 10:15 ` Jan Kara
2024-02-16 15:57 ` Matthew Wilcox
2024-02-16 16:33 ` Liam R. Howlett [this message]
2024-02-19 18:06 ` Jan Kara
2024-02-15 17:40 ` Chuck Lever
2024-02-15 21:08 ` Liam R. Howlett
2024-02-13 21:40 ` [PATCH RFC 0/7] Use Maple Trees for simple_offset utilities Chuck Lever III
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20240216163318.w66ywrhpr5at46pi@revolver \
--to=liam.howlett@oracle.com \
--cc=akpm@linux-foundation.org \
--cc=brauner@kernel.org \
--cc=cel@kernel.org \
--cc=feng.tang@intel.com \
--cc=hughd@google.com \
--cc=jack@suse.cz \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=lkp@intel.com \
--cc=maple-tree@lists.infradead.org \
--cc=oliver.sang@intel.com \
--cc=viro@zeniv.linux.org.uk \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox