From: Mateusz Guzik <mjguzik@gmail.com>
To: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Christian Brauner <brauner@kernel.org>,
kernel test robot <oliver.sang@intel.com>,
oe-lkp@lists.linux.dev, lkp@intel.com,
Linux Memory Management List <linux-mm@kvack.org>,
linux-kernel@vger.kernel.org, ying.huang@intel.com,
feng.tang@intel.com, fengwei.yin@intel.com
Subject: Re: [linux-next:master] [lockref] d042dae6ad: unixbench.throughput -33.7% regression
Date: Wed, 3 Jul 2024 15:53:08 +0200 [thread overview]
Message-ID: <du5vnvwygzbtal6qogmxckawwwxgbppuq5qi5aeqcs5unrlpz3@3k5degdvflzq> (raw)
In-Reply-To: <CAHk-=whagTfq=EgwpuywUUu0ti7PPQuE_ftVjZOVrjnLwtS0Ng@mail.gmail.com>
On Tue, Jul 02, 2024 at 03:14:54PM -0700, Linus Torvalds wrote:
> On Tue, 2 Jul 2024 at 14:15, Mateusz Guzik <mjguzik@gmail.com> wrote:
> >
> > I asked you something in the previous e-mail though (with some nastiness
> > of the problem pointed out) concerning handling of slow vs fastpath,
> > here it is again:
> >
> > [..]for example did you know xfs does not honor rcu grace periods when
> > recycling inodes?
>
> I don't think it should matter.
>
> Yes, the vfs layer will access the inode, but then the sequence number
> checks should always verify that whatever access we did is then
> validated after-the-fact.
>
[..]
> Or do you see something I'm missing? I'm not loving how we deal with
> dentry->d_inode, but considering that the whole point of the RCU
> lookup is that we don't hold any locks, I think it's almost
> unavoidable.
>
Now I'm confused mate. Based on the convo so far I would expect you
would consider the xfs thing a no-go for the machinery.
You were rightfully pointing out the relationship dentry<->inode is not
stable and care is needed to grab the pointer, and even then the pointer
may be wrong by the time one finishes the work.
I presume you are also worried about callbacks not taking proper steps
when looking at the inode itself -- after all they can be racing with
teardown and have to handle it gracefully by returning an error.
This much comes with territory.
Inode changing identities adds potential trouble which does not need to
be there.
Certain things remain stable, for example the inode type (regular, fifo
etc.). This is where xfs not respecting grace periods comes in and
removes it.
It may happen to be for stat et al the way they can be implemented at
the moment this does not matter, but then it may start being a factor in
the future.
Say a type-dependent union showed up and needs to be looked at:
union {
fifo_pointer_t *fifo_data;
dev_t i_rdev;
...;
};
area pointed to by fifo_data is rcu-protected.
Then this:
if (IS_FIFO(inode)) {
fifo_pointer_t *f = rcu_dereference(inode->fifo_data);
if (!f)
return -ENODICE;
/* use f here */
}
while may look safe in terms of taking precatuions and bailing from the
fast path happens to be wrong.
Suppose the inode got reused and is now representing a device, i_rdev is
some funky value. This can race to read that int and trap dereferencing
what it found there. This would not happen if the grace period was
respected.
This is a bug class which does not need to be there and there may be
other possible bugs I did not think of.
There is also potential trouble with security modules as they
unfortunately have a hook for getattr. They presumably can be massaged
into handling a possibly dying inode (or failing with -ENODICE), I would
not put any stake into handling an inode which becomes something else as
they deal with it.
As a side note this is where I should also note the *current* LSM hooks
are racy as is. Suppose you can stat a particular file now, but a chown
to 1234 means you can't. Then your stat call can race against chown +
other ops. You can end up in a state where LSMs gave a green light and
only then the state changed to one you are not allowed to look at. This
would be solvable with per-inode sequence counters, but is arguably
beyond the scope of this patch and I'm not interested in working on it.
[I read the rest of the mail and have no immediate commentary, we will
see after I take a stab at implementing this]
next prev parent reply other threads:[~2024-07-03 13:53 UTC|newest]
Thread overview: 25+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-06-27 2:41 kernel test robot
2024-06-27 6:25 ` Mateusz Guzik
2024-06-27 7:00 ` Mateusz Guzik
2024-06-27 16:32 ` Linus Torvalds
2024-06-27 16:55 ` Mateusz Guzik
2024-06-27 16:57 ` Linus Torvalds
2024-06-27 17:20 ` Mateusz Guzik
2024-06-27 17:23 ` Linus Torvalds
2024-07-02 7:19 ` Mateusz Guzik
2024-07-02 12:10 ` Mateusz Guzik
2024-07-02 16:47 ` Linus Torvalds
2024-07-02 17:02 ` Mateusz Guzik
2024-07-02 17:28 ` Linus Torvalds
2024-07-02 17:46 ` Mateusz Guzik
2024-07-02 17:58 ` Mateusz Guzik
2024-07-02 18:41 ` Linus Torvalds
2024-07-02 20:33 ` Mateusz Guzik
2024-07-02 20:42 ` Linus Torvalds
2024-07-02 21:15 ` Mateusz Guzik
2024-07-02 22:14 ` Linus Torvalds
2024-07-03 13:53 ` Mateusz Guzik [this message]
2024-07-03 14:08 ` Christian Brauner
2024-07-03 14:11 ` Mateusz Guzik
2024-07-03 16:47 ` Linus Torvalds
2024-07-03 8:34 ` Christian Brauner
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=du5vnvwygzbtal6qogmxckawwwxgbppuq5qi5aeqcs5unrlpz3@3k5degdvflzq \
--to=mjguzik@gmail.com \
--cc=brauner@kernel.org \
--cc=feng.tang@intel.com \
--cc=fengwei.yin@intel.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=lkp@intel.com \
--cc=oe-lkp@lists.linux.dev \
--cc=oliver.sang@intel.com \
--cc=torvalds@linux-foundation.org \
--cc=ying.huang@intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox