linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Linus Torvalds <torvalds@linux-foundation.org>
To: Mateusz Guzik <mjguzik@gmail.com>
Cc: Christian Brauner <brauner@kernel.org>,
	kernel test robot <oliver.sang@intel.com>,
	oe-lkp@lists.linux.dev,  lkp@intel.com,
	Linux Memory Management List <linux-mm@kvack.org>,
	linux-kernel@vger.kernel.org,  ying.huang@intel.com,
	feng.tang@intel.com, fengwei.yin@intel.com
Subject: Re: [linux-next:master] [lockref] d042dae6ad: unixbench.throughput -33.7% regression
Date: Tue, 2 Jul 2024 11:41:44 -0700	[thread overview]
Message-ID: <CAHk-=wiBGSLNW6GGbnec-dCbn0kWvD+OXAa5VNXPBKLXYy5KOQ@mail.gmail.com> (raw)
In-Reply-To: <CAGudoHHg-T+ZOTm0fSpW0Hztfxn=fpfnksz5Q3=3YeCeEPo7LQ@mail.gmail.com>

On Tue, 2 Jul 2024 at 10:58, Mateusz Guzik <mjguzik@gmail.com> wrote:
>
> Suppose the rcu fast path lookup reads the dentry seqc, then does all
> the legitimize_mnt and other work. Everything, except modifying the
> lockref. The caller is given a mnt to put (per-cpu scalable), dentry
> seqc read before any of the path validation and an indication this is
> rcu.

Yes.

> Then after whatever is done if the seqc still matches this is the same
> as if there was lockref get/put around it.

So this is partly why I was thinking of a callback. That "check
sequence number afterwards" is still important. And if it's a
callback, it can be done in the path walking code, and it can go on
and say "oh, I'll need to redo this without RCU".

If it's a "we returned a dentry under RCU", suddenly the caller has to
know this about the name lookup and do the repeating by hand.

And as long as we don't expose it to modules and only use it for
"stat()" and friends, I'm ok with it, but I'm just saying that it's
all a bit scary.

> The only worry is pointers suddenly going NULL or similar as
> dentry/inode is looked at. To be worked out on per-syscall basis.

We have subtle rules wrt dentry->d_inode. It can indeed become NULL at
any time during the RCU walk, since what protects it is the d_lock and
the dentry count.

The inode itself is then RCU-free'd, so it will *exist*, but you can't
just blindly use dentry->d_inode itself while under RCU.

Which is why it's cached in 'struct nameidata', and we validate it
with nd->seq when it's loaded. And why things like may_lookup() use
nd->inode, not the dentry.

And that's another rule that we probably should aim to not have escape
from the path walking as an interface.

Because it's much too easy to do

        struct inode *inode = d_backing_inode(path->dentry);

but that's just wrong during the RCU path walk.

Again, having this be a callback during the walk would avoid issues
like this. The callback can just pass in the separate inode pointer.

And then a sequence point failure will return -ECHILD and do the walk
again, while a callback success with all the sequence numbers matching
would return -ECALLBACK or whatever, so that the caller would know
"the stat information was already successfully completed by the
callback".

Anyway, that was my handwavy "this is why I was thinking of a
callback" thing. But it's also an example of just how nasty and subtle
this all is.

But I'm convinced this is all eminently *solvable*. There's nothing
fundamental here. Just a lot of small nasty details.

                    Linus


  reply	other threads:[~2024-07-02 18:42 UTC|newest]

Thread overview: 25+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-06-27  2:41 kernel test robot
2024-06-27  6:25 ` Mateusz Guzik
2024-06-27  7:00   ` Mateusz Guzik
2024-06-27 16:32     ` Linus Torvalds
2024-06-27 16:55       ` Mateusz Guzik
2024-06-27 16:57       ` Linus Torvalds
2024-06-27 17:20         ` Mateusz Guzik
2024-06-27 17:23         ` Linus Torvalds
2024-07-02  7:19 ` Mateusz Guzik
2024-07-02 12:10   ` Mateusz Guzik
2024-07-02 16:47     ` Linus Torvalds
2024-07-02 17:02       ` Mateusz Guzik
2024-07-02 17:28         ` Linus Torvalds
2024-07-02 17:46           ` Mateusz Guzik
2024-07-02 17:58             ` Mateusz Guzik
2024-07-02 18:41               ` Linus Torvalds [this message]
2024-07-02 20:33                 ` Mateusz Guzik
2024-07-02 20:42                   ` Linus Torvalds
2024-07-02 21:15                     ` Mateusz Guzik
2024-07-02 22:14                       ` Linus Torvalds
2024-07-03 13:53                         ` Mateusz Guzik
2024-07-03 14:08                           ` Christian Brauner
2024-07-03 14:11                             ` Mateusz Guzik
2024-07-03 16:47                           ` Linus Torvalds
2024-07-03  8:34   ` Christian Brauner

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAHk-=wiBGSLNW6GGbnec-dCbn0kWvD+OXAa5VNXPBKLXYy5KOQ@mail.gmail.com' \
    --to=torvalds@linux-foundation.org \
    --cc=brauner@kernel.org \
    --cc=feng.tang@intel.com \
    --cc=fengwei.yin@intel.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=lkp@intel.com \
    --cc=mjguzik@gmail.com \
    --cc=oe-lkp@lists.linux.dev \
    --cc=oliver.sang@intel.com \
    --cc=ying.huang@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox