Re: [RFC PATCH] vfs: Fix might sleep in load_unaligned_zeropad() with rcu read lock held

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: "Russell King (Oracle)" <linux@armlinux.org.uk>
To: Al Viro <viro@zeniv.linux.org.uk>
Cc: Xie Yuanbin <xieyuanbin1@huawei.com>,
	brauner@kernel.org, jack@suse.cz, will@kernel.org,
	nico@fluxnic.net, akpm@linux-foundation.org, hch@lst.de,
	jack@suse.com, wozizhi@huaweicloud.com,
	linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org, linux-mm@kvack.org,
	lilinjie8@huawei.com, liaohua4@huawei.com,
	wangkefeng.wang@huawei.com, pangliyuan1@huawei.com
Subject: Re: [RFC PATCH] vfs: Fix might sleep in load_unaligned_zeropad() with rcu read lock held
Date: Wed, 26 Nov 2025 23:31:00 +0000	[thread overview]
Message-ID: <aSeNtFxD1WRjFaiR@shell.armlinux.org.uk> (raw)
In-Reply-To: <20251126200221.GE3538@ZenIV>

On Wed, Nov 26, 2025 at 08:02:21PM +0000, Al Viro wrote:
> On Wed, Nov 26, 2025 at 07:51:54PM +0000, Russell King (Oracle) wrote:
> 
> > I don't understand how that helps. Wasn't the report that the filename
> > crosses a page boundary in userspace, but the following page is
> > inaccessible which causes a fault to be taken (as it always would do).
> > Thus, wouldn't "addr" be a userspace address (that the kernel is
> > accessing) and thus be below TASK_SIZE ?
> > 
> > I'm also confused - if we can't take a fault and handle it while
> > reading the filename from userspace, how are pages that have been
> > swapped out or evicted from the page cache read back in from storage
> > which invariably results in sleeping - which we can't do here because
> > of the RCU context (not that I've ever understood RCU, which is why
> > I've always referred those bugs to Paul.)
> 
> No, the filename is already copied in kernel space *and* it's long enough
> to end right next to the end of page.  There's NUL before the end of page,
> at that, with '/' a couple of bytes prior.  We attempt to save on memory
> accesses, doing word-by-word fetches, starting from the beginning of
> component.  We *will* detect NUL and ignore all subsequent bytes; the
> problem is that the last 3 bytes of page might be '/', 'x' and '\0'.
> We call load_unaligned_zeropad() on page + PAGE_SIZE - 2.  And get
> a fetch that spans the end of page.
> 
> We don't care what's in the next page, if there is one mapped there
> to start with.  If there's nothing mapped, we want zeroes read from
> it, but all we really care about is having the bytes within *our*
> page read correctly - and no oops happening, obviously.
> 
> That fault is an extremely cold case on a fairly hot path.  We don't
> want to mess with disabling pagefaults, etc. - not for the sake
> of that.

I think, looking at the x86 handling, 32-bit ARM has missed a heck of
a lot of changes to the fault handling code, going all the way back to
pre-git history.

I seem to remember that I had updated it to match i386's implementation
at one point in the distant past, which is essentially what we have
today with a few tweaks. As code ages, it gets more difficult to
justify wholesale rewrites to bring it back up.

Relevant to this, looking at i386, that at some point added:

+       /*
+        * We fault-in kernel-space virtual memory on-demand. The
+        * 'reference' page table is init_mm.pgd.
+        *
+        * NOTE! We MUST NOT take any locks for this case. We may
+        * be in an interrupt or a critical region, and should
+        * only copy the information from the master page table,
+        * nothing more.
+        *
+        * This verifies that the fault happens in kernel space
+        * (error_code & 4) == 0, and that the fault was not a
+        * protection error (error_code & 1) == 0.
+        */
+       if (unlikely(address >= TASK_SIZE)) {
+               if (!(error_code & 5))
+                       goto vmalloc_fault;
+               /*
+                * Don't take the mm semaphore here. If we fixup a prefetch
+                * fault we could otherwise deadlock.
+                */
+               goto bad_area_nosemaphore;
+       }

which is after notify_die() and the test to see whether we need a
local_irq_enable(). This means we go straight to the fixing up etc
for these addresses.

In today's kernel, this has morphed into:

        /* Was the fault on kernel-controlled part of the address space? */
        if (unlikely(fault_in_kernel_space(address))) {
                do_kern_addr_fault(regs, error_code, address);
        } else {
                do_user_addr_fault(regs, error_code, address);

meaning any page fault for a kernel space address is handled entirely
separately from the normal page fault handling, and it looks like
this is entirely sensible.

Interestingly, however, I notice that x86 appears to no longer call
notify_die(DIE_PAGE_FAULT) in its page fault handling path, and I
wonder whether that's a regression on x86.

Now, for 32-bit ARM, I think I am coming to the conclusion that Al's
suggestion is probably the easiest solution. However, whether it has
side effects, I couldn't say - the 32-bit ARM fault code has been
modified by quite a few people in ways I don't yet understand, so I
can't be certain at the moment whether it would cause problems.

I think the only thing to do is to try the solution and see what
breaks. I'm not in a position to be able to do that as, having not
had reason to touch 32-bit ARM for years, I don't have a hackable
platform nearby. Maybe Xie Yuanbin can test it?

-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTP is here! 80Mbps down 10Mbps up. Decent connectivity at last!

next prev parent reply	other threads:[~2025-11-26 23:31 UTC|newest]

Thread overview: 59+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-11-26  9:05 [Bug report] hash_name() may cross page boundary and trigger sleep in RCU context Zizhi Wo
2025-11-26 10:19 ` [RFC PATCH] vfs: Fix might sleep in load_unaligned_zeropad() with rcu read lock held Xie Yuanbin
2025-11-26 18:10   ` Al Viro
2025-11-26 18:48     ` Al Viro
2025-11-26 19:05       ` Russell King (Oracle)
2025-11-26 19:26         ` Al Viro
2025-11-26 19:51           ` Russell King (Oracle)
2025-11-26 20:02             ` Al Viro
2025-11-26 22:25               ` david laight
2025-11-26 23:51                 ` Al Viro
2025-11-26 23:31               ` Russell King (Oracle) [this message]
2025-11-27  3:03                 ` Xie Yuanbin
2025-11-27  7:20                   ` Sebastian Andrzej Siewior
2025-11-27 11:20                     ` Xie Yuanbin
2025-11-28  1:39           ` Xie Yuanbin
2025-11-26 20:42   ` Al Viro
2025-11-26 10:27 ` [Bug report] hash_name() may cross page boundary and trigger sleep in RCU context Zizhi Wo
2025-11-26 21:12   ` Linus Torvalds
2025-11-27 10:27     ` Will Deacon
2025-11-27 10:57     ` Russell King (Oracle)
2025-11-28 17:06       ` Linus Torvalds
2025-11-29  1:01         ` Zizhi Wo
2025-11-29  1:35           ` Linus Torvalds
2025-11-29  4:08             ` [Bug report] hash_name() may cross page boundary and trigger Xie Yuanbin
2025-11-29  9:08               ` Al Viro
2025-11-29  9:25                 ` Xie Yuanbin
2025-11-29  9:44                   ` Al Viro
2025-11-29 10:05                     ` Xie Yuanbin
2025-11-29 10:45                 ` david laight
2025-11-29  8:54             ` [Bug report] hash_name() may cross page boundary and trigger sleep in RCU context Al Viro
2025-12-01  2:08             ` Zizhi Wo
2025-11-29  2:18         ` [Bug report] hash_name() may cross page boundary and trigger Xie Yuanbin
2025-12-01 13:28         ` [Bug report] hash_name() may cross page boundary and trigger sleep in RCU context Will Deacon
2025-12-02 12:43         ` Russell King (Oracle)
2025-12-02 13:02           ` Xie Yuanbin
2025-12-02 22:07           ` Linus Torvalds
2025-12-03  1:48             ` Xie Yuanbin
2025-12-05 12:08               ` Russell King (Oracle)
2025-11-26 18:55 ` Al Viro
2025-11-27  2:24   ` Zizhi Wo
2025-11-29  3:37     ` Al Viro
2025-11-30  3:01       ` [RFC][alpha] saner vmalloc handling (was Re: [Bug report] hash_name() may cross page boundary and trigger sleep in RCU context) Al Viro
2025-11-30 11:32         ` david laight
2025-11-30 16:43           ` Al Viro
2025-11-30 18:14             ` Magnus Lindholm
2025-11-30 19:03             ` david laight
2025-11-30 20:31               ` Al Viro
2025-11-30 20:32                 ` Al Viro
2025-11-30 22:16         ` Linus Torvalds
2025-11-30 23:37           ` Al Viro
2025-12-01  2:03       ` [Bug report] hash_name() may cross page boundary and trigger sleep in RCU context Zizhi Wo
2025-11-27 12:59 ` Will Deacon
2025-11-28  1:17   ` Zizhi Wo
2025-11-28  1:18     ` Zizhi Wo
2025-11-28  1:39       ` Zizhi Wo
2025-11-28 12:25         ` Will Deacon
2025-11-29  1:02           ` Zizhi Wo
2025-11-29  3:55             ` Al Viro
2025-12-01  2:38               ` Zizhi Wo

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aSeNtFxD1WRjFaiR@shell.armlinux.org.uk \
    --to=linux@armlinux.org.uk \
    --cc=akpm@linux-foundation.org \
    --cc=brauner@kernel.org \
    --cc=hch@lst.de \
    --cc=jack@suse.com \
    --cc=jack@suse.cz \
    --cc=liaohua4@huawei.com \
    --cc=lilinjie8@huawei.com \
    --cc=linux-arm-kernel@lists.infradead.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=nico@fluxnic.net \
    --cc=pangliyuan1@huawei.com \
    --cc=viro@zeniv.linux.org.uk \
    --cc=wangkefeng.wang@huawei.com \
    --cc=will@kernel.org \
    --cc=wozizhi@huaweicloud.com \
    --cc=xieyuanbin1@huawei.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox