[Bug report] hash_name() may cross page boundary and trigger sleep in RCU context

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [Bug report] hash_name() may cross page boundary and trigger sleep in RCU context
@ 2025-11-26  9:05 Zizhi Wo
  2025-11-26 10:19 ` [RFC PATCH] vfs: Fix might sleep in load_unaligned_zeropad() with rcu read lock held Xie Yuanbin
                   ` (3 more replies)
  0 siblings, 4 replies; 59+ messages in thread
From: Zizhi Wo @ 2025-11-26  9:05 UTC (permalink / raw)
  To: jack, brauner, hch, akpm, linux
  Cc: linux-fsdevel, linux-kernel, linux-mm, linux-arm-kernel, wozizhi,
	yangerkun, wangkefeng.wang, pangliyuan1, xieyuanbin1

We're running into the following issue on an ARM32 platform with the linux
5.10 kernel:

[<c0300b78>] (__dabt_svc) from [<c0529cb8>] (link_path_walk.part.7+0x108/0x45c)
[<c0529cb8>] (link_path_walk.part.7) from [<c052a948>] (path_openat+0xc4/0x10ec)
[<c052a948>] (path_openat) from [<c052cf90>] (do_filp_open+0x9c/0x114)
[<c052cf90>] (do_filp_open) from [<c0511e4c>] (do_sys_openat2+0x418/0x528)
[<c0511e4c>] (do_sys_openat2) from [<c0513d98>] (do_sys_open+0x88/0xe4)
[<c0513d98>] (do_sys_open) from [<c03000c0>] (ret_fast_syscall+0x0/0x58)
...
[<c0315e34>] (unwind_backtrace) from [<c030f2b0>] (show_stack+0x20/0x24)
[<c030f2b0>] (show_stack) from [<c14239f4>] (dump_stack+0xd8/0xf8)
[<c14239f4>] (dump_stack) from [<c038d188>] (___might_sleep+0x19c/0x1e4)
[<c038d188>] (___might_sleep) from [<c031b6fc>] (do_page_fault+0x2f8/0x51c)
[<c031b6fc>] (do_page_fault) from [<c031bb44>] (do_DataAbort+0x90/0x118)
[<c031bb44>] (do_DataAbort) from [<c0300b78>] (__dabt_svc+0x58/0x80)
...

During the execution of hash_name()->load_unaligned_zeropad(), a potential
memory access beyond the PAGE boundary may occur. For example, when the
filename length is near the PAGE_SIZE boundary. This triggers a page fault,
which leads to a call to do_page_fault()->mmap_read_trylock(). If we can't
acquire the lock, we have to fall back to the mmap_read_lock() path, which
calls might_sleep(). This breaks RCU semantics because path lookup occurs
under an RCU read-side critical section. In linux-mainline, arm/arm64
do_page_fault() still has this problem:

lock_mm_and_find_vma->get_mmap_lock_carefully->mmap_read_lock_killable.

And before commit bfcfaa77bdf0 ("vfs: use 'unsigned long' accesses for
dcache name comparison and hashing"), hash_name accessed the name byte by
byte.

To prevent load_unaligned_zeropad() from accessing beyond the valid memory
region, we would need to intercept such cases beforehand? But doing so
would require replicating the internal logic of load_unaligned_zeropad(),
including handling endianness and constructing the correct value manually.
Given that load_unaligned_zeropad() is used in many places across the
kernel, we currently haven't found a good solution to address this cleanly.

What would be the recommended way to handle this situation? Would
appreciate any feedback and guidance from the community. Thanks!

^ permalink raw reply	[flat|nested] 59+ messages in thread

* [RFC PATCH] vfs: Fix might sleep in load_unaligned_zeropad() with rcu read lock held
  2025-11-26  9:05 [Bug report] hash_name() may cross page boundary and trigger sleep in RCU context Zizhi Wo
@ 2025-11-26 10:19 ` Xie Yuanbin
  2025-11-26 18:10   ` Al Viro
  2025-11-26 20:42   ` Al Viro
  2025-11-26 10:27 ` [Bug report] hash_name() may cross page boundary and trigger sleep in RCU context Zizhi Wo
                   ` (2 subsequent siblings)
  3 siblings, 2 replies; 59+ messages in thread
From: Xie Yuanbin @ 2025-11-26 10:19 UTC (permalink / raw)
  To: viro, brauner, jack, linux, will, nico, akpm, hch, jack, wozizhi
  Cc: linux-fsdevel, linux-kernel, linux-arm-kernel, linux-mm,
	lilinjie8, liaohua4, wangkefeng.wang, pangliyuan1, Xie Yuanbin

When the path is initialized with LOOKUP_RCU flag in path_init(), the
rcu read lock will be acquired. Inside the rcu critical section,
load_unaligned_zeropad() may be called. According to the comments of
load_unaligned_zeropad(), when loading the memory, a page fault may be
triggered in the very unlikely case.

On arm32/arm64, a page fault may cause the current thread to sleep inside
mmap_read_lock_killable(). If CONFIG_DEBUG_ATOMIC_SLEEP=y, the following
warning will be triggered:
```log
[   16.243462] BUG: sleeping function called from invalid context at kernel/locking/rwsem.c:1559
[   16.245271] in_atomic(): 0, irqs_disabled(): 0, non_block: 0, pid: 68, name: test
[   16.246219] preempt_count: 0, expected: 0
[   16.246582] RCU nest depth: 1, expected: 0
[   16.247262] CPU: 0 UID: 0 PID: 68 Comm: test Not tainted 6.18.0-rc6-next-20251124 #28 PREEMPT
[   16.247432] Hardware name: Generic DT based system
[   16.247549] Call trace:
[   16.247618]  unwind_backtrace from show_stack+0x10/0x14
[   16.248442]  show_stack from dump_stack_lvl+0x50/0x5c
[   16.248458]  dump_stack_lvl from __might_resched+0x174/0x188
[   16.248475]  __might_resched from down_read_killable+0x18/0x10c
[   16.248490]  down_read_killable from mmap_read_lock_killable+0x24/0x84
[   16.248504]  mmap_read_lock_killable from lock_mm_and_find_vma+0x164/0x18c
[   16.248516]  lock_mm_and_find_vma from do_page_fault+0x1d4/0x4a0
[   16.248529]  do_page_fault from do_DataAbort+0x30/0xa8
[   16.248549]  do_DataAbort from __dabt_svc+0x44/0x60
[   16.248597] Exception stack(0xf0b41da0 to 0xf0b41de8)
[   16.248675] 1da0: c20b34f0 c3f23bf8 00000000 c389be50 f0b41e90 00000501 61c88647 00000000
[   16.248698] 1dc0: 80808080 fefefeff 2f2f2f2f eec51ffd c3219088 f0b41df0 c066d3e4 c066d218
[   16.248705] 1de0: 60000013 ffffffff
[   16.248736]  __dabt_svc from link_path_walk+0xa8/0x444
[   16.248752]  link_path_walk from path_openat+0xac/0xe18
[   16.248764]  path_openat from do_filp_open+0x94/0x134
[   16.248775]  do_filp_open from do_sys_openat2+0x9c/0xf0
[   16.248785]  do_sys_openat2 from sys_openat+0x80/0xa0
[   16.248806]  sys_openat from ret_fast_syscall+0x0/0x4c
[   16.248814] Exception stack(0xf0b41fa8 to 0xf0b41ff0)
[   16.248825] 1fa0:                   00000000 00000000 ffffff9c beb27d0c 00000242 000001b6
[   16.248834] 1fc0: 00000000 00000000 000c543c 00000142 00027e85 00000002 00000002 00000000
[   16.248841] 1fe0: beb27c20 beb27c0c 0006ea80 00072e78
[   16.923450] ------------[ cut here ]------------
[   16.923630] WARNING: kernel/rcu/tree_plugin.h:332 at rcu_note_context_switch+0x408/0x610, CPU#0: test/68
[   16.924780] Voluntary context switch within RCU read-side critical section!
[   16.924887] Modules linked in:
[   16.925670] CPU: 0 UID: 0 PID: 68 Comm: test Tainted: G        W           6.18.0-rc6-next-20251124 #28 PREEMPT
[   16.926120] Tainted: [W]=WARN
[   16.926257] Hardware name: Generic DT based system
[   16.926474] Call trace:
[   16.926487]  unwind_backtrace from show_stack+0x10/0x14
[   16.926899]  show_stack from dump_stack_lvl+0x50/0x5c
[   16.927318]  dump_stack_lvl from __warn+0xf8/0x200
[   16.927696]  __warn from warn_slowpath_fmt+0x180/0x208
[   16.928060]  warn_slowpath_fmt from rcu_note_context_switch+0x408/0x610
[   16.928768]  rcu_note_context_switch from __schedule+0xe4/0xa58
[   16.928917]  __schedule from schedule+0x70/0x124
[   16.929197]  schedule from schedule_preempt_disabled+0x14/0x20
[   16.929514]  schedule_preempt_disabled from rwsem_down_read_slowpath+0x26c/0x4e4
[   16.929875]  rwsem_down_read_slowpath from down_read_killable+0x58/0x10c
[   16.930320]  down_read_killable from mmap_read_lock_killable+0x24/0x84
[   16.930761]  mmap_read_lock_killable from lock_mm_and_find_vma+0x164/0x18c
[   16.931101]  lock_mm_and_find_vma from do_page_fault+0x1d4/0x4a0
[   16.931354]  do_page_fault from do_DataAbort+0x30/0xa8
[   16.931649]  do_DataAbort from __dabt_svc+0x44/0x60
[   16.931862] Exception stack(0xf0b41d88 to 0xf0b41dd0)
[   16.932063] 1d80:                   c3219088 eec5dffd f0b41ec0 00000002 c3219118 00000010
[   16.933732] 1da0: c321913c 00000002 00007878 c2da86c0 00000000 00000002 b8009440 f0b41ddc
[   16.934019] 1dc0: eec5dffd c0677300 60000013 ffffffff
[   16.934294]  __dabt_svc from __d_lookup_rcu+0xc4/0x10c
[   16.934468]  __d_lookup_rcu from lookup_fast+0xa0/0x190
[   16.934720]  lookup_fast from path_openat+0x154/0xe18
[   16.934953]  path_openat from do_filp_open+0x94/0x134
[   16.935141]  do_filp_open from do_sys_openat2+0x9c/0xf0
[   16.935384]  do_sys_openat2 from sys_openat+0x80/0xa0
[   16.935547]  sys_openat from ret_fast_syscall+0x0/0x4c
[   16.935799] Exception stack(0xf0b41fa8 to 0xf0b41ff0)
[   16.936007] 1fa0:                   00000000 00000000 ffffff9c beb27d0c 00000242 000001b6
[   16.936293] 1fc0: 00000000 00000000 000c543c 00000142 00027e85 00000002 00000002 00000000
[   16.936624] 1fe0: beb27c20 beb27c0c 0006ea80 00072e78
[   16.936780] ---[ end trace 0000000000000000 ]---
```

Add pagefault_disable() to handle this situation.

Fixes: b9a50f74905a ("ARM: 7450/1: dcache: select DCACHE_WORD_ACCESS for little-endian ARMv6+ CPUs")

Signed-off-by: Xie Yuanbin <xieyuanbin1@huawei.com>
Co-developed-by: Liyuan Pang <pangliyuan1@huawei.com>
---
On latest linux-next source, using arm32's multi_v7_defconfig, and
setting CONFIG_PREEMPT=y, CONFIG_DEBUG_ATOMIC_SLEEP=y, CONFIG_KFENCE=y,
CONFIG_ARM_PAN=n, then run the following testcase:
```c
static void *thread(void *arg)
{
	while (1) {
		void *p = mmap(NULL, 4096, PROT_READ | PROT_WRITE, MAP_ANON | MAP_PRIVATE, -1, 0);

		assert(p != (void *)-1);
		__asm__ volatile ("":"+r"(p)::"memory");

		munmap(p, 4096);
	}
}

int main()
{
	pthread_t th;
	int ret;
	char path[4096] = "/tmp";

	for (size_t i = 0; i < 2044; ++i) {
		strcat(path, "/x");
		ret = mkdir(path, 0755);
		assert(ret == 0 || errno == EEXIST);
	}
	strcat(path, "/xx");

	assert(strlen(path) == 4095);

	assert(pthread_create(&th, NULL, thread, NULL) == 0);

	while (1) {
		FILE *fp = fopen(path, "wb+");

		assert(fp);
		fclose(fp);
	}
	return 0;
}
```
The might sleep warning will be triggered immediately.

Another possible solution: call pagefault_disable() after rcu_read_lock()
and call pagefault_enable() before rcu_read_unlock(). Inside path_init()
and leave_rcu(). However, this solution has a relatively large scope of
page fault disabling.

 fs/dcache.c | 10 ++++++++--
 fs/namei.c  |  7 +++++++
 2 files changed, 15 insertions(+), 2 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index 23d1752c29e6..154195909f07 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -264,23 +264,29 @@ fs_initcall(init_fs_dcache_sysctls);
  */
 static inline int dentry_string_cmp(const unsigned char *cs, const unsigned char *ct, unsigned tcount)
 {
 	unsigned long a,b,mask;
 
+	pagefault_disable();
 	for (;;) {
 		a = read_word_at_a_time(cs);
 		b = load_unaligned_zeropad(ct);
 		if (tcount < sizeof(unsigned long))
 			break;
-		if (unlikely(a != b))
+		if (unlikely(a != b)) {
+			pagefault_enable();
 			return 1;
+		}
 		cs += sizeof(unsigned long);
 		ct += sizeof(unsigned long);
 		tcount -= sizeof(unsigned long);
-		if (!tcount)
+		if (!tcount) {
+			pagefault_enable();
 			return 0;
+		}
 	}
+	pagefault_enable();
 	mask = bytemask_from_count(tcount);
 	return unlikely(!!((a ^ b) & mask));
 }
 
 #else
diff --git a/fs/namei.c b/fs/namei.c
index 4ac7ff8e3a40..b04756e58ca3 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -2304,10 +2304,11 @@ static inline unsigned int fold_hash(unsigned long x, unsigned long y)
  */
 unsigned int full_name_hash(const void *salt, const char *name, unsigned int len)
 {
 	unsigned long a, x = 0, y = (unsigned long)salt;
 
+	pagefault_disable();
 	for (;;) {
 		if (!len)
 			goto done;
 		a = load_unaligned_zeropad(name);
 		if (len < sizeof(unsigned long))
@@ -2316,10 +2317,11 @@ unsigned int full_name_hash(const void *salt, const char *name, unsigned int len
 		name += sizeof(unsigned long);
 		len -= sizeof(unsigned long);
 	}
 	x ^= a & bytemask_from_count(len);
 done:
+	pagefault_enable();
 	return fold_hash(x, y);
 }
 EXPORT_SYMBOL(full_name_hash);
 
 /* Return the "hash_len" (hash and length) of a null-terminated string */
@@ -2328,18 +2330,20 @@ u64 hashlen_string(const void *salt, const char *name)
 	unsigned long a = 0, x = 0, y = (unsigned long)salt;
 	unsigned long adata, mask, len;
 	const struct word_at_a_time constants = WORD_AT_A_TIME_CONSTANTS;
 
 	len = 0;
+	pagefault_disable();
 	goto inside;
 
 	do {
 		HASH_MIX(x, y, a);
 		len += sizeof(unsigned long);
 inside:
 		a = load_unaligned_zeropad(name+len);
 	} while (!has_zero(a, &adata, &constants));
+	pagefault_enable();
 
 	adata = prep_zero_mask(a, adata, &constants);
 	mask = create_zero_mask(adata);
 	x ^= a & zero_bytemask(mask);
 
@@ -2357,17 +2361,19 @@ static inline const char *hash_name(struct nameidata *nd,
 {
 	unsigned long a, b, x, y = (unsigned long)nd->path.dentry;
 	unsigned long adata, bdata, mask, len;
 	const struct word_at_a_time constants = WORD_AT_A_TIME_CONSTANTS;
 
+	pagefault_disable();
 	/*
 	 * The first iteration is special, because it can result in
 	 * '.' and '..' and has no mixing other than the final fold.
 	 */
 	a = load_unaligned_zeropad(name);
 	b = a ^ REPEAT_BYTE('/');
 	if (has_zero(a, &adata, &constants) | has_zero(b, &bdata, &constants)) {
+		pagefault_enable();
 		adata = prep_zero_mask(a, adata, &constants);
 		bdata = prep_zero_mask(b, bdata, &constants);
 		mask = create_zero_mask(adata | bdata);
 		a &= zero_bytemask(mask);
 		*lastword = a;
@@ -2383,10 +2389,11 @@ static inline const char *hash_name(struct nameidata *nd,
 		HASH_MIX(x, y, a);
 		len += sizeof(unsigned long);
 		a = load_unaligned_zeropad(name+len);
 		b = a ^ REPEAT_BYTE('/');
 	} while (!(has_zero(a, &adata, &constants) | has_zero(b, &bdata, &constants)));
+	pagefault_enable();
 
 	adata = prep_zero_mask(a, adata, &constants);
 	bdata = prep_zero_mask(b, bdata, &constants);
 	mask = create_zero_mask(adata | bdata);
 	a &= zero_bytemask(mask);
-- 
2.51.0



^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [Bug report] hash_name() may cross page boundary and trigger sleep in RCU context
  2025-11-26  9:05 [Bug report] hash_name() may cross page boundary and trigger sleep in RCU context Zizhi Wo
  2025-11-26 10:19 ` [RFC PATCH] vfs: Fix might sleep in load_unaligned_zeropad() with rcu read lock held Xie Yuanbin
@ 2025-11-26 10:27 ` Zizhi Wo
  2025-11-26 21:12   ` Linus Torvalds
  2025-11-26 18:55 ` Al Viro
  2025-11-27 12:59 ` Will Deacon
  3 siblings, 1 reply; 59+ messages in thread
From: Zizhi Wo @ 2025-11-26 10:27 UTC (permalink / raw)
  To: Zizhi Wo, jack, brauner, hch, akpm, linux, torvalds
  Cc: linux-fsdevel, linux-kernel, linux-mm, linux-arm-kernel,
	yangerkun, wangkefeng.wang, pangliyuan1, xieyuanbin1



在 2025/11/26 17:05, Zizhi Wo 写道:
> We're running into the following issue on an ARM32 platform with the linux
> 5.10 kernel:
> 
> [<c0300b78>] (__dabt_svc) from [<c0529cb8>] (link_path_walk.part.7+0x108/0x45c)
> [<c0529cb8>] (link_path_walk.part.7) from [<c052a948>] (path_openat+0xc4/0x10ec)
> [<c052a948>] (path_openat) from [<c052cf90>] (do_filp_open+0x9c/0x114)
> [<c052cf90>] (do_filp_open) from [<c0511e4c>] (do_sys_openat2+0x418/0x528)
> [<c0511e4c>] (do_sys_openat2) from [<c0513d98>] (do_sys_open+0x88/0xe4)
> [<c0513d98>] (do_sys_open) from [<c03000c0>] (ret_fast_syscall+0x0/0x58)
> ...
> [<c0315e34>] (unwind_backtrace) from [<c030f2b0>] (show_stack+0x20/0x24)
> [<c030f2b0>] (show_stack) from [<c14239f4>] (dump_stack+0xd8/0xf8)
> [<c14239f4>] (dump_stack) from [<c038d188>] (___might_sleep+0x19c/0x1e4)
> [<c038d188>] (___might_sleep) from [<c031b6fc>] (do_page_fault+0x2f8/0x51c)
> [<c031b6fc>] (do_page_fault) from [<c031bb44>] (do_DataAbort+0x90/0x118)
> [<c031bb44>] (do_DataAbort) from [<c0300b78>] (__dabt_svc+0x58/0x80)
> ...
> 
> During the execution of hash_name()->load_unaligned_zeropad(), a potential
> memory access beyond the PAGE boundary may occur. For example, when the
> filename length is near the PAGE_SIZE boundary. This triggers a page fault,
> which leads to a call to do_page_fault()->mmap_read_trylock(). If we can't
> acquire the lock, we have to fall back to the mmap_read_lock() path, which
> calls might_sleep(). This breaks RCU semantics because path lookup occurs
> under an RCU read-side critical section. In linux-mainline, arm/arm64
> do_page_fault() still has this problem:
> 
> lock_mm_and_find_vma->get_mmap_lock_carefully->mmap_read_lock_killable.
> 
> And before commit bfcfaa77bdf0 ("vfs: use 'unsigned long' accesses for
> dcache name comparison and hashing"), hash_name accessed the name byte by
> byte.
> 
> To prevent load_unaligned_zeropad() from accessing beyond the valid memory
> region, we would need to intercept such cases beforehand? But doing so
> would require replicating the internal logic of load_unaligned_zeropad(),
> including handling endianness and constructing the correct value manually.
> Given that load_unaligned_zeropad() is used in many places across the
> kernel, we currently haven't found a good solution to address this cleanly.
> 
> What would be the recommended way to handle this situation? Would
> appreciate any feedback and guidance from the community. Thanks!
> 

As a detail, we enabled CONFIG_DEBUG_PREEMPT and CONFIG_KFENCE, which
allowed us to catch the potential out-of-bounds access.

Thanks,
Zizhi Wo



^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC PATCH] vfs: Fix might sleep in load_unaligned_zeropad() with rcu read lock held
  2025-11-26 10:19 ` [RFC PATCH] vfs: Fix might sleep in load_unaligned_zeropad() with rcu read lock held Xie Yuanbin
@ 2025-11-26 18:10   ` Al Viro
  2025-11-26 18:48     ` Al Viro
  2025-11-26 20:42   ` Al Viro
  1 sibling, 1 reply; 59+ messages in thread
From: Al Viro @ 2025-11-26 18:10 UTC (permalink / raw)
  To: Xie Yuanbin
  Cc: brauner, jack, linux, will, nico, akpm, hch, jack, wozizhi,
	linux-fsdevel, linux-kernel, linux-arm-kernel, linux-mm,
	lilinjie8, liaohua4, wangkefeng.wang, pangliyuan1

On Wed, Nov 26, 2025 at 06:19:52PM +0800, Xie Yuanbin wrote:
> When the path is initialized with LOOKUP_RCU flag in path_init(), the
> rcu read lock will be acquired. Inside the rcu critical section,
> load_unaligned_zeropad() may be called. According to the comments of
> load_unaligned_zeropad(), when loading the memory, a page fault may be
> triggered in the very unlikely case.

> Add pagefault_disable() to handle this situation.

Way too costly, IMO.  That needs to be dealt with in page fault handler
and IIRC arm used to do that; did that get broken at some point?


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC PATCH] vfs: Fix might sleep in load_unaligned_zeropad() with rcu read lock held
  2025-11-26 18:10   ` Al Viro
@ 2025-11-26 18:48     ` Al Viro
  2025-11-26 19:05       ` Russell King (Oracle)
  0 siblings, 1 reply; 59+ messages in thread
From: Al Viro @ 2025-11-26 18:48 UTC (permalink / raw)
  To: Xie Yuanbin
  Cc: brauner, jack, linux, will, nico, akpm, hch, jack, wozizhi,
	linux-fsdevel, linux-kernel, linux-arm-kernel, linux-mm,
	lilinjie8, liaohua4, wangkefeng.wang, pangliyuan1

On Wed, Nov 26, 2025 at 06:10:31PM +0000, Al Viro wrote:
> On Wed, Nov 26, 2025 at 06:19:52PM +0800, Xie Yuanbin wrote:
> > When the path is initialized with LOOKUP_RCU flag in path_init(), the
> > rcu read lock will be acquired. Inside the rcu critical section,
> > load_unaligned_zeropad() may be called. According to the comments of
> > load_unaligned_zeropad(), when loading the memory, a page fault may be
> > triggered in the very unlikely case.
> 
> > Add pagefault_disable() to handle this situation.
> 
> Way too costly, IMO.  That needs to be dealt with in page fault handler
> and IIRC arm used to do that; did that get broken at some point?

FWIW, on arm64 it's dealt with hitting do_translation_fault(), which
sees that address is kernel-space one, goes into do_bad_area(), sees
that it's from kernel mode and proceeds into __do_kernel_fault() and
from there - to fixup_exception().  No messing with VMA lookups, etc.
anywhere in process.

Had been that way since 760bfb47c36a "arm64: fault: Route pte translation
faults via do_translation_fault"...

Did that get broken?  Or is it actually arm32-specific?

In any case, making every architecture pay for that is insane - taking
a fault there is extremely rare to start with and callers sit on very
hot paths.  Deal with that in the fault handler.  Really.

We have no business even looking at VMAs when the fault is on kernel-mode
read access to kernel-space address.  And callers of load_unaligned_zeropad()
are not the place for dealing with that.

It's been years since I looked at 32bit arm exception handling, so I'd need
quite a bit of (re)RTF{S,M} before I'm comfortable with poking in
arch/arm/mm/fault.c; better let ARM folks deal with that.  But arch/* is
where it should be dealt with; as for papering over that in fs/*:
NAKed-by: Al Viro <viro@zeniv.linux.org.uk>

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [Bug report] hash_name() may cross page boundary and trigger sleep in RCU context
  2025-11-26  9:05 [Bug report] hash_name() may cross page boundary and trigger sleep in RCU context Zizhi Wo
  2025-11-26 10:19 ` [RFC PATCH] vfs: Fix might sleep in load_unaligned_zeropad() with rcu read lock held Xie Yuanbin
  2025-11-26 10:27 ` [Bug report] hash_name() may cross page boundary and trigger sleep in RCU context Zizhi Wo
@ 2025-11-26 18:55 ` Al Viro
  2025-11-27  2:24   ` Zizhi Wo
  2025-11-27 12:59 ` Will Deacon
  3 siblings, 1 reply; 59+ messages in thread
From: Al Viro @ 2025-11-26 18:55 UTC (permalink / raw)
  To: Zizhi Wo
  Cc: jack, brauner, hch, akpm, linux, linux-fsdevel, linux-kernel,
	linux-mm, linux-arm-kernel, wozizhi, yangerkun, wangkefeng.wang,
	pangliyuan1, xieyuanbin1

On Wed, Nov 26, 2025 at 05:05:05PM +0800, Zizhi Wo wrote:

> under an RCU read-side critical section. In linux-mainline, arm/arm64
> do_page_fault() still has this problem:
> 
> lock_mm_and_find_vma->get_mmap_lock_carefully->mmap_read_lock_killable.

	arm64 shouldn't hit do_page_fault() in the first place, and
do_translation_fault() there will see that address is beyond TASK_SIZE
and go straight to do_bad_area() -> __do_kernel_fault() -> fixup_exception(),
with no messing with mmap_lock.

	Can anybody confirm that problem exists on arm64 (ideally - with
reproducer)?


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC PATCH] vfs: Fix might sleep in load_unaligned_zeropad() with rcu read lock held
  2025-11-26 18:48     ` Al Viro
@ 2025-11-26 19:05       ` Russell King (Oracle)
  2025-11-26 19:26         ` Al Viro
  0 siblings, 1 reply; 59+ messages in thread
From: Russell King (Oracle) @ 2025-11-26 19:05 UTC (permalink / raw)
  To: Al Viro
  Cc: Xie Yuanbin, brauner, jack, will, nico, akpm, hch, jack, wozizhi,
	linux-fsdevel, linux-kernel, linux-arm-kernel, linux-mm,
	lilinjie8, liaohua4, wangkefeng.wang, pangliyuan1

On Wed, Nov 26, 2025 at 06:48:20PM +0000, Al Viro wrote:
> It's been years since I looked at 32bit arm exception handling, so I'd need
> quite a bit of (re)RTF{S,M} before I'm comfortable with poking in
> arch/arm/mm/fault.c; better let ARM folks deal with that.  But arch/* is
> where it should be dealt with; as for papering over that in fs/*:

Don't expect that to happen. I've not looked at it for over a decade,
I do very little 32-bit ARM stuff anymore. Others have modified the
fault handling, the VM has changed, I basically no longer have the
knowledge. Effectively, 32-bit ARM is unmaintained now, although it
still has many users.

-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTP is here! 80Mbps down 10Mbps up. Decent connectivity at last!


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC PATCH] vfs: Fix might sleep in load_unaligned_zeropad() with rcu read lock held
  2025-11-26 19:05       ` Russell King (Oracle)
@ 2025-11-26 19:26         ` Al Viro
  2025-11-26 19:51           ` Russell King (Oracle)
  2025-11-28  1:39           ` Xie Yuanbin
  0 siblings, 2 replies; 59+ messages in thread
From: Al Viro @ 2025-11-26 19:26 UTC (permalink / raw)
  To: Russell King (Oracle)
  Cc: Xie Yuanbin, brauner, jack, will, nico, akpm, hch, jack, wozizhi,
	linux-fsdevel, linux-kernel, linux-arm-kernel, linux-mm,
	lilinjie8, liaohua4, wangkefeng.wang, pangliyuan1

On Wed, Nov 26, 2025 at 07:05:05PM +0000, Russell King (Oracle) wrote:
> On Wed, Nov 26, 2025 at 06:48:20PM +0000, Al Viro wrote:
> > It's been years since I looked at 32bit arm exception handling, so I'd need
> > quite a bit of (re)RTF{S,M} before I'm comfortable with poking in
> > arch/arm/mm/fault.c; better let ARM folks deal with that.  But arch/* is
> > where it should be dealt with; as for papering over that in fs/*:
> 
> Don't expect that to happen. I've not looked at it for over a decade,
> I do very little 32-bit ARM stuff anymore. Others have modified the
> fault handling, the VM has changed, I basically no longer have the
> knowledge. Effectively, 32-bit ARM is unmaintained now, although it
> still has many users.

Joy...  For quick and dirty variant (on current tree), how about
adding
	if (unlikely(addr > TASK_SIZE) && !user_mode(regs))
		goto no_context;

right after

	if (!ttbr0_usermode_access_allowed(regs))
		goto no_context;

in do_page_fault() there?

NOTE: that might or might not break vdso; I don't think it would, but...


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC PATCH] vfs: Fix might sleep in load_unaligned_zeropad() with rcu read lock held
  2025-11-26 19:26         ` Al Viro
@ 2025-11-26 19:51           ` Russell King (Oracle)
  2025-11-26 20:02             ` Al Viro
  2025-11-28  1:39           ` Xie Yuanbin
  1 sibling, 1 reply; 59+ messages in thread
From: Russell King (Oracle) @ 2025-11-26 19:51 UTC (permalink / raw)
  To: Al Viro
  Cc: Xie Yuanbin, brauner, jack, will, nico, akpm, hch, jack, wozizhi,
	linux-fsdevel, linux-kernel, linux-arm-kernel, linux-mm,
	lilinjie8, liaohua4, wangkefeng.wang, pangliyuan1

On Wed, Nov 26, 2025 at 07:26:40PM +0000, Al Viro wrote:
> On Wed, Nov 26, 2025 at 07:05:05PM +0000, Russell King (Oracle) wrote:
> > On Wed, Nov 26, 2025 at 06:48:20PM +0000, Al Viro wrote:
> > > It's been years since I looked at 32bit arm exception handling, so I'd need
> > > quite a bit of (re)RTF{S,M} before I'm comfortable with poking in
> > > arch/arm/mm/fault.c; better let ARM folks deal with that.  But arch/* is
> > > where it should be dealt with; as for papering over that in fs/*:
> > 
> > Don't expect that to happen. I've not looked at it for over a decade,
> > I do very little 32-bit ARM stuff anymore. Others have modified the
> > fault handling, the VM has changed, I basically no longer have the
> > knowledge. Effectively, 32-bit ARM is unmaintained now, although it
> > still has many users.
> 
> Joy...  For quick and dirty variant (on current tree), how about
> adding
> 	if (unlikely(addr > TASK_SIZE) && !user_mode(regs))
> 		goto no_context;
> 
> right after
> 
> 	if (!ttbr0_usermode_access_allowed(regs))
> 		goto no_context;
> 
> in do_page_fault() there?
> 
> NOTE: that might or might not break vdso; I don't think it would, but...

I don't understand how that helps. Wasn't the report that the filename
crosses a page boundary in userspace, but the following page is
inaccessible which causes a fault to be taken (as it always would do).
Thus, wouldn't "addr" be a userspace address (that the kernel is
accessing) and thus be below TASK_SIZE ?

I'm also confused - if we can't take a fault and handle it while
reading the filename from userspace, how are pages that have been
swapped out or evicted from the page cache read back in from storage
which invariably results in sleeping - which we can't do here because
of the RCU context (not that I've ever understood RCU, which is why
I've always referred those bugs to Paul.)

-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTP is here! 80Mbps down 10Mbps up. Decent connectivity at last!


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC PATCH] vfs: Fix might sleep in load_unaligned_zeropad() with rcu read lock held
  2025-11-26 19:51           ` Russell King (Oracle)
@ 2025-11-26 20:02             ` Al Viro
  2025-11-26 22:25               ` david laight
  2025-11-26 23:31               ` Russell King (Oracle)
  0 siblings, 2 replies; 59+ messages in thread
From: Al Viro @ 2025-11-26 20:02 UTC (permalink / raw)
  To: Russell King (Oracle)
  Cc: Xie Yuanbin, brauner, jack, will, nico, akpm, hch, jack, wozizhi,
	linux-fsdevel, linux-kernel, linux-arm-kernel, linux-mm,
	lilinjie8, liaohua4, wangkefeng.wang, pangliyuan1

On Wed, Nov 26, 2025 at 07:51:54PM +0000, Russell King (Oracle) wrote:

> I don't understand how that helps. Wasn't the report that the filename
> crosses a page boundary in userspace, but the following page is
> inaccessible which causes a fault to be taken (as it always would do).
> Thus, wouldn't "addr" be a userspace address (that the kernel is
> accessing) and thus be below TASK_SIZE ?
> 
> I'm also confused - if we can't take a fault and handle it while
> reading the filename from userspace, how are pages that have been
> swapped out or evicted from the page cache read back in from storage
> which invariably results in sleeping - which we can't do here because
> of the RCU context (not that I've ever understood RCU, which is why
> I've always referred those bugs to Paul.)

No, the filename is already copied in kernel space *and* it's long enough
to end right next to the end of page.  There's NUL before the end of page,
at that, with '/' a couple of bytes prior.  We attempt to save on memory
accesses, doing word-by-word fetches, starting from the beginning of
component.  We *will* detect NUL and ignore all subsequent bytes; the
problem is that the last 3 bytes of page might be '/', 'x' and '\0'.
We call load_unaligned_zeropad() on page + PAGE_SIZE - 2.  And get
a fetch that spans the end of page.

We don't care what's in the next page, if there is one mapped there
to start with.  If there's nothing mapped, we want zeroes read from
it, but all we really care about is having the bytes within *our*
page read correctly - and no oops happening, obviously.

That fault is an extremely cold case on a fairly hot path.  We don't
want to mess with disabling pagefaults, etc. - not for the sake
of that.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC PATCH] vfs: Fix might sleep in load_unaligned_zeropad() with rcu read lock held
  2025-11-26 10:19 ` [RFC PATCH] vfs: Fix might sleep in load_unaligned_zeropad() with rcu read lock held Xie Yuanbin
  2025-11-26 18:10   ` Al Viro
@ 2025-11-26 20:42   ` Al Viro
  1 sibling, 0 replies; 59+ messages in thread
From: Al Viro @ 2025-11-26 20:42 UTC (permalink / raw)
  To: Xie Yuanbin
  Cc: brauner, jack, linux, will, nico, akpm, hch, jack, wozizhi,
	linux-fsdevel, linux-kernel, linux-arm-kernel, linux-mm,
	lilinjie8, liaohua4, wangkefeng.wang, pangliyuan1

On Wed, Nov 26, 2025 at 06:19:52PM +0800, Xie Yuanbin wrote:

> On latest linux-next source, using arm32's multi_v7_defconfig, and
> setting CONFIG_PREEMPT=y, CONFIG_DEBUG_ATOMIC_SLEEP=y, CONFIG_KFENCE=y,
> CONFIG_ARM_PAN=n, then run the following testcase:
> ```c
> static void *thread(void *arg)
> {
> 	while (1) {
> 		void *p = mmap(NULL, 4096, PROT_READ | PROT_WRITE, MAP_ANON | MAP_PRIVATE, -1, 0);
> 
> 		assert(p != (void *)-1);
> 		__asm__ volatile ("":"+r"(p)::"memory");
> 
> 		munmap(p, 4096);
> 	}
> }
> 
> int main()
> {
> 	pthread_t th;
> 	int ret;
> 	char path[4096] = "/tmp";
> 
> 	for (size_t i = 0; i < 2044; ++i) {
> 		strcat(path, "/x");
> 		ret = mkdir(path, 0755);
> 		assert(ret == 0 || errno == EEXIST);
> 	}
> 	strcat(path, "/xx");
> 
> 	assert(strlen(path) == 4095);
> 
> 	assert(pthread_create(&th, NULL, thread, NULL) == 0);
> 
> 	while (1) {
> 		FILE *fp = fopen(path, "wb+");
> 
> 		assert(fp);
> 		fclose(fp);
> 	}
> 	return 0;
> }
> ```
> The might sleep warning will be triggered immediately.

"Immediately" part is interesting - presumably KFENCE is playing silly
buggers with PTEs in there.

Anyway, the underlying bug is that fault in this scenario should not
even look at VMAs - it should get to fixup_exception() and be done
with that, with minimal overhead for all other cause of faults.

We have an unaligned 32bit fetch from kernel address, spanning the
page boundary, with the second page unmapped or unreadable.  Access
comes from kernel mode.  All we want is to fail the fault without
an oops, blocking, etc.

AFAICS, on arm32 looks for VMA at address > TASK_SIZE won't find
a damn thing anyway, so skipping these attempts and going to
bad_area looks safe enough, if we do that after all early cases...

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [Bug report] hash_name() may cross page boundary and trigger sleep in RCU context
  2025-11-26 10:27 ` [Bug report] hash_name() may cross page boundary and trigger sleep in RCU context Zizhi Wo
@ 2025-11-26 21:12   ` Linus Torvalds
  2025-11-27 10:27     ` Will Deacon
  2025-11-27 10:57     ` Russell King (Oracle)
  0 siblings, 2 replies; 59+ messages in thread
From: Linus Torvalds @ 2025-11-26 21:12 UTC (permalink / raw)
  To: Zizhi Wo, Russell King, Catalin Marinas, Will Deacon
  Cc: jack, brauner, hch, akpm, linux-fsdevel, linux-kernel, linux-mm,
	linux-arm-kernel, yangerkun, wangkefeng.wang, pangliyuan1,
	xieyuanbin1

On Wed, 26 Nov 2025 at 02:27, Zizhi Wo <wozizhi@huaweicloud.com> wrote:
>
> 在 2025/11/26 17:05, Zizhi Wo 写道:
> > We're running into the following issue on an ARM32 platform with the linux
> > 5.10 kernel:
> >
> > During the execution of hash_name()->load_unaligned_zeropad(), a potential
> > memory access beyond the PAGE boundary may occur.

That is correct.

However:

> >                This triggers a page fault,
> > which leads to a call to do_page_fault()->mmap_read_trylock().

That should *not* happen.  For kernel addresses, mmap_read_trylock()
should never trigger, much less the full mmap_read_lock().

See for example the x86 fault handling in  handle_page_fault():

        if (unlikely(fault_in_kernel_space(address))) {
                do_kern_addr_fault(regs, error_code, address);

and the kernel address case never triggers the mmap lock, because
while faults on kernel addresses can happen for various reasons, they
are never memory mappings.

I'm seeing similar logic in the arm tree, although the check is
different. do_translation_fault() checks for TASK_SIZE.

        if (addr < TASK_SIZE)
                return do_page_fault(addr, fsr, regs);

but it appears that there are paths to do_page_fault() that do not
have this check, ie that do_DataAbort() function does

        if (!inf->fn(addr, fsr & ~FSR_LNX_PF, regs))
                return;

and It's not immediately obvious, but that can call do_page_fault()
too though the fsr_info[] and ifsr_info[] arrays in
arch/arm/mm/fsr-2level.c.

The arm64 case looks like it might have similar issues, but while I'm
more familiar with arm than I _used_ to be, I do not know the
low-level exception handling code at all, so I'm just adding Russell,
Catalin and Will to the participants.

Catalin, Will - the arm64 case uses

        if (is_ttbr0_addr(addr))
                return do_page_fault(far, esr, regs);

instead, but like the 32-bit code that is only triggered for
do_translation_fault().  That may all be ok, because the other cases
seem to be "there is a TLB entry, but we lack privileges", so maybe
will never trigger for a kernel access to a kernel area because they
either do not exist, or we have permissions?

Anyway, possibly a few of those 'do_page_fault' entries should be
'do_translation_fault'? It certainly seems that way at least on 32-bit
arm.

Over to more competent people. Russell?

              Linus

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC PATCH] vfs: Fix might sleep in load_unaligned_zeropad() with rcu read lock held
  2025-11-26 20:02             ` Al Viro
@ 2025-11-26 22:25               ` david laight
  2025-11-26 23:51                 ` Al Viro
  2025-11-26 23:31               ` Russell King (Oracle)
  1 sibling, 1 reply; 59+ messages in thread
From: david laight @ 2025-11-26 22:25 UTC (permalink / raw)
  To: Al Viro
  Cc: Russell King (Oracle),
	Xie Yuanbin, brauner, jack, will, nico, akpm, hch, jack, wozizhi,
	linux-fsdevel, linux-kernel, linux-arm-kernel, linux-mm,
	lilinjie8, liaohua4, wangkefeng.wang, pangliyuan1

On Wed, 26 Nov 2025 20:02:21 +0000
Al Viro <viro@zeniv.linux.org.uk> wrote:

> On Wed, Nov 26, 2025 at 07:51:54PM +0000, Russell King (Oracle) wrote:
> 
> > I don't understand how that helps. Wasn't the report that the filename
> > crosses a page boundary in userspace, but the following page is
> > inaccessible which causes a fault to be taken (as it always would do).
> > Thus, wouldn't "addr" be a userspace address (that the kernel is
> > accessing) and thus be below TASK_SIZE ?
> > 
> > I'm also confused - if we can't take a fault and handle it while
> > reading the filename from userspace, how are pages that have been
> > swapped out or evicted from the page cache read back in from storage
> > which invariably results in sleeping - which we can't do here because
> > of the RCU context (not that I've ever understood RCU, which is why
> > I've always referred those bugs to Paul.)  
> 
> No, the filename is already copied in kernel space *and* it's long enough
> to end right next to the end of page.  There's NUL before the end of page,
> at that, with '/' a couple of bytes prior.  We attempt to save on memory
> accesses, doing word-by-word fetches, starting from the beginning of
> component.  We *will* detect NUL and ignore all subsequent bytes; the
> problem is that the last 3 bytes of page might be '/', 'x' and '\0'.
> We call load_unaligned_zeropad() on page + PAGE_SIZE - 2.  And get
> a fetch that spans the end of page.
> 
> We don't care what's in the next page, if there is one mapped there
> to start with.  If there's nothing mapped, we want zeroes read from
> it, but all we really care about is having the bytes within *our*
> page read correctly - and no oops happening, obviously.
> 
> That fault is an extremely cold case on a fairly hot path.  We don't
> want to mess with disabling pagefaults, etc. - not for the sake
> of that.
> 

Can you fix it with a flag on the exception table entry that means
'don't try to fault in a page'?

I think the logic would be the same as 'disabling pagefaults', just
checking a different flag.
After all the fault itself happens in both cases.

	David


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC PATCH] vfs: Fix might sleep in load_unaligned_zeropad() with rcu read lock held
  2025-11-26 20:02             ` Al Viro
  2025-11-26 22:25               ` david laight
@ 2025-11-26 23:31               ` Russell King (Oracle)
  2025-11-27  3:03                 ` Xie Yuanbin
  1 sibling, 1 reply; 59+ messages in thread
From: Russell King (Oracle) @ 2025-11-26 23:31 UTC (permalink / raw)
  To: Al Viro
  Cc: Xie Yuanbin, brauner, jack, will, nico, akpm, hch, jack, wozizhi,
	linux-fsdevel, linux-kernel, linux-arm-kernel, linux-mm,
	lilinjie8, liaohua4, wangkefeng.wang, pangliyuan1

On Wed, Nov 26, 2025 at 08:02:21PM +0000, Al Viro wrote:
> On Wed, Nov 26, 2025 at 07:51:54PM +0000, Russell King (Oracle) wrote:
> 
> > I don't understand how that helps. Wasn't the report that the filename
> > crosses a page boundary in userspace, but the following page is
> > inaccessible which causes a fault to be taken (as it always would do).
> > Thus, wouldn't "addr" be a userspace address (that the kernel is
> > accessing) and thus be below TASK_SIZE ?
> > 
> > I'm also confused - if we can't take a fault and handle it while
> > reading the filename from userspace, how are pages that have been
> > swapped out or evicted from the page cache read back in from storage
> > which invariably results in sleeping - which we can't do here because
> > of the RCU context (not that I've ever understood RCU, which is why
> > I've always referred those bugs to Paul.)
> 
> No, the filename is already copied in kernel space *and* it's long enough
> to end right next to the end of page.  There's NUL before the end of page,
> at that, with '/' a couple of bytes prior.  We attempt to save on memory
> accesses, doing word-by-word fetches, starting from the beginning of
> component.  We *will* detect NUL and ignore all subsequent bytes; the
> problem is that the last 3 bytes of page might be '/', 'x' and '\0'.
> We call load_unaligned_zeropad() on page + PAGE_SIZE - 2.  And get
> a fetch that spans the end of page.
> 
> We don't care what's in the next page, if there is one mapped there
> to start with.  If there's nothing mapped, we want zeroes read from
> it, but all we really care about is having the bytes within *our*
> page read correctly - and no oops happening, obviously.
> 
> That fault is an extremely cold case on a fairly hot path.  We don't
> want to mess with disabling pagefaults, etc. - not for the sake
> of that.

I think, looking at the x86 handling, 32-bit ARM has missed a heck of
a lot of changes to the fault handling code, going all the way back to
pre-git history.

I seem to remember that I had updated it to match i386's implementation
at one point in the distant past, which is essentially what we have
today with a few tweaks. As code ages, it gets more difficult to
justify wholesale rewrites to bring it back up.

Relevant to this, looking at i386, that at some point added:

+       /*
+        * We fault-in kernel-space virtual memory on-demand. The
+        * 'reference' page table is init_mm.pgd.
+        *
+        * NOTE! We MUST NOT take any locks for this case. We may
+        * be in an interrupt or a critical region, and should
+        * only copy the information from the master page table,
+        * nothing more.
+        *
+        * This verifies that the fault happens in kernel space
+        * (error_code & 4) == 0, and that the fault was not a
+        * protection error (error_code & 1) == 0.
+        */
+       if (unlikely(address >= TASK_SIZE)) {
+               if (!(error_code & 5))
+                       goto vmalloc_fault;
+               /*
+                * Don't take the mm semaphore here. If we fixup a prefetch
+                * fault we could otherwise deadlock.
+                */
+               goto bad_area_nosemaphore;
+       }

which is after notify_die() and the test to see whether we need a
local_irq_enable(). This means we go straight to the fixing up etc
for these addresses.

In today's kernel, this has morphed into:

        /* Was the fault on kernel-controlled part of the address space? */
        if (unlikely(fault_in_kernel_space(address))) {
                do_kern_addr_fault(regs, error_code, address);
        } else {
                do_user_addr_fault(regs, error_code, address);

meaning any page fault for a kernel space address is handled entirely
separately from the normal page fault handling, and it looks like
this is entirely sensible.

Interestingly, however, I notice that x86 appears to no longer call
notify_die(DIE_PAGE_FAULT) in its page fault handling path, and I
wonder whether that's a regression on x86.

Now, for 32-bit ARM, I think I am coming to the conclusion that Al's
suggestion is probably the easiest solution. However, whether it has
side effects, I couldn't say - the 32-bit ARM fault code has been
modified by quite a few people in ways I don't yet understand, so I
can't be certain at the moment whether it would cause problems.

I think the only thing to do is to try the solution and see what
breaks. I'm not in a position to be able to do that as, having not
had reason to touch 32-bit ARM for years, I don't have a hackable
platform nearby. Maybe Xie Yuanbin can test it?

-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTP is here! 80Mbps down 10Mbps up. Decent connectivity at last!


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC PATCH] vfs: Fix might sleep in load_unaligned_zeropad() with rcu read lock held
  2025-11-26 22:25               ` david laight
@ 2025-11-26 23:51                 ` Al Viro
  0 siblings, 0 replies; 59+ messages in thread
From: Al Viro @ 2025-11-26 23:51 UTC (permalink / raw)
  To: david laight
  Cc: Russell King (Oracle),
	Xie Yuanbin, brauner, jack, will, nico, akpm, hch, jack, wozizhi,
	linux-fsdevel, linux-kernel, linux-arm-kernel, linux-mm,
	lilinjie8, liaohua4, wangkefeng.wang, pangliyuan1

On Wed, Nov 26, 2025 at 10:25:05PM +0000, david laight wrote:

> Can you fix it with a flag on the exception table entry that means
> 'don't try to fault in a page'?
> 
> I think the logic would be the same as 'disabling pagefaults', just
> checking a different flag.
> After all the fault itself happens in both cases.

The problem is getting to the point where you search the exception table
without blocking.

x86 #PF had been done that way from well before the point when
load_unaligned_zeropad() had been introduced, so everything worked
there from the very beginning.

arm and arm64, OTOH, were different - there had been logics for
"if trylock fails, check if we are in kernel space and have no
matching exception table entry; bugger off if so, otherwise we
are safe to grab mmap_sem - it's something like get_user() and
we *want* mmap_sem there", but it did exactly the wrong thing for
this case.

The only thing that prevented serious breakage from the very beginning
was that these faults are very rare - and hard to arrange without
KFENCE.  So it didn't blow up.  In 2017 arm64 side of problem had
been spotted and (hopefully) fixed.  arm counterpart stayed unnoticed
(perhaps for the lack of good reproducer) until now.

Most of the faults are from userland code, obviously, so we don't
want to search through the exception table on the common path.
So hanging that on a flag in exception table entry is not a good
idea - we need a cheaper predicate checked first.

x86 starts with separating the fault on kernel address from that on
userland; we are not going anywhere near mmap_sem (and VMAs in general)
in the former case and that's where load_unaligned_zeropad() faults
end up.  arm64 fix consisted of using do_translation_fault() instead
of do_page_fault(), with the former falling back to the latter for
userland addresses and using do_bad_area() for kernel ones.
Assuming that the way it's hooked up covers everything, we should
be fine there.

One potential problem _might_ be with the next PTE present, but
write-only.  Note that it has to cope with symlink bodies as well
and those might come from page cache rather than kmem_cache_alloc().
I'm nowhere near being uptodate on arm64 virtual memory setup, though,
so take that with a cartload of salt...

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [Bug report] hash_name() may cross page boundary and trigger sleep in RCU context
  2025-11-26 18:55 ` Al Viro
@ 2025-11-27  2:24   ` Zizhi Wo
  2025-11-29  3:37     ` Al Viro
  0 siblings, 1 reply; 59+ messages in thread
From: Zizhi Wo @ 2025-11-27  2:24 UTC (permalink / raw)
  To: Al Viro, Zizhi Wo, torvalds
  Cc: jack, brauner, hch, akpm, linux, linux-fsdevel, linux-kernel,
	linux-mm, linux-arm-kernel, yangerkun, wangkefeng.wang,
	pangliyuan1, xieyuanbin1

在 2025/11/27 2:55, Al Viro 写道:
> On Wed, Nov 26, 2025 at 05:05:05PM +0800, Zizhi Wo wrote:
> 
>> under an RCU read-side critical section. In linux-mainline, arm/arm64
>> do_page_fault() still has this problem:
>>
>> lock_mm_and_find_vma->get_mmap_lock_carefully->mmap_read_lock_killable.
> 
> 	arm64 shouldn't hit do_page_fault() in the first place, and
> do_translation_fault() there will see that address is beyond TASK_SIZE
> and go straight to do_bad_area() -> __do_kernel_fault() -> fixup_exception(),
> with no messing with mmap_lock.
> 
> 	Can anybody confirm that problem exists on arm64 (ideally - with
> reproducer)?
> 

Thank you all for the replies.

We did reproduce the issue on arm, and I mistakenly assumed the same
problem existed on arm64 after looking at the do_page_fault() code.
However, I just confirmed using the test program that, as everyone
pointed out, it goes through do_translation_fault() and reaches
do_bad_area() -> __do_kernel_fault(). So indeed, the issue does not
exist on arm64 — that was my oversight...

That said, I’d like to ask a follow-up question:

Why does x86 have special handling in do_kern_addr_fault(), including
logic for vmalloc faults? For example, on CONFIG_X86_32, it still takes
the vmalloc_fault path. As noted in the x86 comments, "We can fault-in 
kernel-space virtual memory on-demand"...

But on arm64, I don’t see similar logic — is there a specific reason
for this difference? Maybe x86's vmalloc area is mapped lazily, while
ARM maps it fully during early boot?

Thanks,
Zizhi Wo

^ permalink raw reply	[flat|nested] 59+ messages in thread

* [RFC PATCH] vfs: Fix might sleep in load_unaligned_zeropad() with rcu read lock held
  2025-11-26 23:31               ` Russell King (Oracle)
@ 2025-11-27  3:03                 ` Xie Yuanbin
  2025-11-27  7:20                   ` Sebastian Andrzej Siewior
  0 siblings, 1 reply; 59+ messages in thread
From: Xie Yuanbin @ 2025-11-27  3:03 UTC (permalink / raw)
  To: viro, linux, will, david.laight, rmk+kernel
  Cc: brauner, jack, nico, akpm, hch, jack, wozizhi, linux-fsdevel,
	linux-kernel, linux-arm-kernel, linux-mm, catalin.marinas, rppt,
	vbabka, pfalcato, lorenzo.stoakes, kuninori.morimoto.gx, tony,
	arnd, bigeasy, punitagrawal, rjw, marc.zyngier, lilinjie8,
	liaohua4, wangkefeng.wang, pangliyuan1, Xie Yuanbin

On, Wed, 26 Nov 2025 19:26:40 +0000, Al Viro wrote:
> For quick and dirty variant (on current tree), how about
> adding
> 	if (unlikely(addr > TASK_SIZE) && !user_mode(regs))
> 		goto no_context;
>
> right after
>
> 	if (!ttbr0_usermode_access_allowed(regs))
> 		goto no_context;
>
> in do_page_fault() there?

On, Wed, 26 Nov 2025 23:31:00 +0000, Russell King (Oracle) wrote:
> Now, for 32-bit ARM, I think I am coming to the conclusion that Al's
> suggestion is probably the easiest solution. However, whether it has
> side effects, I couldn't say - the 32-bit ARM fault code has been
> modified by quite a few people in ways I don't yet understand, so I
> can't be certain at the moment whether it would cause problems.

I think I've already submitted a very similar patch, to fix another bug:
On Thu, 16 Oct 2025 20:16:21 +0800, Xie Yuanbin wrote:
> +#ifdef CONFIG_HARDEN_BRANCH_PREDICTOR
> +	if (unlikely(addr > TASK_SIZE) && user_mode(regs)) {
> +		fault = 0;
> +		code = SEGV_MAPERR;
> +		goto bad_area;
> +	}
> +#endif
Link: https://lore.kernel.org/20250925025744.6807-1-xieyuanbin1@huawei.com

However, the patch seems to have received no response for a very long
time.

On Wed, 26 Nov 2025 23:31:00 +0000, Russell King wrote:
> I think the only thing to do is to try the solution and see what
> breaks. I'm not in a position to be able to do that as, having not
> had reason to touch 32-bit ARM for years, I don't have a hackable
> platform nearby. Maybe Xie Yuanbin can test it?

With pleasure.
By the way, for the config and test case shown in this patch:
vfs: Fix might sleep in load_unaligned_zeropad() with rcu read lock held
Link: https://lore.kernel.org/20251126101952.174467-1-xieyuanbin1@huawei.com
the warning can be reproduced directly on QEMU.

Xie Yuanbin


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC PATCH] vfs: Fix might sleep in load_unaligned_zeropad() with rcu read lock held
  2025-11-27  3:03                 ` Xie Yuanbin
@ 2025-11-27  7:20                   ` Sebastian Andrzej Siewior
  2025-11-27 11:20                     ` Xie Yuanbin
  0 siblings, 1 reply; 59+ messages in thread
From: Sebastian Andrzej Siewior @ 2025-11-27  7:20 UTC (permalink / raw)
  To: Xie Yuanbin
  Cc: viro, linux, will, david.laight, rmk+kernel, brauner, jack, nico,
	akpm, hch, jack, wozizhi, linux-fsdevel, linux-kernel,
	linux-arm-kernel, linux-mm, catalin.marinas, rppt, vbabka,
	pfalcato, lorenzo.stoakes, kuninori.morimoto.gx, tony, arnd,
	punitagrawal, rjw, marc.zyngier, lilinjie8, liaohua4,
	wangkefeng.wang, pangliyuan1

On 2025-11-27 11:03:16 [+0800], Xie Yuanbin wrote:
> On, Wed, 26 Nov 2025 19:26:40 +0000, Al Viro wrote:
> > For quick and dirty variant (on current tree), how about
> > adding
> > 	if (unlikely(addr > TASK_SIZE) && !user_mode(regs))
> > 		goto no_context;
> >
> > right after
> >
> > 	if (!ttbr0_usermode_access_allowed(regs))
> > 		goto no_context;
> >
> > in do_page_fault() there?
> 
> On, Wed, 26 Nov 2025 23:31:00 +0000, Russell King (Oracle) wrote:
> > Now, for 32-bit ARM, I think I am coming to the conclusion that Al's
> > suggestion is probably the easiest solution. However, whether it has
> > side effects, I couldn't say - the 32-bit ARM fault code has been
> > modified by quite a few people in ways I don't yet understand, so I
> > can't be certain at the moment whether it would cause problems.
> 
> I think I've already submitted a very similar patch, to fix another bug:
> On Thu, 16 Oct 2025 20:16:21 +0800, Xie Yuanbin wrote:
> > +#ifdef CONFIG_HARDEN_BRANCH_PREDICTOR
> > +	if (unlikely(addr > TASK_SIZE) && user_mode(regs)) {
> > +		fault = 0;
> > +		code = SEGV_MAPERR;
> > +		goto bad_area;
> > +	}
> > +#endif
> Link: https://lore.kernel.org/20250925025744.6807-1-xieyuanbin1@huawei.com
> 
> However, the patch seems to have received no response for a very long
> time.

This all should be covered by the series here
	https://lore.kernel.org/all/20251110145555.2555055-1-bigeasy@linutronix.de/

or do I miss something.

Sebastian


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [Bug report] hash_name() may cross page boundary and trigger sleep in RCU context
  2025-11-26 21:12   ` Linus Torvalds
@ 2025-11-27 10:27     ` Will Deacon
  2025-11-27 10:57     ` Russell King (Oracle)
  1 sibling, 0 replies; 59+ messages in thread
From: Will Deacon @ 2025-11-27 10:27 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Zizhi Wo, Russell King, Catalin Marinas, jack, brauner, hch,
	akpm, linux-fsdevel, linux-kernel, linux-mm, linux-arm-kernel,
	yangerkun, wangkefeng.wang, pangliyuan1, xieyuanbin1

On Wed, Nov 26, 2025 at 01:12:38PM -0800, Linus Torvalds wrote:
> On Wed, 26 Nov 2025 at 02:27, Zizhi Wo <wozizhi@huaweicloud.com> wrote:
> >
> > 在 2025/11/26 17:05, Zizhi Wo 写道:
> > > We're running into the following issue on an ARM32 platform with the linux
> > > 5.10 kernel:
> > >
> > > During the execution of hash_name()->load_unaligned_zeropad(), a potential
> > > memory access beyond the PAGE boundary may occur.
> 
> That is correct.
> 
> However:
> 
> > >                This triggers a page fault,
> > > which leads to a call to do_page_fault()->mmap_read_trylock().
> 
> That should *not* happen.  For kernel addresses, mmap_read_trylock()
> should never trigger, much less the full mmap_read_lock().
> 
> See for example the x86 fault handling in  handle_page_fault():
> 
>         if (unlikely(fault_in_kernel_space(address))) {
>                 do_kern_addr_fault(regs, error_code, address);
> 
> and the kernel address case never triggers the mmap lock, because
> while faults on kernel addresses can happen for various reasons, they
> are never memory mappings.
> 
> I'm seeing similar logic in the arm tree, although the check is
> different. do_translation_fault() checks for TASK_SIZE.
> 
>         if (addr < TASK_SIZE)
>                 return do_page_fault(addr, fsr, regs);
> 
> but it appears that there are paths to do_page_fault() that do not
> have this check, ie that do_DataAbort() function does
> 
>         if (!inf->fn(addr, fsr & ~FSR_LNX_PF, regs))
>                 return;
> 
> 
> and It's not immediately obvious, but that can call do_page_fault()
> too though the fsr_info[] and ifsr_info[] arrays in
> arch/arm/mm/fsr-2level.c.
> 
> The arm64 case looks like it might have similar issues, but while I'm
> more familiar with arm than I _used_ to be, I do not know the
> low-level exception handling code at all, so I'm just adding Russell,
> Catalin and Will to the participants.
> 
> Catalin, Will - the arm64 case uses
> 
>         if (is_ttbr0_addr(addr))
>                 return do_page_fault(far, esr, regs);
> 
> instead, but like the 32-bit code that is only triggered for
> do_translation_fault().  That may all be ok, because the other cases
> seem to be "there is a TLB entry, but we lack privileges", so maybe
> will never trigger for a kernel access to a kernel area because they
> either do not exist, or we have permissions?

Right, I think the access flag / permission fault case will end up
trying to resolve the VMA for a kernel address but I can't think why
we'd ever run into one of those faults for load_unaligned_zeropad().

Valid kernel mappings are always young (AF set) and, although we can
muck around with permissions, valid mappings are always readable.

Will


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [Bug report] hash_name() may cross page boundary and trigger sleep in RCU context
  2025-11-26 21:12   ` Linus Torvalds
  2025-11-27 10:27     ` Will Deacon
@ 2025-11-27 10:57     ` Russell King (Oracle)
  2025-11-28 17:06       ` Linus Torvalds
  1 sibling, 1 reply; 59+ messages in thread
From: Russell King (Oracle) @ 2025-11-27 10:57 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Zizhi Wo, Catalin Marinas, Will Deacon, jack, brauner, hch, akpm,
	linux-fsdevel, linux-kernel, linux-mm, linux-arm-kernel,
	yangerkun, wangkefeng.wang, pangliyuan1, xieyuanbin1

On Wed, Nov 26, 2025 at 01:12:38PM -0800, Linus Torvalds wrote:
> On Wed, 26 Nov 2025 at 02:27, Zizhi Wo <wozizhi@huaweicloud.com> wrote:
> >
> > 在 2025/11/26 17:05, Zizhi Wo 写道:
> > > We're running into the following issue on an ARM32 platform with the linux
> > > 5.10 kernel:
> > >
> > > During the execution of hash_name()->load_unaligned_zeropad(), a potential
> > > memory access beyond the PAGE boundary may occur.
> 
> That is correct.
> 
> However:
> 
> > >                This triggers a page fault,
> > > which leads to a call to do_page_fault()->mmap_read_trylock().
> 
> That should *not* happen.  For kernel addresses, mmap_read_trylock()
> should never trigger, much less the full mmap_read_lock().
> 
> See for example the x86 fault handling in  handle_page_fault():
> 
>         if (unlikely(fault_in_kernel_space(address))) {
>                 do_kern_addr_fault(regs, error_code, address);
> 
> and the kernel address case never triggers the mmap lock, because
> while faults on kernel addresses can happen for various reasons, they
> are never memory mappings.
> 
> I'm seeing similar logic in the arm tree, although the check is
> different. do_translation_fault() checks for TASK_SIZE.
> 
>         if (addr < TASK_SIZE)
>                 return do_page_fault(addr, fsr, regs);
> 
> but it appears that there are paths to do_page_fault() that do not
> have this check, ie that do_DataAbort() function does
> 
>         if (!inf->fn(addr, fsr & ~FSR_LNX_PF, regs))
>                 return;
> 
> 
> and It's not immediately obvious, but that can call do_page_fault()
> too though the fsr_info[] and ifsr_info[] arrays in
> arch/arm/mm/fsr-2level.c.
> 
> The arm64 case looks like it might have similar issues, but while I'm
> more familiar with arm than I _used_ to be, I do not know the
> low-level exception handling code at all, so I'm just adding Russell,
> Catalin and Will to the participants.
> 
> Catalin, Will - the arm64 case uses
> 
>         if (is_ttbr0_addr(addr))
>                 return do_page_fault(far, esr, regs);
> 
> instead, but like the 32-bit code that is only triggered for
> do_translation_fault().  That may all be ok, because the other cases
> seem to be "there is a TLB entry, but we lack privileges", so maybe
> will never trigger for a kernel access to a kernel area because they
> either do not exist, or we have permissions?
> 
> Anyway, possibly a few of those 'do_page_fault' entries should be
> 'do_translation_fault'? It certainly seems that way at least on 32-bit
> arm.
> 
> Over to more competent people. Russell?

Ha!

As said elsewhere, it looks like 32-bit ARM has been missing updates to
the fault handler since pre-git history - this was modelled in the dim
and distant i386 handling, and it just hasn't kept up.

I'm debating whether an entire rewrite would be appropriate, but I'm in
no position to do that at the moment for several reasons:

1. I've now very little knowledge of the Linux MM, there's been many
   changes over the last decade that I'm not aware of, and my knowledge
   of modern things like RCU, kfence, etc is practically zero.

2. I don't have a 32-bit ARM platform to hand to test on.

3. I've not touched these parts of 32-bit ARM for a very long time, so
   my knowledge there has severely bitrotted - E.g. I need to review the
   FSR codes and what they mean, because that knowledge has now
   evaporated as I've not had to use it for getting on for two decades.

4. The arm32 code has been modified by others in ways I don't yet
   understand.

I'm wondering whether Al's solution would be a reasonable stop-gap, but
I can't say whether it would have any side effects, so I've asked for it
to be tested so we get some idea whether it's a possible solution.

Basically, I'm afraid it's going to be a steep learning curve, and thus
won't be a quick exercise - expect it to take a month or more as there
is Christmas, and then I likely have medical stuff at the beginning of
next year.

The reason I'm suggesting a rewrite to something closer to x86 is that
we then have a familiar code pattern that's much more likely to be
correct going forward for the Linux MM requirements, which should also
make it easier for MM folk to understand.

-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTP is here! 80Mbps down 10Mbps up. Decent connectivity at last!

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC PATCH] vfs: Fix might sleep in load_unaligned_zeropad() with rcu read lock held
  2025-11-27  7:20                   ` Sebastian Andrzej Siewior
@ 2025-11-27 11:20                     ` Xie Yuanbin
  0 siblings, 0 replies; 59+ messages in thread
From: Xie Yuanbin @ 2025-11-27 11:20 UTC (permalink / raw)
  To: bigeasy, viro, linux, will, david.laight, rmk+kernel
  Cc: brauner, jack, nico, akpm, hch, jack, wozizhi, linux-fsdevel,
	linux-kernel, linux-arm-kernel, linux-mm, catalin.marinas, rppt,
	vbabka, pfalcato, lorenzo.stoakes, kuninori.morimoto.gx, tony,
	arnd, punitagrawal, rjw, marc.zyngier, lilinjie8, liaohua4,
	wangkefeng.wang, pangliyuan1, Xie Yuanbin

On, Thu, 27 Nov 2025 08:20:57 +0100, Sebastian Andrzej Siewior wrote:
> This all should be covered by the series here
> 	https://lore.kernel.org/all/20251110145555.2555055-1-bigeasy@linutronix.de/

Yes, I know it.

> or do I miss something.

We had some discussions about this bug:
Link: https://lore.kernel.org/lkml/20251126090505.3057219-1-wozizhi@huaweicloud.com/

The discussions:
Link: https://lore.kernel.org/CAHk-=wh1Wfwt9OFB4AfBbjyeu4JVZuSWQ4A8OoT3W6x9btddfw@mail.gmail.com
Link: https://lore.kernel.org/20251126192640.GD3538@ZenIV
Link: https://lore.kernel.org/aSeNtFxD1WRjFaiR@shell.armlinux.org.uk

According to the discussion, in do_page_fault(), when addr >= TASK_SIZE,
we should not try to acquire the mm read lock or find vma. Instead, we
should directly call __do_kernel_fault() or __do_user_fault(). Your
submission just moved harden_branch_predictor() forward. I think we can
have more discussions about the patches to fix the missing spectre.

I am trying to write a new patch, I hope it will better handle these two
bugs and be compatible with PREEMPT_RT scenarios.

> Sebastian

Thanks!

Xie Yuanbin


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [Bug report] hash_name() may cross page boundary and trigger sleep in RCU context
  2025-11-26  9:05 [Bug report] hash_name() may cross page boundary and trigger sleep in RCU context Zizhi Wo
                   ` (2 preceding siblings ...)
  2025-11-26 18:55 ` Al Viro
@ 2025-11-27 12:59 ` Will Deacon
  2025-11-28  1:17   ` Zizhi Wo
  3 siblings, 1 reply; 59+ messages in thread
From: Will Deacon @ 2025-11-27 12:59 UTC (permalink / raw)
  To: Zizhi Wo
  Cc: jack, brauner, hch, akpm, linux, linux-fsdevel, linux-kernel,
	linux-mm, linux-arm-kernel, wozizhi, yangerkun, wangkefeng.wang,
	pangliyuan1, xieyuanbin1

On Wed, Nov 26, 2025 at 05:05:05PM +0800, Zizhi Wo wrote:
> We're running into the following issue on an ARM32 platform with the linux
> 5.10 kernel:
> 
> [<c0300b78>] (__dabt_svc) from [<c0529cb8>] (link_path_walk.part.7+0x108/0x45c)
> [<c0529cb8>] (link_path_walk.part.7) from [<c052a948>] (path_openat+0xc4/0x10ec)
> [<c052a948>] (path_openat) from [<c052cf90>] (do_filp_open+0x9c/0x114)
> [<c052cf90>] (do_filp_open) from [<c0511e4c>] (do_sys_openat2+0x418/0x528)
> [<c0511e4c>] (do_sys_openat2) from [<c0513d98>] (do_sys_open+0x88/0xe4)
> [<c0513d98>] (do_sys_open) from [<c03000c0>] (ret_fast_syscall+0x0/0x58)
> ...
> [<c0315e34>] (unwind_backtrace) from [<c030f2b0>] (show_stack+0x20/0x24)
> [<c030f2b0>] (show_stack) from [<c14239f4>] (dump_stack+0xd8/0xf8)
> [<c14239f4>] (dump_stack) from [<c038d188>] (___might_sleep+0x19c/0x1e4)
> [<c038d188>] (___might_sleep) from [<c031b6fc>] (do_page_fault+0x2f8/0x51c)
> [<c031b6fc>] (do_page_fault) from [<c031bb44>] (do_DataAbort+0x90/0x118)
> [<c031bb44>] (do_DataAbort) from [<c0300b78>] (__dabt_svc+0x58/0x80)
> ...
> 
> During the execution of hash_name()->load_unaligned_zeropad(), a potential
> memory access beyond the PAGE boundary may occur. For example, when the
> filename length is near the PAGE_SIZE boundary. This triggers a page fault,
> which leads to a call to do_page_fault()->mmap_read_trylock(). If we can't
> acquire the lock, we have to fall back to the mmap_read_lock() path, which
> calls might_sleep(). This breaks RCU semantics because path lookup occurs
> under an RCU read-side critical section. In linux-mainline, arm/arm64
> do_page_fault() still has this problem:
> 
> lock_mm_and_find_vma->get_mmap_lock_carefully->mmap_read_lock_killable.
> 
> And before commit bfcfaa77bdf0 ("vfs: use 'unsigned long' accesses for
> dcache name comparison and hashing"), hash_name accessed the name byte by
> byte.
> 
> To prevent load_unaligned_zeropad() from accessing beyond the valid memory
> region, we would need to intercept such cases beforehand? But doing so
> would require replicating the internal logic of load_unaligned_zeropad(),
> including handling endianness and constructing the correct value manually.
> Given that load_unaligned_zeropad() is used in many places across the
> kernel, we currently haven't found a good solution to address this cleanly.
> 
> What would be the recommended way to handle this situation? Would
> appreciate any feedback and guidance from the community. Thanks!

Does it help if you bodge the translation fault handler along the lines
of the untested diff below?

Will

--->8

diff --git a/arch/arm/mm/fault.c b/arch/arm/mm/fault.c
index bf1577216ffa..b3c81e448798 100644
--- a/arch/arm/mm/fault.c
+++ b/arch/arm/mm/fault.c
@@ -407,7 +407,7 @@ do_translation_fault(unsigned long addr, unsigned int fsr,
        if (addr < TASK_SIZE)
                return do_page_fault(addr, fsr, regs);
 
-       if (user_mode(regs))
+       if (user_mode(regs) || fsr_fs(fsr) == FSR_FS_INVALID_PAGE)
                goto bad_area;
 
        index = pgd_index(addr);
diff --git a/arch/arm/mm/fault.h b/arch/arm/mm/fault.h
index 9ecc2097a87a..8fb26f85e361 100644
--- a/arch/arm/mm/fault.h
+++ b/arch/arm/mm/fault.h
@@ -12,6 +12,8 @@
 #define FSR_FS3_0              (15)
 #define FSR_FS5_0              (0x3f)
 
+#define FSR_FS_INVALID_PAGE    7
+
 #ifdef CONFIG_ARM_LPAE
 #define FSR_FS_AEA             17
 
diff --git a/arch/arm/mm/fsr-2level.c b/arch/arm/mm/fsr-2level.c
index f2be95197265..c7060da345df 100644
--- a/arch/arm/mm/fsr-2level.c
+++ b/arch/arm/mm/fsr-2level.c
@@ -11,7 +11,7 @@ static struct fsr_info fsr_info[] = {
        { do_bad,               SIGBUS,  0,             "external abort on linefetch"      },
        { do_translation_fault, SIGSEGV, SEGV_MAPERR,   "section translation fault"        },
        { do_bad,               SIGBUS,  0,             "external abort on linefetch"      },
-       { do_page_fault,        SIGSEGV, SEGV_MAPERR,   "page translation fault"           },
+       { do_translation_fault, SIGSEGV, SEGV_MAPERR,   "page translation fault"           },
        { do_bad,               SIGBUS,  0,             "external abort on non-linefetch"  },
        { do_bad,               SIGSEGV, SEGV_ACCERR,   "section domain fault"             },
        { do_bad,               SIGBUS,  0,             "external abort on non-linefetch"  },
diff --git a/arch/arm/mm/fsr-3level.c b/arch/arm/mm/fsr-3level.c
index d0ae2963656a..19df4af828bd 100644
--- a/arch/arm/mm/fsr-3level.c
+++ b/arch/arm/mm/fsr-3level.c
@@ -7,7 +7,7 @@ static struct fsr_info fsr_info[] = {
        { do_bad,               SIGBUS,  0,             "reserved translation fault"    },
        { do_translation_fault, SIGSEGV, SEGV_MAPERR,   "level 1 translation fault"     },
        { do_translation_fault, SIGSEGV, SEGV_MAPERR,   "level 2 translation fault"     },
-       { do_page_fault,        SIGSEGV, SEGV_MAPERR,   "level 3 translation fault"     },
+       { do_translation_fault, SIGSEGV, SEGV_MAPERR,   "level 3 translation fault"     },
        { do_bad,               SIGBUS,  0,             "reserved access flag fault"    },
        { do_bad,               SIGSEGV, SEGV_ACCERR,   "level 1 access flag fault"     },
        { do_page_fault,        SIGSEGV, SEGV_ACCERR,   "level 2 access flag fault"     },



^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [Bug report] hash_name() may cross page boundary and trigger sleep in RCU context
  2025-11-27 12:59 ` Will Deacon
@ 2025-11-28  1:17   ` Zizhi Wo
  2025-11-28  1:18     ` Zizhi Wo
  0 siblings, 1 reply; 59+ messages in thread
From: Zizhi Wo @ 2025-11-28  1:17 UTC (permalink / raw)
  To: Will Deacon, Zizhi Wo
  Cc: jack, brauner, hch, akpm, linux, linux-fsdevel, linux-kernel,
	linux-mm, linux-arm-kernel, yangerkun, wangkefeng.wang,
	pangliyuan1, xieyuanbin1



在 2025/11/27 20:59, Will Deacon 写道:
> On Wed, Nov 26, 2025 at 05:05:05PM +0800, Zizhi Wo wrote:
>> We're running into the following issue on an ARM32 platform with the linux
>> 5.10 kernel:
>>
>> [<c0300b78>] (__dabt_svc) from [<c0529cb8>] (link_path_walk.part.7+0x108/0x45c)
>> [<c0529cb8>] (link_path_walk.part.7) from [<c052a948>] (path_openat+0xc4/0x10ec)
>> [<c052a948>] (path_openat) from [<c052cf90>] (do_filp_open+0x9c/0x114)
>> [<c052cf90>] (do_filp_open) from [<c0511e4c>] (do_sys_openat2+0x418/0x528)
>> [<c0511e4c>] (do_sys_openat2) from [<c0513d98>] (do_sys_open+0x88/0xe4)
>> [<c0513d98>] (do_sys_open) from [<c03000c0>] (ret_fast_syscall+0x0/0x58)
>> ...
>> [<c0315e34>] (unwind_backtrace) from [<c030f2b0>] (show_stack+0x20/0x24)
>> [<c030f2b0>] (show_stack) from [<c14239f4>] (dump_stack+0xd8/0xf8)
>> [<c14239f4>] (dump_stack) from [<c038d188>] (___might_sleep+0x19c/0x1e4)
>> [<c038d188>] (___might_sleep) from [<c031b6fc>] (do_page_fault+0x2f8/0x51c)
>> [<c031b6fc>] (do_page_fault) from [<c031bb44>] (do_DataAbort+0x90/0x118)
>> [<c031bb44>] (do_DataAbort) from [<c0300b78>] (__dabt_svc+0x58/0x80)
>> ...
>>
>> During the execution of hash_name()->load_unaligned_zeropad(), a potential
>> memory access beyond the PAGE boundary may occur. For example, when the
>> filename length is near the PAGE_SIZE boundary. This triggers a page fault,
>> which leads to a call to do_page_fault()->mmap_read_trylock(). If we can't
>> acquire the lock, we have to fall back to the mmap_read_lock() path, which
>> calls might_sleep(). This breaks RCU semantics because path lookup occurs
>> under an RCU read-side critical section. In linux-mainline, arm/arm64
>> do_page_fault() still has this problem:
>>
>> lock_mm_and_find_vma->get_mmap_lock_carefully->mmap_read_lock_killable.
>>
>> And before commit bfcfaa77bdf0 ("vfs: use 'unsigned long' accesses for
>> dcache name comparison and hashing"), hash_name accessed the name byte by
>> byte.
>>
>> To prevent load_unaligned_zeropad() from accessing beyond the valid memory
>> region, we would need to intercept such cases beforehand? But doing so
>> would require replicating the internal logic of load_unaligned_zeropad(),
>> including handling endianness and constructing the correct value manually.
>> Given that load_unaligned_zeropad() is used in many places across the
>> kernel, we currently haven't found a good solution to address this cleanly.
>>
>> What would be the recommended way to handle this situation? Would
>> appreciate any feedback and guidance from the community. Thanks!
> 
> Does it help if you bodge the translation fault handler along the lines
> of the untested diff below?

Thank you for the solution you provided. However, I seem to have
encountered a bit of a problem.

> 
> Will
> 
> --->8
> 
> diff --git a/arch/arm/mm/fault.c b/arch/arm/mm/fault.c
> index bf1577216ffa..b3c81e448798 100644
> --- a/arch/arm/mm/fault.c
> +++ b/arch/arm/mm/fault.c
> @@ -407,7 +407,7 @@ do_translation_fault(unsigned long addr, unsigned int fsr,
>          if (addr < TASK_SIZE)
>                  return do_page_fault(addr, fsr, regs);
>   
> -       if (user_mode(regs))
> +       if (user_mode(regs) || fsr_fs(fsr) == FSR_FS_INVALID_PAGE)
>                  goto bad_area;



I'm getting an "FSR_FS_INVALID_PAGE undeclared" error during
compilation...

In which kernel or FSR version was this macro or constant defined?

>   
>          index = pgd_index(addr);
> diff --git a/arch/arm/mm/fault.h b/arch/arm/mm/fault.h
> index 9ecc2097a87a..8fb26f85e361 100644
> --- a/arch/arm/mm/fault.h
> +++ b/arch/arm/mm/fault.h
> @@ -12,6 +12,8 @@
>   #define FSR_FS3_0              (15)
>   #define FSR_FS5_0              (0x3f)
>   
> +#define FSR_FS_INVALID_PAGE    7
> +
>   #ifdef CONFIG_ARM_LPAE
>   #define FSR_FS_AEA             17
>   
> diff --git a/arch/arm/mm/fsr-2level.c b/arch/arm/mm/fsr-2level.c
> index f2be95197265..c7060da345df 100644
> --- a/arch/arm/mm/fsr-2level.c
> +++ b/arch/arm/mm/fsr-2level.c
> @@ -11,7 +11,7 @@ static struct fsr_info fsr_info[] = {
>          { do_bad,               SIGBUS,  0,             "external abort on linefetch"      },
>          { do_translation_fault, SIGSEGV, SEGV_MAPERR,   "section translation fault"        },
>          { do_bad,               SIGBUS,  0,             "external abort on linefetch"      },
> -       { do_page_fault,        SIGSEGV, SEGV_MAPERR,   "page translation fault"           },
> +       { do_translation_fault, SIGSEGV, SEGV_MAPERR,   "page translation fault"           },
>          { do_bad,               SIGBUS,  0,             "external abort on non-linefetch"  },
>          { do_bad,               SIGSEGV, SEGV_ACCERR,   "section domain fault"             },
>          { do_bad,               SIGBUS,  0,             "external abort on non-linefetch"  },
> diff --git a/arch/arm/mm/fsr-3level.c b/arch/arm/mm/fsr-3level.c
> index d0ae2963656a..19df4af828bd 100644
> --- a/arch/arm/mm/fsr-3level.c
> +++ b/arch/arm/mm/fsr-3level.c
> @@ -7,7 +7,7 @@ static struct fsr_info fsr_info[] = {
>          { do_bad,               SIGBUS,  0,             "reserved translation fault"    },
>          { do_translation_fault, SIGSEGV, SEGV_MAPERR,   "level 1 translation fault"     },
>          { do_translation_fault, SIGSEGV, SEGV_MAPERR,   "level 2 translation fault"     },
> -       { do_page_fault,        SIGSEGV, SEGV_MAPERR,   "level 3 translation fault"     },
> +       { do_translation_fault, SIGSEGV, SEGV_MAPERR,   "level 3 translation fault"     },
>          { do_bad,               SIGBUS,  0,             "reserved access flag fault"    },
>          { do_bad,               SIGSEGV, SEGV_ACCERR,   "level 1 access flag fault"     },
>          { do_page_fault,        SIGSEGV, SEGV_ACCERR,   "level 2 access flag fault"     },
> 
> 

By the way, I tried Al's solution, and this problem didn't reproduce.

Thanks,
Zizhi Wo



^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [Bug report] hash_name() may cross page boundary and trigger sleep in RCU context
  2025-11-28  1:17   ` Zizhi Wo
@ 2025-11-28  1:18     ` Zizhi Wo
  2025-11-28  1:39       ` Zizhi Wo
  0 siblings, 1 reply; 59+ messages in thread
From: Zizhi Wo @ 2025-11-28  1:18 UTC (permalink / raw)
  To: Will Deacon
  Cc: jack, brauner, hch, akpm, linux, linux-fsdevel, linux-kernel,
	linux-mm, linux-arm-kernel, yangerkun, wangkefeng.wang,
	pangliyuan1, xieyuanbin1



在 2025/11/28 9:17, Zizhi Wo 写道:
> 
> 
> 在 2025/11/27 20:59, Will Deacon 写道:
>> On Wed, Nov 26, 2025 at 05:05:05PM +0800, Zizhi Wo wrote:
>>> We're running into the following issue on an ARM32 platform with the 
>>> linux
>>> 5.10 kernel:
>>>
>>> [<c0300b78>] (__dabt_svc) from [<c0529cb8>] 
>>> (link_path_walk.part.7+0x108/0x45c)
>>> [<c0529cb8>] (link_path_walk.part.7) from [<c052a948>] 
>>> (path_openat+0xc4/0x10ec)
>>> [<c052a948>] (path_openat) from [<c052cf90>] (do_filp_open+0x9c/0x114)
>>> [<c052cf90>] (do_filp_open) from [<c0511e4c>] 
>>> (do_sys_openat2+0x418/0x528)
>>> [<c0511e4c>] (do_sys_openat2) from [<c0513d98>] (do_sys_open+0x88/0xe4)
>>> [<c0513d98>] (do_sys_open) from [<c03000c0>] (ret_fast_syscall+0x0/0x58)
>>> ...
>>> [<c0315e34>] (unwind_backtrace) from [<c030f2b0>] (show_stack+0x20/0x24)
>>> [<c030f2b0>] (show_stack) from [<c14239f4>] (dump_stack+0xd8/0xf8)
>>> [<c14239f4>] (dump_stack) from [<c038d188>] (___might_sleep+0x19c/0x1e4)
>>> [<c038d188>] (___might_sleep) from [<c031b6fc>] 
>>> (do_page_fault+0x2f8/0x51c)
>>> [<c031b6fc>] (do_page_fault) from [<c031bb44>] (do_DataAbort+0x90/0x118)
>>> [<c031bb44>] (do_DataAbort) from [<c0300b78>] (__dabt_svc+0x58/0x80)
>>> ...
>>>
>>> During the execution of hash_name()->load_unaligned_zeropad(), a 
>>> potential
>>> memory access beyond the PAGE boundary may occur. For example, when the
>>> filename length is near the PAGE_SIZE boundary. This triggers a page 
>>> fault,
>>> which leads to a call to do_page_fault()->mmap_read_trylock(). If we 
>>> can't
>>> acquire the lock, we have to fall back to the mmap_read_lock() path, 
>>> which
>>> calls might_sleep(). This breaks RCU semantics because path lookup 
>>> occurs
>>> under an RCU read-side critical section. In linux-mainline, arm/arm64
>>> do_page_fault() still has this problem:
>>>
>>> lock_mm_and_find_vma->get_mmap_lock_carefully->mmap_read_lock_killable.
>>>
>>> And before commit bfcfaa77bdf0 ("vfs: use 'unsigned long' accesses for
>>> dcache name comparison and hashing"), hash_name accessed the name 
>>> byte by
>>> byte.
>>>
>>> To prevent load_unaligned_zeropad() from accessing beyond the valid 
>>> memory
>>> region, we would need to intercept such cases beforehand? But doing so
>>> would require replicating the internal logic of 
>>> load_unaligned_zeropad(),
>>> including handling endianness and constructing the correct value 
>>> manually.
>>> Given that load_unaligned_zeropad() is used in many places across the
>>> kernel, we currently haven't found a good solution to address this 
>>> cleanly.
>>>
>>> What would be the recommended way to handle this situation? Would
>>> appreciate any feedback and guidance from the community. Thanks!
>>
>> Does it help if you bodge the translation fault handler along the lines
>> of the untested diff below?
> 
> Thank you for the solution you provided. However, I seem to have
> encountered a bit of a problem.
> 
>>
>> Will
>>
>> --->8
>>
>> diff --git a/arch/arm/mm/fault.c b/arch/arm/mm/fault.c
>> index bf1577216ffa..b3c81e448798 100644
>> --- a/arch/arm/mm/fault.c
>> +++ b/arch/arm/mm/fault.c
>> @@ -407,7 +407,7 @@ do_translation_fault(unsigned long addr, unsigned 
>> int fsr,
>>          if (addr < TASK_SIZE)
>>                  return do_page_fault(addr, fsr, regs);
>> -       if (user_mode(regs))
>> +       if (user_mode(regs) || fsr_fs(fsr) == FSR_FS_INVALID_PAGE)
>>                  goto bad_area;
> 
> 
> 
> I'm getting an "FSR_FS_INVALID_PAGE undeclared" error during
> compilation...
> 
> In which kernel or FSR version was this macro or constant defined

Sorry, I didn't see this "#define FSR_FS_INVALID_PAGE". I'll try again
right away.

Please ignore my previous reply.

> 
>>          index = pgd_index(addr);
>> diff --git a/arch/arm/mm/fault.h b/arch/arm/mm/fault.h
>> index 9ecc2097a87a..8fb26f85e361 100644
>> --- a/arch/arm/mm/fault.h
>> +++ b/arch/arm/mm/fault.h
>> @@ -12,6 +12,8 @@
>>   #define FSR_FS3_0              (15)
>>   #define FSR_FS5_0              (0x3f)
>> +#define FSR_FS_INVALID_PAGE    7
>> +
>>   #ifdef CONFIG_ARM_LPAE
>>   #define FSR_FS_AEA             17
>> diff --git a/arch/arm/mm/fsr-2level.c b/arch/arm/mm/fsr-2level.c
>> index f2be95197265..c7060da345df 100644
>> --- a/arch/arm/mm/fsr-2level.c
>> +++ b/arch/arm/mm/fsr-2level.c
>> @@ -11,7 +11,7 @@ static struct fsr_info fsr_info[] = {
>>          { do_bad,               SIGBUS,  0,             "external 
>> abort on linefetch"      },
>>          { do_translation_fault, SIGSEGV, SEGV_MAPERR,   "section 
>> translation fault"        },
>>          { do_bad,               SIGBUS,  0,             "external 
>> abort on linefetch"      },
>> -       { do_page_fault,        SIGSEGV, SEGV_MAPERR,   "page 
>> translation fault"           },
>> +       { do_translation_fault, SIGSEGV, SEGV_MAPERR,   "page 
>> translation fault"           },
>>          { do_bad,               SIGBUS,  0,             "external 
>> abort on non-linefetch"  },
>>          { do_bad,               SIGSEGV, SEGV_ACCERR,   "section 
>> domain fault"             },
>>          { do_bad,               SIGBUS,  0,             "external 
>> abort on non-linefetch"  },
>> diff --git a/arch/arm/mm/fsr-3level.c b/arch/arm/mm/fsr-3level.c
>> index d0ae2963656a..19df4af828bd 100644
>> --- a/arch/arm/mm/fsr-3level.c
>> +++ b/arch/arm/mm/fsr-3level.c
>> @@ -7,7 +7,7 @@ static struct fsr_info fsr_info[] = {
>>          { do_bad,               SIGBUS,  0,             "reserved 
>> translation fault"    },
>>          { do_translation_fault, SIGSEGV, SEGV_MAPERR,   "level 1 
>> translation fault"     },
>>          { do_translation_fault, SIGSEGV, SEGV_MAPERR,   "level 2 
>> translation fault"     },
>> -       { do_page_fault,        SIGSEGV, SEGV_MAPERR,   "level 3 
>> translation fault"     },
>> +       { do_translation_fault, SIGSEGV, SEGV_MAPERR,   "level 3 
>> translation fault"     },
>>          { do_bad,               SIGBUS,  0,             "reserved 
>> access flag fault"    },
>>          { do_bad,               SIGSEGV, SEGV_ACCERR,   "level 1 
>> access flag fault"     },
>>          { do_page_fault,        SIGSEGV, SEGV_ACCERR,   "level 2 
>> access flag fault"     },
>>
>>
> 
> By the way, I tried Al's solution, and this problem didn't reproduce.
> 
> Thanks,
> Zizhi Wo



^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC PATCH] vfs: Fix might sleep in load_unaligned_zeropad() with rcu read lock held
  2025-11-26 19:26         ` Al Viro
  2025-11-26 19:51           ` Russell King (Oracle)
@ 2025-11-28  1:39           ` Xie Yuanbin
  1 sibling, 0 replies; 59+ messages in thread
From: Xie Yuanbin @ 2025-11-28  1:39 UTC (permalink / raw)
  To: viro, linux, will, david.laight, rmk+kernel
  Cc: bigeasy, brauner, jack, nico, akpm, hch, jack, wozizhi,
	linux-fsdevel, linux-kernel, linux-arm-kernel, linux-mm,
	catalin.marinas, rppt, vbabka, pfalcato, lorenzo.stoakes,
	kuninori.morimoto.gx, tony, arnd, punitagrawal, rjw,
	marc.zyngier, lilinjie8, liaohua4, wangkefeng.wang, pangliyuan1,
	Xie Yuanbin

On Wed, 26 Nov 2025 19:26:40 +0000, Al Viro wrote:
> For quick and dirty variant (on current tree), how about
> adding
> 	if (unlikely(addr > TASK_SIZE) && !user_mode(regs))
> 		goto no_context;
>
> right after
>
> 	if (!ttbr0_usermode_access_allowed(regs))
> 		goto no_context;
>
> in do_page_fault() there?
>
> NOTE: that might or might not break vdso; I don't think it would, but...

On Wed, 26 Nov 2025 23:31:00 +0000, Russell King wrote:
> Now, for 32-bit ARM, I think I am coming to the conclusion that Al's
> suggestion is probably the easiest solution. However, whether it has
> side effects, I couldn't say - the 32-bit ARM fault code has been
> modified by quite a few people in ways I don't yet understand, so I
> can't be certain at the moment whether it would cause problems.
>
> I think the only thing to do is to try the solution and see what
> breaks. I'm not in a position to be able to do that as, having not
> had reason to touch 32-bit ARM for years, I don't have a hackable
> platform nearby. Maybe Xie Yuanbin can test it?

Hi, Al Viro and Russell King!

I moved the judgment forward to before local_irq_enable() and submitted
a new patch:
Link: https://lore.kernel.org/20251127140109.191657-1-xieyuanbin1@huawei.com

This is because there's another bug I reported before that also requires
a similar judgment, but before the interrupt is enabled.
Link: https://lore.kernel.org/20250925025744.6807-1-xieyuanbin1@huawei.com

I hope this can fix both of these bugs.

It is closer to the x86's implementation and works well in current tests.
Could you please take a look? Thanks you very much!

Xie Yuanbin


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [Bug report] hash_name() may cross page boundary and trigger sleep in RCU context
  2025-11-28  1:18     ` Zizhi Wo
@ 2025-11-28  1:39       ` Zizhi Wo
  2025-11-28 12:25         ` Will Deacon
  0 siblings, 1 reply; 59+ messages in thread
From: Zizhi Wo @ 2025-11-28  1:39 UTC (permalink / raw)
  To: Will Deacon
  Cc: jack, brauner, hch, akpm, linux, linux-fsdevel, linux-kernel,
	linux-mm, linux-arm-kernel, yangerkun, wangkefeng.wang,
	pangliyuan1, xieyuanbin1



在 2025/11/28 9:18, Zizhi Wo 写道:
> 
> 
> 在 2025/11/28 9:17, Zizhi Wo 写道:
>>
>>
>> 在 2025/11/27 20:59, Will Deacon 写道:
>>> On Wed, Nov 26, 2025 at 05:05:05PM +0800, Zizhi Wo wrote:
>>>> We're running into the following issue on an ARM32 platform with the 
>>>> linux
>>>> 5.10 kernel:
>>>>
>>>> [<c0300b78>] (__dabt_svc) from [<c0529cb8>] 
>>>> (link_path_walk.part.7+0x108/0x45c)
>>>> [<c0529cb8>] (link_path_walk.part.7) from [<c052a948>] 
>>>> (path_openat+0xc4/0x10ec)
>>>> [<c052a948>] (path_openat) from [<c052cf90>] (do_filp_open+0x9c/0x114)
>>>> [<c052cf90>] (do_filp_open) from [<c0511e4c>] 
>>>> (do_sys_openat2+0x418/0x528)
>>>> [<c0511e4c>] (do_sys_openat2) from [<c0513d98>] (do_sys_open+0x88/0xe4)
>>>> [<c0513d98>] (do_sys_open) from [<c03000c0>] 
>>>> (ret_fast_syscall+0x0/0x58)
>>>> ...
>>>> [<c0315e34>] (unwind_backtrace) from [<c030f2b0>] 
>>>> (show_stack+0x20/0x24)
>>>> [<c030f2b0>] (show_stack) from [<c14239f4>] (dump_stack+0xd8/0xf8)
>>>> [<c14239f4>] (dump_stack) from [<c038d188>] 
>>>> (___might_sleep+0x19c/0x1e4)
>>>> [<c038d188>] (___might_sleep) from [<c031b6fc>] 
>>>> (do_page_fault+0x2f8/0x51c)
>>>> [<c031b6fc>] (do_page_fault) from [<c031bb44>] 
>>>> (do_DataAbort+0x90/0x118)
>>>> [<c031bb44>] (do_DataAbort) from [<c0300b78>] (__dabt_svc+0x58/0x80)
>>>> ...
>>>>
>>>> During the execution of hash_name()->load_unaligned_zeropad(), a 
>>>> potential
>>>> memory access beyond the PAGE boundary may occur. For example, when the
>>>> filename length is near the PAGE_SIZE boundary. This triggers a page 
>>>> fault,
>>>> which leads to a call to do_page_fault()->mmap_read_trylock(). If we 
>>>> can't
>>>> acquire the lock, we have to fall back to the mmap_read_lock() path, 
>>>> which
>>>> calls might_sleep(). This breaks RCU semantics because path lookup 
>>>> occurs
>>>> under an RCU read-side critical section. In linux-mainline, arm/arm64
>>>> do_page_fault() still has this problem:
>>>>
>>>> lock_mm_and_find_vma->get_mmap_lock_carefully->mmap_read_lock_killable.
>>>>
>>>> And before commit bfcfaa77bdf0 ("vfs: use 'unsigned long' accesses for
>>>> dcache name comparison and hashing"), hash_name accessed the name 
>>>> byte by
>>>> byte.
>>>>
>>>> To prevent load_unaligned_zeropad() from accessing beyond the valid 
>>>> memory
>>>> region, we would need to intercept such cases beforehand? But doing so
>>>> would require replicating the internal logic of 
>>>> load_unaligned_zeropad(),
>>>> including handling endianness and constructing the correct value 
>>>> manually.
>>>> Given that load_unaligned_zeropad() is used in many places across the
>>>> kernel, we currently haven't found a good solution to address this 
>>>> cleanly.
>>>>
>>>> What would be the recommended way to handle this situation? Would
>>>> appreciate any feedback and guidance from the community. Thanks!
>>>
>>> Does it help if you bodge the translation fault handler along the lines
>>> of the untested diff below?

I tried it out and it works — thank you for the solution you provided.

At the same time, since I’m a beginner in this area, I’d like to ask a
question.

The comment above do_translation_fault() says:
“We enter here because the first level page table doesn't contain a
valid entry for the address.”

However, after modifying the code, it seems that when encountering
FSR_FS_INVALID_PAGE, the kernel no longer creates a page table entry,
but instead directly jumps to bad_area.

I'd like to ask — could this change potentially cause any other side
effects?

Thanks,
Zizhi Wo


>>
>> Thank you for the solution you provided. However, I seem to have
>> encountered a bit of a problem.
>>
>>>
>>> Will
>>>
>>> --->8
>>>
>>> diff --git a/arch/arm/mm/fault.c b/arch/arm/mm/fault.c
>>> index bf1577216ffa..b3c81e448798 100644
>>> --- a/arch/arm/mm/fault.c
>>> +++ b/arch/arm/mm/fault.c
>>> @@ -407,7 +407,7 @@ do_translation_fault(unsigned long addr, unsigned 
>>> int fsr,
>>>          if (addr < TASK_SIZE)
>>>                  return do_page_fault(addr, fsr, regs);
>>> -       if (user_mode(regs))
>>> +       if (user_mode(regs) || fsr_fs(fsr) == FSR_FS_INVALID_PAGE)
>>>                  goto bad_area;
>>
>>
>>
>> I'm getting an "FSR_FS_INVALID_PAGE undeclared" error during
>> compilation...
>>
>> In which kernel or FSR version was this macro or constant defined
> 
> Sorry, I didn't see this "#define FSR_FS_INVALID_PAGE". I'll try again
> right away.
> 
> Please ignore my previous reply.
> 
>>
>>>          index = pgd_index(addr);
>>> diff --git a/arch/arm/mm/fault.h b/arch/arm/mm/fault.h
>>> index 9ecc2097a87a..8fb26f85e361 100644
>>> --- a/arch/arm/mm/fault.h
>>> +++ b/arch/arm/mm/fault.h
>>> @@ -12,6 +12,8 @@
>>>   #define FSR_FS3_0              (15)
>>>   #define FSR_FS5_0              (0x3f)
>>> +#define FSR_FS_INVALID_PAGE    7
>>> +
>>>   #ifdef CONFIG_ARM_LPAE
>>>   #define FSR_FS_AEA             17
>>> diff --git a/arch/arm/mm/fsr-2level.c b/arch/arm/mm/fsr-2level.c
>>> index f2be95197265..c7060da345df 100644
>>> --- a/arch/arm/mm/fsr-2level.c
>>> +++ b/arch/arm/mm/fsr-2level.c
>>> @@ -11,7 +11,7 @@ static struct fsr_info fsr_info[] = {
>>>          { do_bad,               SIGBUS,  0,             "external 
>>> abort on linefetch"      },
>>>          { do_translation_fault, SIGSEGV, SEGV_MAPERR,   "section 
>>> translation fault"        },
>>>          { do_bad,               SIGBUS,  0,             "external 
>>> abort on linefetch"      },
>>> -       { do_page_fault,        SIGSEGV, SEGV_MAPERR,   "page 
>>> translation fault"           },
>>> +       { do_translation_fault, SIGSEGV, SEGV_MAPERR,   "page 
>>> translation fault"           },
>>>          { do_bad,               SIGBUS,  0,             "external 
>>> abort on non-linefetch"  },
>>>          { do_bad,               SIGSEGV, SEGV_ACCERR,   "section 
>>> domain fault"             },
>>>          { do_bad,               SIGBUS,  0,             "external 
>>> abort on non-linefetch"  },
>>> diff --git a/arch/arm/mm/fsr-3level.c b/arch/arm/mm/fsr-3level.c
>>> index d0ae2963656a..19df4af828bd 100644
>>> --- a/arch/arm/mm/fsr-3level.c
>>> +++ b/arch/arm/mm/fsr-3level.c
>>> @@ -7,7 +7,7 @@ static struct fsr_info fsr_info[] = {
>>>          { do_bad,               SIGBUS,  0,             "reserved 
>>> translation fault"    },
>>>          { do_translation_fault, SIGSEGV, SEGV_MAPERR,   "level 1 
>>> translation fault"     },
>>>          { do_translation_fault, SIGSEGV, SEGV_MAPERR,   "level 2 
>>> translation fault"     },
>>> -       { do_page_fault,        SIGSEGV, SEGV_MAPERR,   "level 3 
>>> translation fault"     },
>>> +       { do_translation_fault, SIGSEGV, SEGV_MAPERR,   "level 3 
>>> translation fault"     },
>>>          { do_bad,               SIGBUS,  0,             "reserved 
>>> access flag fault"    },
>>>          { do_bad,               SIGSEGV, SEGV_ACCERR,   "level 1 
>>> access flag fault"     },
>>>          { do_page_fault,        SIGSEGV, SEGV_ACCERR,   "level 2 
>>> access flag fault"     },
>>>
>>>
>>
>> By the way, I tried Al's solution, and this problem didn't reproduce.
>>
>> Thanks,
>> Zizhi Wo
> 
> 



^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [Bug report] hash_name() may cross page boundary and trigger sleep in RCU context
  2025-11-28  1:39       ` Zizhi Wo
@ 2025-11-28 12:25         ` Will Deacon
  2025-11-29  1:02           ` Zizhi Wo
  0 siblings, 1 reply; 59+ messages in thread
From: Will Deacon @ 2025-11-28 12:25 UTC (permalink / raw)
  To: Zizhi Wo
  Cc: jack, brauner, hch, akpm, linux, linux-fsdevel, linux-kernel,
	linux-mm, linux-arm-kernel, yangerkun, wangkefeng.wang,
	pangliyuan1, xieyuanbin1

On Fri, Nov 28, 2025 at 09:39:45AM +0800, Zizhi Wo wrote:
> 在 2025/11/28 9:18, Zizhi Wo 写道:
> > 在 2025/11/28 9:17, Zizhi Wo 写道:
> > > 在 2025/11/27 20:59, Will Deacon 写道:
> > > > On Wed, Nov 26, 2025 at 05:05:05PM +0800, Zizhi Wo wrote:
> > > > > We're running into the following issue on an ARM32 platform
> > > > > with the linux
> > > > > 5.10 kernel:
> > > > > 
> > > > > [<c0300b78>] (__dabt_svc) from [<c0529cb8>]
> > > > > (link_path_walk.part.7+0x108/0x45c)
> > > > > [<c0529cb8>] (link_path_walk.part.7) from [<c052a948>]
> > > > > (path_openat+0xc4/0x10ec)
> > > > > [<c052a948>] (path_openat) from [<c052cf90>] (do_filp_open+0x9c/0x114)
> > > > > [<c052cf90>] (do_filp_open) from [<c0511e4c>]
> > > > > (do_sys_openat2+0x418/0x528)
> > > > > [<c0511e4c>] (do_sys_openat2) from [<c0513d98>] (do_sys_open+0x88/0xe4)
> > > > > [<c0513d98>] (do_sys_open) from [<c03000c0>]
> > > > > (ret_fast_syscall+0x0/0x58)
> > > > > ...
> > > > > [<c0315e34>] (unwind_backtrace) from [<c030f2b0>]
> > > > > (show_stack+0x20/0x24)
> > > > > [<c030f2b0>] (show_stack) from [<c14239f4>] (dump_stack+0xd8/0xf8)
> > > > > [<c14239f4>] (dump_stack) from [<c038d188>]
> > > > > (___might_sleep+0x19c/0x1e4)
> > > > > [<c038d188>] (___might_sleep) from [<c031b6fc>]
> > > > > (do_page_fault+0x2f8/0x51c)
> > > > > [<c031b6fc>] (do_page_fault) from [<c031bb44>]
> > > > > (do_DataAbort+0x90/0x118)
> > > > > [<c031bb44>] (do_DataAbort) from [<c0300b78>] (__dabt_svc+0x58/0x80)
> > > > > ...
> > > > > 
> > > > > During the execution of
> > > > > hash_name()->load_unaligned_zeropad(), a potential
> > > > > memory access beyond the PAGE boundary may occur. For example, when the
> > > > > filename length is near the PAGE_SIZE boundary. This
> > > > > triggers a page fault,
> > > > > which leads to a call to
> > > > > do_page_fault()->mmap_read_trylock(). If we can't
> > > > > acquire the lock, we have to fall back to the
> > > > > mmap_read_lock() path, which
> > > > > calls might_sleep(). This breaks RCU semantics because path
> > > > > lookup occurs
> > > > > under an RCU read-side critical section. In linux-mainline, arm/arm64
> > > > > do_page_fault() still has this problem:
> > > > > 
> > > > > lock_mm_and_find_vma->get_mmap_lock_carefully->mmap_read_lock_killable.
> > > > > 
> > > > > And before commit bfcfaa77bdf0 ("vfs: use 'unsigned long' accesses for
> > > > > dcache name comparison and hashing"), hash_name accessed the
> > > > > name byte by
> > > > > byte.
> > > > > 
> > > > > To prevent load_unaligned_zeropad() from accessing beyond
> > > > > the valid memory
> > > > > region, we would need to intercept such cases beforehand? But doing so
> > > > > would require replicating the internal logic of
> > > > > load_unaligned_zeropad(),
> > > > > including handling endianness and constructing the correct
> > > > > value manually.
> > > > > Given that load_unaligned_zeropad() is used in many places across the
> > > > > kernel, we currently haven't found a good solution to
> > > > > address this cleanly.
> > > > > 
> > > > > What would be the recommended way to handle this situation? Would
> > > > > appreciate any feedback and guidance from the community. Thanks!
> > > > 
> > > > Does it help if you bodge the translation fault handler along the lines
> > > > of the untested diff below?
> 
> I tried it out and it works — thank you for the solution you provided.

Thanks for giving it a spin.

> At the same time, since I’m a beginner in this area, I’d like to ask a
> question.
> 
> The comment above do_translation_fault() says:
> “We enter here because the first level page table doesn't contain a
> valid entry for the address.”
> 
> However, after modifying the code, it seems that when encountering
> FSR_FS_INVALID_PAGE, the kernel no longer creates a page table entry,
> but instead directly jumps to bad_area.

FSR_FS_INVALID_PAGE indicates a last level translation fault (that's the
"page" part) so it's only applicable in the case where the other levels
of page-table have been populated already.

I wondered about checking !is_vmalloc_addr() too, but I couldn't
convince myself that load_unaligned_zeropad() is only ever used with the
linear map.

> I'd like to ask — could this change potentially cause any other side
> effects?

There's always the possibility but I personally think it's more
self-contained than the other patches doing the rounds. For example, I
don't make any changes to the permission fault handling path.

Will


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [Bug report] hash_name() may cross page boundary and trigger sleep in RCU context
  2025-11-27 10:57     ` Russell King (Oracle)
@ 2025-11-28 17:06       ` Linus Torvalds
  2025-11-29  1:01         ` Zizhi Wo
                           ` (3 more replies)
  0 siblings, 4 replies; 59+ messages in thread
From: Linus Torvalds @ 2025-11-28 17:06 UTC (permalink / raw)
  To: Russell King (Oracle)
  Cc: Zizhi Wo, Catalin Marinas, Will Deacon, jack, brauner, hch, akpm,
	linux-fsdevel, linux-kernel, linux-mm, linux-arm-kernel,
	yangerkun, wangkefeng.wang, pangliyuan1, xieyuanbin1

On Thu, 27 Nov 2025 at 02:58, Russell King (Oracle)
<linux@armlinux.org.uk> wrote:
>
> Ha!
>
> As said elsewhere, it looks like 32-bit ARM has been missing updates to
> the fault handler since pre-git history - this was modelled in the dim
> and distant i386 handling, and it just hasn't kept up.

I actually have this dim memory of having seen something along these
lines before, and I just had never realized how it could happen,
because that call to do_page_fault() in do_translation_fault()
visually *looks* like the only call-site, and so that

        if (addr < TASK_SIZE)
                return do_page_fault(addr, fsr, regs);

looks like it does everything correctly. That "do_page_fault()"
function is static to the arch/arm/mm/fault.c file, and that's the
only place that appears to call it.

The operative word being "appears".

Becuse I had never before realized that that fault.c then also does that

  #include "fsr-2level.c"

and then that do_page_fault() function is exposed through those
fsr_info[] operation arrays.

Anyway, I don't think that the ARM fault handling is all *that* bad.
Sure, it might be worth double-checking, but it *has* been converted
to the generic accounting helpers a few years ago and to the stack
growing fixes.

I think the fix here may be as simple as this trivial patch:

  diff --git a/arch/arm/mm/fault.c b/arch/arm/mm/fault.c
  index 2bc828a1940c..27024ec2d46d 100644
  --- a/arch/arm/mm/fault.c
  +++ b/arch/arm/mm/fault.c
  @@ -277,6 +277,10 @@ do_page_fault(unsigned long addr, ...
        if (interrupts_enabled(regs))
                local_irq_enable();

  +     /* non-user address faults never have context */
  +     if (addr >= TASK_SIZE)
  +             goto no_context;
  +
        /*
         * If we're in an interrupt or have no user
         * context, we must not take the fault..

but I really haven't thought much about it.

> I'm debating whether an entire rewrite would be appropriate

I don't think it's necessarily all that big of a deal. Yeah, this is
old code, and yeah, it could probably be cleaned up a bit, but at the
same time, "old and crusty" also means "fairly well tested". This
whole fault on a kernel address is a fairly unusual case, and as
mentioned, I *think* the above fix is sufficient.

Zizhi Wo - can you confirm that that patch (whitespace-damaged, but
simple enough to just do manually) fixes things for your test-case?

           Linus

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [Bug report] hash_name() may cross page boundary and trigger sleep in RCU context
  2025-11-28 17:06       ` Linus Torvalds
@ 2025-11-29  1:01         ` Zizhi Wo
  2025-11-29  1:35           ` Linus Torvalds
  2025-11-29  2:18         ` [Bug report] hash_name() may cross page boundary and trigger Xie Yuanbin
                           ` (2 subsequent siblings)
  3 siblings, 1 reply; 59+ messages in thread
From: Zizhi Wo @ 2025-11-29  1:01 UTC (permalink / raw)
  To: Linus Torvalds, Russell King (Oracle)
  Cc: Zizhi Wo, Catalin Marinas, Will Deacon, jack, brauner, hch, akpm,
	linux-fsdevel, linux-kernel, linux-mm, linux-arm-kernel,
	yangerkun, wangkefeng.wang, pangliyuan1, xieyuanbin1, Al Viro



在 2025/11/29 1:06, Linus Torvalds 写道:
> On Thu, 27 Nov 2025 at 02:58, Russell King (Oracle)
> <linux@armlinux.org.uk> wrote:
>>
>> Ha!
>>
>> As said elsewhere, it looks like 32-bit ARM has been missing updates to
>> the fault handler since pre-git history - this was modelled in the dim
>> and distant i386 handling, and it just hasn't kept up.
> 
> I actually have this dim memory of having seen something along these
> lines before, and I just had never realized how it could happen,
> because that call to do_page_fault() in do_translation_fault()
> visually *looks* like the only call-site, and so that
> 
>          if (addr < TASK_SIZE)
>                  return do_page_fault(addr, fsr, regs);
> 
> looks like it does everything correctly. That "do_page_fault()"
> function is static to the arch/arm/mm/fault.c file, and that's the
> only place that appears to call it.
> 
> The operative word being "appears".
> 
> Becuse I had never before realized that that fault.c then also does that
> 
>    #include "fsr-2level.c"
> 
> and then that do_page_fault() function is exposed through those
> fsr_info[] operation arrays.

Yes, it enters through fsr_info.

> 
> Anyway, I don't think that the ARM fault handling is all *that* bad.
> Sure, it might be worth double-checking, but it *has* been converted
> to the generic accounting helpers a few years ago and to the stack
> growing fixes.
> 
> I think the fix here may be as simple as this trivial patch:
> 
>    diff --git a/arch/arm/mm/fault.c b/arch/arm/mm/fault.c
>    index 2bc828a1940c..27024ec2d46d 100644
>    --- a/arch/arm/mm/fault.c
>    +++ b/arch/arm/mm/fault.c
>    @@ -277,6 +277,10 @@ do_page_fault(unsigned long addr, ...
>          if (interrupts_enabled(regs))
>                  local_irq_enable();
> 
>    +     /* non-user address faults never have context */
>    +     if (addr >= TASK_SIZE)
>    +             goto no_context;
>    +
>          /*
>           * If we're in an interrupt or have no user
>           * context, we must not take the fault..
> 
> but I really haven't thought much about it.
> 
>> I'm debating whether an entire rewrite would be appropriate

Thank you for your answer. In fact, this solution is similar to the one
provided by Al. It has an additional check to determine reg:

```
if (unlikely(addr > TASK_SIZE) && !user_mode(regs))
	goto no_context;
```

I'd like to ask if this "regs" examination also needs to be brought
along?

I'm even thinking if we directly have the corresponding processing
replaced by do_translation_fault(), is that also correct?

```
-       { do_page_fault,        SIGSEGV, SEGV_MAPERR,   "page 
translation fault"           },
+       { do_translation_fault, SIGSEGV, SEGV_MAPERR,   "page 
translation fault"           },
```

> 
> I don't think it's necessarily all that big of a deal. Yeah, this is
> old code, and yeah, it could probably be cleaned up a bit, but at the
> same time, "old and crusty" also means "fairly well tested". This
> whole fault on a kernel address is a fairly unusual case, and as
> mentioned, I *think* the above fix is sufficient.
> 
> Zizhi Wo - can you confirm that that patch (whitespace-damaged, but
> simple enough to just do manually) fixes things for your test-case?
> 
>             Linus
> 

I tried it out and it works — thanks!

Thanks,
Zizhi Wo



^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [Bug report] hash_name() may cross page boundary and trigger sleep in RCU context
  2025-11-28 12:25         ` Will Deacon
@ 2025-11-29  1:02           ` Zizhi Wo
  2025-11-29  3:55             ` Al Viro
  0 siblings, 1 reply; 59+ messages in thread
From: Zizhi Wo @ 2025-11-29  1:02 UTC (permalink / raw)
  To: Will Deacon, Zizhi Wo, Linus Torvalds
  Cc: jack, brauner, hch, akpm, linux, linux-fsdevel, linux-kernel,
	linux-mm, linux-arm-kernel, yangerkun, wangkefeng.wang,
	pangliyuan1, xieyuanbin1, Al Viro



在 2025/11/28 20:25, Will Deacon 写道:
> On Fri, Nov 28, 2025 at 09:39:45AM +0800, Zizhi Wo wrote:
>> 在 2025/11/28 9:18, Zizhi Wo 写道:
>>> 在 2025/11/28 9:17, Zizhi Wo 写道:
>>>> 在 2025/11/27 20:59, Will Deacon 写道:
>>>>> On Wed, Nov 26, 2025 at 05:05:05PM +0800, Zizhi Wo wrote:
>>>>>> We're running into the following issue on an ARM32 platform
>>>>>> with the linux
>>>>>> 5.10 kernel:
>>>>>>
>>>>>> [<c0300b78>] (__dabt_svc) from [<c0529cb8>]
>>>>>> (link_path_walk.part.7+0x108/0x45c)
>>>>>> [<c0529cb8>] (link_path_walk.part.7) from [<c052a948>]
>>>>>> (path_openat+0xc4/0x10ec)
>>>>>> [<c052a948>] (path_openat) from [<c052cf90>] (do_filp_open+0x9c/0x114)
>>>>>> [<c052cf90>] (do_filp_open) from [<c0511e4c>]
>>>>>> (do_sys_openat2+0x418/0x528)
>>>>>> [<c0511e4c>] (do_sys_openat2) from [<c0513d98>] (do_sys_open+0x88/0xe4)
>>>>>> [<c0513d98>] (do_sys_open) from [<c03000c0>]
>>>>>> (ret_fast_syscall+0x0/0x58)
>>>>>> ...
>>>>>> [<c0315e34>] (unwind_backtrace) from [<c030f2b0>]
>>>>>> (show_stack+0x20/0x24)
>>>>>> [<c030f2b0>] (show_stack) from [<c14239f4>] (dump_stack+0xd8/0xf8)
>>>>>> [<c14239f4>] (dump_stack) from [<c038d188>]
>>>>>> (___might_sleep+0x19c/0x1e4)
>>>>>> [<c038d188>] (___might_sleep) from [<c031b6fc>]
>>>>>> (do_page_fault+0x2f8/0x51c)
>>>>>> [<c031b6fc>] (do_page_fault) from [<c031bb44>]
>>>>>> (do_DataAbort+0x90/0x118)
>>>>>> [<c031bb44>] (do_DataAbort) from [<c0300b78>] (__dabt_svc+0x58/0x80)
>>>>>> ...
>>>>>>
>>>>>> During the execution of
>>>>>> hash_name()->load_unaligned_zeropad(), a potential
>>>>>> memory access beyond the PAGE boundary may occur. For example, when the
>>>>>> filename length is near the PAGE_SIZE boundary. This
>>>>>> triggers a page fault,
>>>>>> which leads to a call to
>>>>>> do_page_fault()->mmap_read_trylock(). If we can't
>>>>>> acquire the lock, we have to fall back to the
>>>>>> mmap_read_lock() path, which
>>>>>> calls might_sleep(). This breaks RCU semantics because path
>>>>>> lookup occurs
>>>>>> under an RCU read-side critical section. In linux-mainline, arm/arm64
>>>>>> do_page_fault() still has this problem:
>>>>>>
>>>>>> lock_mm_and_find_vma->get_mmap_lock_carefully->mmap_read_lock_killable.
>>>>>>
>>>>>> And before commit bfcfaa77bdf0 ("vfs: use 'unsigned long' accesses for
>>>>>> dcache name comparison and hashing"), hash_name accessed the
>>>>>> name byte by
>>>>>> byte.
>>>>>>
>>>>>> To prevent load_unaligned_zeropad() from accessing beyond
>>>>>> the valid memory
>>>>>> region, we would need to intercept such cases beforehand? But doing so
>>>>>> would require replicating the internal logic of
>>>>>> load_unaligned_zeropad(),
>>>>>> including handling endianness and constructing the correct
>>>>>> value manually.
>>>>>> Given that load_unaligned_zeropad() is used in many places across the
>>>>>> kernel, we currently haven't found a good solution to
>>>>>> address this cleanly.
>>>>>>
>>>>>> What would be the recommended way to handle this situation? Would
>>>>>> appreciate any feedback and guidance from the community. Thanks!
>>>>>
>>>>> Does it help if you bodge the translation fault handler along the lines
>>>>> of the untested diff below?
>>
>> I tried it out and it works — thank you for the solution you provided.
> 
> Thanks for giving it a spin.
> 
>> At the same time, since I’m a beginner in this area, I’d like to ask a
>> question.
>>
>> The comment above do_translation_fault() says:
>> “We enter here because the first level page table doesn't contain a
>> valid entry for the address.”
>>
>> However, after modifying the code, it seems that when encountering
>> FSR_FS_INVALID_PAGE, the kernel no longer creates a page table entry,
>> but instead directly jumps to bad_area.
> 
> FSR_FS_INVALID_PAGE indicates a last level translation fault (that's the
> "page" part) so it's only applicable in the case where the other levels
> of page-table have been populated already.
> 
> I wondered about checking !is_vmalloc_addr() too, but I couldn't
> convince myself that load_unaligned_zeropad() is only ever used with the
> linear map.
> 

Thank you very much for the answer. For the vmalloc area, I checked the
call points on the vfs side, such as dentry_string_cmp() or hash_name().
Their "names addr" are all assigned by kmalloc(), so there should be no
corresponding issues. But I'm not familiar with the other calling
points...


>> I'd like to ask — could this change potentially cause any other side
>> effects?
> 
> There's always the possibility but I personally think it's more
> self-contained than the other patches doing the rounds. For example, I
> don't make any changes to the permission fault handling path.
> 
> Will
> 

Ok. Thank you for your explanation.

Thanks,
Zizhi Wo



^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [Bug report] hash_name() may cross page boundary and trigger sleep in RCU context
  2025-11-29  1:01         ` Zizhi Wo
@ 2025-11-29  1:35           ` Linus Torvalds
  2025-11-29  4:08             ` [Bug report] hash_name() may cross page boundary and trigger Xie Yuanbin
                               ` (2 more replies)
  0 siblings, 3 replies; 59+ messages in thread
From: Linus Torvalds @ 2025-11-29  1:35 UTC (permalink / raw)
  To: Zizhi Wo
  Cc: Russell King (Oracle),
	Catalin Marinas, Will Deacon, jack, brauner, hch, akpm,
	linux-fsdevel, linux-kernel, linux-mm, linux-arm-kernel,
	yangerkun, wangkefeng.wang, pangliyuan1, xieyuanbin1, Al Viro

On Fri, 28 Nov 2025 at 17:01, Zizhi Wo <wozizhi@huaweicloud.com> wrote:
>
> Thank you for your answer. In fact, this solution is similar to the one
> provided by Al.

Hmm. I'm not seeing the replies from Al for some reason. Maybe he didn't cc me.

> It has an additional check to determine reg:
>
> if (unlikely(addr > TASK_SIZE) && !user_mode(regs))
>         goto no_context;
>
> I'd like to ask if this "regs" examination also needs to be brought
> along?

That seems unnecessary.

Yes, in this case the original problem you reported with sleeping in
an RCU region was triggered by a kernel access, and a user-space
access would never have caused any such issues.

So checking for !user_mode(regs) isn't exactly *wrong*.

But while it isn't wrong, I think it's also kind of pointless.

Because regardless of whether it's a kernel or user space access, an
access outside TASK_SIZE shouldn't be associated with a valid user
space context, so the code might as well just go to the "no_context"
label directly.

That said, somebody should  definitely double-check me - because I
think arm also did the vdso trick at high addresses that i386 used to
do, so there is the fake VDSO thing up there.

But if you get a page fault on that, it's not going to be fixed up, so
even if user space can access it, there's no point in looking that
fake vm area up for page faults.

I think.

> I'm even thinking if we directly have the corresponding processing
> replaced by do_translation_fault(), is that also correct?
>
> ```
> -       { do_page_fault,        SIGSEGV, SEGV_MAPERR,   "page
> translation fault"           },
> +       { do_translation_fault, SIGSEGV, SEGV_MAPERR,   "page
> translation fault"           },

I think that might break kprobes.

Looking around, I think my patch might also be a bit broken: I think
it might be better to move it further down to below the check for
FSR_LNX_PF.

But somebody who knows the exact arm page fault handling better than
me should verify both that and my VDSO gate page thinking.

           Linus

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [Bug report] hash_name() may cross page boundary and trigger
  2025-11-28 17:06       ` Linus Torvalds
  2025-11-29  1:01         ` Zizhi Wo
@ 2025-11-29  2:18         ` Xie Yuanbin
  2025-12-01 13:28         ` [Bug report] hash_name() may cross page boundary and trigger sleep in RCU context Will Deacon
  2025-12-02 12:43         ` Russell King (Oracle)
  3 siblings, 0 replies; 59+ messages in thread
From: Xie Yuanbin @ 2025-11-29  2:18 UTC (permalink / raw)
  To: torvalds, will, linux, viro, bigeasy, rmk+kernel
  Cc: akpm, brauner, catalin.marinas, hch, jack, linux-arm-kernel,
	linux-fsdevel, linux-kernel, linux-mm, pangliyuan1,
	wangkefeng.wang, wozizhi, xieyuanbin1, yangerkun, lilinjie8,
	liaohua4

Hi, Linus Torvalds and Will Deacon!

We have some discussion and solutions on other threads, and it seems
that there are somthing missing on this discussion thread. Therefore,
I think it is necessary to synchronize some information here.

1. There is a test case that can consistently reproduce the bug, which
might be helpful for us to do the test. The test case is located after
the '---' maker line in the following patch:
Link: https://lore.kernel.org/20251126101952.174467-1-xieyuanbin1@huawei.com

2. Al Viro give a suggest on 2025-11-26 19:26:
Link: https://lore.kernel.org/20251126192640.GD3538@ZenIV

This patch is similar to one I submitted long time ago, which was
intended fix another bug: missing branch predictor mitigation:
Link: https://lore.kernel.org/20250925025744.6807-1-xieyuanbin1@huawei.com

My patch was not accepted, Sebastian's patch:
Link: https://lore.kernel.org/20251110145555.2555055-2-bigeasy@linutronix.de
fixed this bug, but Sebastian's patch has not yet been merged into the
linux-next branch, so this bug still exists in the current linux-next
branch.

I hope there is a simple solution to fix both bugs, so I submitted this
patch on 2025-11-27 14:49:
Link: https://lore.kernel.org/20251127140109.191657-1-xieyuanbin1@huawei.com
This patch is based on the linux-next branch, therefore it does not
contain Sebastian's patch.

3. On 2025-11-28 17:06, Linus Torvalds provided a solution similar to
Al Viro's suggestion and my patch:
Link: https://lore.kernel.org/CAHk-=wh+cFLLi2x6u61pvL07phSyHPVBTo9Lac2uuqK4eRG_=w@mail.gmail.com

Currently, all solutions have been tested that can fix this one bug.
I still hold the view that perhaps there is a simpler way to fix another
bug at the same time, because the solutions of these two bugs are very
similar.

Thanks very much!

Xie Yuanbin

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [Bug report] hash_name() may cross page boundary and trigger sleep in RCU context
  2025-11-27  2:24   ` Zizhi Wo
@ 2025-11-29  3:37     ` Al Viro
  2025-11-30  3:01       ` [RFC][alpha] saner vmalloc handling (was Re: [Bug report] hash_name() may cross page boundary and trigger sleep in RCU context) Al Viro
  2025-12-01  2:03       ` [Bug report] hash_name() may cross page boundary and trigger sleep in RCU context Zizhi Wo
  0 siblings, 2 replies; 59+ messages in thread
From: Al Viro @ 2025-11-29  3:37 UTC (permalink / raw)
  To: Zizhi Wo
  Cc: torvalds, jack, brauner, hch, akpm, linux, linux-fsdevel,
	linux-kernel, linux-mm, linux-arm-kernel, yangerkun,
	wangkefeng.wang, pangliyuan1, xieyuanbin1

On Thu, Nov 27, 2025 at 10:24:19AM +0800, Zizhi Wo wrote:

> Why does x86 have special handling in do_kern_addr_fault(), including
> logic for vmalloc faults? For example, on CONFIG_X86_32, it still takes
> the vmalloc_fault path. As noted in the x86 comments, "We can fault-in
> kernel-space virtual memory on-demand"...
> 
> But on arm64, I don’t see similar logic — is there a specific reason
> for this difference? Maybe x86's vmalloc area is mapped lazily, while
> ARM maps it fully during early boot?

x86 MMU uses the same register for kernel and userland top-level page
tables; arm64 MMU has separate page tables for those - TTBR0 and TTBR1
point to the table to be used for translation, depending upon the bit
55 of virtual address.

vmalloc works with page table of init_mm (see pgd_offset_k() uses in
there).  On arm64 that's it - TTBR1 is set to that and it stays that way,
so access to vmalloc'ed area will do the right thing.

On 32bit x86 you need to propagate the change into top-level page tables
of every thread.  That's what arch_sync_kernel_mappings() is for; look for
the calls in mm/vmalloc.c and see the discussion of race in the comment in
front of x86 vmalloc_fault().  Nothing of that sort is needed of arm64,
since all threads are using the same page table for kernel part of the
address space.

The reason why 64bit x86 doesn't need to bother is different - there we
fill all relevant top-level page table slots in preallocate_vmalloc_pages()
before any additional threads could be created.  The pointers in those
slots are not going to change and they will be propagated to all subsequent
threads by pgd_alloc(), so the page tables actually modified by vmalloc()
are shared by all threads.

AFAICS, 32bit arm is similar to 32bit x86 in that respect; propagation
is lazier, though - there arch_sync_kernel_mappings() bumps a counter
in init_mm and context switches use that to check if propagation needs
to be done.  No idea how well does that work on vfree() side of things -
hadn't looked into that rabbit hole...

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [Bug report] hash_name() may cross page boundary and trigger sleep in RCU context
  2025-11-29  1:02           ` Zizhi Wo
@ 2025-11-29  3:55             ` Al Viro
  2025-12-01  2:38               ` Zizhi Wo
  0 siblings, 1 reply; 59+ messages in thread
From: Al Viro @ 2025-11-29  3:55 UTC (permalink / raw)
  To: Zizhi Wo
  Cc: Will Deacon, Linus Torvalds, jack, brauner, hch, akpm, linux,
	linux-fsdevel, linux-kernel, linux-mm, linux-arm-kernel,
	yangerkun, wangkefeng.wang, pangliyuan1, xieyuanbin1

On Sat, Nov 29, 2025 at 09:02:27AM +0800, Zizhi Wo wrote:

> Thank you very much for the answer. For the vmalloc area, I checked the
> call points on the vfs side, such as dentry_string_cmp() or hash_name().
> Their "names addr" are all assigned by kmalloc(), so there should be no
> corresponding issues. But I'm not familiar with the other calling
> points...

Pathname might be a symlink body, sitting in page cache or whatever
->get_link() has returned...


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [Bug report] hash_name() may cross page boundary and trigger
  2025-11-29  1:35           ` Linus Torvalds
@ 2025-11-29  4:08             ` Xie Yuanbin
  2025-11-29  9:08               ` Al Viro
  2025-11-29  8:54             ` [Bug report] hash_name() may cross page boundary and trigger sleep in RCU context Al Viro
  2025-12-01  2:08             ` Zizhi Wo
  2 siblings, 1 reply; 59+ messages in thread
From: Xie Yuanbin @ 2025-11-29  4:08 UTC (permalink / raw)
  To: torvalds, will, linux, viro, bigeasy, rmk+kernel
  Cc: akpm, brauner, catalin.marinas, hch, jack, linux-arm-kernel,
	linux-fsdevel, linux-kernel, linux-mm, pangliyuan1,
	wangkefeng.wang, wozizhi, xieyuanbin1, yangerkun, lilinjie8,
	liaohua4

On Fri, 28 Nov 2025 17:35:37 -0800, Linus Torvalds wrote:
> On Fri, 28 Nov 2025 at 17:01, Zizhi Wo <wozizhi@huaweicloud.com> wrote:
>> It has an additional check to determine reg:
>>
>> if (unlikely(addr > TASK_SIZE) && !user_mode(regs))
>>         goto no_context;
>>
>> I'd like to ask if this "regs" examination also needs to be brought
>> along?
>
> That seems unnecessary.
>
> Yes, in this case the original problem you reported with sleeping in
> an RCU region was triggered by a kernel access, and a user-space
> access would never have caused any such issues.
>
> So checking for !user_mode(regs) isn't exactly *wrong*.
>
> But while it isn't wrong, I think it's also kind of pointless.
>
> Because regardless of whether it's a kernel or user space access, an
> access outside TASK_SIZE shouldn't be associated with a valid user
> space context, so the code might as well just go to the "no_context"
> label directly.
>
> That said, somebody should  definitely double-check me - because I
> think arm also did the vdso trick at high addresses that i386 used to
> do, so there is the fake VDSO thing up there.
>
> But if you get a page fault on that, it's not going to be fixed up, so
> even if user space can access it, there's no point in looking that
> fake vm area up for page faults.

I think the `user_mode(regs)` check is necessary because the label
no_context actually jumps to __do_kernel_fault(), whereas page fault
from user mode should jump to `__do_user_fault()`.

Alternatively, we would need to change `goto no_context` to
`goto bad_area`. Or perhaps I misunderstood something, please point it out.

Thanks very much!

Xie Yuanbin


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [Bug report] hash_name() may cross page boundary and trigger sleep in RCU context
  2025-11-29  1:35           ` Linus Torvalds
  2025-11-29  4:08             ` [Bug report] hash_name() may cross page boundary and trigger Xie Yuanbin
@ 2025-11-29  8:54             ` Al Viro
  2025-12-01  2:08             ` Zizhi Wo
  2 siblings, 0 replies; 59+ messages in thread
From: Al Viro @ 2025-11-29  8:54 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Zizhi Wo, Russell King (Oracle),
	Catalin Marinas, Will Deacon, jack, brauner, hch, akpm,
	linux-fsdevel, linux-kernel, linux-mm, linux-arm-kernel,
	yangerkun, wangkefeng.wang, pangliyuan1, xieyuanbin1

On Fri, Nov 28, 2025 at 05:35:37PM -0800, Linus Torvalds wrote:
> On Fri, 28 Nov 2025 at 17:01, Zizhi Wo <wozizhi@huaweicloud.com> wrote:
> >
> > Thank you for your answer. In fact, this solution is similar to the one
> > provided by Al.
> 
> Hmm. I'm not seeing the replies from Al for some reason. Maybe he didn't cc me.

<checks>
Sorry, I thought you've been somewhere in that Cc, should've verified that.

> That said, somebody should  definitely double-check me - because I
> think arm also did the vdso trick at high addresses that i386 used to
> do, so there is the fake VDSO thing up there.
> 
> But if you get a page fault on that, it's not going to be fixed up, so
> even if user space can access it, there's no point in looking that
> fake vm area up for page faults.

gate_vma is not inserted anywhere - it's special-cased in coredump_next_vma(),
proc_get_vma() and get_gate_page(); none of that can lead to insertion.

AFAICS its uses in mlock_fixup(), __mmap_complete() and should_skip_vam()
are pure paranoia - "if we somehow ended up running into gate_vma, let's
make sure we don't screw it over" and had always been that way.

So VMA lookup would get NULL.


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [Bug report] hash_name() may cross page boundary and trigger
  2025-11-29  4:08             ` [Bug report] hash_name() may cross page boundary and trigger Xie Yuanbin
@ 2025-11-29  9:08               ` Al Viro
  2025-11-29  9:25                 ` Xie Yuanbin
  2025-11-29 10:45                 ` david laight
  0 siblings, 2 replies; 59+ messages in thread
From: Al Viro @ 2025-11-29  9:08 UTC (permalink / raw)
  To: Xie Yuanbin
  Cc: torvalds, will, linux, bigeasy, rmk+kernel, akpm, brauner,
	catalin.marinas, hch, jack, linux-arm-kernel, linux-fsdevel,
	linux-kernel, linux-mm, pangliyuan1, wangkefeng.wang, wozizhi,
	yangerkun, lilinjie8, liaohua4

On Sat, Nov 29, 2025 at 12:08:17PM +0800, Xie Yuanbin wrote:

> I think the `user_mode(regs)` check is necessary because the label
> no_context actually jumps to __do_kernel_fault(), whereas page fault
> from user mode should jump to `__do_user_fault()`.
> 
> Alternatively, we would need to change `goto no_context` to
> `goto bad_area`. Or perhaps I misunderstood something, please point it out.

FWIW, goto bad_area has an obvious problem: uses of 'fault' value, which
contains garbage.

The cause of problem is the heuristics in get_mmap_lock_carefully():
	if (regs && !user_mode(regs)) {
		unsigned long ip = exception_ip(regs);
		if (!search_exception_tables(ip))
			return false;
	}
trylock has failed and we are trying to decide whether it's safe to block.
The assumption (inherited from old logics in assorted page fault handlers)
is "by that point we know that fault in kernel mode is either an oops
or #PF on uaccess; in the latter case we should be OK with locking mm,
in the former we should just get to oopsing without risking deadlocks".

load_unaligned_zeropad() is where that assumption breaks - there is
an exception handler and it's not an uaccess attempt; the address is
not going to match any VMA and we really don't want to do anything
blocking.

Note that VMA lookup will return NULL there anyway - there won't be a VMA
for that address.  What we get is exactly the same thing we'd get from
do_bad_area(), whether we get a kernel or userland insn faulting.

The minimal fix would be something like
	if (unlikely(addr >= TASK_SIZE) && !(flags & FAULT_FLAG_USER))
		goto no_context;

right before
	if (!(flags & FAULT_FLAG_USER))
		goto lock_mmap;

in do_page_fault().  Alternatively,
	if (unlikely(addr >= TASK_SIZE)) {
		do_bad_area(addr, fsr, regs);
		return 0;
	}
or
	if (unlikely(addr >= TASK_SIZE)) {
		fault = 0;
		code = SEGV_MAPERR;
		goto bad_area;
	}
at the same place.  Incidentally, making do_bad_area() return 0 would
seem to make all callers happier...


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [Bug report] hash_name() may cross page boundary and trigger
  2025-11-29  9:08               ` Al Viro
@ 2025-11-29  9:25                 ` Xie Yuanbin
  2025-11-29  9:44                   ` Al Viro
  2025-11-29 10:45                 ` david laight
  1 sibling, 1 reply; 59+ messages in thread
From: Xie Yuanbin @ 2025-11-29  9:25 UTC (permalink / raw)
  To: viro, will, linux, bigeasy, rmk+kernel
  Cc: akpm, brauner, catalin.marinas, hch, jack, linux-arm-kernel,
	linux-fsdevel, linux-kernel, linux-mm, pangliyuan1,
	wangkefeng.wang, wozizhi, xieyuanbin1, yangerkun, lilinjie8,
	liaohua4

On Sat, 29 Nov 2025 09:08:13 +0000, Al Viro wrote:
> On Sat, Nov 29, 2025 at 12:08:17PM +0800, Xie Yuanbin wrote:
>
>> I think the `user_mode(regs)` check is necessary because the label
>> no_context actually jumps to __do_kernel_fault(), whereas page fault
>> from user mode should jump to `__do_user_fault()`.
>>
>> Alternatively, we would need to change `goto no_context` to
>> `goto bad_area`. Or perhaps I misunderstood something, please point it out.
>
> FWIW, goto bad_area has an obvious problem: uses of 'fault' value, which
> contains garbage.

Yes, I know it, I just omitted it. Thank you for pointing that out.

> or
> 	if (unlikely(addr >= TASK_SIZE)) {
> 		fault = 0;
> 		code = SEGV_MAPERR;
> 		goto bad_area;
> 	}

In fact, I have already submitted another patch, which is exactly the way
as you described:
Link: https://lore.kernel.org/20251127140109.191657-1-xieyuanbin1@huawei.com

The only difference is that I will move the judgment to before
local_irq_enable(). The reason for doing this is to fix another bug,
you can find more details about it here:
Link: https://lore.kernel.org/20250925025744.6807-1-xieyuanbin1@huawei.com
Link: https://lore.kernel.org/20251129021815.9679-1-xieyuanbin1@huawei.com

To keep the email concise, I will not repeat the description here.

Xie Yuanbin


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [Bug report] hash_name() may cross page boundary and trigger
  2025-11-29  9:25                 ` Xie Yuanbin
@ 2025-11-29  9:44                   ` Al Viro
  2025-11-29 10:05                     ` Xie Yuanbin
  0 siblings, 1 reply; 59+ messages in thread
From: Al Viro @ 2025-11-29  9:44 UTC (permalink / raw)
  To: Xie Yuanbin
  Cc: will, linux, bigeasy, rmk+kernel, akpm, brauner, catalin.marinas,
	hch, jack, linux-arm-kernel, linux-fsdevel, linux-kernel,
	linux-mm, pangliyuan1, wangkefeng.wang, wozizhi, yangerkun,
	lilinjie8, liaohua4

On Sat, Nov 29, 2025 at 05:25:45PM +0800, Xie Yuanbin wrote:
> On Sat, 29 Nov 2025 09:08:13 +0000, Al Viro wrote:
> > On Sat, Nov 29, 2025 at 12:08:17PM +0800, Xie Yuanbin wrote:
> >
> >> I think the `user_mode(regs)` check is necessary because the label
> >> no_context actually jumps to __do_kernel_fault(), whereas page fault
> >> from user mode should jump to `__do_user_fault()`.
> >>
> >> Alternatively, we would need to change `goto no_context` to
> >> `goto bad_area`. Or perhaps I misunderstood something, please point it out.
> >
> > FWIW, goto bad_area has an obvious problem: uses of 'fault' value, which
> > contains garbage.
> 
> Yes, I know it, I just omitted it. Thank you for pointing that out.
> 
> > or
> > 	if (unlikely(addr >= TASK_SIZE)) {
> > 		fault = 0;
> > 		code = SEGV_MAPERR;
> > 		goto bad_area;
> > 	}
> 
> In fact, I have already submitted another patch, which is exactly the way
> as you described:
> Link: https://lore.kernel.org/20251127140109.191657-1-xieyuanbin1@huawei.com
> 
> The only difference is that I will move the judgment to before
> local_irq_enable(). The reason for doing this is to fix another bug,
> you can find more details about it here:
> Link: https://lore.kernel.org/20250925025744.6807-1-xieyuanbin1@huawei.com
> Link: https://lore.kernel.org/20251129021815.9679-1-xieyuanbin1@huawei.com

AFAICS, your patch does nothing to the case when we hit kernel address from
kernel mode, which is what triggers that "block in RCU mode for no good reason"
fun...


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [Bug report] hash_name() may cross page boundary and trigger
  2025-11-29  9:44                   ` Al Viro
@ 2025-11-29 10:05                     ` Xie Yuanbin
  0 siblings, 0 replies; 59+ messages in thread
From: Xie Yuanbin @ 2025-11-29 10:05 UTC (permalink / raw)
  To: viro
  Cc: torvalds, akpm, bigeasy, brauner, catalin.marinas, hch, jack,
	liaohua4, lilinjie8, linux-arm-kernel, linux-fsdevel,
	linux-kernel, linux-mm, linux, pangliyuan1, rmk+kernel,
	wangkefeng.wang, will, wozizhi, xieyuanbin1, yangerkun

On Sat, 29 Nov 2025 09:44:48 +0000, Al Viro wrote:
> On Sat, Nov 29, 2025 at 05:25:45PM +0800, Xie Yuanbin wrote:
>> In fact, I have already submitted another patch, which is exactly the way
>> as you described:
>> Link: https://lore.kernel.org/20251127140109.191657-1-xieyuanbin1@huawei.com
>>
>> The only difference is that I will move the judgment to before
>> local_irq_enable(). The reason for doing this is to fix another bug,
>> you can find more details about it here:
>> Link: https://lore.kernel.org/20250925025744.6807-1-xieyuanbin1@huawei.com
>> Link: https://lore.kernel.org/20251129021815.9679-1-xieyuanbin1@huawei.com
>
>AFAICS, your patch does nothing to the case when we hit kernel address from
>kernel mode, which is what triggers that "block in RCU mode for no good reason"
>fun...

I'm a little confused. Which patch are you referring to?

BTW, I'm trying my best to fix both of these two bugs (might_sleep() in
RCU Read Critical Section and missing harden_branch_predictor()
mitigation):
Link: https://lore.kernel.org/20251126090505.3057219-1-wozizhi@huaweicloud.com
Link: https://lore.kernel.org/20250925025744.6807-1-xieyuanbin1@huawei.com
at the same time, because I feel that the solutions of these two bugs are
very similar in some way. And there is a preliminary solution in place:
```patch
diff --git a/arch/arm/mm/fault.c b/arch/arm/mm/fault.c
index 2bc828a1940c..5c58072d8235 100644
--- a/arch/arm/mm/fault.c
+++ b/arch/arm/mm/fault.c
@@ -270,10 +270,15 @@ do_page_fault(unsigned long addr, unsigned int fsr, struct pt_regs *regs)
 	if (kprobe_page_fault(regs, fsr))
 		return 0;
 
+	if (unlikely(addr >= TASK_SIZE)) {
+		fault = 0;
+		code = SEGV_MAPERR;
+		goto bad_area;
+	}
 
 	/* Enable interrupts if they were enabled in the parent context. */
 	if (interrupts_enabled(regs))
 		local_irq_enable();
```
Link: https://lore.kernel.org/20251127140109.191657-1-xieyuanbin1@huawei.com

I'm not sure if I'm doing the right thing. Do you have any suggestions for
this?

Thanks very much!

Xie Yuanbin


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [Bug report] hash_name() may cross page boundary and trigger
  2025-11-29  9:08               ` Al Viro
  2025-11-29  9:25                 ` Xie Yuanbin
@ 2025-11-29 10:45                 ` david laight
  1 sibling, 0 replies; 59+ messages in thread
From: david laight @ 2025-11-29 10:45 UTC (permalink / raw)
  To: Al Viro
  Cc: Xie Yuanbin, torvalds, will, linux, bigeasy, rmk+kernel, akpm,
	brauner, catalin.marinas, hch, jack, linux-arm-kernel,
	linux-fsdevel, linux-kernel, linux-mm, pangliyuan1,
	wangkefeng.wang, wozizhi, yangerkun, lilinjie8, liaohua4

On Sat, 29 Nov 2025 09:08:13 +0000
Al Viro <viro@zeniv.linux.org.uk> wrote:

> On Sat, Nov 29, 2025 at 12:08:17PM +0800, Xie Yuanbin wrote:
> 
> > I think the `user_mode(regs)` check is necessary because the label
> > no_context actually jumps to __do_kernel_fault(), whereas page fault
> > from user mode should jump to `__do_user_fault()`.
> > 
> > Alternatively, we would need to change `goto no_context` to
> > `goto bad_area`. Or perhaps I misunderstood something, please point it out.  
> 
> FWIW, goto bad_area has an obvious problem: uses of 'fault' value, which
> contains garbage.
> 
> The cause of problem is the heuristics in get_mmap_lock_carefully():
> 	if (regs && !user_mode(regs)) {
> 		unsigned long ip = exception_ip(regs);
> 		if (!search_exception_tables(ip))
> 			return false;
> 	}
> trylock has failed and we are trying to decide whether it's safe to block.
> The assumption (inherited from old logics in assorted page fault handlers)
> is "by that point we know that fault in kernel mode is either an oops
> or #PF on uaccess; in the latter case we should be OK with locking mm,
> in the former we should just get to oopsing without risking deadlocks".
> 
> load_unaligned_zeropad() is where that assumption breaks - there is
> an exception handler and it's not an uaccess attempt; the address is
> not going to match any VMA and we really don't want to do anything
> blocking.

Doesn't that also affect code that (ab)uses get_user() for kernel addresss?
For x86 even __get_kernel_nofault() does that.
In that case it hits a normal 'user fault' exception table entry rather
a 'special' one that could be marked as such.

> 
> Note that VMA lookup will return NULL there anyway - there won't be a VMA
> for that address.  What we get is exactly the same thing we'd get from
> do_bad_area(), whether we get a kernel or userland insn faulting.
> 
> The minimal fix would be something like
> 	if (unlikely(addr >= TASK_SIZE) && !(flags & FAULT_FLAG_USER))
> 		goto no_context;

Is there an issue with TASK_SIZE being process dependant?
Don't you want 'the bottom of kernel addresses' not 'the top of the current process'.

	David

> 
> right before
> 	if (!(flags & FAULT_FLAG_USER))
> 		goto lock_mmap;
> 
> in do_page_fault().  Alternatively,
> 	if (unlikely(addr >= TASK_SIZE)) {
> 		do_bad_area(addr, fsr, regs);
> 		return 0;
> 	}
> or
> 	if (unlikely(addr >= TASK_SIZE)) {
> 		fault = 0;
> 		code = SEGV_MAPERR;
> 		goto bad_area;
> 	}
> at the same place.  Incidentally, making do_bad_area() return 0 would
> seem to make all callers happier...
> 



^ permalink raw reply	[flat|nested] 59+ messages in thread

* [RFC][alpha] saner vmalloc handling (was Re: [Bug report] hash_name() may cross page boundary and trigger sleep in RCU context)
  2025-11-29  3:37     ` Al Viro
@ 2025-11-30  3:01       ` Al Viro
  2025-11-30 11:32         ` david laight
  2025-11-30 22:16         ` Linus Torvalds
  2025-12-01  2:03       ` [Bug report] hash_name() may cross page boundary and trigger sleep in RCU context Zizhi Wo
  1 sibling, 2 replies; 59+ messages in thread
From: Al Viro @ 2025-11-30  3:01 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: akpm, linux-kernel, linux-mm, linux-alpha

On Sat, Nov 29, 2025 at 03:37:28AM +0000, Al Viro wrote:

> AFAICS, 32bit arm is similar to 32bit x86 in that respect; propagation
> is lazier, though - there arch_sync_kernel_mappings() bumps a counter
> in init_mm and context switches use that to check if propagation needs
> to be done.  No idea how well does that work on vfree() side of things -
> hadn't looked into that rabbit hole...

BTW, speaking of vmalloc space - does anybody object against sorting
CONFIG_ALPHA_LARGE_VMALLOC out, so that we wouldn't need to mess
with that in alpha page fault handler?

Basically, do what amd64 does - something along the lines of (untested)
patch below.  Comments?

[PATCH] alpha: take the LARGE_VMALLOC kludge out

Support of vmalloc space that won't fit into the single L1 slot had
been a headache for quite a while.

The only things we use seg1 for are virtual mapping of page tables
(at the last 8G) and vmalloc space (below that).  Normal setup has
vmalloc space from -16G to -8G, occupying the penultimate L1 slot.
It is set up (with table sitting just after the kernel image) very
early, by callback_init().  pgd_alloc() copies that entry when
setting a new L1 table up, and it's never changed afterwards.
All page table changes done by vmalloc() are done to tables that
are shared between all threads, avoiding the need to synchronize
them.

It would be trivial to extend that - preallocate L2 tables to
cover the entire vmalloc space (8Kb for each 8Gb of that) and
set all the L1 slots involved before anything gets forked,
then copy these slots on pgd_alloc().  Unfortunately, that
had been done in a different way - only one L2 table is
preallocated, the rest gets created on demand, which means
that we need to propagate changes to threads' L1 tables.
It's kinda-sorta handled in do_page_fault(), but it's racy and
fixing that up would be a major headache.

Bugger that for the game of soldiers - do what e.g. amd64 does
and preallocate these in mem_init().  And replace the bool
ALPHA_LARGE_VMALLOC with int ALPHA_VMALLOC_SIZE (in gigabytes),
dependent upon EXPERT and defaulting to 8.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
diff --git a/arch/alpha/Kconfig b/arch/alpha/Kconfig
index 80367f2cf821..36cbba4e21d9 100644
--- a/arch/alpha/Kconfig
+++ b/arch/alpha/Kconfig
@@ -410,20 +410,21 @@ config ALPHA_WTINT
 
 	  If unsure, say N.
 
-# LARGE_VMALLOC is racy, if you *really* need it then fix it first
-config ALPHA_LARGE_VMALLOC
-	bool
+config ALPHA_VMALLOC_SIZE
+	int "vmalloc space (in gigabytes)" if EXPERT
+	default "8"
+	range 8 2040
 	help
-	  Process creation and other aspects of virtual memory management can
-	  be streamlined if we restrict the kernel to one PGD for all vmalloc
-	  allocations.  This equates to about 8GB.
+	  We preallocate the second-level page tables to cover the entire
+	  vmalloc area; one 8Kb page for each 8Gb.
 
-	  Under normal circumstances, this is so far and above what is needed
-	  as to be laughable.  However, there are certain applications (such
-	  as benchmark-grade in-kernel web serving) that can make use of as
-	  much vmalloc space as is available.
+	  Default is 8Gb total and under normal circumstances, this is so
+	  far and above what is needed as to be laughable.  However, there are
+	  certain applications (such as benchmark-grade in-kernel web serving)
+	  that can make use of as much vmalloc space as is available.
 
-	  Say N unless you know you need gobs and gobs of vmalloc space.
+	  Leave it at 8 unless you know you need gobs and gobs of
+	  vmalloc space.
 
 config VERBOSE_MCHECK
 	bool "Verbose Machine Checks"
diff --git a/arch/alpha/include/asm/pgtable.h b/arch/alpha/include/asm/pgtable.h
index 90e7a9539102..0f554d01fe54 100644
--- a/arch/alpha/include/asm/pgtable.h
+++ b/arch/alpha/include/asm/pgtable.h
@@ -49,11 +49,8 @@ struct vm_area_struct;
 /* Number of pointers that fit on a page:  this will go away. */
 #define PTRS_PER_PAGE	(1UL << (PAGE_SHIFT-3))
 
-#ifdef CONFIG_ALPHA_LARGE_VMALLOC
-#define VMALLOC_START		0xfffffe0000000000
-#else
-#define VMALLOC_START		(-2*PGDIR_SIZE)
-#endif
+#define VMALLOC_SLOTS		DIV_ROUND_UP(CONFIG_ALPHA_VMALLOC_SIZE, 8)
+#define VMALLOC_START		(-(VMALLOC_SLOTS + 1)*PGDIR_SIZE)
 #define VMALLOC_END		(-PGDIR_SIZE)
 
 /*
diff --git a/arch/alpha/mm/fault.c b/arch/alpha/mm/fault.c
index a9816bbc9f34..0bc5fc4d510e 100644
--- a/arch/alpha/mm/fault.c
+++ b/arch/alpha/mm/fault.c
@@ -111,10 +111,6 @@ do_page_fault(unsigned long address, unsigned long mmcsr,
 	if (!mm || faulthandler_disabled())
 		goto no_context;
 
-#ifdef CONFIG_ALPHA_LARGE_VMALLOC
-	if (address >= TASK_SIZE)
-		goto vmalloc_fault;
-#endif
 	if (user_mode(regs))
 		flags |= FAULT_FLAG_USER;
 	perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, address);
@@ -225,24 +221,4 @@ do_page_fault(unsigned long address, unsigned long mmcsr,
  do_sigsegv:
 	force_sig_fault(SIGSEGV, si_code, (void __user *) address);
 	return;
-
-#ifdef CONFIG_ALPHA_LARGE_VMALLOC
- vmalloc_fault:
-	if (user_mode(regs))
-		goto do_sigsegv;
-	else {
-		/* Synchronize this task's top level page-table
-		   with the "reference" page table from init.  */
-		long index = pgd_index(address);
-		pgd_t *pgd, *pgd_k;
-
-		pgd = current->active_mm->pgd + index;
-		pgd_k = swapper_pg_dir + index;
-		if (!pgd_present(*pgd) && pgd_present(*pgd_k)) {
-			pgd_val(*pgd) = pgd_val(*pgd_k);
-			return;
-		}
-		goto no_context;
-	}
-#endif
 }
diff --git a/arch/alpha/mm/init.c b/arch/alpha/mm/init.c
index 4c5ab9cd8a0a..e5eea8b05c7f 100644
--- a/arch/alpha/mm/init.c
+++ b/arch/alpha/mm/init.c
@@ -45,13 +45,10 @@ pgd_alloc(struct mm_struct *mm)
 	ret = __pgd_alloc(mm, 0);
 	init = pgd_offset(&init_mm, 0UL);
 	if (ret) {
-#ifdef CONFIG_ALPHA_LARGE_VMALLOC
-		memcpy (ret + USER_PTRS_PER_PGD, init + USER_PTRS_PER_PGD,
-			(PTRS_PER_PGD - USER_PTRS_PER_PGD - 1)*sizeof(pgd_t));
-#else
-		pgd_val(ret[PTRS_PER_PGD-2]) = pgd_val(init[PTRS_PER_PGD-2]);
-#endif
-
+		for (int i = 0; i < VMALLOC_SLOTS; i++) {
+			pgd_val(ret[PTRS_PER_PGD - VMALLOC_SLOTS - 1 + i]) =
+			pgd_val(init[PTRS_PER_PGD - VMALLOC_SLOTS - 1 + i]);
+		}
 		/* The last PGD entry is the VPTB self-map.  */
 		pgd_val(ret[PTRS_PER_PGD-1])
 		  = pte_val(mk_pte(virt_to_page(ret), PAGE_KERNEL));
@@ -148,9 +145,10 @@ callback_init(void * kernel_end)
 	   On systems with larger consoles, additional pages will be
 	   allocated as needed during the mapping process.
 
-	   In the case of not SRM, but not CONFIG_ALPHA_LARGE_VMALLOC,
-	   we need to allocate the PGD we use for vmalloc before we start
-	   forking other tasks.  */
+	   In any case we need to allocate a PGD we use for vmalloc
+	   before we start forking other tasks.  If vmalloc wants more
+	   than one PGD slot, allocate the rest later (at mem_init() -
+	   it's still early enough).  */
 
 	two_pages = (void *)
 	  (((unsigned long)kernel_end + ~PAGE_MASK) & PAGE_MASK);
@@ -246,6 +244,22 @@ srm_paging_stop (void)
 }
 #endif
 
+void __init mem_init(void)
+{
+	// first slot already filled by callback_init()
+	unsigned long addr = VMALLOC_START + PGDIR_SIZE;
+
+	while (addr < VMALLOC_END) {
+		pgd_t *pgd = pgd_offset_k(addr);
+		p4d_t *p4d = p4d_offset(pgd, addr);
+		pud_t *pud = pud_offset(p4d, addr);
+		pmd_t *pmd = pmd_alloc(&init_mm, pud, addr);
+		if (!pmd)
+			panic("can't preallocate tables for vmalloc");
+		addr += PGDIR_SIZE;
+	}
+}
+
 static const pgprot_t protection_map[16] = {
 	[VM_NONE]					= _PAGE_P(_PAGE_FOE | _PAGE_FOW |
 								  _PAGE_FOR),


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC][alpha] saner vmalloc handling (was Re: [Bug report] hash_name() may cross page boundary and trigger sleep in RCU context)
  2025-11-30  3:01       ` [RFC][alpha] saner vmalloc handling (was Re: [Bug report] hash_name() may cross page boundary and trigger sleep in RCU context) Al Viro
@ 2025-11-30 11:32         ` david laight
  2025-11-30 16:43           ` Al Viro
  2025-11-30 22:16         ` Linus Torvalds
  1 sibling, 1 reply; 59+ messages in thread
From: david laight @ 2025-11-30 11:32 UTC (permalink / raw)
  To: Al Viro; +Cc: Linus Torvalds, akpm, linux-kernel, linux-mm, linux-alpha

On Sun, 30 Nov 2025 03:01:46 +0000
Al Viro <viro@zeniv.linux.org.uk> wrote:

> On Sat, Nov 29, 2025 at 03:37:28AM +0000, Al Viro wrote:
> 
> > AFAICS, 32bit arm is similar to 32bit x86 in that respect; propagation
> > is lazier, though - there arch_sync_kernel_mappings() bumps a counter
> > in init_mm and context switches use that to check if propagation needs
> > to be done.  No idea how well does that work on vfree() side of things -
> > hadn't looked into that rabbit hole...  
> 
> BTW, speaking of vmalloc space - does anybody object against sorting
> CONFIG_ALPHA_LARGE_VMALLOC out, so that we wouldn't need to mess
> with that in alpha page fault handler?
> 
> Basically, do what amd64 does - something along the lines of (untested)
> patch below.  Comments?

How difficult would it be to allocate the pte for the next 8GB on demand
inside vmalloc(), and then propagate it to the per-task page tables.
That is a path than can sleep, so being slow if it needs to synchronise
with other cpu shouldn't matter - especially since it won't happen often.

That should be moderately generic code and would let the vmalloc limit
be 'soft'; perhaps based on physical memory size, and even be raisable
from a sysctl.

Likely more use for very large x86-64 and arm-64 systems than alpha.

	David


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC][alpha] saner vmalloc handling (was Re: [Bug report] hash_name() may cross page boundary and trigger sleep in RCU context)
  2025-11-30 11:32         ` david laight
@ 2025-11-30 16:43           ` Al Viro
  2025-11-30 18:14             ` Magnus Lindholm
  2025-11-30 19:03             ` david laight
  0 siblings, 2 replies; 59+ messages in thread
From: Al Viro @ 2025-11-30 16:43 UTC (permalink / raw)
  To: david laight; +Cc: Linus Torvalds, akpm, linux-kernel, linux-mm, linux-alpha

On Sun, Nov 30, 2025 at 11:32:13AM +0000, david laight wrote:

> How difficult would it be to allocate the pte for the next 8GB on demand
> inside vmalloc(), and then propagate it to the per-task page tables.
> That is a path than can sleep, so being slow if it needs to synchronise
> with other cpu shouldn't matter - especially since it won't happen often.
> 
> That should be moderately generic code and would let the vmalloc limit
> be 'soft'; perhaps based on physical memory size, and even be raisable
> from a sysctl.

Considerable headache and pretty pointless, at that.  Note that >8G vmalloc
space on alpha had been racy all along (and known to be that); it was
basically "could we squeeze more out of khttpd" kind of fun.

Do we have realistic vmalloc-crazy loads with high fragmentation of vmalloc
space and total footprint worth bothering with that?


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC][alpha] saner vmalloc handling (was Re: [Bug report] hash_name() may cross page boundary and trigger sleep in RCU context)
  2025-11-30 16:43           ` Al Viro
@ 2025-11-30 18:14             ` Magnus Lindholm
  2025-11-30 19:03             ` david laight
  1 sibling, 0 replies; 59+ messages in thread
From: Magnus Lindholm @ 2025-11-30 18:14 UTC (permalink / raw)
  To: Al Viro
  Cc: david laight, Linus Torvalds, akpm, linux-kernel, linux-mm, linux-alpha

On Sun, Nov 30, 2025 at 5:43 PM Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> On Sun, Nov 30, 2025 at 11:32:13AM +0000, david laight wrote:
>
> > How difficult would it be to allocate the pte for the next 8GB on demand
> > inside vmalloc(), and then propagate it to the per-task page tables.
> > That is a path than can sleep, so being slow if it needs to synchronise
> > with other cpu shouldn't matter - especially since it won't happen often.
> >
> > That should be moderately generic code and would let the vmalloc limit
> > be 'soft'; perhaps based on physical memory size, and even be raisable
> > from a sysctl.
>
> Considerable headache and pretty pointless, at that.  Note that >8G vmalloc
> space on alpha had been racy all along (and known to be that); it was
> basically "could we squeeze more out of khttpd" kind of fun.
>
> Do we have realistic vmalloc-crazy loads with high fragmentation of vmalloc
> space and total footprint worth bothering with that?

Hi everyone,

In my opinion, for Alpha I’d prefer the static preallocation model, as
it fixes the LARGE_VMALLOC race cleanly and keeps the fault path
straightforward. I don’t see many realistic Alpha workloads that would
benefit from a more complex or dynamic vmalloc setup, and compile-time
adjustment seems sufficient. Al’s version solves the issues without
adding new moving parts, which feels like the right tradeoff. Removing
code that has never worked properly should not cause any harm.

FWIW, I applied Al's patch and I'm running it now on my XP1000.
Seems to work as-is.

Regards

Magnus


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC][alpha] saner vmalloc handling (was Re: [Bug report] hash_name() may cross page boundary and trigger sleep in RCU context)
  2025-11-30 16:43           ` Al Viro
  2025-11-30 18:14             ` Magnus Lindholm
@ 2025-11-30 19:03             ` david laight
  2025-11-30 20:31               ` Al Viro
  1 sibling, 1 reply; 59+ messages in thread
From: david laight @ 2025-11-30 19:03 UTC (permalink / raw)
  To: Al Viro; +Cc: Linus Torvalds, akpm, linux-kernel, linux-mm, linux-alpha

On Sun, 30 Nov 2025 16:43:48 +0000
Al Viro <viro@zeniv.linux.org.uk> wrote:

> On Sun, Nov 30, 2025 at 11:32:13AM +0000, david laight wrote:
> 
> > How difficult would it be to allocate the pte for the next 8GB on demand
> > inside vmalloc(), and then propagate it to the per-task page tables.
> > That is a path than can sleep, so being slow if it needs to synchronise
> > with other cpu shouldn't matter - especially since it won't happen often.
> > 
> > That should be moderately generic code and would let the vmalloc limit
> > be 'soft'; perhaps based on physical memory size, and even be raisable
> > from a sysctl.  
> 
> Considerable headache and pretty pointless, at that.  Note that >8G vmalloc
> space on alpha had been racy all along (and known to be that); it was
> basically "could we squeeze more out of khttpd" kind of fun.
> 
> Do we have realistic vmalloc-crazy loads with high fragmentation of vmalloc
> space and total footprint worth bothering with that?
> 

I doubt it matters for alpha - I suspect you could just nuke ALPHA_LARGE_VMALLOC.
At a guess it was written way back when the biggest/fastest systems you could
get were alpha.

I was more thinking about about modern 64 bits systems where you might want
to run a distro kernel on systems with relatively small amounts of RAM and
others with 100s of cpu and multi TB of RAM.
I can well image workloads for the latter that might run out of vmalloc space.
In some situations even getting a command line parameter in can be hard,
so you might want it to be a systcl - even if changing that is what does the
update.
(Doing the updates in the page fault handler definitely sounds like a recipe
for disaster.)

Note that I've not looked at where amd64 gets the limit for mem_init().
Maybe it tries to 'guess' the correct value for the system.
But it is likely to be workload related - so allocating 8K for every 8G
of physical memory (one option) may be wasteful.

	David

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC][alpha] saner vmalloc handling (was Re: [Bug report] hash_name() may cross page boundary and trigger sleep in RCU context)
  2025-11-30 19:03             ` david laight
@ 2025-11-30 20:31               ` Al Viro
  2025-11-30 20:32                 ` Al Viro
  0 siblings, 1 reply; 59+ messages in thread
From: Al Viro @ 2025-11-30 20:31 UTC (permalink / raw)
  To: david laight; +Cc: Linus Torvalds, akpm, linux-kernel, linux-mm, linux-alpha

On Sun, Nov 30, 2025 at 07:03:31PM +0000, david laight wrote:

> (Doing the updates in the page fault handler definitely sounds like a recipe
> for disaster.)

See the comments in arch/x86/mm/fault.c regarding the need to do it in
#PF on i386.  Basically, you have

CPU1: does vmalloc(), needs to grab a new slot in top-level table.
Updates the page table tree for init_mm (rooted at swapper_pg_dir).
Starts propagating that change to other threads.

CPU2: does vmalloc(), which grabs another address in the range
covered by the same slot.  It works with the same page table tree,
so it sees that slot already occupied, inserts a new page reference
into the page table hanging off it.  No top-level changes to
propagate, so it returns the address it has grabbed to the caller.

CPU2: caller of vmalloc() dereferences the returned object.
If propagation from CPU1 has not reached the top-level page table
CPU2 is using, the top-level slot is empty and MMU of CPU2 raises
#PF.

BTW, it might be possible for two parallel allocations to grab two
areas, each requiring grabbing its own top-level slot ;-/
It certainly can happen on x86 - two vmalloc(SZ_4M) in parallel
will do just that, but with NUMA you can get it 64bit boxen
as well.

So simple generation count on root page table won't solve it either...

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC][alpha] saner vmalloc handling (was Re: [Bug report] hash_name() may cross page boundary and trigger sleep in RCU context)
  2025-11-30 20:31               ` Al Viro
@ 2025-11-30 20:32                 ` Al Viro
  0 siblings, 0 replies; 59+ messages in thread
From: Al Viro @ 2025-11-30 20:32 UTC (permalink / raw)
  To: david laight; +Cc: Linus Torvalds, akpm, linux-kernel, linux-mm, linux-alpha

On Sun, Nov 30, 2025 at 08:31:18PM +0000, Al Viro wrote:
> It certainly can happen on x86 - two vmalloc(SZ_4M) in parallel

32bit x86, that is.

> will do just that, but with NUMA you can get it 64bit boxen
> as well.


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC][alpha] saner vmalloc handling (was Re: [Bug report] hash_name() may cross page boundary and trigger sleep in RCU context)
  2025-11-30  3:01       ` [RFC][alpha] saner vmalloc handling (was Re: [Bug report] hash_name() may cross page boundary and trigger sleep in RCU context) Al Viro
  2025-11-30 11:32         ` david laight
@ 2025-11-30 22:16         ` Linus Torvalds
  2025-11-30 23:37           ` Al Viro
  1 sibling, 1 reply; 59+ messages in thread
From: Linus Torvalds @ 2025-11-30 22:16 UTC (permalink / raw)
  To: Al Viro; +Cc: akpm, linux-kernel, linux-mm, linux-alpha

On Sat, 29 Nov 2025 at 19:01, Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> +         Default is 8Gb total and under normal circumstances, this is so
> +         far and above what is needed as to be laughable.  However, there are
> +         certain applications (such as benchmark-grade in-kernel web serving)
> +         that can make use of as much vmalloc space as is available.

I wonder if we even need the config variable?

Because this reads like the whole feature exists due to the old 'tux'
web server thing (from the early 2000's - long long gone, never merged
upstream).

So I'm not sure there are any actual real use-cases for tons of
vmalloc space on alpha.

Anyway, I see no real objections to the patch, only a "maybe it could
be cut down even more".

                Linus


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC][alpha] saner vmalloc handling (was Re: [Bug report] hash_name() may cross page boundary and trigger sleep in RCU context)
  2025-11-30 22:16         ` Linus Torvalds
@ 2025-11-30 23:37           ` Al Viro
  0 siblings, 0 replies; 59+ messages in thread
From: Al Viro @ 2025-11-30 23:37 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: akpm, linux-kernel, linux-mm, x86

On Sun, Nov 30, 2025 at 02:16:13PM -0800, Linus Torvalds wrote:
> On Sat, 29 Nov 2025 at 19:01, Al Viro <viro@zeniv.linux.org.uk> wrote:
> >
> > +         Default is 8Gb total and under normal circumstances, this is so
> > +         far and above what is needed as to be laughable.  However, there are
> > +         certain applications (such as benchmark-grade in-kernel web serving)
> > +         that can make use of as much vmalloc space as is available.
> 
> I wonder if we even need the config variable?
> 
> Because this reads like the whole feature exists due to the old 'tux'
> web server thing (from the early 2000's - long long gone, never merged
> upstream).
> 
> So I'm not sure there are any actual real use-cases for tons of
> vmalloc space on alpha.
> 
> Anyway, I see no real objections to the patch, only a "maybe it could
> be cut down even more".

FWIW, I'm trying to figure out what's going on with amd64 in that area;
we used to do allocate-on-demand until 2020, when Joerg went for "let's
preallocate them" and killed arch_sync_kernel_mappings(), which got
reverted soon after, only to be brought back when Joerg had fixed the
bug in preallocation.  It stayed that way until this August, when
commit 6659d027998083fbb6d42a165b0c90dc2e8ba989
Author: Harry Yoo <harry.yoo@oracle.com>
Date:   Mon Aug 18 11:02:06 2025 +0900

     x86/mm/64: define ARCH_PAGE_TABLE_SYNC_MASK and arch_sync_kernel_mappings()

happened, with reference to this
commit 8d400913c231bd1da74067255816453f96cd35b0
Author: Oscar Salvador <osalvador@suse.de>
Date:   Thu Apr 29 22:57:19 2021 -0700

     x86/vmemmap: handle unpopulated sub-pmd ranges

What I don't understand is how does that manage to avoid the same race -
on #PF amd64 does not bother with vmalloc_fault logics.  Exact same
scenario with two vmalloc() on different CPUs would seem to apply here
as well...

Which callers of arch_sync_kernel_mappings() are involved?  If it's
anything in mm/vmalloc.c, I really don't see how that could be correct;
if it's about apply_to_page_range() and calls never hit vmalloc space,
we might be OK, but it would be nice to have described somewhere...

Am I missing something obvious here?  

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [Bug report] hash_name() may cross page boundary and trigger sleep in RCU context
  2025-11-29  3:37     ` Al Viro
  2025-11-30  3:01       ` [RFC][alpha] saner vmalloc handling (was Re: [Bug report] hash_name() may cross page boundary and trigger sleep in RCU context) Al Viro
@ 2025-12-01  2:03       ` Zizhi Wo
  1 sibling, 0 replies; 59+ messages in thread
From: Zizhi Wo @ 2025-12-01  2:03 UTC (permalink / raw)
  To: Al Viro, Zizhi Wo
  Cc: torvalds, jack, brauner, hch, akpm, linux, linux-fsdevel,
	linux-kernel, linux-mm, linux-arm-kernel, yangerkun,
	wangkefeng.wang, pangliyuan1, xieyuanbin1



在 2025/11/29 11:37, Al Viro 写道:
> On Thu, Nov 27, 2025 at 10:24:19AM +0800, Zizhi Wo wrote:
> 
>> Why does x86 have special handling in do_kern_addr_fault(), including
>> logic for vmalloc faults? For example, on CONFIG_X86_32, it still takes
>> the vmalloc_fault path. As noted in the x86 comments, "We can fault-in
>> kernel-space virtual memory on-demand"...
>>
>> But on arm64, I don’t see similar logic — is there a specific reason
>> for this difference? Maybe x86's vmalloc area is mapped lazily, while
>> ARM maps it fully during early boot?
> 
> x86 MMU uses the same register for kernel and userland top-level page
> tables; arm64 MMU has separate page tables for those - TTBR0 and TTBR1
> point to the table to be used for translation, depending upon the bit
> 55 of virtual address.
> 
> vmalloc works with page table of init_mm (see pgd_offset_k() uses in
> there).  On arm64 that's it - TTBR1 is set to that and it stays that way,
> so access to vmalloc'ed area will do the right thing.
> 
> On 32bit x86 you need to propagate the change into top-level page tables
> of every thread.  That's what arch_sync_kernel_mappings() is for; look for
> the calls in mm/vmalloc.c and see the discussion of race in the comment in
> front of x86 vmalloc_fault().  Nothing of that sort is needed of arm64,
> since all threads are using the same page table for kernel part of the
> address space.
> 
> The reason why 64bit x86 doesn't need to bother is different - there we
> fill all relevant top-level page table slots in preallocate_vmalloc_pages()
> before any additional threads could be created.  The pointers in those
> slots are not going to change and they will be propagated to all subsequent
> threads by pgd_alloc(), so the page tables actually modified by vmalloc()
> are shared by all threads.

Thank you very much for your answer. This well explains my confusion!

Thanks,
Zizhi Wo

> 
> AFAICS, 32bit arm is similar to 32bit x86 in that respect; propagation
> is lazier, though - there arch_sync_kernel_mappings() bumps a counter
> in init_mm and context switches use that to check if propagation needs
> to be done.  No idea how well does that work on vfree() side of things -
> hadn't looked into that rabbit hole...
> 



^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [Bug report] hash_name() may cross page boundary and trigger sleep in RCU context
  2025-11-29  1:35           ` Linus Torvalds
  2025-11-29  4:08             ` [Bug report] hash_name() may cross page boundary and trigger Xie Yuanbin
  2025-11-29  8:54             ` [Bug report] hash_name() may cross page boundary and trigger sleep in RCU context Al Viro
@ 2025-12-01  2:08             ` Zizhi Wo
  2 siblings, 0 replies; 59+ messages in thread
From: Zizhi Wo @ 2025-12-01  2:08 UTC (permalink / raw)
  To: Linus Torvalds, Zizhi Wo
  Cc: Russell King (Oracle),
	Catalin Marinas, Will Deacon, jack, brauner, hch, akpm,
	linux-fsdevel, linux-kernel, linux-mm, linux-arm-kernel,
	yangerkun, wangkefeng.wang, pangliyuan1, xieyuanbin1, Al Viro



在 2025/11/29 9:35, Linus Torvalds 写道:
> On Fri, 28 Nov 2025 at 17:01, Zizhi Wo <wozizhi@huaweicloud.com> wrote:
>>
>> Thank you for your answer. In fact, this solution is similar to the one
>> provided by Al.
> 
> Hmm. I'm not seeing the replies from Al for some reason. Maybe he didn't cc me.
> 
>> It has an additional check to determine reg:
>>
>> if (unlikely(addr > TASK_SIZE) && !user_mode(regs))
>>          goto no_context;
>>
>> I'd like to ask if this "regs" examination also needs to be brought
>> along?
> 
> That seems unnecessary.
> 
> Yes, in this case the original problem you reported with sleeping in
> an RCU region was triggered by a kernel access, and a user-space
> access would never have caused any such issues.
> 
> So checking for !user_mode(regs) isn't exactly *wrong*.
> 
> But while it isn't wrong, I think it's also kind of pointless.
> 
> Because regardless of whether it's a kernel or user space access, an
> access outside TASK_SIZE shouldn't be associated with a valid user
> space context, so the code might as well just go to the "no_context"
> label directly.
> 
> That said, somebody should  definitely double-check me - because I
> think arm also did the vdso trick at high addresses that i386 used to
> do, so there is the fake VDSO thing up there.
> 
> But if you get a page fault on that, it's not going to be fixed up, so
> even if user space can access it, there's no point in looking that
> fake vm area up for page faults.
> 
> I think.
> 
>> I'm even thinking if we directly have the corresponding processing
>> replaced by do_translation_fault(), is that also correct?
>>
>> ```
>> -       { do_page_fault,        SIGSEGV, SEGV_MAPERR,   "page
>> translation fault"           },
>> +       { do_translation_fault, SIGSEGV, SEGV_MAPERR,   "page
>> translation fault"           },
> 
> I think that might break kprobes.
> 
> Looking around, I think my patch might also be a bit broken: I think
> it might be better to move it further down to below the check for
> FSR_LNX_PF.
> 
> But somebody who knows the exact arm page fault handling better than
> me should verify both that and my VDSO gate page thinking.
> 
>             Linus
> 

Thank you for your reply! Regarding the existing discussions in the
community, I will re-examine the logic in this regard and digest it.

Thanks,
Zizhi Wo



^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [Bug report] hash_name() may cross page boundary and trigger sleep in RCU context
  2025-11-29  3:55             ` Al Viro
@ 2025-12-01  2:38               ` Zizhi Wo
  0 siblings, 0 replies; 59+ messages in thread
From: Zizhi Wo @ 2025-12-01  2:38 UTC (permalink / raw)
  To: Al Viro, Zizhi Wo
  Cc: Will Deacon, Linus Torvalds, jack, brauner, hch, akpm, linux,
	linux-fsdevel, linux-kernel, linux-mm, linux-arm-kernel,
	yangerkun, wangkefeng.wang, pangliyuan1, xieyuanbin1



在 2025/11/29 11:55, Al Viro 写道:
> On Sat, Nov 29, 2025 at 09:02:27AM +0800, Zizhi Wo wrote:
> 
>> Thank you very much for the answer. For the vmalloc area, I checked the
>> call points on the vfs side, such as dentry_string_cmp() or hash_name().
>> Their "names addr" are all assigned by kmalloc(), so there should be no
>> corresponding issues. But I'm not familiar with the other calling
>> points...
> 
> Pathname might be a symlink body, sitting in page cache or whatever
> ->get_link() has returned...
> 


Thanks for the additional explanation — I indeed hadn't considered
symlinks. But if the data is in the page cache, as I understand it, its
address wouldn't be in the vmalloc area, right? However, for other
.get_link implementations, it's true that there's no guarantee.

Thanks,
Zizhi Wo



^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [Bug report] hash_name() may cross page boundary and trigger sleep in RCU context
  2025-11-28 17:06       ` Linus Torvalds
  2025-11-29  1:01         ` Zizhi Wo
  2025-11-29  2:18         ` [Bug report] hash_name() may cross page boundary and trigger Xie Yuanbin
@ 2025-12-01 13:28         ` Will Deacon
  2025-12-02 12:43         ` Russell King (Oracle)
  3 siblings, 0 replies; 59+ messages in thread
From: Will Deacon @ 2025-12-01 13:28 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Russell King (Oracle),
	Zizhi Wo, Catalin Marinas, jack, brauner, hch, akpm,
	linux-fsdevel, linux-kernel, linux-mm, linux-arm-kernel,
	yangerkun, wangkefeng.wang, pangliyuan1, xieyuanbin1

On Fri, Nov 28, 2025 at 09:06:50AM -0800, Linus Torvalds wrote:
> On Thu, 27 Nov 2025 at 02:58, Russell King (Oracle)
> <linux@armlinux.org.uk> wrote:
> >
> > Ha!
> >
> > As said elsewhere, it looks like 32-bit ARM has been missing updates to
> > the fault handler since pre-git history - this was modelled in the dim
> > and distant i386 handling, and it just hasn't kept up.
> 
> I actually have this dim memory of having seen something along these
> lines before, and I just had never realized how it could happen,
> because that call to do_page_fault() in do_translation_fault()
> visually *looks* like the only call-site, and so that
> 
>         if (addr < TASK_SIZE)
>                 return do_page_fault(addr, fsr, regs);
> 
> looks like it does everything correctly. That "do_page_fault()"
> function is static to the arch/arm/mm/fault.c file, and that's the
> only place that appears to call it.
> 
> The operative word being "appears".
> 
> Becuse I had never before realized that that fault.c then also does that
> 
>   #include "fsr-2level.c"
> 
> and then that do_page_fault() function is exposed through those
> fsr_info[] operation arrays.
> 
> Anyway, I don't think that the ARM fault handling is all *that* bad.
> Sure, it might be worth double-checking, but it *has* been converted
> to the generic accounting helpers a few years ago and to the stack
> growing fixes.
> 
> I think the fix here may be as simple as this trivial patch:
> 
>   diff --git a/arch/arm/mm/fault.c b/arch/arm/mm/fault.c
>   index 2bc828a1940c..27024ec2d46d 100644
>   --- a/arch/arm/mm/fault.c
>   +++ b/arch/arm/mm/fault.c
>   @@ -277,6 +277,10 @@ do_page_fault(unsigned long addr, ...
>         if (interrupts_enabled(regs))
>                 local_irq_enable();
> 
>   +     /* non-user address faults never have context */
>   +     if (addr >= TASK_SIZE)
>   +             goto no_context;
>   +
>         /*
>          * If we're in an interrupt or have no user
>          * context, we must not take the fault..
> 
> but I really haven't thought much about it.

In the hack I posted [1], I deliberately avoided modifying
do_page_fault() as it's used on the permission fault path. With your
change above, I'm worried that userspace could simply try to access a
kernel address and that would lead to a panic.

Will

[1] https://lore.kernel.org/all/aShLKpTBr9akSuUG@willie-the-truck/


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [Bug report] hash_name() may cross page boundary and trigger sleep in RCU context
  2025-11-28 17:06       ` Linus Torvalds
                           ` (2 preceding siblings ...)
  2025-12-01 13:28         ` [Bug report] hash_name() may cross page boundary and trigger sleep in RCU context Will Deacon
@ 2025-12-02 12:43         ` Russell King (Oracle)
  2025-12-02 13:02           ` Xie Yuanbin
  2025-12-02 22:07           ` Linus Torvalds
  3 siblings, 2 replies; 59+ messages in thread
From: Russell King (Oracle) @ 2025-12-02 12:43 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Zizhi Wo, Catalin Marinas, Will Deacon, jack, brauner, hch, akpm,
	linux-fsdevel, linux-kernel, linux-mm, linux-arm-kernel,
	yangerkun, wangkefeng.wang, pangliyuan1, xieyuanbin1

On Fri, Nov 28, 2025 at 09:06:50AM -0800, Linus Torvalds wrote:
> I don't think it's necessarily all that big of a deal. Yeah, this is
> old code, and yeah, it could probably be cleaned up a bit, but at the
> same time, "old and crusty" also means "fairly well tested". This
> whole fault on a kernel address is a fairly unusual case, and as
> mentioned, I *think* the above fix is sufficient.

We have another issue in the code - which has the branch predictor
hardening for spectre issues, which can be called with interrupts
enabled, causing a kernel warning - obviously not good.

There's another issue which PREEMPT_RT has picked up on - which is
that delivering signals via __do_user_fault() with interrupts disabled
causes spinlocks (which can sleep on PREEMPT_RT) to warn.

What I'm thinking is to address both of these by handling kernel space
page faults (which will be permission or PTE-not-present) separately
(not even build tested):

diff --git a/arch/arm/mm/fault.c b/arch/arm/mm/fault.c
index 2bc828a1940c..972bce697c6c 100644
--- a/arch/arm/mm/fault.c
+++ b/arch/arm/mm/fault.c
@@ -175,7 +175,8 @@ __do_kernel_fault(struct mm_struct *mm, unsigned long addr, unsigned int fsr,
 
 /*
  * Something tried to access memory that isn't in our memory map..
- * User mode accesses just cause a SIGSEGV
+ * User mode accesses just cause a SIGSEGV. Ensure interrupts are enabled
+ * here, which is safe as the fault being handled is from userspace.
  */
 static void
 __do_user_fault(unsigned long addr, unsigned int fsr, unsigned int sig,
@@ -183,8 +184,7 @@ __do_user_fault(unsigned long addr, unsigned int fsr, unsigned int sig,
 {
 	struct task_struct *tsk = current;
 
-	if (addr > TASK_SIZE)
-		harden_branch_predictor();
+	local_irq_enable();
 
 #ifdef CONFIG_DEBUG_USER
 	if (((user_debug & UDBG_SEGV) && (sig == SIGSEGV)) ||
@@ -259,6 +259,38 @@ static inline bool ttbr0_usermode_access_allowed(struct pt_regs *regs)
 }
 #endif
 
+static int __kprobes
+do_kernel_address_page_fault(unsigned long addr, unsigned int fsr,
+			     struct pt_regs *regs)
+{
+	if (user_mode(regs)) {
+		/*
+		 * Fault from user mode for a kernel space address. User mode
+		 * should not be faulting in kernel space, which includes the
+		 * vector/khelper page. Handle the Spectre issues while
+		 * interrupts are still disabled, then send a SIGSEGV. Note
+		 * that __do_user_fault() will enable interrupts.
+		 */
+		harden_branch_predictor();
+		__do_user_fault(addr, fsr, SIGSEGV, SEGV_MAPERR, regs);
+	} else {
+		/*
+		 * Fault from kernel mode. Enable interrupts if they were
+		 * enabled in the parent context. Section (upper page table)
+		 * translation faults are handled via do_translation_fault(),
+		 * so we will only get here for a non-present kernel space
+		 * PTE or kernel space permission fault. Both of these should
+		 * not happen.
+		 */
+		if (interrupts_enabled(regs))
+			local_irq_enable();
+
+		__do_kernel_fault(mm, addr, fsr, regs);
+	}
+
+	return 0;
+}
+
 static int __kprobes
 do_page_fault(unsigned long addr, unsigned int fsr, struct pt_regs *regs)
 {
@@ -272,6 +304,8 @@ do_page_fault(unsigned long addr, unsigned int fsr, struct pt_regs *regs)
 	if (kprobe_page_fault(regs, fsr))
 		return 0;
 
+	if (addr >= TASK_SIZE)
+		return do_kernel_address_page_fault(addr, fsr, regs);
 
 	/* Enable interrupts if they were enabled in the parent context. */
 	if (interrupts_enabled(regs))

... and I think there was a bug in the branch predictor handling -
addr == TASK_SIZE should have been included.

Does this look sensible?

-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTP is here! 80Mbps down 10Mbps up. Decent connectivity at last!


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [Bug report] hash_name() may cross page boundary and trigger sleep in RCU context
  2025-12-02 12:43         ` Russell King (Oracle)
@ 2025-12-02 13:02           ` Xie Yuanbin
  2025-12-02 22:07           ` Linus Torvalds
  1 sibling, 0 replies; 59+ messages in thread
From: Xie Yuanbin @ 2025-12-02 13:02 UTC (permalink / raw)
  To: linux
  Cc: akpm, brauner, catalin.marinas, hch, jack, linux-arm-kernel,
	linux-fsdevel, linux-kernel, linux-mm, pangliyuan1, torvalds,
	wangkefeng.wang, will, wozizhi, xieyuanbin1, yangerkun

On Tue, 2 Dec 2025 12:43:32 +0000, Russell King (Oracle) wrote:
> We have another issue in the code - which has the branch predictor
> hardening for spectre issues, which can be called with interrupts
> enabled, causing a kernel warning - obviously not good.
>
> There's another issue which PREEMPT_RT has picked up on - which is
> that delivering signals via __do_user_fault() with interrupts disabled
> causes spinlocks (which can sleep on PREEMPT_RT) to warn.
>
> What I'm thinking is to address both of these by handling kernel space
> page faults (which will be permission or PTE-not-present) separately
> (not even build tested):
>
> diff --git a/arch/arm/mm/fault.c b/arch/arm/mm/fault.c
> index 2bc828a1940c..972bce697c6c 100644
> --- a/arch/arm/mm/fault.c
> +++ b/arch/arm/mm/fault.c
> @@ -175,7 +175,8 @@ __do_kernel_fault(struct mm_struct *mm, unsigned long addr, unsigned int fsr,
>
>  /*
>   * Something tried to access memory that isn't in our memory map..
> - * User mode accesses just cause a SIGSEGV
> + * User mode accesses just cause a SIGSEGV. Ensure interrupts are enabled
> + * here, which is safe as the fault being handled is from userspace.
>   */
>  static void
>  __do_user_fault(unsigned long addr, unsigned int fsr, unsigned int sig,
> @@ -183,8 +184,7 @@ __do_user_fault(unsigned long addr, unsigned int fsr, unsigned int sig,
>  {
>  	struct task_struct *tsk = current;
> 
> -	if (addr > TASK_SIZE)
> -		harden_branch_predictor();
> +	local_irq_enable();
>
>  #ifdef CONFIG_DEBUG_USER
>  	if (((user_debug & UDBG_SEGV) && (sig == SIGSEGV)) ||
> @@ -259,6 +259,38 @@ static inline bool ttbr0_usermode_access_allowed(struct pt_regs *regs)
>  }
>  #endif
>
> +static int __kprobes
> +do_kernel_address_page_fault(unsigned long addr, unsigned int fsr,
> +			     struct pt_regs *regs)
> +{
> +	if (user_mode(regs)) {
> +		/*
> +		 * Fault from user mode for a kernel space address. User mode
> +		 * should not be faulting in kernel space, which includes the
> +		 * vector/khelper page. Handle the Spectre issues while
> +		 * interrupts are still disabled, then send a SIGSEGV. Note
> +		 * that __do_user_fault() will enable interrupts.
> +		 */
> +		harden_branch_predictor();
> +		__do_user_fault(addr, fsr, SIGSEGV, SEGV_MAPERR, regs);
> +	} else {
> +		/*
> +		 * Fault from kernel mode. Enable interrupts if they were
> +		 * enabled in the parent context. Section (upper page table)
> +		 * translation faults are handled via do_translation_fault(),
> +		 * so we will only get here for a non-present kernel space
> +		 * PTE or kernel space permission fault. Both of these should
> +		 * not happen.
> +		 */
> +		if (interrupts_enabled(regs))
> +			local_irq_enable();
> +
> +		__do_kernel_fault(mm, addr, fsr, regs);
> +	}
> +
> +	return 0;
> +}
> +
>  static int __kprobes
>  do_page_fault(unsigned long addr, unsigned int fsr, struct pt_regs *regs)
>  {
> @@ -272,6 +304,8 @@ do_page_fault(unsigned long addr, unsigned int fsr, struct pt_regs *regs)
>  	if (kprobe_page_fault(regs, fsr))
>  		return 0;
>
> +	if (addr >= TASK_SIZE)
> +		return do_kernel_address_page_fault(addr, fsr, regs);
>
>  	/* Enable interrupts if they were enabled in the parent context. */
>  	if (interrupts_enabled(regs))
>
> ... and I think there was a bug in the branch predictor handling -
> addr == TASK_SIZE should have been included.
>
> Does this look sensible?

Hi, Russell King!

This patch removes
```c
	if (addr > TASK_SIZE)
		harden_branch_predictor();
```
from do_user_fault(), and adds it to do_page_fault()->
do_kernel_address_page_fault().

However, do_user_fault() is not only called by do_page_fault(). It is
also called by do_bad_area(), do_sect_fault() and do_translation_fault().
I am not sure that if this will lead to some missing
harden_branch_predictor() mitigation.

What about something like this:
```patch
diff --git a/arch/arm/mm/fault.c b/arch/arm/mm/fault.c
index 2bc828a1940c..5c58072d8235 100644
--- a/arch/arm/mm/fault.c
+++ b/arch/arm/mm/fault.c
@@ -270,10 +270,15 @@ do_page_fault(unsigned long addr, unsigned int fsr, struct pt_regs *regs)
 	vm_flags_t vm_flags = VM_ACCESS_FLAGS;

 	if (kprobe_page_fault(regs, fsr))
 		return 0;

+	if (unlikely(addr >= TASK_SIZE)) {
+		fault = 0;
+		code = SEGV_MAPERR;
+		goto bad_area;
+	}

 	/* Enable interrupts if they were enabled in the parent context. */
 	if (interrupts_enabled(regs))
 		local_irq_enable();
diff --git a/arch/arm/mm/fault.c b/arch/arm/mm/fault.c
index 5c58072d8235..f8ee1854c854 100644
--- a/arch/arm/mm/fault.c
+++ b/arch/arm/mm/fault.c
@@ -184,10 +184,13 @@ __do_user_fault(unsigned long addr, unsigned int fsr, unsigned int sig,
 	struct task_struct *tsk = current;
 
 	if (addr > TASK_SIZE)
 		harden_branch_predictor();
 
+	if (IS_ENABLED(CONFIG_PREEMPT_RT))
+		local_irq_enable();
+
 #ifdef CONFIG_DEBUG_USER
 	if (((user_debug & UDBG_SEGV) && (sig == SIGSEGV)) ||
 	    ((user_debug & UDBG_BUS)  && (sig == SIGBUS))) {
 		pr_err("8<--- cut here ---\n");
 		pr_err("%s: unhandled page fault (%d) at 0x%08lx, code 0x%03x\n",
```

Thanks very much!

Xie Yuanbin


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [Bug report] hash_name() may cross page boundary and trigger sleep in RCU context
  2025-12-02 12:43         ` Russell King (Oracle)
  2025-12-02 13:02           ` Xie Yuanbin
@ 2025-12-02 22:07           ` Linus Torvalds
  2025-12-03  1:48             ` Xie Yuanbin
  1 sibling, 1 reply; 59+ messages in thread
From: Linus Torvalds @ 2025-12-02 22:07 UTC (permalink / raw)
  To: Russell King (Oracle)
  Cc: Zizhi Wo, Catalin Marinas, Will Deacon, jack, brauner, hch, akpm,
	linux-fsdevel, linux-kernel, linux-mm, linux-arm-kernel,
	yangerkun, wangkefeng.wang, pangliyuan1, xieyuanbin1

On Tue, 2 Dec 2025 at 04:43, Russell King (Oracle)
<linux@armlinux.org.uk> wrote:
>
> What I'm thinking is to address both of these by handling kernel space
> page faults (which will be permission or PTE-not-present) separately
> (not even build tested):

That patch looks sane to me.

But I also didn't build test it, just scanned it visually ;)

            Linus


^ permalink raw reply	[flat|nested] 59+ messages in thread

* [Bug report] hash_name() may cross page boundary and trigger sleep in RCU context
  2025-12-02 22:07           ` Linus Torvalds
@ 2025-12-03  1:48             ` Xie Yuanbin
  2025-12-05 12:08               ` Russell King (Oracle)
  0 siblings, 1 reply; 59+ messages in thread
From: Xie Yuanbin @ 2025-12-03  1:48 UTC (permalink / raw)
  To: torvalds, linux
  Cc: akpm, brauner, catalin.marinas, hch, jack, linux-arm-kernel,
	linux-fsdevel, linux-kernel, linux-mm, pangliyuan1,
	wangkefeng.wang, will, wozizhi, xieyuanbin1, yangerkun

On Tue, 2 Dec 2025 14:07:25 -0800, Linus Torvalds wrote:
> On Tue, 2 Dec 2025 at 04:43, Russell King (Oracle)
> <linux@armlinux.org.uk> wrote:
>>
>> What I'm thinking is to address both of these by handling kernel space
>> page faults (which will be permission or PTE-not-present) separately
>> (not even build tested):
>
> That patch looks sane to me.
>
> But I also didn't build test it, just scanned it visually ;)

That patch removes harden_branch_predictor() from __do_user_fault(), and
moves it to do_page_fault()->do_kernel_address_page_fault(). 
This resolves previously mentioned kernel warning issue. However,
__do_user_fault() is not only called by do_page_fault(), it is
alse called by do_bad_area(), do_sect_fault() and do_translation_fault().

So I think that some harden_branch_predictor() is missing on other paths.
According to my tests, when CONFIG_ARM_LPAE=n, harden_branch_predictor()
will never be called anymore, even if a user program trys to access the
kernel address.

Or perhaps I've misunderstood something, could you please point it out?
Thank you very much.

What about something like this (The patch has been tested):
```patch
diff --git a/arch/arm/mm/fault.c b/arch/arm/mm/fault.c
index 2bc828a1940c..af86198631c5 100644
--- a/arch/arm/mm/fault.c
+++ b/arch/arm/mm/fault.c
@@ -183,9 +183,11 @@ __do_user_fault(unsigned long addr, unsigned int fsr, unsigned int sig,
 {
 	struct task_struct *tsk = current;

-	if (addr > TASK_SIZE)
+	if (addr >= TASK_SIZE)
 		harden_branch_predictor();

+	local_irq_enable();
+
 #ifdef CONFIG_DEBUG_USER
 	if (((user_debug & UDBG_SEGV) && (sig == SIGSEGV)) ||
 	    ((user_debug & UDBG_BUS)  && (sig == SIGBUS))) {
@@ -272,6 +274,24 @@ do_page_fault(unsigned long addr, unsigned int fsr, struct pt_regs *regs)
 	if (kprobe_page_fault(regs, fsr))
 		return 0;

+	if (unlikely(addr >= TASK_SIZE)) {
+		/*
+		 * Fault from user mode for a kernel space address. User mode
+		 * should not be faulting in kernel space, which includes the
+		 * vector/khelper page. Handle the Spectre issues while
+		 * interrupts are still disabled, then send a SIGSEGV. Note
+		 * that __do_user_fault() will enable interrupts.
+		 *
+		 * Fault from kernel mode. Jump to __do_kernel_fault()->
+		 * fixup_exception() directly, without getting mm lock and
+		 * finding vma. The interrupts are not enabled but it will be
+		 * good, just like what do_translation_fault() and
+		 * do_bad_area() does.
+		 */
+		fault = 0;
+		code = SEGV_MAPERR;
+		goto bad_area;
+	}

 	/* Enable interrupts if they were enabled in the parent context. */
 	if (interrupts_enabled(regs))

```

Thanks very much!


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [Bug report] hash_name() may cross page boundary and trigger sleep in RCU context
  2025-12-03  1:48             ` Xie Yuanbin
@ 2025-12-05 12:08               ` Russell King (Oracle)
  0 siblings, 0 replies; 59+ messages in thread
From: Russell King (Oracle) @ 2025-12-05 12:08 UTC (permalink / raw)
  To: Xie Yuanbin
  Cc: torvalds, akpm, brauner, catalin.marinas, hch, jack,
	linux-arm-kernel, linux-fsdevel, linux-kernel, linux-mm,
	pangliyuan1, wangkefeng.wang, will, wozizhi, yangerkun

On Wed, Dec 03, 2025 at 09:48:00AM +0800, Xie Yuanbin wrote:
> On Tue, 2 Dec 2025 14:07:25 -0800, Linus Torvalds wrote:
> > On Tue, 2 Dec 2025 at 04:43, Russell King (Oracle)
> > <linux@armlinux.org.uk> wrote:
> >>
> >> What I'm thinking is to address both of these by handling kernel space
> >> page faults (which will be permission or PTE-not-present) separately
> >> (not even build tested):
> >
> > That patch looks sane to me.
> >
> > But I also didn't build test it, just scanned it visually ;)
> 
> That patch removes harden_branch_predictor() from __do_user_fault(), and
> moves it to do_page_fault()->do_kernel_address_page_fault(). 
> This resolves previously mentioned kernel warning issue. However,
> __do_user_fault() is not only called by do_page_fault(), it is
> alse called by do_bad_area(), do_sect_fault() and do_translation_fault().
> 
> So I think that some harden_branch_predictor() is missing on other paths.
> According to my tests, when CONFIG_ARM_LPAE=n, harden_branch_predictor()
> will never be called anymore, even if a user program trys to access the
> kernel address.
> 
> Or perhaps I've misunderstood something, could you please point it out?
> Thank you very much.

Right, let's split these issues into separate patches. Please test this
patch, which should address only the hash_name() fault issue, and
provides the basis for fixing the branch predictor issue.

Yes, at the moment, do_kernel_address_page_fault() looks very much like
do_bad_area(), but with the addition of the IRQ-enable if the parent
context was enabled, but the following patch to address the branch
predictor hardening will show why its different.

In my opinion, this approach makes the handling for kernel address
page faults (non-present pages and page permission faults) much easier
to understand.

Note that this will call __do_user_fault() with interrupts disabled.

Build tested, and remotely boot tested on Cortex-A5 hardware but
without kfence enabled. Also tested usermode access to kernel space
which fails with SEGV:
- read from 0xc0000000 (section permission fault, do_sect_fault)
- read from 0xffff2000 (page translation fault, do_page_fault)
- read from 0xffff0000 (vectors page - read possible as expected)
- write to 0xffff0000 (page permission fault, do_page_fault)

8<===
From: "Russell King (Oracle)" <rmk+kernel@armlinux.org.uk>
Subject: [PATCH] ARM: fix hash_name() fault

Zizhi Wo reports:

"During the execution of hash_name()->load_unaligned_zeropad(), a
 potential memory access beyond the PAGE boundary may occur. For
 example, when the filename length is near the PAGE_SIZE boundary.
 This triggers a page fault, which leads to a call to
 do_page_fault()->mmap_read_trylock(). If we can't acquire the lock,
 we have to fall back to the mmap_read_lock() path, which calls
 might_sleep(). This breaks RCU semantics because path lookup occurs
 under an RCU read-side critical section."

This is seen with CONFIG_DEBUG_ATOMIC_SLEEP=y and CONFIG_KFENCE=y.

Kernel addresses (with the exception of the vectors/kuser helper
page) do not have VMAs associated with them. If the vectors/kuser
helper page faults, then there are two possibilities:

1. if the fault happened while in kernel mode, then we're basically
   dead, because the CPU won't be able to vector through this page
   to handle the fault.
2. if the fault happened while in user mode, that means the page was
   protected from user access, and we want to fault anyway.

Thus, we can handle kernel addresses from any context entirely
separately without going anywhere near the mmap lock. This gives us
an entirely non-sleeping path for all kernel mode kernel address
faults.

Reported-by: Zizhi Wo <wozizhi@huaweicloud.com>
Reported-by: Xie Yuanbin <xieyuanbin1@huawei.com>
Link: https://lore.kernel.org/r/20251126090505.3057219-1-wozizhi@huaweicloud.com
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
---
 arch/arm/mm/fault.c | 35 +++++++++++++++++++++++++++++++++++
 1 file changed, 35 insertions(+)

diff --git a/arch/arm/mm/fault.c b/arch/arm/mm/fault.c
index 46169fe42c61..2bbec38ced97 100644
--- a/arch/arm/mm/fault.c
+++ b/arch/arm/mm/fault.c
@@ -260,6 +260,35 @@ static inline bool ttbr0_usermode_access_allowed(struct pt_regs *regs)
 }
 #endif
 
+static int __kprobes
+do_kernel_address_page_fault(struct mm_struct *mm, unsigned long addr,
+			     unsigned int fsr, struct pt_regs *regs)
+{
+	if (user_mode(regs)) {
+		/*
+		 * Fault from user mode for a kernel space address. User mode
+		 * should not be faulting in kernel space, which includes the
+		 * vector/khelper page. Send a SIGSEGV.
+		 */
+		__do_user_fault(addr, fsr, SIGSEGV, SEGV_MAPERR, regs);
+	} else {
+		/*
+		 * Fault from kernel mode. Enable interrupts if they were
+		 * enabled in the parent context. Section (upper page table)
+		 * translation faults are handled via do_translation_fault(),
+		 * so we will only get here for a non-present kernel space
+		 * PTE or PTE permission fault. This may happen in exceptional
+		 * circumstances and need the fixup tables to be walked.
+		 */
+		if (interrupts_enabled(regs))
+			local_irq_enable();
+
+		__do_kernel_fault(mm, addr, fsr, regs);
+	}
+
+	return 0;
+}
+
 static int __kprobes
 do_page_fault(unsigned long addr, unsigned int fsr, struct pt_regs *regs)
 {
@@ -273,6 +302,12 @@ do_page_fault(unsigned long addr, unsigned int fsr, struct pt_regs *regs)
 	if (kprobe_page_fault(regs, fsr))
 		return 0;
 
+	/*
+	 * Handle kernel addresses faults separately, which avoids touching
+	 * the mmap lock from contexts that are not able to sleep.
+	 */
+	if (addr >= TASK_SIZE)
+		return do_kernel_address_page_fault(mm, addr, fsr, regs);
 
 	/* Enable interrupts if they were enabled in the parent context. */
 	if (interrupts_enabled(regs))
-- 
2.47.3

-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTP is here! 80Mbps down 10Mbps up. Decent connectivity at last!


^ permalink raw reply	[flat|nested] 59+ messages in thread

end of thread, other threads:[~2025-12-05 12:08 UTC | newest]

Thread overview: 59+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-11-26  9:05 [Bug report] hash_name() may cross page boundary and trigger sleep in RCU context Zizhi Wo
2025-11-26 10:19 ` [RFC PATCH] vfs: Fix might sleep in load_unaligned_zeropad() with rcu read lock held Xie Yuanbin
2025-11-26 18:10   ` Al Viro
2025-11-26 18:48     ` Al Viro
2025-11-26 19:05       ` Russell King (Oracle)
2025-11-26 19:26         ` Al Viro
2025-11-26 19:51           ` Russell King (Oracle)
2025-11-26 20:02             ` Al Viro
2025-11-26 22:25               ` david laight
2025-11-26 23:51                 ` Al Viro
2025-11-26 23:31               ` Russell King (Oracle)
2025-11-27  3:03                 ` Xie Yuanbin
2025-11-27  7:20                   ` Sebastian Andrzej Siewior
2025-11-27 11:20                     ` Xie Yuanbin
2025-11-28  1:39           ` Xie Yuanbin
2025-11-26 20:42   ` Al Viro
2025-11-26 10:27 ` [Bug report] hash_name() may cross page boundary and trigger sleep in RCU context Zizhi Wo
2025-11-26 21:12   ` Linus Torvalds
2025-11-27 10:27     ` Will Deacon
2025-11-27 10:57     ` Russell King (Oracle)
2025-11-28 17:06       ` Linus Torvalds
2025-11-29  1:01         ` Zizhi Wo
2025-11-29  1:35           ` Linus Torvalds
2025-11-29  4:08             ` [Bug report] hash_name() may cross page boundary and trigger Xie Yuanbin
2025-11-29  9:08               ` Al Viro
2025-11-29  9:25                 ` Xie Yuanbin
2025-11-29  9:44                   ` Al Viro
2025-11-29 10:05                     ` Xie Yuanbin
2025-11-29 10:45                 ` david laight
2025-11-29  8:54             ` [Bug report] hash_name() may cross page boundary and trigger sleep in RCU context Al Viro
2025-12-01  2:08             ` Zizhi Wo
2025-11-29  2:18         ` [Bug report] hash_name() may cross page boundary and trigger Xie Yuanbin
2025-12-01 13:28         ` [Bug report] hash_name() may cross page boundary and trigger sleep in RCU context Will Deacon
2025-12-02 12:43         ` Russell King (Oracle)
2025-12-02 13:02           ` Xie Yuanbin
2025-12-02 22:07           ` Linus Torvalds
2025-12-03  1:48             ` Xie Yuanbin
2025-12-05 12:08               ` Russell King (Oracle)
2025-11-26 18:55 ` Al Viro
2025-11-27  2:24   ` Zizhi Wo
2025-11-29  3:37     ` Al Viro
2025-11-30  3:01       ` [RFC][alpha] saner vmalloc handling (was Re: [Bug report] hash_name() may cross page boundary and trigger sleep in RCU context) Al Viro
2025-11-30 11:32         ` david laight
2025-11-30 16:43           ` Al Viro
2025-11-30 18:14             ` Magnus Lindholm
2025-11-30 19:03             ` david laight
2025-11-30 20:31               ` Al Viro
2025-11-30 20:32                 ` Al Viro
2025-11-30 22:16         ` Linus Torvalds
2025-11-30 23:37           ` Al Viro
2025-12-01  2:03       ` [Bug report] hash_name() may cross page boundary and trigger sleep in RCU context Zizhi Wo
2025-11-27 12:59 ` Will Deacon
2025-11-28  1:17   ` Zizhi Wo
2025-11-28  1:18     ` Zizhi Wo
2025-11-28  1:39       ` Zizhi Wo
2025-11-28 12:25         ` Will Deacon
2025-11-29  1:02           ` Zizhi Wo
2025-11-29  3:55             ` Al Viro
2025-12-01  2:38               ` Zizhi Wo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox