linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [linus:master] [mm]  efa7df3e3b: kernel_BUG_at_include/linux/page_ref.h
@ 2024-05-31  8:24 kernel test robot
  2024-05-31 16:50 ` Yang Shi
  0 siblings, 1 reply; 28+ messages in thread
From: kernel test robot @ 2024-05-31  8:24 UTC (permalink / raw)
  To: Rik van Riel
  Cc: oe-lkp, lkp, linux-kernel, Andrew Morton, Yang Shi,
	Matthew Wilcox, Christopher Lameter, linux-mm, oliver.sang



Hello,

kernel test robot noticed "kernel_BUG_at_include/linux/page_ref.h" on:

commit: efa7df3e3bb5da8e6abbe37727417f32a37fba47 ("mm: align larger anonymous mappings on THP boundaries")
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master

[test failed on linus/master      e0cce98fe279b64f4a7d81b7f5c3a23d80b92fbc]
[test failed on linux-next/master 6dc544b66971c7f9909ff038b62149105272d26a]

in testcase: trinity
version: trinity-x86_64-6a17c218-1_20240527
with following parameters:

	runtime: 300s
	group: group-00
	nr_groups: 5



compiler: gcc-13
test machine: qemu-system-x86_64 -enable-kvm -cpu SandyBridge -smp 2 -m 16G

(please refer to attached dmesg/kmsg for entire log/backtrace)


we noticed the issue does not always happen. 34 times out of 50 runs as below.
the parent is clean.

1803d0c5ee1a3bbe efa7df3e3bb5da8e6abbe377274
---------------- ---------------------------
       fail:runs  %reproduction    fail:runs
           |             |             |
           :50          68%          34:50    dmesg.Kernel_panic-not_syncing:Fatal_exception
           :50          68%          34:50    dmesg.RIP:try_get_folio
           :50          68%          34:50    dmesg.invalid_opcode:#[##]
           :50          68%          34:50    dmesg.kernel_BUG_at_include/linux/page_ref.h



If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <oliver.sang@intel.com>
| Closes: https://lore.kernel.org/oe-lkp/202405311534.86cd4043-lkp@intel.com


[  275.267158][ T4335] ------------[ cut here ]------------
[  275.267949][ T4335] kernel BUG at include/linux/page_ref.h:275!
[  275.268526][ T4335] invalid opcode: 0000 [#1] KASAN PTI
[  275.269001][ T4335] CPU: 0 PID: 4335 Comm: trinity-c3 Not tainted 6.7.0-rc4-00061-gefa7df3e3bb5 #1
[  275.269787][ T4335] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
[ 275.270679][ T4335] RIP: 0010:try_get_folio (include/linux/page_ref.h:275 (discriminator 3) mm/gup.c:79 (discriminator 3)) 
[ 275.271159][ T4335] Code: c3 cc cc cc cc 44 89 e6 48 89 df e8 e4 54 11 00 eb ae 90 0f 0b 90 31 db eb d5 9c 58 0f 1f 40 00 f6 c4 02 0f 84 46 ff ff ff 90 <0f> 0b 48 c7 c6 a0 54 d2 87 48 89 df e8 a9 e9 ff ff 90 0f 0b be 04
All code
========
   0:	c3                   	ret
   1:	cc                   	int3
   2:	cc                   	int3
   3:	cc                   	int3
   4:	cc                   	int3
   5:	44 89 e6             	mov    %r12d,%esi
   8:	48 89 df             	mov    %rbx,%rdi
   b:	e8 e4 54 11 00       	call   0x1154f4
  10:	eb ae                	jmp    0xffffffffffffffc0
  12:	90                   	nop
  13:	0f 0b                	ud2
  15:	90                   	nop
  16:	31 db                	xor    %ebx,%ebx
  18:	eb d5                	jmp    0xffffffffffffffef
  1a:	9c                   	pushf
  1b:	58                   	pop    %rax
  1c:	0f 1f 40 00          	nopl   0x0(%rax)
  20:	f6 c4 02             	test   $0x2,%ah
  23:	0f 84 46 ff ff ff    	je     0xffffffffffffff6f
  29:	90                   	nop
  2a:*	0f 0b                	ud2		<-- trapping instruction
  2c:	48 c7 c6 a0 54 d2 87 	mov    $0xffffffff87d254a0,%rsi
  33:	48 89 df             	mov    %rbx,%rdi
  36:	e8 a9 e9 ff ff       	call   0xffffffffffffe9e4
  3b:	90                   	nop
  3c:	0f 0b                	ud2
  3e:	be                   	.byte 0xbe
  3f:	04                   	.byte 0x4

Code starting with the faulting instruction
===========================================
   0:	0f 0b                	ud2
   2:	48 c7 c6 a0 54 d2 87 	mov    $0xffffffff87d254a0,%rsi
   9:	48 89 df             	mov    %rbx,%rdi
   c:	e8 a9 e9 ff ff       	call   0xffffffffffffe9ba
  11:	90                   	nop
  12:	0f 0b                	ud2
  14:	be                   	.byte 0xbe
  15:	04                   	.byte 0x4
[  275.272813][ T4335] RSP: 0018:ffffc90005dcf650 EFLAGS: 00010202
[  275.273346][ T4335] RAX: 0000000000000246 RBX: ffffea00066e0000 RCX: 0000000000000000
[  275.274032][ T4335] RDX: fffff94000cdc007 RSI: 0000000000000004 RDI: ffffea00066e0034
[  275.274719][ T4335] RBP: ffffea00066e0000 R08: 0000000000000000 R09: fffff94000cdc006
[  275.275404][ T4335] R10: ffffea00066e0037 R11: 0000000000000000 R12: 0000000000000136
[  275.276106][ T4335] R13: ffffea00066e0034 R14: dffffc0000000000 R15: ffffea00066e0008
[  275.276790][ T4335] FS:  00007fa2f9b61740(0000) GS:ffffffff89d0d000(0000) knlGS:0000000000000000
[  275.277570][ T4335] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  275.278143][ T4335] CR2: 00007fa2f6c00000 CR3: 0000000134b04000 CR4: 00000000000406f0
[  275.278833][ T4335] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  275.279521][ T4335] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  275.280201][ T4335] Call Trace:
[  275.280499][ T4335]  <TASK>
[ 275.280751][ T4335] ? die (arch/x86/kernel/dumpstack.c:421 arch/x86/kernel/dumpstack.c:434 arch/x86/kernel/dumpstack.c:447) 
[ 275.281087][ T4335] ? do_trap (arch/x86/kernel/traps.c:112 arch/x86/kernel/traps.c:153) 
[ 275.281463][ T4335] ? try_get_folio (include/linux/page_ref.h:275 (discriminator 3) mm/gup.c:79 (discriminator 3)) 
[ 275.281884][ T4335] ? try_get_folio (include/linux/page_ref.h:275 (discriminator 3) mm/gup.c:79 (discriminator 3)) 
[ 275.282300][ T4335] ? do_error_trap (arch/x86/kernel/traps.c:174) 
[ 275.282711][ T4335] ? try_get_folio (include/linux/page_ref.h:275 (discriminator 3) mm/gup.c:79 (discriminator 3)) 
[ 275.283129][ T4335] ? handle_invalid_op (arch/x86/kernel/traps.c:212) 
[ 275.283561][ T4335] ? try_get_folio (include/linux/page_ref.h:275 (discriminator 3) mm/gup.c:79 (discriminator 3)) 
[ 275.283990][ T4335] ? exc_invalid_op (arch/x86/kernel/traps.c:264) 
[ 275.284415][ T4335] ? asm_exc_invalid_op (arch/x86/include/asm/idtentry.h:568) 
[ 275.284859][ T4335] ? try_get_folio (include/linux/page_ref.h:275 (discriminator 3) mm/gup.c:79 (discriminator 3)) 
[ 275.285278][ T4335] try_grab_folio (mm/gup.c:148) 
[ 275.285684][ T4335] __get_user_pages (mm/gup.c:1297 (discriminator 1)) 
[ 275.286111][ T4335] ? __pfx___get_user_pages (mm/gup.c:1188) 
[ 275.286579][ T4335] ? __pfx_validate_chain (kernel/locking/lockdep.c:3825) 
[ 275.287034][ T4335] ? mark_lock (kernel/locking/lockdep.c:4656 (discriminator 1)) 
[ 275.287416][ T4335] __gup_longterm_locked (mm/gup.c:1509 mm/gup.c:2209) 
[ 275.288192][ T4335] ? __pfx___gup_longterm_locked (mm/gup.c:2204) 
[ 275.288697][ T4335] ? __pfx_lock_acquire (kernel/locking/lockdep.c:5722) 
[ 275.289135][ T4335] ? __pfx___might_resched (kernel/sched/core.c:10106) 
[ 275.289595][ T4335] pin_user_pages_remote (mm/gup.c:3350) 
[ 275.290041][ T4335] ? __pfx_pin_user_pages_remote (mm/gup.c:3350) 
[ 275.290545][ T4335] ? find_held_lock (kernel/locking/lockdep.c:5244 (discriminator 1)) 
[ 275.290961][ T4335] ? mm_access (kernel/fork.c:1573) 
[ 275.291353][ T4335] process_vm_rw_single_vec+0x142/0x360 
[ 275.291900][ T4335] ? __pfx_process_vm_rw_single_vec+0x10/0x10 
[ 275.292471][ T4335] ? mm_access (kernel/fork.c:1573) 
[ 275.292859][ T4335] process_vm_rw_core+0x272/0x4e0 
[ 275.293384][ T4335] ? hlock_class (arch/x86/include/asm/bitops.h:227 arch/x86/include/asm/bitops.h:239 include/asm-generic/bitops/instrumented-non-atomic.h:142 kernel/locking/lockdep.c:228) 
[ 275.293780][ T4335] ? __pfx_process_vm_rw_core+0x10/0x10 
[ 275.294350][ T4335] process_vm_rw (mm/process_vm_access.c:284) 
[ 275.294748][ T4335] ? __pfx_process_vm_rw (mm/process_vm_access.c:259) 
[ 275.295197][ T4335] ? __task_pid_nr_ns (include/linux/rcupdate.h:306 (discriminator 1) include/linux/rcupdate.h:780 (discriminator 1) kernel/pid.c:504 (discriminator 1)) 
[ 275.295634][ T4335] __x64_sys_process_vm_readv (mm/process_vm_access.c:291) 
[ 275.296139][ T4335] ? syscall_enter_from_user_mode (kernel/entry/common.c:94 kernel/entry/common.c:112) 
[ 275.296642][ T4335] do_syscall_64 (arch/x86/entry/common.c:51 (discriminator 1) arch/x86/entry/common.c:82 (discriminator 1)) 
[ 275.297032][ T4335] ? __task_pid_nr_ns (include/linux/rcupdate.h:306 (discriminator 1) include/linux/rcupdate.h:780 (discriminator 1) kernel/pid.c:504 (discriminator 1)) 
[ 275.297470][ T4335] ? lockdep_hardirqs_on_prepare (kernel/locking/lockdep.c:4300 kernel/locking/lockdep.c:4359) 
[ 275.297988][ T4335] ? do_syscall_64 (arch/x86/include/asm/cpufeature.h:171 arch/x86/entry/common.c:97) 
[ 275.298389][ T4335] ? lockdep_hardirqs_on_prepare (kernel/locking/lockdep.c:4300 kernel/locking/lockdep.c:4359) 
[ 275.298906][ T4335] ? do_syscall_64 (arch/x86/include/asm/cpufeature.h:171 arch/x86/entry/common.c:97) 
[ 275.299304][ T4335] ? do_syscall_64 (arch/x86/include/asm/cpufeature.h:171 arch/x86/entry/common.c:97) 
[ 275.299703][ T4335] ? do_syscall_64 (arch/x86/include/asm/cpufeature.h:171 arch/x86/entry/common.c:97) 
[ 275.300115][ T4335] entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:129) 
[  275.300622][ T4335] RIP: 0033:0x7fa2f9c65719
[ 275.301011][ T4335] Code: 08 89 e8 5b 5d c3 66 2e 0f 1f 84 00 00 00 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d b7 06 0d 00 f7 d8 64 89 01 48
All code
========
   0:	08 89 e8 5b 5d c3    	or     %cl,-0x3ca2a418(%rcx)
   6:	66 2e 0f 1f 84 00 00 	cs nopw 0x0(%rax,%rax,1)
   d:	00 00 00 
  10:	90                   	nop
  11:	48 89 f8             	mov    %rdi,%rax
  14:	48 89 f7             	mov    %rsi,%rdi
  17:	48 89 d6             	mov    %rdx,%rsi
  1a:	48 89 ca             	mov    %rcx,%rdx
  1d:	4d 89 c2             	mov    %r8,%r10
  20:	4d 89 c8             	mov    %r9,%r8
  23:	4c 8b 4c 24 08       	mov    0x8(%rsp),%r9
  28:	0f 05                	syscall
  2a:*	48 3d 01 f0 ff ff    	cmp    $0xfffffffffffff001,%rax		<-- trapping instruction
  30:	73 01                	jae    0x33
  32:	c3                   	ret
  33:	48 8b 0d b7 06 0d 00 	mov    0xd06b7(%rip),%rcx        # 0xd06f1
  3a:	f7 d8                	neg    %eax
  3c:	64 89 01             	mov    %eax,%fs:(%rcx)
  3f:	48                   	rex.W

Code starting with the faulting instruction
===========================================
   0:	48 3d 01 f0 ff ff    	cmp    $0xfffffffffffff001,%rax
   6:	73 01                	jae    0x9
   8:	c3                   	ret
   9:	48 8b 0d b7 06 0d 00 	mov    0xd06b7(%rip),%rcx        # 0xd06c7
  10:	f7 d8                	neg    %eax
  12:	64 89 01             	mov    %eax,%fs:(%rcx)
  15:	48                   	rex.W


The kernel config and materials to reproduce are available at:
https://download.01.org/0day-ci/archive/20240531/202405311534.86cd4043-lkp@intel.com



-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [linus:master] [mm] efa7df3e3b: kernel_BUG_at_include/linux/page_ref.h
  2024-05-31  8:24 [linus:master] [mm] efa7df3e3b: kernel_BUG_at_include/linux/page_ref.h kernel test robot
@ 2024-05-31 16:50 ` Yang Shi
       [not found]   ` <890e5a79-8574-4a24-90ab-b9888968d5e5@redhat.com>
  0 siblings, 1 reply; 28+ messages in thread
From: Yang Shi @ 2024-05-31 16:50 UTC (permalink / raw)
  To: kernel test robot, Peter Xu, David Hildenbrand, Jason Gunthorpe
  Cc: Rik van Riel, oe-lkp, lkp, linux-kernel, Andrew Morton,
	Matthew Wilcox, Christopher Lameter, linux-mm

On Fri, May 31, 2024 at 1:24 AM kernel test robot <oliver.sang@intel.com> wrote:
>
>
>
> Hello,
>
> kernel test robot noticed "kernel_BUG_at_include/linux/page_ref.h" on:
>
> commit: efa7df3e3bb5da8e6abbe37727417f32a37fba47 ("mm: align larger anonymous mappings on THP boundaries")
> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master
>
> [test failed on linus/master      e0cce98fe279b64f4a7d81b7f5c3a23d80b92fbc]
> [test failed on linux-next/master 6dc544b66971c7f9909ff038b62149105272d26a]
>
> in testcase: trinity
> version: trinity-x86_64-6a17c218-1_20240527
> with following parameters:
>
>         runtime: 300s
>         group: group-00
>         nr_groups: 5
>
>
>
> compiler: gcc-13
> test machine: qemu-system-x86_64 -enable-kvm -cpu SandyBridge -smp 2 -m 16G
>
> (please refer to attached dmesg/kmsg for entire log/backtrace)
>
>
> we noticed the issue does not always happen. 34 times out of 50 runs as below.
> the parent is clean.
>
> 1803d0c5ee1a3bbe efa7df3e3bb5da8e6abbe377274
> ---------------- ---------------------------
>        fail:runs  %reproduction    fail:runs
>            |             |             |
>            :50          68%          34:50    dmesg.Kernel_panic-not_syncing:Fatal_exception
>            :50          68%          34:50    dmesg.RIP:try_get_folio
>            :50          68%          34:50    dmesg.invalid_opcode:#[##]
>            :50          68%          34:50    dmesg.kernel_BUG_at_include/linux/page_ref.h
>
>
>
> If you fix the issue in a separate patch/commit (i.e. not just a new version of
> the same patch/commit), kindly add following tags
> | Reported-by: kernel test robot <oliver.sang@intel.com>
> | Closes: https://lore.kernel.org/oe-lkp/202405311534.86cd4043-lkp@intel.com
>
>
> [  275.267158][ T4335] ------------[ cut here ]------------
> [  275.267949][ T4335] kernel BUG at include/linux/page_ref.h:275!
> [  275.268526][ T4335] invalid opcode: 0000 [#1] KASAN PTI
> [  275.269001][ T4335] CPU: 0 PID: 4335 Comm: trinity-c3 Not tainted 6.7.0-rc4-00061-gefa7df3e3bb5 #1
> [  275.269787][ T4335] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
> [ 275.270679][ T4335] RIP: 0010:try_get_folio (include/linux/page_ref.h:275 (discriminator 3) mm/gup.c:79 (discriminator 3))
> [ 275.271159][ T4335] Code: c3 cc cc cc cc 44 89 e6 48 89 df e8 e4 54 11 00 eb ae 90 0f 0b 90 31 db eb d5 9c 58 0f 1f 40 00 f6 c4 02 0f 84 46 ff ff ff 90 <0f> 0b 48 c7 c6 a0 54 d2 87 48 89 df e8 a9 e9 ff ff 90 0f 0b be 04

If I read this BUG correctly, it is:

VM_BUG_ON(!in_atomic() && !irqs_disabled());

try_grab_folio() actually assumes it is in an atomic context (irq
disabled or preempt disabled) for this call path. This is achieved by
disabling irq in gup fast or calling it in rcu critical section in
page cache lookup path.

And try_grab_folio() is used when the folio is a large folio. The
bisected commit made the fuzzy test get PMD aligned address and large
folio more likely than before, and process_vm_readv/writev actually
doesn't take care of the large folio case at all. A properly aligned
address, for example, allocated by posix_memalign, should be able to
trigger this BUG even though the bisected commit doesn't exist.

We can't call pin_user_pages_remote() in rcu critical section since it
may sleep, and I don't think we have GUP fast remote either if I
remember correctly. It also doesn't make sense to disallow large folio
for process_vm_readv/writev either.

Maybe a new GUP flag or just use FOLL_LONGTERM to let GUP call
try_glab_folio() in rcu critical section? Added more GUP folks in this
loop.


> All code
> ========
>    0:   c3                      ret
>    1:   cc                      int3
>    2:   cc                      int3
>    3:   cc                      int3
>    4:   cc                      int3
>    5:   44 89 e6                mov    %r12d,%esi
>    8:   48 89 df                mov    %rbx,%rdi
>    b:   e8 e4 54 11 00          call   0x1154f4
>   10:   eb ae                   jmp    0xffffffffffffffc0
>   12:   90                      nop
>   13:   0f 0b                   ud2
>   15:   90                      nop
>   16:   31 db                   xor    %ebx,%ebx
>   18:   eb d5                   jmp    0xffffffffffffffef
>   1a:   9c                      pushf
>   1b:   58                      pop    %rax
>   1c:   0f 1f 40 00             nopl   0x0(%rax)
>   20:   f6 c4 02                test   $0x2,%ah
>   23:   0f 84 46 ff ff ff       je     0xffffffffffffff6f
>   29:   90                      nop
>   2a:*  0f 0b                   ud2             <-- trapping instruction
>   2c:   48 c7 c6 a0 54 d2 87    mov    $0xffffffff87d254a0,%rsi
>   33:   48 89 df                mov    %rbx,%rdi
>   36:   e8 a9 e9 ff ff          call   0xffffffffffffe9e4
>   3b:   90                      nop
>   3c:   0f 0b                   ud2
>   3e:   be                      .byte 0xbe
>   3f:   04                      .byte 0x4
>
> Code starting with the faulting instruction
> ===========================================
>    0:   0f 0b                   ud2
>    2:   48 c7 c6 a0 54 d2 87    mov    $0xffffffff87d254a0,%rsi
>    9:   48 89 df                mov    %rbx,%rdi
>    c:   e8 a9 e9 ff ff          call   0xffffffffffffe9ba
>   11:   90                      nop
>   12:   0f 0b                   ud2
>   14:   be                      .byte 0xbe
>   15:   04                      .byte 0x4
> [  275.272813][ T4335] RSP: 0018:ffffc90005dcf650 EFLAGS: 00010202
> [  275.273346][ T4335] RAX: 0000000000000246 RBX: ffffea00066e0000 RCX: 0000000000000000
> [  275.274032][ T4335] RDX: fffff94000cdc007 RSI: 0000000000000004 RDI: ffffea00066e0034
> [  275.274719][ T4335] RBP: ffffea00066e0000 R08: 0000000000000000 R09: fffff94000cdc006
> [  275.275404][ T4335] R10: ffffea00066e0037 R11: 0000000000000000 R12: 0000000000000136
> [  275.276106][ T4335] R13: ffffea00066e0034 R14: dffffc0000000000 R15: ffffea00066e0008
> [  275.276790][ T4335] FS:  00007fa2f9b61740(0000) GS:ffffffff89d0d000(0000) knlGS:0000000000000000
> [  275.277570][ T4335] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  275.278143][ T4335] CR2: 00007fa2f6c00000 CR3: 0000000134b04000 CR4: 00000000000406f0
> [  275.278833][ T4335] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [  275.279521][ T4335] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [  275.280201][ T4335] Call Trace:
> [  275.280499][ T4335]  <TASK>
> [ 275.280751][ T4335] ? die (arch/x86/kernel/dumpstack.c:421 arch/x86/kernel/dumpstack.c:434 arch/x86/kernel/dumpstack.c:447)
> [ 275.281087][ T4335] ? do_trap (arch/x86/kernel/traps.c:112 arch/x86/kernel/traps.c:153)
> [ 275.281463][ T4335] ? try_get_folio (include/linux/page_ref.h:275 (discriminator 3) mm/gup.c:79 (discriminator 3))
> [ 275.281884][ T4335] ? try_get_folio (include/linux/page_ref.h:275 (discriminator 3) mm/gup.c:79 (discriminator 3))
> [ 275.282300][ T4335] ? do_error_trap (arch/x86/kernel/traps.c:174)
> [ 275.282711][ T4335] ? try_get_folio (include/linux/page_ref.h:275 (discriminator 3) mm/gup.c:79 (discriminator 3))
> [ 275.283129][ T4335] ? handle_invalid_op (arch/x86/kernel/traps.c:212)
> [ 275.283561][ T4335] ? try_get_folio (include/linux/page_ref.h:275 (discriminator 3) mm/gup.c:79 (discriminator 3))
> [ 275.283990][ T4335] ? exc_invalid_op (arch/x86/kernel/traps.c:264)
> [ 275.284415][ T4335] ? asm_exc_invalid_op (arch/x86/include/asm/idtentry.h:568)
> [ 275.284859][ T4335] ? try_get_folio (include/linux/page_ref.h:275 (discriminator 3) mm/gup.c:79 (discriminator 3))
> [ 275.285278][ T4335] try_grab_folio (mm/gup.c:148)
> [ 275.285684][ T4335] __get_user_pages (mm/gup.c:1297 (discriminator 1))
> [ 275.286111][ T4335] ? __pfx___get_user_pages (mm/gup.c:1188)
> [ 275.286579][ T4335] ? __pfx_validate_chain (kernel/locking/lockdep.c:3825)
> [ 275.287034][ T4335] ? mark_lock (kernel/locking/lockdep.c:4656 (discriminator 1))
> [ 275.287416][ T4335] __gup_longterm_locked (mm/gup.c:1509 mm/gup.c:2209)
> [ 275.288192][ T4335] ? __pfx___gup_longterm_locked (mm/gup.c:2204)
> [ 275.288697][ T4335] ? __pfx_lock_acquire (kernel/locking/lockdep.c:5722)
> [ 275.289135][ T4335] ? __pfx___might_resched (kernel/sched/core.c:10106)
> [ 275.289595][ T4335] pin_user_pages_remote (mm/gup.c:3350)
> [ 275.290041][ T4335] ? __pfx_pin_user_pages_remote (mm/gup.c:3350)
> [ 275.290545][ T4335] ? find_held_lock (kernel/locking/lockdep.c:5244 (discriminator 1))
> [ 275.290961][ T4335] ? mm_access (kernel/fork.c:1573)
> [ 275.291353][ T4335] process_vm_rw_single_vec+0x142/0x360
> [ 275.291900][ T4335] ? __pfx_process_vm_rw_single_vec+0x10/0x10
> [ 275.292471][ T4335] ? mm_access (kernel/fork.c:1573)
> [ 275.292859][ T4335] process_vm_rw_core+0x272/0x4e0
> [ 275.293384][ T4335] ? hlock_class (arch/x86/include/asm/bitops.h:227 arch/x86/include/asm/bitops.h:239 include/asm-generic/bitops/instrumented-non-atomic.h:142 kernel/locking/lockdep.c:228)
> [ 275.293780][ T4335] ? __pfx_process_vm_rw_core+0x10/0x10
> [ 275.294350][ T4335] process_vm_rw (mm/process_vm_access.c:284)
> [ 275.294748][ T4335] ? __pfx_process_vm_rw (mm/process_vm_access.c:259)
> [ 275.295197][ T4335] ? __task_pid_nr_ns (include/linux/rcupdate.h:306 (discriminator 1) include/linux/rcupdate.h:780 (discriminator 1) kernel/pid.c:504 (discriminator 1))
> [ 275.295634][ T4335] __x64_sys_process_vm_readv (mm/process_vm_access.c:291)
> [ 275.296139][ T4335] ? syscall_enter_from_user_mode (kernel/entry/common.c:94 kernel/entry/common.c:112)
> [ 275.296642][ T4335] do_syscall_64 (arch/x86/entry/common.c:51 (discriminator 1) arch/x86/entry/common.c:82 (discriminator 1))
> [ 275.297032][ T4335] ? __task_pid_nr_ns (include/linux/rcupdate.h:306 (discriminator 1) include/linux/rcupdate.h:780 (discriminator 1) kernel/pid.c:504 (discriminator 1))
> [ 275.297470][ T4335] ? lockdep_hardirqs_on_prepare (kernel/locking/lockdep.c:4300 kernel/locking/lockdep.c:4359)
> [ 275.297988][ T4335] ? do_syscall_64 (arch/x86/include/asm/cpufeature.h:171 arch/x86/entry/common.c:97)
> [ 275.298389][ T4335] ? lockdep_hardirqs_on_prepare (kernel/locking/lockdep.c:4300 kernel/locking/lockdep.c:4359)
> [ 275.298906][ T4335] ? do_syscall_64 (arch/x86/include/asm/cpufeature.h:171 arch/x86/entry/common.c:97)
> [ 275.299304][ T4335] ? do_syscall_64 (arch/x86/include/asm/cpufeature.h:171 arch/x86/entry/common.c:97)
> [ 275.299703][ T4335] ? do_syscall_64 (arch/x86/include/asm/cpufeature.h:171 arch/x86/entry/common.c:97)
> [ 275.300115][ T4335] entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:129)
> [  275.300622][ T4335] RIP: 0033:0x7fa2f9c65719
> [ 275.301011][ T4335] Code: 08 89 e8 5b 5d c3 66 2e 0f 1f 84 00 00 00 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d b7 06 0d 00 f7 d8 64 89 01 48
> All code
> ========
>    0:   08 89 e8 5b 5d c3       or     %cl,-0x3ca2a418(%rcx)
>    6:   66 2e 0f 1f 84 00 00    cs nopw 0x0(%rax,%rax,1)
>    d:   00 00 00
>   10:   90                      nop
>   11:   48 89 f8                mov    %rdi,%rax
>   14:   48 89 f7                mov    %rsi,%rdi
>   17:   48 89 d6                mov    %rdx,%rsi
>   1a:   48 89 ca                mov    %rcx,%rdx
>   1d:   4d 89 c2                mov    %r8,%r10
>   20:   4d 89 c8                mov    %r9,%r8
>   23:   4c 8b 4c 24 08          mov    0x8(%rsp),%r9
>   28:   0f 05                   syscall
>   2a:*  48 3d 01 f0 ff ff       cmp    $0xfffffffffffff001,%rax         <-- trapping instruction
>   30:   73 01                   jae    0x33
>   32:   c3                      ret
>   33:   48 8b 0d b7 06 0d 00    mov    0xd06b7(%rip),%rcx        # 0xd06f1
>   3a:   f7 d8                   neg    %eax
>   3c:   64 89 01                mov    %eax,%fs:(%rcx)
>   3f:   48                      rex.W
>
> Code starting with the faulting instruction
> ===========================================
>    0:   48 3d 01 f0 ff ff       cmp    $0xfffffffffffff001,%rax
>    6:   73 01                   jae    0x9
>    8:   c3                      ret
>    9:   48 8b 0d b7 06 0d 00    mov    0xd06b7(%rip),%rcx        # 0xd06c7
>   10:   f7 d8                   neg    %eax
>   12:   64 89 01                mov    %eax,%fs:(%rcx)
>   15:   48                      rex.W
>
>
> The kernel config and materials to reproduce are available at:
> https://download.01.org/0day-ci/archive/20240531/202405311534.86cd4043-lkp@intel.com
>
>
>
> --
> 0-DAY CI Kernel Test Service
> https://github.com/intel/lkp-tests/wiki
>


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [linus:master] [mm] efa7df3e3b: kernel_BUG_at_include/linux/page_ref.h
       [not found]   ` <890e5a79-8574-4a24-90ab-b9888968d5e5@redhat.com>
@ 2024-05-31 18:07     ` Yang Shi
  2024-05-31 18:13       ` Yang Shi
  2024-05-31 23:24     ` Peter Xu
  1 sibling, 1 reply; 28+ messages in thread
From: Yang Shi @ 2024-05-31 18:07 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: kernel test robot, Peter Xu, Jason Gunthorpe, Vivek Kasireddy,
	Rik van Riel, oe-lkp, lkp, linux-kernel, Andrew Morton,
	Matthew Wilcox, Christopher Lameter, linux-mm

On Fri, May 31, 2024 at 10:46 AM David Hildenbrand <david@redhat.com> wrote:
>
> On 31.05.24 18:50, Yang Shi wrote:
> > On Fri, May 31, 2024 at 1:24 AM kernel test robot <oliver.sang@intel.com> wrote:
> >>
> >>
> >>
> >> Hello,
> >>
> >> kernel test robot noticed "kernel_BUG_at_include/linux/page_ref.h" on:
> >>
> >> commit: efa7df3e3bb5da8e6abbe37727417f32a37fba47 ("mm: align larger anonymous mappings on THP boundaries")
> >> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master
> >>
> >> [test failed on linus/master      e0cce98fe279b64f4a7d81b7f5c3a23d80b92fbc]
> >> [test failed on linux-next/master 6dc544b66971c7f9909ff038b62149105272d26a]
> >>
> >> in testcase: trinity
> >> version: trinity-x86_64-6a17c218-1_20240527
> >> with following parameters:
> >>
> >>          runtime: 300s
> >>          group: group-00
> >>          nr_groups: 5
> >>
> >>
> >>
> >> compiler: gcc-13
> >> test machine: qemu-system-x86_64 -enable-kvm -cpu SandyBridge -smp 2 -m 16G
> >>
> >> (please refer to attached dmesg/kmsg for entire log/backtrace)
> >>
> >>
> >> we noticed the issue does not always happen. 34 times out of 50 runs as below.
> >> the parent is clean.
> >>
> >> 1803d0c5ee1a3bbe efa7df3e3bb5da8e6abbe377274
> >> ---------------- ---------------------------
> >>         fail:runs  %reproduction    fail:runs
> >>             |             |             |
> >>             :50          68%          34:50    dmesg.Kernel_panic-not_syncing:Fatal_exception
> >>             :50          68%          34:50    dmesg.RIP:try_get_folio
> >>             :50          68%          34:50    dmesg.invalid_opcode:#[##]
> >>             :50          68%          34:50    dmesg.kernel_BUG_at_include/linux/page_ref.h
> >>
> >>
> >>
> >> If you fix the issue in a separate patch/commit (i.e. not just a new version of
> >> the same patch/commit), kindly add following tags
> >> | Reported-by: kernel test robot <oliver.sang@intel.com>
> >> | Closes: https://lore.kernel.org/oe-lkp/202405311534.86cd4043-lkp@intel.com
> >>
> >>
> >> [  275.267158][ T4335] ------------[ cut here ]------------
> >> [  275.267949][ T4335] kernel BUG at include/linux/page_ref.h:275!
> >> [  275.268526][ T4335] invalid opcode: 0000 [#1] KASAN PTI
> >> [  275.269001][ T4335] CPU: 0 PID: 4335 Comm: trinity-c3 Not tainted 6.7.0-rc4-00061-gefa7df3e3bb5 #1
> >> [  275.269787][ T4335] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
> >> [ 275.270679][ T4335] RIP: 0010:try_get_folio (include/linux/page_ref.h:275 (discriminator 3) mm/gup.c:79 (discriminator 3))
> >> [ 275.271159][ T4335] Code: c3 cc cc cc cc 44 89 e6 48 89 df e8 e4 54 11 00 eb ae 90 0f 0b 90 31 db eb d5 9c 58 0f 1f 40 00 f6 c4 02 0f 84 46 ff ff ff 90 <0f> 0b 48 c7 c6 a0 54 d2 87 48 89 df e8 a9 e9 ff ff 90 0f 0b be 04
> >
> > If I read this BUG correctly, it is:
> >
> > VM_BUG_ON(!in_atomic() && !irqs_disabled());
> >
>
> Yes, that seems to be the one.
>
> > try_grab_folio() actually assumes it is in an atomic context (irq
> > disabled or preempt disabled) for this call path. This is achieved by
> > disabling irq in gup fast or calling it in rcu critical section in
> > page cache lookup path
>
> try_grab_folio()->try_get_folio()->folio_ref_try_add_rcu()
>
> Is called (mm-unstable) from:
>
> (1) gup_fast function, here IRQs are disable
> (2) gup_hugepte(), possibly problematic
> (3) memfd_pin_folios(), possibly problematic
> (4) __get_user_pages(), likely problematic
>
> (1) should be fine.
>
> (2) is possibly problematic on the !fast path. If so, due to commit
>      a12083d721d7 ("mm/gup: handle hugepd for follow_page()") ? CCing Peter.
>
> (3) is possibly wrong. CCing Vivek.
>
> (4) is what we hit here
>
> >
> > And try_grab_folio() is used when the folio is a large folio. The
>
>
> We come via process_vm_rw()->pin_user_pages_remote()->__get_user_pages()->try_grab_folio()
>
> That code was added in
>
> commit 57edfcfd3419b4799353d8cbd6ce49da075cfdbd
> Author: Peter Xu <peterx@redhat.com>
> Date:   Wed Jun 28 17:53:07 2023 -0400
>
>      mm/gup: accelerate thp gup even for "pages != NULL"
>
>      The acceleration of THP was done with ctx.page_mask, however it'll be
>      ignored if **pages is non-NULL.
>
>
> Likely the try_grab_folio() in __get_user_pages() is wrong?
>
> As documented, we already hold a refcount. Likely we should better do a
> folio_ref_add() and sanity check the refcount.

Yes, a plain folio_ref_add() seems ok for these cases.

In addition, the comment of folio_try_get_rcu() says, which is just a
wrapper of folio_ref_try_add_rcu():

You can also use this function if you're holding a lock that prevents
pages being frozen & removed; eg the i_pages lock for the page cache
or the mmap_lock or page table lock for page tables.  In this case, it
will always succeed, and you could have used a plain folio_get(), but
it's sometimes more convenient to have a common function called from
both locked and RCU-protected contexts.

So IIUC we can use the plain folio_get() at least for
process_vm_readv/writev since mmap_lock is held in this path.

>
>
> In essence, I think: try_grab_folio() should only be called from GUP-fast where
> IRQs are disabled.

Yes, I agree. Just the fast path should need to call try_grab_folio().

>
> (2), (3) and (4) are possible offenders of that.
>
>
> Or am I getting it all wrong? :)
>
> --
> Cheers,
>
> David / dhildenb
>


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [linus:master] [mm] efa7df3e3b: kernel_BUG_at_include/linux/page_ref.h
  2024-05-31 18:07     ` Yang Shi
@ 2024-05-31 18:13       ` Yang Shi
  2024-05-31 18:24         ` David Hildenbrand
  0 siblings, 1 reply; 28+ messages in thread
From: Yang Shi @ 2024-05-31 18:13 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: kernel test robot, Peter Xu, Jason Gunthorpe, Vivek Kasireddy,
	Rik van Riel, oe-lkp, lkp, linux-kernel, Andrew Morton,
	Matthew Wilcox, Christopher Lameter, linux-mm

On Fri, May 31, 2024 at 11:07 AM Yang Shi <shy828301@gmail.com> wrote:
>
> On Fri, May 31, 2024 at 10:46 AM David Hildenbrand <david@redhat.com> wrote:
> >
> > On 31.05.24 18:50, Yang Shi wrote:
> > > On Fri, May 31, 2024 at 1:24 AM kernel test robot <oliver.sang@intel.com> wrote:
> > >>
> > >>
> > >>
> > >> Hello,
> > >>
> > >> kernel test robot noticed "kernel_BUG_at_include/linux/page_ref.h" on:
> > >>
> > >> commit: efa7df3e3bb5da8e6abbe37727417f32a37fba47 ("mm: align larger anonymous mappings on THP boundaries")
> > >> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master
> > >>
> > >> [test failed on linus/master      e0cce98fe279b64f4a7d81b7f5c3a23d80b92fbc]
> > >> [test failed on linux-next/master 6dc544b66971c7f9909ff038b62149105272d26a]
> > >>
> > >> in testcase: trinity
> > >> version: trinity-x86_64-6a17c218-1_20240527
> > >> with following parameters:
> > >>
> > >>          runtime: 300s
> > >>          group: group-00
> > >>          nr_groups: 5
> > >>
> > >>
> > >>
> > >> compiler: gcc-13
> > >> test machine: qemu-system-x86_64 -enable-kvm -cpu SandyBridge -smp 2 -m 16G
> > >>
> > >> (please refer to attached dmesg/kmsg for entire log/backtrace)
> > >>
> > >>
> > >> we noticed the issue does not always happen. 34 times out of 50 runs as below.
> > >> the parent is clean.
> > >>
> > >> 1803d0c5ee1a3bbe efa7df3e3bb5da8e6abbe377274
> > >> ---------------- ---------------------------
> > >>         fail:runs  %reproduction    fail:runs
> > >>             |             |             |
> > >>             :50          68%          34:50    dmesg.Kernel_panic-not_syncing:Fatal_exception
> > >>             :50          68%          34:50    dmesg.RIP:try_get_folio
> > >>             :50          68%          34:50    dmesg.invalid_opcode:#[##]
> > >>             :50          68%          34:50    dmesg.kernel_BUG_at_include/linux/page_ref.h
> > >>
> > >>
> > >>
> > >> If you fix the issue in a separate patch/commit (i.e. not just a new version of
> > >> the same patch/commit), kindly add following tags
> > >> | Reported-by: kernel test robot <oliver.sang@intel.com>
> > >> | Closes: https://lore.kernel.org/oe-lkp/202405311534.86cd4043-lkp@intel.com
> > >>
> > >>
> > >> [  275.267158][ T4335] ------------[ cut here ]------------
> > >> [  275.267949][ T4335] kernel BUG at include/linux/page_ref.h:275!
> > >> [  275.268526][ T4335] invalid opcode: 0000 [#1] KASAN PTI
> > >> [  275.269001][ T4335] CPU: 0 PID: 4335 Comm: trinity-c3 Not tainted 6.7.0-rc4-00061-gefa7df3e3bb5 #1
> > >> [  275.269787][ T4335] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
> > >> [ 275.270679][ T4335] RIP: 0010:try_get_folio (include/linux/page_ref.h:275 (discriminator 3) mm/gup.c:79 (discriminator 3))
> > >> [ 275.271159][ T4335] Code: c3 cc cc cc cc 44 89 e6 48 89 df e8 e4 54 11 00 eb ae 90 0f 0b 90 31 db eb d5 9c 58 0f 1f 40 00 f6 c4 02 0f 84 46 ff ff ff 90 <0f> 0b 48 c7 c6 a0 54 d2 87 48 89 df e8 a9 e9 ff ff 90 0f 0b be 04
> > >
> > > If I read this BUG correctly, it is:
> > >
> > > VM_BUG_ON(!in_atomic() && !irqs_disabled());
> > >
> >
> > Yes, that seems to be the one.
> >
> > > try_grab_folio() actually assumes it is in an atomic context (irq
> > > disabled or preempt disabled) for this call path. This is achieved by
> > > disabling irq in gup fast or calling it in rcu critical section in
> > > page cache lookup path
> >
> > try_grab_folio()->try_get_folio()->folio_ref_try_add_rcu()
> >
> > Is called (mm-unstable) from:
> >
> > (1) gup_fast function, here IRQs are disable
> > (2) gup_hugepte(), possibly problematic
> > (3) memfd_pin_folios(), possibly problematic
> > (4) __get_user_pages(), likely problematic
> >
> > (1) should be fine.
> >
> > (2) is possibly problematic on the !fast path. If so, due to commit
> >      a12083d721d7 ("mm/gup: handle hugepd for follow_page()") ? CCing Peter.
> >
> > (3) is possibly wrong. CCing Vivek.
> >
> > (4) is what we hit here
> >
> > >
> > > And try_grab_folio() is used when the folio is a large folio. The
> >
> >
> > We come via process_vm_rw()->pin_user_pages_remote()->__get_user_pages()->try_grab_folio()
> >
> > That code was added in
> >
> > commit 57edfcfd3419b4799353d8cbd6ce49da075cfdbd
> > Author: Peter Xu <peterx@redhat.com>
> > Date:   Wed Jun 28 17:53:07 2023 -0400
> >
> >      mm/gup: accelerate thp gup even for "pages != NULL"
> >
> >      The acceleration of THP was done with ctx.page_mask, however it'll be
> >      ignored if **pages is non-NULL.
> >
> >
> > Likely the try_grab_folio() in __get_user_pages() is wrong?
> >
> > As documented, we already hold a refcount. Likely we should better do a
> > folio_ref_add() and sanity check the refcount.
>
> Yes, a plain folio_ref_add() seems ok for these cases.
>
> In addition, the comment of folio_try_get_rcu() says, which is just a
> wrapper of folio_ref_try_add_rcu():
>
> You can also use this function if you're holding a lock that prevents
> pages being frozen & removed; eg the i_pages lock for the page cache
> or the mmap_lock or page table lock for page tables.  In this case, it
> will always succeed, and you could have used a plain folio_get(), but
> it's sometimes more convenient to have a common function called from
> both locked and RCU-protected contexts.
>
> So IIUC we can use the plain folio_get() at least for
> process_vm_readv/writev since mmap_lock is held in this path.
>
> >
> >
> > In essence, I think: try_grab_folio() should only be called from GUP-fast where
> > IRQs are disabled.
>
> Yes, I agree. Just the fast path should need to call try_grab_folio().

try_grab_folio() also handles FOLL_PIN and FOLL_GET, so we may just
keep calling it and add a flag to try_grab_folio, just like:

if flag is true
    folio_ref_add()
else
    try_get_folio()

>
> >
> > (2), (3) and (4) are possible offenders of that.
> >
> >
> > Or am I getting it all wrong? :)
> >
> > --
> > Cheers,
> >
> > David / dhildenb
> >


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [linus:master] [mm] efa7df3e3b: kernel_BUG_at_include/linux/page_ref.h
  2024-05-31 18:13       ` Yang Shi
@ 2024-05-31 18:24         ` David Hildenbrand
  2024-05-31 18:30           ` Yang Shi
  0 siblings, 1 reply; 28+ messages in thread
From: David Hildenbrand @ 2024-05-31 18:24 UTC (permalink / raw)
  To: Yang Shi
  Cc: kernel test robot, Peter Xu, Jason Gunthorpe, Vivek Kasireddy,
	Rik van Riel, oe-lkp, lkp, linux-kernel, Andrew Morton,
	Matthew Wilcox, Christopher Lameter, linux-mm

On 31.05.24 20:13, Yang Shi wrote:
> On Fri, May 31, 2024 at 11:07 AM Yang Shi <shy828301@gmail.com> wrote:
>>
>> On Fri, May 31, 2024 at 10:46 AM David Hildenbrand <david@redhat.com> wrote:
>>>
>>> On 31.05.24 18:50, Yang Shi wrote:
>>>> On Fri, May 31, 2024 at 1:24 AM kernel test robot <oliver.sang@intel.com> wrote:
>>>>>
>>>>>
>>>>>
>>>>> Hello,
>>>>>
>>>>> kernel test robot noticed "kernel_BUG_at_include/linux/page_ref.h" on:
>>>>>
>>>>> commit: efa7df3e3bb5da8e6abbe37727417f32a37fba47 ("mm: align larger anonymous mappings on THP boundaries")
>>>>> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master
>>>>>
>>>>> [test failed on linus/master      e0cce98fe279b64f4a7d81b7f5c3a23d80b92fbc]
>>>>> [test failed on linux-next/master 6dc544b66971c7f9909ff038b62149105272d26a]
>>>>>
>>>>> in testcase: trinity
>>>>> version: trinity-x86_64-6a17c218-1_20240527
>>>>> with following parameters:
>>>>>
>>>>>           runtime: 300s
>>>>>           group: group-00
>>>>>           nr_groups: 5
>>>>>
>>>>>
>>>>>
>>>>> compiler: gcc-13
>>>>> test machine: qemu-system-x86_64 -enable-kvm -cpu SandyBridge -smp 2 -m 16G
>>>>>
>>>>> (please refer to attached dmesg/kmsg for entire log/backtrace)
>>>>>
>>>>>
>>>>> we noticed the issue does not always happen. 34 times out of 50 runs as below.
>>>>> the parent is clean.
>>>>>
>>>>> 1803d0c5ee1a3bbe efa7df3e3bb5da8e6abbe377274
>>>>> ---------------- ---------------------------
>>>>>          fail:runs  %reproduction    fail:runs
>>>>>              |             |             |
>>>>>              :50          68%          34:50    dmesg.Kernel_panic-not_syncing:Fatal_exception
>>>>>              :50          68%          34:50    dmesg.RIP:try_get_folio
>>>>>              :50          68%          34:50    dmesg.invalid_opcode:#[##]
>>>>>              :50          68%          34:50    dmesg.kernel_BUG_at_include/linux/page_ref.h
>>>>>
>>>>>
>>>>>
>>>>> If you fix the issue in a separate patch/commit (i.e. not just a new version of
>>>>> the same patch/commit), kindly add following tags
>>>>> | Reported-by: kernel test robot <oliver.sang@intel.com>
>>>>> | Closes: https://lore.kernel.org/oe-lkp/202405311534.86cd4043-lkp@intel.com
>>>>>
>>>>>
>>>>> [  275.267158][ T4335] ------------[ cut here ]------------
>>>>> [  275.267949][ T4335] kernel BUG at include/linux/page_ref.h:275!
>>>>> [  275.268526][ T4335] invalid opcode: 0000 [#1] KASAN PTI
>>>>> [  275.269001][ T4335] CPU: 0 PID: 4335 Comm: trinity-c3 Not tainted 6.7.0-rc4-00061-gefa7df3e3bb5 #1
>>>>> [  275.269787][ T4335] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
>>>>> [ 275.270679][ T4335] RIP: 0010:try_get_folio (include/linux/page_ref.h:275 (discriminator 3) mm/gup.c:79 (discriminator 3))
>>>>> [ 275.271159][ T4335] Code: c3 cc cc cc cc 44 89 e6 48 89 df e8 e4 54 11 00 eb ae 90 0f 0b 90 31 db eb d5 9c 58 0f 1f 40 00 f6 c4 02 0f 84 46 ff ff ff 90 <0f> 0b 48 c7 c6 a0 54 d2 87 48 89 df e8 a9 e9 ff ff 90 0f 0b be 04
>>>>
>>>> If I read this BUG correctly, it is:
>>>>
>>>> VM_BUG_ON(!in_atomic() && !irqs_disabled());
>>>>
>>>
>>> Yes, that seems to be the one.
>>>
>>>> try_grab_folio() actually assumes it is in an atomic context (irq
>>>> disabled or preempt disabled) for this call path. This is achieved by
>>>> disabling irq in gup fast or calling it in rcu critical section in
>>>> page cache lookup path
>>>
>>> try_grab_folio()->try_get_folio()->folio_ref_try_add_rcu()
>>>
>>> Is called (mm-unstable) from:
>>>
>>> (1) gup_fast function, here IRQs are disable
>>> (2) gup_hugepte(), possibly problematic
>>> (3) memfd_pin_folios(), possibly problematic
>>> (4) __get_user_pages(), likely problematic
>>>
>>> (1) should be fine.
>>>
>>> (2) is possibly problematic on the !fast path. If so, due to commit
>>>       a12083d721d7 ("mm/gup: handle hugepd for follow_page()") ? CCing Peter.
>>>
>>> (3) is possibly wrong. CCing Vivek.
>>>
>>> (4) is what we hit here
>>>
>>>>
>>>> And try_grab_folio() is used when the folio is a large folio. The
>>>
>>>
>>> We come via process_vm_rw()->pin_user_pages_remote()->__get_user_pages()->try_grab_folio()
>>>
>>> That code was added in
>>>
>>> commit 57edfcfd3419b4799353d8cbd6ce49da075cfdbd
>>> Author: Peter Xu <peterx@redhat.com>
>>> Date:   Wed Jun 28 17:53:07 2023 -0400
>>>
>>>       mm/gup: accelerate thp gup even for "pages != NULL"
>>>
>>>       The acceleration of THP was done with ctx.page_mask, however it'll be
>>>       ignored if **pages is non-NULL.
>>>
>>>
>>> Likely the try_grab_folio() in __get_user_pages() is wrong?
>>>
>>> As documented, we already hold a refcount. Likely we should better do a
>>> folio_ref_add() and sanity check the refcount.
>>
>> Yes, a plain folio_ref_add() seems ok for these cases.
>>
>> In addition, the comment of folio_try_get_rcu() says, which is just a
>> wrapper of folio_ref_try_add_rcu():
>>
>> You can also use this function if you're holding a lock that prevents
>> pages being frozen & removed; eg the i_pages lock for the page cache
>> or the mmap_lock or page table lock for page tables.  In this case, it
>> will always succeed, and you could have used a plain folio_get(), but
>> it's sometimes more convenient to have a common function called from
>> both locked and RCU-protected contexts.
>>
>> So IIUC we can use the plain folio_get() at least for
>> process_vm_readv/writev since mmap_lock is held in this path.
>>
>>>
>>>
>>> In essence, I think: try_grab_folio() should only be called from GUP-fast where
>>> IRQs are disabled.
>>
>> Yes, I agree. Just the fast path should need to call try_grab_folio().
> 
> try_grab_folio() also handles FOLL_PIN and FOLL_GET, so we may just
> keep calling it and add a flag to try_grab_folio, just like:
> 
> if flag is true
>      folio_ref_add()
> else
>      try_get_folio()


try_grab_page() is what we use on the GUP-slow path. We'd likely want a 
folio variant of that.

We might want to call that gup_try_grab_folio() and rename the other one 
to gup_fast_try_grab_folio().

Or something like that :)

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [linus:master] [mm] efa7df3e3b: kernel_BUG_at_include/linux/page_ref.h
  2024-05-31 18:24         ` David Hildenbrand
@ 2024-05-31 18:30           ` Yang Shi
  2024-05-31 18:38             ` David Hildenbrand
  0 siblings, 1 reply; 28+ messages in thread
From: Yang Shi @ 2024-05-31 18:30 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: kernel test robot, Peter Xu, Jason Gunthorpe, Vivek Kasireddy,
	Rik van Riel, oe-lkp, lkp, linux-kernel, Andrew Morton,
	Matthew Wilcox, Christopher Lameter, linux-mm

On Fri, May 31, 2024 at 11:24 AM David Hildenbrand <david@redhat.com> wrote:
>
> On 31.05.24 20:13, Yang Shi wrote:
> > On Fri, May 31, 2024 at 11:07 AM Yang Shi <shy828301@gmail.com> wrote:
> >>
> >> On Fri, May 31, 2024 at 10:46 AM David Hildenbrand <david@redhat.com> wrote:
> >>>
> >>> On 31.05.24 18:50, Yang Shi wrote:
> >>>> On Fri, May 31, 2024 at 1:24 AM kernel test robot <oliver.sang@intel.com> wrote:
> >>>>>
> >>>>>
> >>>>>
> >>>>> Hello,
> >>>>>
> >>>>> kernel test robot noticed "kernel_BUG_at_include/linux/page_ref.h" on:
> >>>>>
> >>>>> commit: efa7df3e3bb5da8e6abbe37727417f32a37fba47 ("mm: align larger anonymous mappings on THP boundaries")
> >>>>> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master
> >>>>>
> >>>>> [test failed on linus/master      e0cce98fe279b64f4a7d81b7f5c3a23d80b92fbc]
> >>>>> [test failed on linux-next/master 6dc544b66971c7f9909ff038b62149105272d26a]
> >>>>>
> >>>>> in testcase: trinity
> >>>>> version: trinity-x86_64-6a17c218-1_20240527
> >>>>> with following parameters:
> >>>>>
> >>>>>           runtime: 300s
> >>>>>           group: group-00
> >>>>>           nr_groups: 5
> >>>>>
> >>>>>
> >>>>>
> >>>>> compiler: gcc-13
> >>>>> test machine: qemu-system-x86_64 -enable-kvm -cpu SandyBridge -smp 2 -m 16G
> >>>>>
> >>>>> (please refer to attached dmesg/kmsg for entire log/backtrace)
> >>>>>
> >>>>>
> >>>>> we noticed the issue does not always happen. 34 times out of 50 runs as below.
> >>>>> the parent is clean.
> >>>>>
> >>>>> 1803d0c5ee1a3bbe efa7df3e3bb5da8e6abbe377274
> >>>>> ---------------- ---------------------------
> >>>>>          fail:runs  %reproduction    fail:runs
> >>>>>              |             |             |
> >>>>>              :50          68%          34:50    dmesg.Kernel_panic-not_syncing:Fatal_exception
> >>>>>              :50          68%          34:50    dmesg.RIP:try_get_folio
> >>>>>              :50          68%          34:50    dmesg.invalid_opcode:#[##]
> >>>>>              :50          68%          34:50    dmesg.kernel_BUG_at_include/linux/page_ref.h
> >>>>>
> >>>>>
> >>>>>
> >>>>> If you fix the issue in a separate patch/commit (i.e. not just a new version of
> >>>>> the same patch/commit), kindly add following tags
> >>>>> | Reported-by: kernel test robot <oliver.sang@intel.com>
> >>>>> | Closes: https://lore.kernel.org/oe-lkp/202405311534.86cd4043-lkp@intel.com
> >>>>>
> >>>>>
> >>>>> [  275.267158][ T4335] ------------[ cut here ]------------
> >>>>> [  275.267949][ T4335] kernel BUG at include/linux/page_ref.h:275!
> >>>>> [  275.268526][ T4335] invalid opcode: 0000 [#1] KASAN PTI
> >>>>> [  275.269001][ T4335] CPU: 0 PID: 4335 Comm: trinity-c3 Not tainted 6.7.0-rc4-00061-gefa7df3e3bb5 #1
> >>>>> [  275.269787][ T4335] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
> >>>>> [ 275.270679][ T4335] RIP: 0010:try_get_folio (include/linux/page_ref.h:275 (discriminator 3) mm/gup.c:79 (discriminator 3))
> >>>>> [ 275.271159][ T4335] Code: c3 cc cc cc cc 44 89 e6 48 89 df e8 e4 54 11 00 eb ae 90 0f 0b 90 31 db eb d5 9c 58 0f 1f 40 00 f6 c4 02 0f 84 46 ff ff ff 90 <0f> 0b 48 c7 c6 a0 54 d2 87 48 89 df e8 a9 e9 ff ff 90 0f 0b be 04
> >>>>
> >>>> If I read this BUG correctly, it is:
> >>>>
> >>>> VM_BUG_ON(!in_atomic() && !irqs_disabled());
> >>>>
> >>>
> >>> Yes, that seems to be the one.
> >>>
> >>>> try_grab_folio() actually assumes it is in an atomic context (irq
> >>>> disabled or preempt disabled) for this call path. This is achieved by
> >>>> disabling irq in gup fast or calling it in rcu critical section in
> >>>> page cache lookup path
> >>>
> >>> try_grab_folio()->try_get_folio()->folio_ref_try_add_rcu()
> >>>
> >>> Is called (mm-unstable) from:
> >>>
> >>> (1) gup_fast function, here IRQs are disable
> >>> (2) gup_hugepte(), possibly problematic
> >>> (3) memfd_pin_folios(), possibly problematic
> >>> (4) __get_user_pages(), likely problematic
> >>>
> >>> (1) should be fine.
> >>>
> >>> (2) is possibly problematic on the !fast path. If so, due to commit
> >>>       a12083d721d7 ("mm/gup: handle hugepd for follow_page()") ? CCing Peter.
> >>>
> >>> (3) is possibly wrong. CCing Vivek.
> >>>
> >>> (4) is what we hit here
> >>>
> >>>>
> >>>> And try_grab_folio() is used when the folio is a large folio. The
> >>>
> >>>
> >>> We come via process_vm_rw()->pin_user_pages_remote()->__get_user_pages()->try_grab_folio()
> >>>
> >>> That code was added in
> >>>
> >>> commit 57edfcfd3419b4799353d8cbd6ce49da075cfdbd
> >>> Author: Peter Xu <peterx@redhat.com>
> >>> Date:   Wed Jun 28 17:53:07 2023 -0400
> >>>
> >>>       mm/gup: accelerate thp gup even for "pages != NULL"
> >>>
> >>>       The acceleration of THP was done with ctx.page_mask, however it'll be
> >>>       ignored if **pages is non-NULL.
> >>>
> >>>
> >>> Likely the try_grab_folio() in __get_user_pages() is wrong?
> >>>
> >>> As documented, we already hold a refcount. Likely we should better do a
> >>> folio_ref_add() and sanity check the refcount.
> >>
> >> Yes, a plain folio_ref_add() seems ok for these cases.
> >>
> >> In addition, the comment of folio_try_get_rcu() says, which is just a
> >> wrapper of folio_ref_try_add_rcu():
> >>
> >> You can also use this function if you're holding a lock that prevents
> >> pages being frozen & removed; eg the i_pages lock for the page cache
> >> or the mmap_lock or page table lock for page tables.  In this case, it
> >> will always succeed, and you could have used a plain folio_get(), but
> >> it's sometimes more convenient to have a common function called from
> >> both locked and RCU-protected contexts.
> >>
> >> So IIUC we can use the plain folio_get() at least for
> >> process_vm_readv/writev since mmap_lock is held in this path.
> >>
> >>>
> >>>
> >>> In essence, I think: try_grab_folio() should only be called from GUP-fast where
> >>> IRQs are disabled.
> >>
> >> Yes, I agree. Just the fast path should need to call try_grab_folio().
> >
> > try_grab_folio() also handles FOLL_PIN and FOLL_GET, so we may just
> > keep calling it and add a flag to try_grab_folio, just like:
> >
> > if flag is true
> >      folio_ref_add()
> > else
> >      try_get_folio()
>
>
> try_grab_page() is what we use on the GUP-slow path. We'd likely want a
> folio variant of that.
>
> We might want to call that gup_try_grab_folio() and rename the other one
> to gup_fast_try_grab_folio().

Won't we duplicate the most code with two versions try_grab_folio()?

I meant something like:

try_grab_folio(struct page *page, int refs, unsigned int flags, bool fast)
{
    if fast
        try_get_folio()
    else
        folio_ref_add()
}

We can keep the duplicated code minimum in this way.

>
> Or something like that :)
>
> --
> Cheers,
>
> David / dhildenb
>


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [linus:master] [mm] efa7df3e3b: kernel_BUG_at_include/linux/page_ref.h
  2024-05-31 18:30           ` Yang Shi
@ 2024-05-31 18:38             ` David Hildenbrand
  2024-05-31 19:06               ` Yang Shi
  0 siblings, 1 reply; 28+ messages in thread
From: David Hildenbrand @ 2024-05-31 18:38 UTC (permalink / raw)
  To: Yang Shi
  Cc: kernel test robot, Peter Xu, Jason Gunthorpe, Vivek Kasireddy,
	Rik van Riel, oe-lkp, lkp, linux-kernel, Andrew Morton,
	Matthew Wilcox, Christopher Lameter, linux-mm

On 31.05.24 20:30, Yang Shi wrote:
> On Fri, May 31, 2024 at 11:24 AM David Hildenbrand <david@redhat.com> wrote:
>>
>> On 31.05.24 20:13, Yang Shi wrote:
>>> On Fri, May 31, 2024 at 11:07 AM Yang Shi <shy828301@gmail.com> wrote:
>>>>
>>>> On Fri, May 31, 2024 at 10:46 AM David Hildenbrand <david@redhat.com> wrote:
>>>>>
>>>>> On 31.05.24 18:50, Yang Shi wrote:
>>>>>> On Fri, May 31, 2024 at 1:24 AM kernel test robot <oliver.sang@intel.com> wrote:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Hello,
>>>>>>>
>>>>>>> kernel test robot noticed "kernel_BUG_at_include/linux/page_ref.h" on:
>>>>>>>
>>>>>>> commit: efa7df3e3bb5da8e6abbe37727417f32a37fba47 ("mm: align larger anonymous mappings on THP boundaries")
>>>>>>> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master
>>>>>>>
>>>>>>> [test failed on linus/master      e0cce98fe279b64f4a7d81b7f5c3a23d80b92fbc]
>>>>>>> [test failed on linux-next/master 6dc544b66971c7f9909ff038b62149105272d26a]
>>>>>>>
>>>>>>> in testcase: trinity
>>>>>>> version: trinity-x86_64-6a17c218-1_20240527
>>>>>>> with following parameters:
>>>>>>>
>>>>>>>            runtime: 300s
>>>>>>>            group: group-00
>>>>>>>            nr_groups: 5
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> compiler: gcc-13
>>>>>>> test machine: qemu-system-x86_64 -enable-kvm -cpu SandyBridge -smp 2 -m 16G
>>>>>>>
>>>>>>> (please refer to attached dmesg/kmsg for entire log/backtrace)
>>>>>>>
>>>>>>>
>>>>>>> we noticed the issue does not always happen. 34 times out of 50 runs as below.
>>>>>>> the parent is clean.
>>>>>>>
>>>>>>> 1803d0c5ee1a3bbe efa7df3e3bb5da8e6abbe377274
>>>>>>> ---------------- ---------------------------
>>>>>>>           fail:runs  %reproduction    fail:runs
>>>>>>>               |             |             |
>>>>>>>               :50          68%          34:50    dmesg.Kernel_panic-not_syncing:Fatal_exception
>>>>>>>               :50          68%          34:50    dmesg.RIP:try_get_folio
>>>>>>>               :50          68%          34:50    dmesg.invalid_opcode:#[##]
>>>>>>>               :50          68%          34:50    dmesg.kernel_BUG_at_include/linux/page_ref.h
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> If you fix the issue in a separate patch/commit (i.e. not just a new version of
>>>>>>> the same patch/commit), kindly add following tags
>>>>>>> | Reported-by: kernel test robot <oliver.sang@intel.com>
>>>>>>> | Closes: https://lore.kernel.org/oe-lkp/202405311534.86cd4043-lkp@intel.com
>>>>>>>
>>>>>>>
>>>>>>> [  275.267158][ T4335] ------------[ cut here ]------------
>>>>>>> [  275.267949][ T4335] kernel BUG at include/linux/page_ref.h:275!
>>>>>>> [  275.268526][ T4335] invalid opcode: 0000 [#1] KASAN PTI
>>>>>>> [  275.269001][ T4335] CPU: 0 PID: 4335 Comm: trinity-c3 Not tainted 6.7.0-rc4-00061-gefa7df3e3bb5 #1
>>>>>>> [  275.269787][ T4335] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
>>>>>>> [ 275.270679][ T4335] RIP: 0010:try_get_folio (include/linux/page_ref.h:275 (discriminator 3) mm/gup.c:79 (discriminator 3))
>>>>>>> [ 275.271159][ T4335] Code: c3 cc cc cc cc 44 89 e6 48 89 df e8 e4 54 11 00 eb ae 90 0f 0b 90 31 db eb d5 9c 58 0f 1f 40 00 f6 c4 02 0f 84 46 ff ff ff 90 <0f> 0b 48 c7 c6 a0 54 d2 87 48 89 df e8 a9 e9 ff ff 90 0f 0b be 04
>>>>>>
>>>>>> If I read this BUG correctly, it is:
>>>>>>
>>>>>> VM_BUG_ON(!in_atomic() && !irqs_disabled());
>>>>>>
>>>>>
>>>>> Yes, that seems to be the one.
>>>>>
>>>>>> try_grab_folio() actually assumes it is in an atomic context (irq
>>>>>> disabled or preempt disabled) for this call path. This is achieved by
>>>>>> disabling irq in gup fast or calling it in rcu critical section in
>>>>>> page cache lookup path
>>>>>
>>>>> try_grab_folio()->try_get_folio()->folio_ref_try_add_rcu()
>>>>>
>>>>> Is called (mm-unstable) from:
>>>>>
>>>>> (1) gup_fast function, here IRQs are disable
>>>>> (2) gup_hugepte(), possibly problematic
>>>>> (3) memfd_pin_folios(), possibly problematic
>>>>> (4) __get_user_pages(), likely problematic
>>>>>
>>>>> (1) should be fine.
>>>>>
>>>>> (2) is possibly problematic on the !fast path. If so, due to commit
>>>>>        a12083d721d7 ("mm/gup: handle hugepd for follow_page()") ? CCing Peter.
>>>>>
>>>>> (3) is possibly wrong. CCing Vivek.
>>>>>
>>>>> (4) is what we hit here
>>>>>
>>>>>>
>>>>>> And try_grab_folio() is used when the folio is a large folio. The
>>>>>
>>>>>
>>>>> We come via process_vm_rw()->pin_user_pages_remote()->__get_user_pages()->try_grab_folio()
>>>>>
>>>>> That code was added in
>>>>>
>>>>> commit 57edfcfd3419b4799353d8cbd6ce49da075cfdbd
>>>>> Author: Peter Xu <peterx@redhat.com>
>>>>> Date:   Wed Jun 28 17:53:07 2023 -0400
>>>>>
>>>>>        mm/gup: accelerate thp gup even for "pages != NULL"
>>>>>
>>>>>        The acceleration of THP was done with ctx.page_mask, however it'll be
>>>>>        ignored if **pages is non-NULL.
>>>>>
>>>>>
>>>>> Likely the try_grab_folio() in __get_user_pages() is wrong?
>>>>>
>>>>> As documented, we already hold a refcount. Likely we should better do a
>>>>> folio_ref_add() and sanity check the refcount.
>>>>
>>>> Yes, a plain folio_ref_add() seems ok for these cases.
>>>>
>>>> In addition, the comment of folio_try_get_rcu() says, which is just a
>>>> wrapper of folio_ref_try_add_rcu():
>>>>
>>>> You can also use this function if you're holding a lock that prevents
>>>> pages being frozen & removed; eg the i_pages lock for the page cache
>>>> or the mmap_lock or page table lock for page tables.  In this case, it
>>>> will always succeed, and you could have used a plain folio_get(), but
>>>> it's sometimes more convenient to have a common function called from
>>>> both locked and RCU-protected contexts.
>>>>
>>>> So IIUC we can use the plain folio_get() at least for
>>>> process_vm_readv/writev since mmap_lock is held in this path.
>>>>
>>>>>
>>>>>
>>>>> In essence, I think: try_grab_folio() should only be called from GUP-fast where
>>>>> IRQs are disabled.
>>>>
>>>> Yes, I agree. Just the fast path should need to call try_grab_folio().
>>>
>>> try_grab_folio() also handles FOLL_PIN and FOLL_GET, so we may just
>>> keep calling it and add a flag to try_grab_folio, just like:
>>>
>>> if flag is true
>>>       folio_ref_add()
>>> else
>>>       try_get_folio()
>>
>>
>> try_grab_page() is what we use on the GUP-slow path. We'd likely want a
>> folio variant of that.
>>
>> We might want to call that gup_try_grab_folio() and rename the other one
>> to gup_fast_try_grab_folio().
> 
> Won't we duplicate the most code with two versions try_grab_folio()?
> 
> I meant something like:
> 
> try_grab_folio(struct page *page, int refs, unsigned int flags, bool fast)
> {
>      if fast
>          try_get_folio()
>      else
>          folio_ref_add()
> }
> 

That's insufficient to handle FOLL_PIN. Likely we should do this:

diff --git a/mm/gup.c b/mm/gup.c
index 231711efa390d..fea93a64bf235 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -203,8 +203,8 @@ static void gup_put_folio(struct folio *folio, int refs, unsigned int flags)
  }
  
  /**
- * try_grab_page() - elevate a page's refcount by a flag-dependent amount
- * @page:    pointer to page to be grabbed
+ * try_grab_folio() - elevate a folios's refcount by a flag-dependent amount
+ * @folio:   pointer to folio to be grabbed
   * @flags:   gup flags: these are the FOLL_* flag values.
   *
   * This might not do anything at all, depending on the flags argument.
@@ -216,16 +216,16 @@ static void gup_put_folio(struct folio *folio, int refs, unsigned int flags)
   * time. Cases: please see the try_grab_folio() documentation, with
   * "refs=1".
   *
+ * Must not be called from GUP-fast: the folio must not get freed concurrently.
+ *
   * Return: 0 for success, or if no action was required (if neither FOLL_PIN
   * nor FOLL_GET was set, nothing is done). A negative error code for failure:
   *
   *   -ENOMEM           FOLL_GET or FOLL_PIN was set, but the page could not
   *                     be grabbed.
   */
-int __must_check try_grab_page(struct page *page, unsigned int flags)
+int __must_check try_grab_page(struct folio *folio, unsigned int flags)
  {
-       struct folio *folio = page_folio(page);
-
         if (WARN_ON_ONCE(folio_ref_count(folio) <= 0))
                 return -ENOMEM;
  
@@ -239,7 +239,7 @@ int __must_check try_grab_page(struct page *page, unsigned int flags)
                  * Don't take a pin on the zero page - it's not going anywhere
                  * and it is used in a *lot* of places.
                  */
-               if (is_zero_page(page))
+               if (is_zero_folio(folio))
                         return 0;
  
                 /*
@@ -260,6 +260,11 @@ int __must_check try_grab_page(struct page *page, unsigned int flags)
         return 0;
  }
  
+int __must_check try_grab_page(struct page *page, unsigned int flags)
+{
+       return gup_try_grab_folio(page_folio(page), flags);
+}
+
  /**
   * unpin_user_page() - release a dma-pinned page
   * @page:            pointer to page to be released


Then, fix the callers and rename the other one to gup_fast_*.


-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [linus:master] [mm] efa7df3e3b: kernel_BUG_at_include/linux/page_ref.h
  2024-05-31 18:38             ` David Hildenbrand
@ 2024-05-31 19:06               ` Yang Shi
  2024-05-31 20:57                 ` Yang Shi
  0 siblings, 1 reply; 28+ messages in thread
From: Yang Shi @ 2024-05-31 19:06 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: kernel test robot, Peter Xu, Jason Gunthorpe, Vivek Kasireddy,
	Rik van Riel, oe-lkp, lkp, linux-kernel, Andrew Morton,
	Matthew Wilcox, Christopher Lameter, linux-mm

On Fri, May 31, 2024 at 11:38 AM David Hildenbrand <david@redhat.com> wrote:
>
> On 31.05.24 20:30, Yang Shi wrote:
> > On Fri, May 31, 2024 at 11:24 AM David Hildenbrand <david@redhat.com> wrote:
> >>
> >> On 31.05.24 20:13, Yang Shi wrote:
> >>> On Fri, May 31, 2024 at 11:07 AM Yang Shi <shy828301@gmail.com> wrote:
> >>>>
> >>>> On Fri, May 31, 2024 at 10:46 AM David Hildenbrand <david@redhat.com> wrote:
> >>>>>
> >>>>> On 31.05.24 18:50, Yang Shi wrote:
> >>>>>> On Fri, May 31, 2024 at 1:24 AM kernel test robot <oliver.sang@intel.com> wrote:
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> Hello,
> >>>>>>>
> >>>>>>> kernel test robot noticed "kernel_BUG_at_include/linux/page_ref.h" on:
> >>>>>>>
> >>>>>>> commit: efa7df3e3bb5da8e6abbe37727417f32a37fba47 ("mm: align larger anonymous mappings on THP boundaries")
> >>>>>>> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master
> >>>>>>>
> >>>>>>> [test failed on linus/master      e0cce98fe279b64f4a7d81b7f5c3a23d80b92fbc]
> >>>>>>> [test failed on linux-next/master 6dc544b66971c7f9909ff038b62149105272d26a]
> >>>>>>>
> >>>>>>> in testcase: trinity
> >>>>>>> version: trinity-x86_64-6a17c218-1_20240527
> >>>>>>> with following parameters:
> >>>>>>>
> >>>>>>>            runtime: 300s
> >>>>>>>            group: group-00
> >>>>>>>            nr_groups: 5
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> compiler: gcc-13
> >>>>>>> test machine: qemu-system-x86_64 -enable-kvm -cpu SandyBridge -smp 2 -m 16G
> >>>>>>>
> >>>>>>> (please refer to attached dmesg/kmsg for entire log/backtrace)
> >>>>>>>
> >>>>>>>
> >>>>>>> we noticed the issue does not always happen. 34 times out of 50 runs as below.
> >>>>>>> the parent is clean.
> >>>>>>>
> >>>>>>> 1803d0c5ee1a3bbe efa7df3e3bb5da8e6abbe377274
> >>>>>>> ---------------- ---------------------------
> >>>>>>>           fail:runs  %reproduction    fail:runs
> >>>>>>>               |             |             |
> >>>>>>>               :50          68%          34:50    dmesg.Kernel_panic-not_syncing:Fatal_exception
> >>>>>>>               :50          68%          34:50    dmesg.RIP:try_get_folio
> >>>>>>>               :50          68%          34:50    dmesg.invalid_opcode:#[##]
> >>>>>>>               :50          68%          34:50    dmesg.kernel_BUG_at_include/linux/page_ref.h
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> If you fix the issue in a separate patch/commit (i.e. not just a new version of
> >>>>>>> the same patch/commit), kindly add following tags
> >>>>>>> | Reported-by: kernel test robot <oliver.sang@intel.com>
> >>>>>>> | Closes: https://lore.kernel.org/oe-lkp/202405311534.86cd4043-lkp@intel.com
> >>>>>>>
> >>>>>>>
> >>>>>>> [  275.267158][ T4335] ------------[ cut here ]------------
> >>>>>>> [  275.267949][ T4335] kernel BUG at include/linux/page_ref.h:275!
> >>>>>>> [  275.268526][ T4335] invalid opcode: 0000 [#1] KASAN PTI
> >>>>>>> [  275.269001][ T4335] CPU: 0 PID: 4335 Comm: trinity-c3 Not tainted 6.7.0-rc4-00061-gefa7df3e3bb5 #1
> >>>>>>> [  275.269787][ T4335] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
> >>>>>>> [ 275.270679][ T4335] RIP: 0010:try_get_folio (include/linux/page_ref.h:275 (discriminator 3) mm/gup.c:79 (discriminator 3))
> >>>>>>> [ 275.271159][ T4335] Code: c3 cc cc cc cc 44 89 e6 48 89 df e8 e4 54 11 00 eb ae 90 0f 0b 90 31 db eb d5 9c 58 0f 1f 40 00 f6 c4 02 0f 84 46 ff ff ff 90 <0f> 0b 48 c7 c6 a0 54 d2 87 48 89 df e8 a9 e9 ff ff 90 0f 0b be 04
> >>>>>>
> >>>>>> If I read this BUG correctly, it is:
> >>>>>>
> >>>>>> VM_BUG_ON(!in_atomic() && !irqs_disabled());
> >>>>>>
> >>>>>
> >>>>> Yes, that seems to be the one.
> >>>>>
> >>>>>> try_grab_folio() actually assumes it is in an atomic context (irq
> >>>>>> disabled or preempt disabled) for this call path. This is achieved by
> >>>>>> disabling irq in gup fast or calling it in rcu critical section in
> >>>>>> page cache lookup path
> >>>>>
> >>>>> try_grab_folio()->try_get_folio()->folio_ref_try_add_rcu()
> >>>>>
> >>>>> Is called (mm-unstable) from:
> >>>>>
> >>>>> (1) gup_fast function, here IRQs are disable
> >>>>> (2) gup_hugepte(), possibly problematic
> >>>>> (3) memfd_pin_folios(), possibly problematic
> >>>>> (4) __get_user_pages(), likely problematic
> >>>>>
> >>>>> (1) should be fine.
> >>>>>
> >>>>> (2) is possibly problematic on the !fast path. If so, due to commit
> >>>>>        a12083d721d7 ("mm/gup: handle hugepd for follow_page()") ? CCing Peter.
> >>>>>
> >>>>> (3) is possibly wrong. CCing Vivek.
> >>>>>
> >>>>> (4) is what we hit here
> >>>>>
> >>>>>>
> >>>>>> And try_grab_folio() is used when the folio is a large folio. The
> >>>>>
> >>>>>
> >>>>> We come via process_vm_rw()->pin_user_pages_remote()->__get_user_pages()->try_grab_folio()
> >>>>>
> >>>>> That code was added in
> >>>>>
> >>>>> commit 57edfcfd3419b4799353d8cbd6ce49da075cfdbd
> >>>>> Author: Peter Xu <peterx@redhat.com>
> >>>>> Date:   Wed Jun 28 17:53:07 2023 -0400
> >>>>>
> >>>>>        mm/gup: accelerate thp gup even for "pages != NULL"
> >>>>>
> >>>>>        The acceleration of THP was done with ctx.page_mask, however it'll be
> >>>>>        ignored if **pages is non-NULL.
> >>>>>
> >>>>>
> >>>>> Likely the try_grab_folio() in __get_user_pages() is wrong?
> >>>>>
> >>>>> As documented, we already hold a refcount. Likely we should better do a
> >>>>> folio_ref_add() and sanity check the refcount.
> >>>>
> >>>> Yes, a plain folio_ref_add() seems ok for these cases.
> >>>>
> >>>> In addition, the comment of folio_try_get_rcu() says, which is just a
> >>>> wrapper of folio_ref_try_add_rcu():
> >>>>
> >>>> You can also use this function if you're holding a lock that prevents
> >>>> pages being frozen & removed; eg the i_pages lock for the page cache
> >>>> or the mmap_lock or page table lock for page tables.  In this case, it
> >>>> will always succeed, and you could have used a plain folio_get(), but
> >>>> it's sometimes more convenient to have a common function called from
> >>>> both locked and RCU-protected contexts.
> >>>>
> >>>> So IIUC we can use the plain folio_get() at least for
> >>>> process_vm_readv/writev since mmap_lock is held in this path.
> >>>>
> >>>>>
> >>>>>
> >>>>> In essence, I think: try_grab_folio() should only be called from GUP-fast where
> >>>>> IRQs are disabled.
> >>>>
> >>>> Yes, I agree. Just the fast path should need to call try_grab_folio().
> >>>
> >>> try_grab_folio() also handles FOLL_PIN and FOLL_GET, so we may just
> >>> keep calling it and add a flag to try_grab_folio, just like:
> >>>
> >>> if flag is true
> >>>       folio_ref_add()
> >>> else
> >>>       try_get_folio()
> >>
> >>
> >> try_grab_page() is what we use on the GUP-slow path. We'd likely want a
> >> folio variant of that.
> >>
> >> We might want to call that gup_try_grab_folio() and rename the other one
> >> to gup_fast_try_grab_folio().
> >
> > Won't we duplicate the most code with two versions try_grab_folio()?
> >
> > I meant something like:
> >
> > try_grab_folio(struct page *page, int refs, unsigned int flags, bool fast)
> > {
> >      if fast
> >          try_get_folio()
> >      else
> >          folio_ref_add()
> > }
> >
>
> That's insufficient to handle FOLL_PIN. Likely we should do this:
>
> diff --git a/mm/gup.c b/mm/gup.c
> index 231711efa390d..fea93a64bf235 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -203,8 +203,8 @@ static void gup_put_folio(struct folio *folio, int refs, unsigned int flags)
>   }
>
>   /**
> - * try_grab_page() - elevate a page's refcount by a flag-dependent amount
> - * @page:    pointer to page to be grabbed
> + * try_grab_folio() - elevate a folios's refcount by a flag-dependent amount
> + * @folio:   pointer to folio to be grabbed
>    * @flags:   gup flags: these are the FOLL_* flag values.
>    *
>    * This might not do anything at all, depending on the flags argument.
> @@ -216,16 +216,16 @@ static void gup_put_folio(struct folio *folio, int refs, unsigned int flags)
>    * time. Cases: please see the try_grab_folio() documentation, with
>    * "refs=1".
>    *
> + * Must not be called from GUP-fast: the folio must not get freed concurrently.
> + *
>    * Return: 0 for success, or if no action was required (if neither FOLL_PIN
>    * nor FOLL_GET was set, nothing is done). A negative error code for failure:
>    *
>    *   -ENOMEM           FOLL_GET or FOLL_PIN was set, but the page could not
>    *                     be grabbed.
>    */
> -int __must_check try_grab_page(struct page *page, unsigned int flags)
> +int __must_check try_grab_page(struct folio *folio, unsigned int flags)
>   {
> -       struct folio *folio = page_folio(page);
> -
>          if (WARN_ON_ONCE(folio_ref_count(folio) <= 0))
>                  return -ENOMEM;
>
> @@ -239,7 +239,7 @@ int __must_check try_grab_page(struct page *page, unsigned int flags)
>                   * Don't take a pin on the zero page - it's not going anywhere
>                   * and it is used in a *lot* of places.
>                   */
> -               if (is_zero_page(page))
> +               if (is_zero_folio(folio))
>                          return 0;
>
>                  /*
> @@ -260,6 +260,11 @@ int __must_check try_grab_page(struct page *page, unsigned int flags)
>          return 0;
>   }
>
> +int __must_check try_grab_page(struct page *page, unsigned int flags)
> +{
> +       return gup_try_grab_folio(page_folio(page), flags);
> +}
> +
>   /**
>    * unpin_user_page() - release a dma-pinned page
>    * @page:            pointer to page to be released
>
>
> Then, fix the callers and rename the other one to gup_fast_*.

I see your point. Replace try_grab_page() to try_grab_folio() for slow
path, it returns 0 or errno, but it should never fail in slow path
since we already hold at least one reference IIUC. The fast version
should just like old try_grab_folio(), which returns the pointer to
folio or NULL.

>
>
> --
> Cheers,
>
> David / dhildenb
>


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [linus:master] [mm] efa7df3e3b: kernel_BUG_at_include/linux/page_ref.h
  2024-05-31 19:06               ` Yang Shi
@ 2024-05-31 20:57                 ` Yang Shi
  2024-06-03 14:02                   ` Oliver Sang
  0 siblings, 1 reply; 28+ messages in thread
From: Yang Shi @ 2024-05-31 20:57 UTC (permalink / raw)
  To: David Hildenbrand, kernel test robot
  Cc: Peter Xu, Jason Gunthorpe, Vivek Kasireddy, Rik van Riel, oe-lkp,
	lkp, linux-kernel, Andrew Morton, Matthew Wilcox,
	Christopher Lameter, linux-mm

[-- Attachment #1: Type: text/plain, Size: 271 bytes --]

Hi Oliver,

I just came up with a quick patch (just build test) per the discussion
and attached, can you please to give it a try? Once it is verified, I
will refine the patch and submit for review.

Thanks,
Yang

>
> >
> >
> > --
> > Cheers,
> >
> > David / dhildenb
> >

[-- Attachment #2: try_grab_folio_fix.patch --]
[-- Type: application/octet-stream, Size: 13001 bytes --]

diff --git a/mm/gup.c b/mm/gup.c
index e17466fd62bb..96a84abe4ef2 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -100,7 +100,7 @@ static inline struct folio *try_get_folio(struct page *page, int refs)
 }
 
 /**
- * try_grab_folio() - Attempt to get or pin a folio.
+ * try_grab_folio_fast() - Attempt to get or pin a folio in fast path.
  * @page:  pointer to page to be grabbed
  * @refs:  the value to (effectively) add to the folio's refcount
  * @flags: gup flags: these are the FOLL_* flag values.
@@ -125,7 +125,7 @@ static inline struct folio *try_get_folio(struct page *page, int refs)
  * nor FOLL_PIN was set, that's considered failure, and furthermore,
  * a likely bug in the caller, so a warning is also emitted.
  */
-struct folio *try_grab_folio(struct page *page, int refs, unsigned int flags)
+struct folio *try_grab_folio_fast(struct page *page, int refs, unsigned int flags)
 {
 	struct folio *folio;
 
@@ -205,28 +205,28 @@ static void gup_put_folio(struct folio *folio, int refs, unsigned int flags)
 }
 
 /**
- * try_grab_page() - elevate a page's refcount by a flag-dependent amount
- * @page:    pointer to page to be grabbed
- * @flags:   gup flags: these are the FOLL_* flag values.
+ * try_grab_folio() - add a folio's refcount by a flag-dependent amount
+ * @folio:    pointer to folio to be grabbed
+ * @refs:     the value to (effectively) add to the folio's refcount
+ * @flags:    gup flags: these are the FOLL_* flag values.
  *
  * This might not do anything at all, depending on the flags argument.
  *
  * "grab" names in this file mean, "look at flags to decide whether to use
- * FOLL_PIN or FOLL_GET behavior, when incrementing the page's refcount.
+ * FOLL_PIN or FOLL_GET behavior, when incrementing the folio's refcount.
  *
  * Either FOLL_PIN or FOLL_GET (or neither) may be set, but not both at the same
- * time. Cases: please see the try_grab_folio() documentation, with
- * "refs=1".
+ * time.
  *
  * Return: 0 for success, or if no action was required (if neither FOLL_PIN
  * nor FOLL_GET was set, nothing is done). A negative error code for failure:
  *
- *   -ENOMEM		FOLL_GET or FOLL_PIN was set, but the page could not
+ *   -ENOMEM		FOLL_GET or FOLL_PIN was set, but the folio could not
  *			be grabbed.
  */
-int __must_check try_grab_page(struct page *page, unsigned int flags)
+int __must_check try_grab_folio(struct folio *folio, int refs, unsigned int flags)
 {
-	struct folio *folio = page_folio(page);
+	struct page *page = &folio->page;
 
 	if (WARN_ON_ONCE(folio_ref_count(folio) <= 0))
 		return -ENOMEM;
@@ -235,7 +235,7 @@ int __must_check try_grab_page(struct page *page, unsigned int flags)
 		return -EREMOTEIO;
 
 	if (flags & FOLL_GET)
-		folio_ref_inc(folio);
+		folio_ref_add(folio, refs);
 	else if (flags & FOLL_PIN) {
 		/*
 		 * Don't take a pin on the zero page - it's not going anywhere
@@ -245,18 +245,18 @@ int __must_check try_grab_page(struct page *page, unsigned int flags)
 			return 0;
 
 		/*
-		 * Similar to try_grab_folio(): be sure to *also*
-		 * increment the normal page refcount field at least once,
+		 * Increment the normal page refcount field at least once,
 		 * so that the page really is pinned.
 		 */
 		if (folio_test_large(folio)) {
-			folio_ref_add(folio, 1);
-			atomic_add(1, &folio->_pincount);
+			folio_ref_add(folio, refs);
+			atomic_add(refs, &folio->_pincount);
 		} else {
-			folio_ref_add(folio, GUP_PIN_COUNTING_BIAS);
+			folio_ref_add(folio,
+					refs * (GUP_PIN_COUNTING_BIAS - 1));
 		}
 
-		node_stat_mod_folio(folio, NR_FOLL_PIN_ACQUIRED, 1);
+		node_stat_mod_folio(folio, NR_FOLL_PIN_ACQUIRED, refs);
 	}
 
 	return 0;
@@ -584,7 +584,7 @@ static unsigned long hugepte_addr_end(unsigned long addr, unsigned long end,
  */
 static int gup_hugepte(struct vm_area_struct *vma, pte_t *ptep, unsigned long sz,
 		       unsigned long addr, unsigned long end, unsigned int flags,
-		       struct page **pages, int *nr)
+		       struct page **pages, int *nr, bool fast)
 {
 	unsigned long pte_end;
 	struct page *page;
@@ -607,9 +607,15 @@ static int gup_hugepte(struct vm_area_struct *vma, pte_t *ptep, unsigned long sz
 	page = pte_page(pte);
 	refs = record_subpages(page, sz, addr, end, pages + *nr);
 
-	folio = try_grab_folio(page, refs, flags);
-	if (!folio)
-		return 0;
+	if (fast) {
+		folio = try_grab_folio_fast(page, refs, flags);
+		if (!folio)
+			return 0;
+	} else {
+		folio = page_folio(page);
+		if (try_grab_folio(folio, refs, flags))
+			return 0;
+	}
 
 	if (unlikely(pte_val(pte) != pte_val(ptep_get(ptep)))) {
 		gup_put_folio(folio, refs, flags);
@@ -637,7 +643,7 @@ static int gup_hugepte(struct vm_area_struct *vma, pte_t *ptep, unsigned long sz
 static int gup_hugepd(struct vm_area_struct *vma, hugepd_t hugepd,
 		      unsigned long addr, unsigned int pdshift,
 		      unsigned long end, unsigned int flags,
-		      struct page **pages, int *nr)
+		      struct page **pages, int *nr, bool fast)
 {
 	pte_t *ptep;
 	unsigned long sz = 1UL << hugepd_shift(hugepd);
@@ -647,7 +653,7 @@ static int gup_hugepd(struct vm_area_struct *vma, hugepd_t hugepd,
 	ptep = hugepte_offset(hugepd, addr, pdshift);
 	do {
 		next = hugepte_addr_end(addr, end, sz);
-		ret = gup_hugepte(vma, ptep, sz, addr, end, flags, pages, nr);
+		ret = gup_hugepte(vma, ptep, sz, addr, end, flags, pages, nr, fast);
 		if (ret != 1)
 			return ret;
 	} while (ptep++, addr = next, addr != end);
@@ -674,7 +680,7 @@ static struct page *follow_hugepd(struct vm_area_struct *vma, hugepd_t hugepd,
 	ptep = hugepte_offset(hugepd, addr, pdshift);
 	ptl = huge_pte_lock(h, vma->vm_mm, ptep);
 	ret = gup_hugepd(vma, hugepd, addr, pdshift, addr + PAGE_SIZE,
-			 flags, &page, &nr);
+			 flags, &page, &nr, false);
 	spin_unlock(ptl);
 
 	if (ret == 1) {
@@ -691,7 +697,7 @@ static struct page *follow_hugepd(struct vm_area_struct *vma, hugepd_t hugepd,
 static inline int gup_hugepd(struct vm_area_struct *vma, hugepd_t hugepd,
 			     unsigned long addr, unsigned int pdshift,
 			     unsigned long end, unsigned int flags,
-			     struct page **pages, int *nr)
+			     struct page **pages, int *nr, bool fast)
 {
 	return 0;
 }
@@ -778,7 +784,7 @@ static struct page *follow_huge_pud(struct vm_area_struct *vma,
 	    gup_must_unshare(vma, flags, page))
 		return ERR_PTR(-EMLINK);
 
-	ret = try_grab_page(page, flags);
+	ret = try_grab_folio(page_folio(page), 1, flags);
 	if (ret)
 		page = ERR_PTR(ret);
 	else
@@ -855,7 +861,7 @@ static struct page *follow_huge_pmd(struct vm_area_struct *vma,
 	VM_BUG_ON_PAGE((flags & FOLL_PIN) && PageAnon(page) &&
 			!PageAnonExclusive(page), page);
 
-	ret = try_grab_page(page, flags);
+	ret = try_grab_folio(page_folio(page), 1, flags);
 	if (ret)
 		return ERR_PTR(ret);
 
@@ -1017,8 +1023,8 @@ static struct page *follow_page_pte(struct vm_area_struct *vma,
 	VM_BUG_ON_PAGE((flags & FOLL_PIN) && PageAnon(page) &&
 		       !PageAnonExclusive(page), page);
 
-	/* try_grab_page() does nothing unless FOLL_GET or FOLL_PIN is set. */
-	ret = try_grab_page(page, flags);
+	/* try_grab_folio() does nothing unless FOLL_GET or FOLL_PIN is set. */
+	ret = try_grab_folio(page_folio(page), 1, flags);
 	if (unlikely(ret)) {
 		page = ERR_PTR(ret);
 		goto out;
@@ -1282,7 +1288,7 @@ static int get_gate_page(struct mm_struct *mm, unsigned long address,
 			goto unmap;
 		*page = pte_page(entry);
 	}
-	ret = try_grab_page(*page, gup_flags);
+	ret = try_grab_folio(page_folio(*page), 1, gup_flags);
 	if (unlikely(ret))
 		goto unmap;
 out:
@@ -1685,20 +1691,19 @@ static long __get_user_pages(struct mm_struct *mm,
 			 * pages.
 			 */
 			if (page_increm > 1) {
-				struct folio *folio;
+				struct folio *folio = page_folio(page);
 
 				/*
 				 * Since we already hold refcount on the
 				 * large folio, this should never fail.
 				 */
-				folio = try_grab_folio(page, page_increm - 1,
-						       foll_flags);
-				if (WARN_ON_ONCE(!folio)) {
+				if (try_grab_folio(folio, page_increm - 1,
+						   foll_flags)) {
 					/*
 					 * Release the 1st page ref if the
 					 * folio is problematic, fail hard.
 					 */
-					gup_put_folio(page_folio(page), 1,
+					gup_put_folio(folio, 1,
 						      foll_flags);
 					ret = -EFAULT;
 					goto out;
@@ -3041,7 +3046,7 @@ static int gup_fast_pte_range(pmd_t pmd, pmd_t *pmdp, unsigned long addr,
 		VM_BUG_ON(!pfn_valid(pte_pfn(pte)));
 		page = pte_page(pte);
 
-		folio = try_grab_folio(page, 1, flags);
+		folio = try_grab_folio_fast(page, 1, flags);
 		if (!folio)
 			goto pte_unmap;
 
@@ -3128,7 +3133,7 @@ static int gup_fast_devmap_leaf(unsigned long pfn, unsigned long addr,
 			break;
 		}
 
-		folio = try_grab_folio(page, 1, flags);
+		folio = try_grab_folio_fast(page, 1, flags);
 		if (!folio) {
 			gup_fast_undo_dev_pagemap(nr, nr_start, flags, pages);
 			break;
@@ -3217,7 +3222,7 @@ static int gup_fast_pmd_leaf(pmd_t orig, pmd_t *pmdp, unsigned long addr,
 	page = pmd_page(orig);
 	refs = record_subpages(page, PMD_SIZE, addr, end, pages + *nr);
 
-	folio = try_grab_folio(page, refs, flags);
+	folio = try_grab_folio_fast(page, refs, flags);
 	if (!folio)
 		return 0;
 
@@ -3261,7 +3266,7 @@ static int gup_fast_pud_leaf(pud_t orig, pud_t *pudp, unsigned long addr,
 	page = pud_page(orig);
 	refs = record_subpages(page, PUD_SIZE, addr, end, pages + *nr);
 
-	folio = try_grab_folio(page, refs, flags);
+	folio = try_grab_folio_fast(page, refs, flags);
 	if (!folio)
 		return 0;
 
@@ -3301,7 +3306,7 @@ static int gup_fast_pgd_leaf(pgd_t orig, pgd_t *pgdp, unsigned long addr,
 	page = pgd_page(orig);
 	refs = record_subpages(page, PGDIR_SIZE, addr, end, pages + *nr);
 
-	folio = try_grab_folio(page, refs, flags);
+	folio = try_grab_folio_fast(page, refs, flags);
 	if (!folio)
 		return 0;
 
@@ -3355,7 +3360,7 @@ static int gup_fast_pmd_range(pud_t *pudp, pud_t pud, unsigned long addr,
 			 * pmd format and THP pmd format
 			 */
 			if (gup_hugepd(NULL, __hugepd(pmd_val(pmd)), addr,
-				       PMD_SHIFT, next, flags, pages, nr) != 1)
+				       PMD_SHIFT, next, flags, pages, nr, true) != 1)
 				return 0;
 		} else if (!gup_fast_pte_range(pmd, pmdp, addr, next, flags,
 					       pages, nr))
@@ -3385,7 +3390,7 @@ static int gup_fast_pud_range(p4d_t *p4dp, p4d_t p4d, unsigned long addr,
 				return 0;
 		} else if (unlikely(is_hugepd(__hugepd(pud_val(pud))))) {
 			if (gup_hugepd(NULL, __hugepd(pud_val(pud)), addr,
-				       PUD_SHIFT, next, flags, pages, nr) != 1)
+				       PUD_SHIFT, next, flags, pages, nr, true) != 1)
 				return 0;
 		} else if (!gup_fast_pmd_range(pudp, pud, addr, next, flags,
 					       pages, nr))
@@ -3412,7 +3417,7 @@ static int gup_fast_p4d_range(pgd_t *pgdp, pgd_t pgd, unsigned long addr,
 		BUILD_BUG_ON(p4d_leaf(p4d));
 		if (unlikely(is_hugepd(__hugepd(p4d_val(p4d))))) {
 			if (gup_hugepd(NULL, __hugepd(p4d_val(p4d)), addr,
-				       P4D_SHIFT, next, flags, pages, nr) != 1)
+				       P4D_SHIFT, next, flags, pages, nr, true) != 1)
 				return 0;
 		} else if (!gup_fast_pud_range(p4dp, p4d, addr, next, flags,
 					       pages, nr))
@@ -3441,7 +3446,7 @@ static void gup_fast_pgd_range(unsigned long addr, unsigned long end,
 				return;
 		} else if (unlikely(is_hugepd(__hugepd(pgd_val(pgd))))) {
 			if (gup_hugepd(NULL, __hugepd(pgd_val(pgd)), addr,
-				       PGDIR_SHIFT, next, flags, pages, nr) != 1)
+				       PGDIR_SHIFT, next, flags, pages, nr, true) != 1)
 				return;
 		} else if (!gup_fast_p4d_range(pgdp, pgd, addr, next, flags,
 					       pages, nr))
@@ -3842,13 +3847,14 @@ long memfd_pin_folios(struct file *memfd, loff_t start, loff_t end,
 				    next_idx != folio_index(fbatch.folios[i]))
 					continue;
 
-				folio = try_grab_folio(&fbatch.folios[i]->page,
-						       1, FOLL_PIN);
-				if (!folio) {
+				if (try_grab_folio(fbatch.folios[i],
+						       1, FOLL_PIN)) {
 					folio_batch_release(&fbatch);
 					goto err;
 				}
 
+				folio = fbatch.folios[i];
+
 				if (nr_folios == 0)
 					*offset = offset_in_folio(folio, start);
 
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 8e49f402d7c7..b6280a01c5fd 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1331,7 +1331,7 @@ struct page *follow_devmap_pmd(struct vm_area_struct *vma, unsigned long addr,
 	if (!*pgmap)
 		return ERR_PTR(-EFAULT);
 	page = pfn_to_page(pfn);
-	ret = try_grab_page(page, flags);
+	ret = try_grab_folio(page_folio(page), 1, flags);
 	if (ret)
 		page = ERR_PTR(ret);
 
diff --git a/mm/internal.h b/mm/internal.h
index 3419c329b3bc..539d5f550102 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -1215,8 +1215,8 @@ int migrate_device_coherent_page(struct page *page);
 /*
  * mm/gup.c
  */
-struct folio *try_grab_folio(struct page *page, int refs, unsigned int flags);
-int __must_check try_grab_page(struct page *page, unsigned int flags);
+struct folio *try_grab_folio_fast(struct page *page, int refs, unsigned int flags);
+int __must_check try_grab_folio(struct folio *folio, int refs, unsigned int flags);
 
 /*
  * mm/huge_memory.c

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [linus:master] [mm] efa7df3e3b: kernel_BUG_at_include/linux/page_ref.h
       [not found]   ` <890e5a79-8574-4a24-90ab-b9888968d5e5@redhat.com>
  2024-05-31 18:07     ` Yang Shi
@ 2024-05-31 23:24     ` Peter Xu
  2024-06-01  0:01       ` Yang Shi
  1 sibling, 1 reply; 28+ messages in thread
From: Peter Xu @ 2024-05-31 23:24 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Yang Shi, kernel test robot, Jason Gunthorpe, Vivek Kasireddy,
	Rik van Riel, oe-lkp, lkp, linux-kernel, Andrew Morton,
	Matthew Wilcox, Christopher Lameter, linux-mm

On Fri, May 31, 2024 at 07:46:41PM +0200, David Hildenbrand wrote:
> try_grab_folio()->try_get_folio()->folio_ref_try_add_rcu()
> 
> Is called (mm-unstable) from:
> 
> (1) gup_fast function, here IRQs are disable
> (2) gup_hugepte(), possibly problematic
> (3) memfd_pin_folios(), possibly problematic
> (4) __get_user_pages(), likely problematic
> 
> (1) should be fine.
> 
> (2) is possibly problematic on the !fast path. If so, due to commit
>     a12083d721d7 ("mm/gup: handle hugepd for follow_page()") ? CCing Peter.
> 
> (3) is possibly wrong. CCing Vivek.
> 
> (4) is what we hit here

I guess it was overlooked because try_grab_folio() didn't have any comment
or implication on RCU or IRQ internal helpers being used, hence a bit
confusing.  E.g. it has different context requirement on try_grab_page(),
even though they look like sister functions.  It might be helpful to have a
better name, something like try_grab_folio_rcu() in this case.

Btw, none of above cases (2-4) have real bug, but we're looking at some way
to avoid triggering the sanity check, am I right?  I hope besides the host
splash I didn't overlook any other side effect this issue would cause, and
the splash IIUC should so far be benign, as either gup slow (2,4) or the
newly added memfd_pin_folios() (3) look like to have the refcount stablized
anyway.

Yang's patch in the other email looks sane to me, just that then we'll add
quite some code just to avoid this sanity check in paths 2-4 which seems
like an slight overkill.

One thing I'm thinking is whether folio_ref_try_add_rcu() can get rid of
its RCU limitation. It boils down to whether we can use atomic_add_unless()
on TINY_RCU / UP setup too?  I mean, we do plenty of similar things
(get_page_unless_zero, etc.) in generic code and I don't understand why
here we need to treat folio_ref_try_add_rcu() specially.

IOW, the assertions here we added:

	VM_BUG_ON(!in_atomic() && !irqs_disabled());

Is because we need atomicity of below sequences:

	VM_BUG_ON_FOLIO(folio_ref_count(folio) == 0, folio);
	folio_ref_add(folio, count);

But atomic ops avoids it.

This chunk of code was (mostly) originally added in 2008 in commit
e286781d5f2e ("mm: speculative page references").

In short, I'm wondering whether something like below would make sense and
easier:

===8<===
diff --git a/include/linux/page_ref.h b/include/linux/page_ref.h
index 1acf5bac7f50..c89a67d239d1 100644
--- a/include/linux/page_ref.h
+++ b/include/linux/page_ref.h
@@ -258,26 +258,9 @@ static inline bool folio_try_get(struct folio *folio)
 	return folio_ref_add_unless(folio, 1, 0);
 }
 
-static inline bool folio_ref_try_add_rcu(struct folio *folio, int count)
-{
-#ifdef CONFIG_TINY_RCU
-	/*
-	 * The caller guarantees the folio will not be freed from interrupt
-	 * context, so (on !SMP) we only need preemption to be disabled
-	 * and TINY_RCU does that for us.
-	 */
-# ifdef CONFIG_PREEMPT_COUNT
-	VM_BUG_ON(!in_atomic() && !irqs_disabled());
-# endif
-	VM_BUG_ON_FOLIO(folio_ref_count(folio) == 0, folio);
-	folio_ref_add(folio, count);
-#else
-	if (unlikely(!folio_ref_add_unless(folio, count, 0))) {
-		/* Either the folio has been freed, or will be freed. */
-		return false;
-	}
-#endif
-	return true;
+static inline bool folio_ref_try_add(struct folio *folio, int count)
+{
+	return folio_ref_add_unless(folio, count, 0);
 }
 
 /**
@@ -305,7 +288,7 @@ static inline bool folio_ref_try_add_rcu(struct folio *folio, int count)
  */
 static inline bool folio_try_get_rcu(struct folio *folio)
 {
-	return folio_ref_try_add_rcu(folio, 1);
+	return folio_ref_try_add(folio, 1);
 }
 
 static inline int page_ref_freeze(struct page *page, int count)
diff --git a/mm/gup.c b/mm/gup.c
index e17466fd62bb..17f89e8d31f1 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -78,7 +78,7 @@ static inline struct folio *try_get_folio(struct page *page, int refs)
 	folio = page_folio(page);
 	if (WARN_ON_ONCE(folio_ref_count(folio) < 0))
 		return NULL;
-	if (unlikely(!folio_ref_try_add_rcu(folio, refs)))
+	if (unlikely(!folio_ref_try_add(folio, refs)))
 		return NULL;
 
 	/*
===8<===

So instead of adding new code, we fix it by removing some.  There might be
some implication on TINY_RCU / UP config on using the atomic_add_unless()
to replace one atomic_add(), but I'm not sure whether that's a major issue.

The benefit is try_get_folio() then don't need a renaming then, because the
rcu implication just went away.

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [linus:master] [mm] efa7df3e3b: kernel_BUG_at_include/linux/page_ref.h
  2024-05-31 23:24     ` Peter Xu
@ 2024-06-01  0:01       ` Yang Shi
  2024-06-01  0:59         ` Yang Shi
  0 siblings, 1 reply; 28+ messages in thread
From: Yang Shi @ 2024-06-01  0:01 UTC (permalink / raw)
  To: Peter Xu
  Cc: David Hildenbrand, kernel test robot, Jason Gunthorpe,
	Vivek Kasireddy, Rik van Riel, oe-lkp, lkp, linux-kernel,
	Andrew Morton, Matthew Wilcox, Christopher Lameter, linux-mm

On Fri, May 31, 2024 at 4:25 PM Peter Xu <peterx@redhat.com> wrote:
>
> On Fri, May 31, 2024 at 07:46:41PM +0200, David Hildenbrand wrote:
> > try_grab_folio()->try_get_folio()->folio_ref_try_add_rcu()
> >
> > Is called (mm-unstable) from:
> >
> > (1) gup_fast function, here IRQs are disable
> > (2) gup_hugepte(), possibly problematic
> > (3) memfd_pin_folios(), possibly problematic
> > (4) __get_user_pages(), likely problematic
> >
> > (1) should be fine.
> >
> > (2) is possibly problematic on the !fast path. If so, due to commit
> >     a12083d721d7 ("mm/gup: handle hugepd for follow_page()") ? CCing Peter.
> >
> > (3) is possibly wrong. CCing Vivek.
> >
> > (4) is what we hit here
>
> I guess it was overlooked because try_grab_folio() didn't have any comment
> or implication on RCU or IRQ internal helpers being used, hence a bit
> confusing.  E.g. it has different context requirement on try_grab_page(),
> even though they look like sister functions.  It might be helpful to have a
> better name, something like try_grab_folio_rcu() in this case.
>
> Btw, none of above cases (2-4) have real bug, but we're looking at some way
> to avoid triggering the sanity check, am I right?  I hope besides the host
> splash I didn't overlook any other side effect this issue would cause, and
> the splash IIUC should so far be benign, as either gup slow (2,4) or the
> newly added memfd_pin_folios() (3) look like to have the refcount stablized
> anyway.
>
> Yang's patch in the other email looks sane to me, just that then we'll add
> quite some code just to avoid this sanity check in paths 2-4 which seems
> like an slight overkill.
>
> One thing I'm thinking is whether folio_ref_try_add_rcu() can get rid of
> its RCU limitation. It boils down to whether we can use atomic_add_unless()
> on TINY_RCU / UP setup too?  I mean, we do plenty of similar things
> (get_page_unless_zero, etc.) in generic code and I don't understand why
> here we need to treat folio_ref_try_add_rcu() specially.
>
> IOW, the assertions here we added:
>
>         VM_BUG_ON(!in_atomic() && !irqs_disabled());
>
> Is because we need atomicity of below sequences:
>
>         VM_BUG_ON_FOLIO(folio_ref_count(folio) == 0, folio);
>         folio_ref_add(folio, count);
>
> But atomic ops avoids it.

Yeah, I didn't think of why atomic can't do it either. But is it
written in this way because we want to catch the refcount == 0 case
since it means a severe bug? Did we miss something?

>
> This chunk of code was (mostly) originally added in 2008 in commit
> e286781d5f2e ("mm: speculative page references").
>
> In short, I'm wondering whether something like below would make sense and
> easier:
>
> ===8<===
> diff --git a/include/linux/page_ref.h b/include/linux/page_ref.h
> index 1acf5bac7f50..c89a67d239d1 100644
> --- a/include/linux/page_ref.h
> +++ b/include/linux/page_ref.h
> @@ -258,26 +258,9 @@ static inline bool folio_try_get(struct folio *folio)
>         return folio_ref_add_unless(folio, 1, 0);
>  }
>
> -static inline bool folio_ref_try_add_rcu(struct folio *folio, int count)
> -{
> -#ifdef CONFIG_TINY_RCU
> -       /*
> -        * The caller guarantees the folio will not be freed from interrupt
> -        * context, so (on !SMP) we only need preemption to be disabled
> -        * and TINY_RCU does that for us.
> -        */
> -# ifdef CONFIG_PREEMPT_COUNT
> -       VM_BUG_ON(!in_atomic() && !irqs_disabled());
> -# endif
> -       VM_BUG_ON_FOLIO(folio_ref_count(folio) == 0, folio);
> -       folio_ref_add(folio, count);
> -#else
> -       if (unlikely(!folio_ref_add_unless(folio, count, 0))) {
> -               /* Either the folio has been freed, or will be freed. */
> -               return false;
> -       }
> -#endif
> -       return true;
> +static inline bool folio_ref_try_add(struct folio *folio, int count)
> +{
> +       return folio_ref_add_unless(folio, count, 0);
>  }
>
>  /**
> @@ -305,7 +288,7 @@ static inline bool folio_ref_try_add_rcu(struct folio *folio, int count)
>   */
>  static inline bool folio_try_get_rcu(struct folio *folio)
>  {
> -       return folio_ref_try_add_rcu(folio, 1);
> +       return folio_ref_try_add(folio, 1);
>  }
>
>  static inline int page_ref_freeze(struct page *page, int count)
> diff --git a/mm/gup.c b/mm/gup.c
> index e17466fd62bb..17f89e8d31f1 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -78,7 +78,7 @@ static inline struct folio *try_get_folio(struct page *page, int refs)
>         folio = page_folio(page);
>         if (WARN_ON_ONCE(folio_ref_count(folio) < 0))
>                 return NULL;
> -       if (unlikely(!folio_ref_try_add_rcu(folio, refs)))
> +       if (unlikely(!folio_ref_try_add(folio, refs)))
>                 return NULL;
>
>         /*
> ===8<===
>
> So instead of adding new code, we fix it by removing some.  There might be
> some implication on TINY_RCU / UP config on using the atomic_add_unless()
> to replace one atomic_add(), but I'm not sure whether that's a major issue.
>
> The benefit is try_get_folio() then don't need a renaming then, because the
> rcu implication just went away.
>
> Thanks,
>
> --
> Peter Xu
>


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [linus:master] [mm] efa7df3e3b: kernel_BUG_at_include/linux/page_ref.h
  2024-06-01  0:01       ` Yang Shi
@ 2024-06-01  0:59         ` Yang Shi
       [not found]           ` <0edfcfed-e8c4-4c46-bbce-528c07084792@redhat.com>
  0 siblings, 1 reply; 28+ messages in thread
From: Yang Shi @ 2024-06-01  0:59 UTC (permalink / raw)
  To: Peter Xu
  Cc: David Hildenbrand, kernel test robot, Jason Gunthorpe,
	Vivek Kasireddy, Rik van Riel, oe-lkp, lkp, linux-kernel,
	Andrew Morton, Matthew Wilcox, Christopher Lameter, linux-mm

On Fri, May 31, 2024 at 5:01 PM Yang Shi <shy828301@gmail.com> wrote:
>
> On Fri, May 31, 2024 at 4:25 PM Peter Xu <peterx@redhat.com> wrote:
> >
> > On Fri, May 31, 2024 at 07:46:41PM +0200, David Hildenbrand wrote:
> > > try_grab_folio()->try_get_folio()->folio_ref_try_add_rcu()
> > >
> > > Is called (mm-unstable) from:
> > >
> > > (1) gup_fast function, here IRQs are disable
> > > (2) gup_hugepte(), possibly problematic
> > > (3) memfd_pin_folios(), possibly problematic
> > > (4) __get_user_pages(), likely problematic
> > >
> > > (1) should be fine.
> > >
> > > (2) is possibly problematic on the !fast path. If so, due to commit
> > >     a12083d721d7 ("mm/gup: handle hugepd for follow_page()") ? CCing Peter.
> > >
> > > (3) is possibly wrong. CCing Vivek.
> > >
> > > (4) is what we hit here
> >
> > I guess it was overlooked because try_grab_folio() didn't have any comment
> > or implication on RCU or IRQ internal helpers being used, hence a bit
> > confusing.  E.g. it has different context requirement on try_grab_page(),
> > even though they look like sister functions.  It might be helpful to have a
> > better name, something like try_grab_folio_rcu() in this case.
> >
> > Btw, none of above cases (2-4) have real bug, but we're looking at some way
> > to avoid triggering the sanity check, am I right?  I hope besides the host
> > splash I didn't overlook any other side effect this issue would cause, and
> > the splash IIUC should so far be benign, as either gup slow (2,4) or the
> > newly added memfd_pin_folios() (3) look like to have the refcount stablized
> > anyway.
> >
> > Yang's patch in the other email looks sane to me, just that then we'll add
> > quite some code just to avoid this sanity check in paths 2-4 which seems
> > like an slight overkill.
> >
> > One thing I'm thinking is whether folio_ref_try_add_rcu() can get rid of
> > its RCU limitation. It boils down to whether we can use atomic_add_unless()
> > on TINY_RCU / UP setup too?  I mean, we do plenty of similar things
> > (get_page_unless_zero, etc.) in generic code and I don't understand why
> > here we need to treat folio_ref_try_add_rcu() specially.
> >
> > IOW, the assertions here we added:
> >
> >         VM_BUG_ON(!in_atomic() && !irqs_disabled());
> >
> > Is because we need atomicity of below sequences:
> >
> >         VM_BUG_ON_FOLIO(folio_ref_count(folio) == 0, folio);
> >         folio_ref_add(folio, count);
> >
> > But atomic ops avoids it.
>
> Yeah, I didn't think of why atomic can't do it either. But is it
> written in this way because we want to catch the refcount == 0 case
> since it means a severe bug? Did we miss something?

Thought more about it and disassembled the code. IIUC, this is an
optimization for non-SMP kernel. When in rcu critical section or irq
is disabled, we just need an atomic add instruction.
folio_ref_add_unless() would yield more instructions, including branch
instruction. But I'm wondering how useful it would be nowadays. Is it
really worth the complexity? AFAIK, for example, ARM64 has not
supported non-SMP kernel for years.

My patch actually replaced all folio_ref_add_unless() to
folio_ref_add() for slow paths, so it is supposed to run faster, but
we are already in slow path, it may be not measurable at all. So
having more simple and readable code may outweigh the potential slight
performance gain in this case?

>
> >
> > This chunk of code was (mostly) originally added in 2008 in commit
> > e286781d5f2e ("mm: speculative page references").
> >
> > In short, I'm wondering whether something like below would make sense and
> > easier:
> >
> > ===8<===
> > diff --git a/include/linux/page_ref.h b/include/linux/page_ref.h
> > index 1acf5bac7f50..c89a67d239d1 100644
> > --- a/include/linux/page_ref.h
> > +++ b/include/linux/page_ref.h
> > @@ -258,26 +258,9 @@ static inline bool folio_try_get(struct folio *folio)
> >         return folio_ref_add_unless(folio, 1, 0);
> >  }
> >
> > -static inline bool folio_ref_try_add_rcu(struct folio *folio, int count)
> > -{
> > -#ifdef CONFIG_TINY_RCU
> > -       /*
> > -        * The caller guarantees the folio will not be freed from interrupt
> > -        * context, so (on !SMP) we only need preemption to be disabled
> > -        * and TINY_RCU does that for us.
> > -        */
> > -# ifdef CONFIG_PREEMPT_COUNT
> > -       VM_BUG_ON(!in_atomic() && !irqs_disabled());
> > -# endif
> > -       VM_BUG_ON_FOLIO(folio_ref_count(folio) == 0, folio);
> > -       folio_ref_add(folio, count);
> > -#else
> > -       if (unlikely(!folio_ref_add_unless(folio, count, 0))) {
> > -               /* Either the folio has been freed, or will be freed. */
> > -               return false;
> > -       }
> > -#endif
> > -       return true;
> > +static inline bool folio_ref_try_add(struct folio *folio, int count)
> > +{
> > +       return folio_ref_add_unless(folio, count, 0);
> >  }
> >
> >  /**
> > @@ -305,7 +288,7 @@ static inline bool folio_ref_try_add_rcu(struct folio *folio, int count)
> >   */
> >  static inline bool folio_try_get_rcu(struct folio *folio)
> >  {
> > -       return folio_ref_try_add_rcu(folio, 1);
> > +       return folio_ref_try_add(folio, 1);
> >  }
> >
> >  static inline int page_ref_freeze(struct page *page, int count)
> > diff --git a/mm/gup.c b/mm/gup.c
> > index e17466fd62bb..17f89e8d31f1 100644
> > --- a/mm/gup.c
> > +++ b/mm/gup.c
> > @@ -78,7 +78,7 @@ static inline struct folio *try_get_folio(struct page *page, int refs)
> >         folio = page_folio(page);
> >         if (WARN_ON_ONCE(folio_ref_count(folio) < 0))
> >                 return NULL;
> > -       if (unlikely(!folio_ref_try_add_rcu(folio, refs)))
> > +       if (unlikely(!folio_ref_try_add(folio, refs)))
> >                 return NULL;
> >
> >         /*
> > ===8<===
> >
> > So instead of adding new code, we fix it by removing some.  There might be
> > some implication on TINY_RCU / UP config on using the atomic_add_unless()
> > to replace one atomic_add(), but I'm not sure whether that's a major issue.
> >
> > The benefit is try_get_folio() then don't need a renaming then, because the
> > rcu implication just went away.
> >
> > Thanks,
> >
> > --
> > Peter Xu
> >


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [linus:master] [mm] efa7df3e3b: kernel_BUG_at_include/linux/page_ref.h
  2024-05-31 20:57                 ` Yang Shi
@ 2024-06-03 14:02                   ` Oliver Sang
  2024-06-03 16:54                     ` Yang Shi
  0 siblings, 1 reply; 28+ messages in thread
From: Oliver Sang @ 2024-06-03 14:02 UTC (permalink / raw)
  To: Yang Shi
  Cc: David Hildenbrand, Peter Xu, Jason Gunthorpe, Vivek Kasireddy,
	Rik van Riel, oe-lkp, lkp, linux-kernel, Andrew Morton,
	Matthew Wilcox, Christopher Lameter, linux-mm, oliver.sang

hi, Yang Shi,

On Fri, May 31, 2024 at 01:57:06PM -0700, Yang Shi wrote:
> Hi Oliver,
> 
> I just came up with a quick patch (just build test) per the discussion
> and attached, can you please to give it a try? Once it is verified, I
> will refine the patch and submit for review.

what's the base of this patch? I tried to apply it upon efa7df3e3b or
v6.10-rc2. both failed.

> 
> Thanks,
> Yang
> 
> >
> > >
> > >
> > > --
> > > Cheers,
> > >
> > > David / dhildenb
> > >




^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [linus:master] [mm] efa7df3e3b: kernel_BUG_at_include/linux/page_ref.h
       [not found]           ` <0edfcfed-e8c4-4c46-bbce-528c07084792@redhat.com>
@ 2024-06-03 15:08             ` Peter Xu
       [not found]               ` <8da12503-839d-459f-a2fa-4abd6d21935d@redhat.com>
  0 siblings, 1 reply; 28+ messages in thread
From: Peter Xu @ 2024-06-03 15:08 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Yang Shi, kernel test robot, Jason Gunthorpe, Vivek Kasireddy,
	Rik van Riel, oe-lkp, lkp, linux-kernel, Andrew Morton,
	Matthew Wilcox, Christopher Lameter, linux-mm

On Sat, Jun 01, 2024 at 08:22:21AM +0200, David Hildenbrand wrote:
> On 01.06.24 02:59, Yang Shi wrote:
> > On Fri, May 31, 2024 at 5:01 PM Yang Shi <shy828301@gmail.com> wrote:
> > > 
> > > On Fri, May 31, 2024 at 4:25 PM Peter Xu <peterx@redhat.com> wrote:
> > > > 
> > > > On Fri, May 31, 2024 at 07:46:41PM +0200, David Hildenbrand wrote:
> > > > > try_grab_folio()->try_get_folio()->folio_ref_try_add_rcu()
> > > > > 
> > > > > Is called (mm-unstable) from:
> > > > > 
> > > > > (1) gup_fast function, here IRQs are disable
> > > > > (2) gup_hugepte(), possibly problematic
> > > > > (3) memfd_pin_folios(), possibly problematic
> > > > > (4) __get_user_pages(), likely problematic
> > > > > 
> > > > > (1) should be fine.
> > > > > 
> > > > > (2) is possibly problematic on the !fast path. If so, due to commit
> > > > >      a12083d721d7 ("mm/gup: handle hugepd for follow_page()") ? CCing Peter.
> > > > > 
> > > > > (3) is possibly wrong. CCing Vivek.
> > > > > 
> > > > > (4) is what we hit here
> > > > 
> > > > I guess it was overlooked because try_grab_folio() didn't have any comment
> > > > or implication on RCU or IRQ internal helpers being used, hence a bit
> > > > confusing.  E.g. it has different context requirement on try_grab_page(),
> > > > even though they look like sister functions.  It might be helpful to have a
> > > > better name, something like try_grab_folio_rcu() in this case.
> > > > 
> > > > Btw, none of above cases (2-4) have real bug, but we're looking at some way
> > > > to avoid triggering the sanity check, am I right?  I hope besides the host
> > > > splash I didn't overlook any other side effect this issue would cause, and
> > > > the splash IIUC should so far be benign, as either gup slow (2,4) or the
> > > > newly added memfd_pin_folios() (3) look like to have the refcount stablized
> > > > anyway.
> > > > 
> > > > Yang's patch in the other email looks sane to me, just that then we'll add
> > > > quite some code just to avoid this sanity check in paths 2-4 which seems
> > > > like an slight overkill.
> > > > 
> > > > One thing I'm thinking is whether folio_ref_try_add_rcu() can get rid of
> > > > its RCU limitation. It boils down to whether we can use atomic_add_unless()
> > > > on TINY_RCU / UP setup too?  I mean, we do plenty of similar things
> > > > (get_page_unless_zero, etc.) in generic code and I don't understand why
> > > > here we need to treat folio_ref_try_add_rcu() specially.
> > > > 
> > > > IOW, the assertions here we added:
> > > > 
> > > >          VM_BUG_ON(!in_atomic() && !irqs_disabled());
> > > > 
> > > > Is because we need atomicity of below sequences:
> > > > 
> > > >          VM_BUG_ON_FOLIO(folio_ref_count(folio) == 0, folio);
> > > >          folio_ref_add(folio, count);
> > > > 
> > > > But atomic ops avoids it.
> > > 
> > > Yeah, I didn't think of why atomic can't do it either. But is it
> > > written in this way because we want to catch the refcount == 0 case
> > > since it means a severe bug? Did we miss something?
> > 
> > Thought more about it and disassembled the code. IIUC, this is an
> > optimization for non-SMP kernel. When in rcu critical section or irq
> > is disabled, we just need an atomic add instruction.
> > folio_ref_add_unless() would yield more instructions, including branch
> > instruction. But I'm wondering how useful it would be nowadays. Is it
> > really worth the complexity? AFAIK, for example, ARM64 has not
> > supported non-SMP kernel for years.
> > 
> > My patch actually replaced all folio_ref_add_unless() to
> > folio_ref_add() for slow paths, so it is supposed to run faster, but
> > we are already in slow path, it may be not measurable at all. So
> > having more simple and readable code may outweigh the potential slight
> > performance gain in this case?
> 
> Yes, we don't want to use atomic RMW that return values where we can use
> atomic RMW that don't return values. The former is slower and implies a
> memory barrier, that can be optimized out on some arcitectures (arm64 IIRC)
> 
> We should clean that up here, and make it clearer that the old function is
> only for grabbing a folio if it can be freed concurrently -- GUP-fast.

Note again that this only affects TINY_RCU, which mostly implies
!PREEMPTION and UP.  It's a matter of whether we prefer adding these bunch
of code to optimize that.

Also we didn't yet measure that in a real workload and see how that
"unless" plays when buried in other paths, but then we'll need a special
kernel build first, and again I'm not sure whether it'll be worthwhile.

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [linus:master] [mm] efa7df3e3b: kernel_BUG_at_include/linux/page_ref.h
  2024-06-03 14:02                   ` Oliver Sang
@ 2024-06-03 16:54                     ` Yang Shi
  2024-06-04 23:53                       ` Yang Shi
  0 siblings, 1 reply; 28+ messages in thread
From: Yang Shi @ 2024-06-03 16:54 UTC (permalink / raw)
  To: Oliver Sang
  Cc: David Hildenbrand, Peter Xu, Jason Gunthorpe, Vivek Kasireddy,
	Rik van Riel, oe-lkp, lkp, linux-kernel, Andrew Morton,
	Matthew Wilcox, Christopher Lameter, linux-mm

On Mon, Jun 3, 2024 at 7:02 AM Oliver Sang <oliver.sang@intel.com> wrote:
>
> hi, Yang Shi,
>
> On Fri, May 31, 2024 at 01:57:06PM -0700, Yang Shi wrote:
> > Hi Oliver,
> >
> > I just came up with a quick patch (just build test) per the discussion
> > and attached, can you please to give it a try? Once it is verified, I
> > will refine the patch and submit for review.
>
> what's the base of this patch? I tried to apply it upon efa7df3e3b or
> v6.10-rc2. both failed.

Its base is mm-unstable. The head commit is 8e06d6b9274d ("mm: add
swappiness= arg to memory.reclaim"). Sorry for the confusion, I should
have mentioned this.

>
> >
> > Thanks,
> > Yang
> >
> > >
> > > >
> > > >
> > > > --
> > > > Cheers,
> > > >
> > > > David / dhildenb
> > > >
>
>


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [linus:master] [mm] efa7df3e3b: kernel_BUG_at_include/linux/page_ref.h
       [not found]                 ` <Zl4m-sAhZknHOHdb@x1n>
@ 2024-06-03 20:37                   ` David Hildenbrand
  2024-06-03 20:44                     ` Yang Shi
       [not found]                     ` <Zl4vlGJsbHiahYil@x1n>
  0 siblings, 2 replies; 28+ messages in thread
From: David Hildenbrand @ 2024-06-03 20:37 UTC (permalink / raw)
  To: Peter Xu
  Cc: Yang Shi, kernel test robot, Jason Gunthorpe, Vivek Kasireddy,
	Rik van Riel, oe-lkp, lkp, linux-kernel, Andrew Morton,
	Matthew Wilcox, Christopher Lameter, linux-mm

>> try_get_folio() is all about grabbing a folio that might get freed
>> concurrently. That's why it calls folio_ref_try_add_rcu() and does
>> complicated stuff.
> 
> IMHO we can define it.. e.g. try_get_page() wasn't defined as so.
> 
> If we want to be crystal clear on that and if we think that's what we want,
> again I would suggest we rename it differently from try_get_page() to avoid
> future misuses, then add documents. We may want to also even assert the

Yes, absolutely.

> rcu/irq implications in try_get_folio() at entrance, then that'll be
> detected even without TINY_RCU config.
> 
>>
>> On !CONFIG_TINY_RCU, it performs a folio_ref_add_unless(). That's
>> essentially a atomic_add_unless(), which in the worst case ends up being a
>> cmpxchg loop.
>>
>>
>> Stating that we should be using try_get_folio() in paths where we are sure
>> the folio refcount is not 0 is the same as using folio_try_get() where
>> folio_get() would be sufficient.
>>
>> The VM_BUG_ON in folio_ref_try_add_rcu() really tells us here that we are
>> using a function in the wrong context, although in our case, it is safe to
>> use (there is now BUG). Which is true, because we know we have a folio
>> reference and can simply use a simple folio_ref_add().
>>
>> Again, just like we have folio_get() and folio_try_get(), we should
>> distinguish in GUP whether we are adding more reference to a folio (and
>> effectively do what folio_get() would), or whether we are actually grabbing
>> a folio that could be freed concurrently (what folio_try_get() would do).
> 
> Yes we can.  Again, IMHO it's a matter of whether it will worth it.
> 
> Note that even with SMP and even if we keep this code, the
> atomic_add_unless only affects gup slow on THP only, and even with that
> overhead it is much faster than before when that path was introduced.. and
> per my previous experience we don't care too much there, really.
> 
> So it's literally only three paths that are relevant here on the "unless"
> overhead:
> 
>    - gup slow on THP (I just added it; used to be even slower..)
> 
>    - vivik's new path
> 
>    - hugepd (which may be gone for good in a few months..)
>    
> IMHO none of them has perf concerns.  The real perf concern paths is
> gup-fast when pgtable entry existed, but that must use atomic_add_unless()
> anyway.  Even gup-slow !thp case won't regress as that uses try_get_page().

My point is primarily that we should be clear that the one thing is 
GUP-fast, and the other is for GUP-slow.

Sooner or later we'll see more code that uses try_grab_page() to be 
converted to folios, and people might naturally use try_grab_folio(), 
just like we did with Vivik's code.

And I don't think we'll want to make GUP-slow in general using 
try_grab_folio() in the future.

So ...

> 
> So again, IMHO the easist way to fix this WARN is we drop the TINY_RCU bit,
> if nobody worries on UP perf.
> 
> I don't have a strong opinion, if any of us really worry about above three
> use cases on "unless" overhead, and think it worthwhile to add the code to
> support it, I won't object. But to me it's adding pain with no real benefit
> we could ever measure, and adding complexity to backport too since we'll
> need a fix for as old as 6.6.

... for the sake of fixing this WARN, I don't primarily care. Adjusting 
the TINY_RCU feels wrong because I suspect somebody had good reasons to 
do it like that, and it actually reported something valuable (using the 
wrong function for the job).

In any case, if we take the easy route to fix the WARN, I'll come back 
and clean the functions here up properly.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [linus:master] [mm] efa7df3e3b: kernel_BUG_at_include/linux/page_ref.h
  2024-06-03 20:37                   ` David Hildenbrand
@ 2024-06-03 20:44                     ` Yang Shi
  2024-06-03 21:01                       ` David Hildenbrand
       [not found]                     ` <Zl4vlGJsbHiahYil@x1n>
  1 sibling, 1 reply; 28+ messages in thread
From: Yang Shi @ 2024-06-03 20:44 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Peter Xu, kernel test robot, Jason Gunthorpe, Vivek Kasireddy,
	Rik van Riel, oe-lkp, lkp, linux-kernel, Andrew Morton,
	Matthew Wilcox, Christopher Lameter, linux-mm

On Mon, Jun 3, 2024 at 1:38 PM David Hildenbrand <david@redhat.com> wrote:
>
> >> try_get_folio() is all about grabbing a folio that might get freed
> >> concurrently. That's why it calls folio_ref_try_add_rcu() and does
> >> complicated stuff.
> >
> > IMHO we can define it.. e.g. try_get_page() wasn't defined as so.
> >
> > If we want to be crystal clear on that and if we think that's what we want,
> > again I would suggest we rename it differently from try_get_page() to avoid
> > future misuses, then add documents. We may want to also even assert the
>
> Yes, absolutely.
>
> > rcu/irq implications in try_get_folio() at entrance, then that'll be
> > detected even without TINY_RCU config.
> >
> >>
> >> On !CONFIG_TINY_RCU, it performs a folio_ref_add_unless(). That's
> >> essentially a atomic_add_unless(), which in the worst case ends up being a
> >> cmpxchg loop.
> >>
> >>
> >> Stating that we should be using try_get_folio() in paths where we are sure
> >> the folio refcount is not 0 is the same as using folio_try_get() where
> >> folio_get() would be sufficient.
> >>
> >> The VM_BUG_ON in folio_ref_try_add_rcu() really tells us here that we are
> >> using a function in the wrong context, although in our case, it is safe to
> >> use (there is now BUG). Which is true, because we know we have a folio
> >> reference and can simply use a simple folio_ref_add().
> >>
> >> Again, just like we have folio_get() and folio_try_get(), we should
> >> distinguish in GUP whether we are adding more reference to a folio (and
> >> effectively do what folio_get() would), or whether we are actually grabbing
> >> a folio that could be freed concurrently (what folio_try_get() would do).
> >
> > Yes we can.  Again, IMHO it's a matter of whether it will worth it.
> >
> > Note that even with SMP and even if we keep this code, the
> > atomic_add_unless only affects gup slow on THP only, and even with that
> > overhead it is much faster than before when that path was introduced.. and
> > per my previous experience we don't care too much there, really.
> >
> > So it's literally only three paths that are relevant here on the "unless"
> > overhead:
> >
> >    - gup slow on THP (I just added it; used to be even slower..)
> >
> >    - vivik's new path
> >
> >    - hugepd (which may be gone for good in a few months..)
> >
> > IMHO none of them has perf concerns.  The real perf concern paths is
> > gup-fast when pgtable entry existed, but that must use atomic_add_unless()
> > anyway.  Even gup-slow !thp case won't regress as that uses try_get_page().
>
> My point is primarily that we should be clear that the one thing is
> GUP-fast, and the other is for GUP-slow.
>
> Sooner or later we'll see more code that uses try_grab_page() to be
> converted to folios, and people might naturally use try_grab_folio(),
> just like we did with Vivik's code.
>
> And I don't think we'll want to make GUP-slow in general using
> try_grab_folio() in the future.
>
> So ...
>
> >
> > So again, IMHO the easist way to fix this WARN is we drop the TINY_RCU bit,
> > if nobody worries on UP perf.
> >
> > I don't have a strong opinion, if any of us really worry about above three
> > use cases on "unless" overhead, and think it worthwhile to add the code to
> > support it, I won't object. But to me it's adding pain with no real benefit
> > we could ever measure, and adding complexity to backport too since we'll
> > need a fix for as old as 6.6.
>
> ... for the sake of fixing this WARN, I don't primarily care. Adjusting
> the TINY_RCU feels wrong because I suspect somebody had good reasons to
> do it like that, and it actually reported something valuable (using the
> wrong function for the job).

I think this is the major concern about what fix we should do. If that
tiny rcu optimization still makes sense and is useful, we'd better
keep it. But I can't tell. Leaving it as is may be safer.

>
> In any case, if we take the easy route to fix the WARN, I'll come back
> and clean the functions here up properly.
>
> --
> Cheers,
>
> David / dhildenb
>


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [linus:master] [mm] efa7df3e3b: kernel_BUG_at_include/linux/page_ref.h
  2024-06-03 20:44                     ` Yang Shi
@ 2024-06-03 21:01                       ` David Hildenbrand
  0 siblings, 0 replies; 28+ messages in thread
From: David Hildenbrand @ 2024-06-03 21:01 UTC (permalink / raw)
  To: Yang Shi
  Cc: Peter Xu, kernel test robot, Jason Gunthorpe, Vivek Kasireddy,
	Rik van Riel, oe-lkp, lkp, linux-kernel, Andrew Morton,
	Matthew Wilcox, Christopher Lameter, linux-mm

On 03.06.24 22:44, Yang Shi wrote:
> On Mon, Jun 3, 2024 at 1:38 PM David Hildenbrand <david@redhat.com> wrote:
>>
>>>> try_get_folio() is all about grabbing a folio that might get freed
>>>> concurrently. That's why it calls folio_ref_try_add_rcu() and does
>>>> complicated stuff.
>>>
>>> IMHO we can define it.. e.g. try_get_page() wasn't defined as so.
>>>
>>> If we want to be crystal clear on that and if we think that's what we want,
>>> again I would suggest we rename it differently from try_get_page() to avoid
>>> future misuses, then add documents. We may want to also even assert the
>>
>> Yes, absolutely.
>>
>>> rcu/irq implications in try_get_folio() at entrance, then that'll be
>>> detected even without TINY_RCU config.
>>>
>>>>
>>>> On !CONFIG_TINY_RCU, it performs a folio_ref_add_unless(). That's
>>>> essentially a atomic_add_unless(), which in the worst case ends up being a
>>>> cmpxchg loop.
>>>>
>>>>
>>>> Stating that we should be using try_get_folio() in paths where we are sure
>>>> the folio refcount is not 0 is the same as using folio_try_get() where
>>>> folio_get() would be sufficient.
>>>>
>>>> The VM_BUG_ON in folio_ref_try_add_rcu() really tells us here that we are
>>>> using a function in the wrong context, although in our case, it is safe to
>>>> use (there is now BUG). Which is true, because we know we have a folio
>>>> reference and can simply use a simple folio_ref_add().
>>>>
>>>> Again, just like we have folio_get() and folio_try_get(), we should
>>>> distinguish in GUP whether we are adding more reference to a folio (and
>>>> effectively do what folio_get() would), or whether we are actually grabbing
>>>> a folio that could be freed concurrently (what folio_try_get() would do).
>>>
>>> Yes we can.  Again, IMHO it's a matter of whether it will worth it.
>>>
>>> Note that even with SMP and even if we keep this code, the
>>> atomic_add_unless only affects gup slow on THP only, and even with that
>>> overhead it is much faster than before when that path was introduced.. and
>>> per my previous experience we don't care too much there, really.
>>>
>>> So it's literally only three paths that are relevant here on the "unless"
>>> overhead:
>>>
>>>     - gup slow on THP (I just added it; used to be even slower..)
>>>
>>>     - vivik's new path
>>>
>>>     - hugepd (which may be gone for good in a few months..)
>>>
>>> IMHO none of them has perf concerns.  The real perf concern paths is
>>> gup-fast when pgtable entry existed, but that must use atomic_add_unless()
>>> anyway.  Even gup-slow !thp case won't regress as that uses try_get_page().
>>
>> My point is primarily that we should be clear that the one thing is
>> GUP-fast, and the other is for GUP-slow.
>>
>> Sooner or later we'll see more code that uses try_grab_page() to be
>> converted to folios, and people might naturally use try_grab_folio(),
>> just like we did with Vivik's code.
>>
>> And I don't think we'll want to make GUP-slow in general using
>> try_grab_folio() in the future.
>>
>> So ...
>>
>>>
>>> So again, IMHO the easist way to fix this WARN is we drop the TINY_RCU bit,
>>> if nobody worries on UP perf.
>>>
>>> I don't have a strong opinion, if any of us really worry about above three
>>> use cases on "unless" overhead, and think it worthwhile to add the code to
>>> support it, I won't object. But to me it's adding pain with no real benefit
>>> we could ever measure, and adding complexity to backport too since we'll
>>> need a fix for as old as 6.6.
>>
>> ... for the sake of fixing this WARN, I don't primarily care. Adjusting
>> the TINY_RCU feels wrong because I suspect somebody had good reasons to
>> do it like that, and it actually reported something valuable (using the
>> wrong function for the job).
> 
> I think this is the major concern about what fix we should do. If that
> tiny rcu optimization still makes sense and is useful, we'd better
> keep it. But I can't tell. Leaving it as is may be safer.

Willy moved that code in 020853b6f5e and I think it dates back to e286781d5f2e.

That contained:

+       /*
+        * Preempt must be disabled here - we rely on rcu_read_lock doing
+        * this for us.
+        *
+        * Pagecache won't be truncated from interrupt context, so if we have
+        * found a page in the radix tree here, we have pinned its refcount by
+        * disabling preempt, and hence no need for the "speculative get" that
+        * SMP requires.
+        */
+       VM_BUG_ON(page_count(page) == 0);
+       atomic_inc(&page->_count);
+


-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [linus:master] [mm] efa7df3e3b: kernel_BUG_at_include/linux/page_ref.h
       [not found]                     ` <Zl4vlGJsbHiahYil@x1n>
@ 2024-06-03 21:05                       ` David Hildenbrand
  2024-06-03 22:43                         ` Paul E. McKenney
  0 siblings, 1 reply; 28+ messages in thread
From: David Hildenbrand @ 2024-06-03 21:05 UTC (permalink / raw)
  To: Peter Xu
  Cc: Yang Shi, kernel test robot, Jason Gunthorpe, Vivek Kasireddy,
	Rik van Riel, oe-lkp, lkp, linux-kernel, Andrew Morton,
	Matthew Wilcox, Christopher Lameter, linux-mm, Paul E . McKenney

On 03.06.24 23:03, Peter Xu wrote:
> On Mon, Jun 03, 2024 at 10:37:55PM +0200, David Hildenbrand wrote:
>>>> try_get_folio() is all about grabbing a folio that might get freed
>>>> concurrently. That's why it calls folio_ref_try_add_rcu() and does
>>>> complicated stuff.
>>>
>>> IMHO we can define it.. e.g. try_get_page() wasn't defined as so.
>>>
>>> If we want to be crystal clear on that and if we think that's what we want,
>>> again I would suggest we rename it differently from try_get_page() to avoid
>>> future misuses, then add documents. We may want to also even assert the
>>
>> Yes, absolutely.
>>
>>> rcu/irq implications in try_get_folio() at entrance, then that'll be
>>> detected even without TINY_RCU config.
>>>
>>>>
>>>> On !CONFIG_TINY_RCU, it performs a folio_ref_add_unless(). That's
>>>> essentially a atomic_add_unless(), which in the worst case ends up being a
>>>> cmpxchg loop.
>>>>
>>>>
>>>> Stating that we should be using try_get_folio() in paths where we are sure
>>>> the folio refcount is not 0 is the same as using folio_try_get() where
>>>> folio_get() would be sufficient.
>>>>
>>>> The VM_BUG_ON in folio_ref_try_add_rcu() really tells us here that we are
>>>> using a function in the wrong context, although in our case, it is safe to
>>>> use (there is now BUG). Which is true, because we know we have a folio
>>>> reference and can simply use a simple folio_ref_add().
>>>>
>>>> Again, just like we have folio_get() and folio_try_get(), we should
>>>> distinguish in GUP whether we are adding more reference to a folio (and
>>>> effectively do what folio_get() would), or whether we are actually grabbing
>>>> a folio that could be freed concurrently (what folio_try_get() would do).
>>>
>>> Yes we can.  Again, IMHO it's a matter of whether it will worth it.
>>>
>>> Note that even with SMP and even if we keep this code, the
>>> atomic_add_unless only affects gup slow on THP only, and even with that
>>> overhead it is much faster than before when that path was introduced.. and
>>> per my previous experience we don't care too much there, really.
>>>
>>> So it's literally only three paths that are relevant here on the "unless"
>>> overhead:
>>>
>>>     - gup slow on THP (I just added it; used to be even slower..)
>>>
>>>     - vivik's new path
>>>
>>>     - hugepd (which may be gone for good in a few months..)
>>> IMHO none of them has perf concerns.  The real perf concern paths is
>>> gup-fast when pgtable entry existed, but that must use atomic_add_unless()
>>> anyway.  Even gup-slow !thp case won't regress as that uses try_get_page().
>>
>> My point is primarily that we should be clear that the one thing is
>> GUP-fast, and the other is for GUP-slow.
> 
> Yes, understood.
> 
>>
>> Sooner or later we'll see more code that uses try_grab_page() to be
>> converted to folios, and people might naturally use try_grab_folio(), just
>> like we did with Vivik's code.
>>
>> And I don't think we'll want to make GUP-slow in general using
>> try_grab_folio() in the future.
>>
>> So ...
>>
>>>
>>> So again, IMHO the easist way to fix this WARN is we drop the TINY_RCU bit,
>>> if nobody worries on UP perf.
>>>
>>> I don't have a strong opinion, if any of us really worry about above three
>>> use cases on "unless" overhead, and think it worthwhile to add the code to
>>> support it, I won't object. But to me it's adding pain with no real benefit
>>> we could ever measure, and adding complexity to backport too since we'll
>>> need a fix for as old as 6.6.
>>
>> ... for the sake of fixing this WARN, I don't primarily care. Adjusting the
>> TINY_RCU feels wrong because I suspect somebody had good reasons to do it
>> like that, and it actually reported something valuable (using the wrong
>> function for the job).
>>
>> In any case, if we take the easy route to fix the WARN, I'll come back and
>> clean the functions here up properly.
> 
> Definitely, then there can also be some measurements which will be even
> better.  I mean, if the diff is minimal, we can be clearer on the path we
> choose; while if it shows improvements we have more solid results than
> predictions and discussions.
> 
> Yes I do worry about the UP change too, hence I sincerely was trying to
> collect some feedback.  My current guess is the UP was still important in
> 2008 when the code first wrote, and maybe it changed over the 16 years. I
> just notice Nicolas wrote it; I know he's still active so I've added him
> into the loop too.
> 
> Just for easier reference, the commit introduced the UP change is:
> 
> commit e286781d5f2e9c846e012a39653a166e9d31777d
> Author: Nicholas Piggin <npiggin@gmail.com>
> Date:   Fri Jul 25 19:45:30 2008 -0700
> 
>      mm: speculative page references
> 
> +static inline int page_cache_get_speculative(struct page *page)
> +{
> +       VM_BUG_ON(in_interrupt());
> +
> +#if !defined(CONFIG_SMP) && defined(CONFIG_CLASSIC_RCU)
> +# ifdef CONFIG_PREEMPT
> +       VM_BUG_ON(!in_atomic());
> +# endif
> +       /*
> +        * Preempt must be disabled here - we rely on rcu_read_lock doing
> +        * this for us.
> +        *
> +        * Pagecache won't be truncated from interrupt context, so if we have
> +        * found a page in the radix tree here, we have pinned its refcount by
> +        * disabling preempt, and hence no need for the "speculative get" that
> +        * SMP requires.
> +        */
> +       VM_BUG_ON(page_count(page) == 0);
> +       atomic_inc(&page->_count);
> +
> +#else
> +       if (unlikely(!get_page_unless_zero(page))) {
> +               /*
> +                * Either the page has been freed, or will be freed.
> +                * In either case, retry here and the caller should
> +                * do the right thing (see comments above).
> +                */
> +               return 0;
> +       }
> +#endif
> +       VM_BUG_ON(PageTail(page));
> +
> +       return 1;
> +}
> 
> Thanks,
> 

I chased it further to:

commit 8375ad98cc1defc36adf4a77d9ea1e71db51a371
Author: Paul E. McKenney <paulmck@kernel.org>
Date:   Mon Apr 29 15:06:13 2013 -0700

     vm: adjust ifdef for TINY_RCU
     
     There is an ifdef in page_cache_get_speculative() that checks for !SMP
     and TREE_RCU, which has been an impossible combination since the advent
     of TINY_RCU.  The ifdef enables a fastpath that is valid when preemption
     is disabled by rcu_read_lock() in UP systems, which is the case when
     TINY_RCU is enabled.  This commit therefore adjusts the ifdef to
     generate the fastpath when TINY_RCU is enabled.


Where Paul explicitly restored that fastpath for TINY_RCU instead of removing that code.

So maybe Paul can comment if that is still worth having. CCing him.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [linus:master] [mm] efa7df3e3b: kernel_BUG_at_include/linux/page_ref.h
  2024-06-03 21:05                       ` David Hildenbrand
@ 2024-06-03 22:43                         ` Paul E. McKenney
  2024-06-04 17:35                           ` Yang Shi
  0 siblings, 1 reply; 28+ messages in thread
From: Paul E. McKenney @ 2024-06-03 22:43 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Peter Xu, Yang Shi, kernel test robot, Jason Gunthorpe,
	Vivek Kasireddy, Rik van Riel, oe-lkp, lkp, linux-kernel,
	Andrew Morton, Matthew Wilcox, Christopher Lameter, linux-mm

On Mon, Jun 03, 2024 at 11:05:44PM +0200, David Hildenbrand wrote:
> On 03.06.24 23:03, Peter Xu wrote:
> > On Mon, Jun 03, 2024 at 10:37:55PM +0200, David Hildenbrand wrote:
> > > > > try_get_folio() is all about grabbing a folio that might get freed
> > > > > concurrently. That's why it calls folio_ref_try_add_rcu() and does
> > > > > complicated stuff.
> > > > 
> > > > IMHO we can define it.. e.g. try_get_page() wasn't defined as so.
> > > > 
> > > > If we want to be crystal clear on that and if we think that's what we want,
> > > > again I would suggest we rename it differently from try_get_page() to avoid
> > > > future misuses, then add documents. We may want to also even assert the
> > > 
> > > Yes, absolutely.
> > > 
> > > > rcu/irq implications in try_get_folio() at entrance, then that'll be
> > > > detected even without TINY_RCU config.
> > > > 
> > > > > 
> > > > > On !CONFIG_TINY_RCU, it performs a folio_ref_add_unless(). That's
> > > > > essentially a atomic_add_unless(), which in the worst case ends up being a
> > > > > cmpxchg loop.
> > > > > 
> > > > > 
> > > > > Stating that we should be using try_get_folio() in paths where we are sure
> > > > > the folio refcount is not 0 is the same as using folio_try_get() where
> > > > > folio_get() would be sufficient.
> > > > > 
> > > > > The VM_BUG_ON in folio_ref_try_add_rcu() really tells us here that we are
> > > > > using a function in the wrong context, although in our case, it is safe to
> > > > > use (there is now BUG). Which is true, because we know we have a folio
> > > > > reference and can simply use a simple folio_ref_add().
> > > > > 
> > > > > Again, just like we have folio_get() and folio_try_get(), we should
> > > > > distinguish in GUP whether we are adding more reference to a folio (and
> > > > > effectively do what folio_get() would), or whether we are actually grabbing
> > > > > a folio that could be freed concurrently (what folio_try_get() would do).
> > > > 
> > > > Yes we can.  Again, IMHO it's a matter of whether it will worth it.
> > > > 
> > > > Note that even with SMP and even if we keep this code, the
> > > > atomic_add_unless only affects gup slow on THP only, and even with that
> > > > overhead it is much faster than before when that path was introduced.. and
> > > > per my previous experience we don't care too much there, really.
> > > > 
> > > > So it's literally only three paths that are relevant here on the "unless"
> > > > overhead:
> > > > 
> > > >     - gup slow on THP (I just added it; used to be even slower..)
> > > > 
> > > >     - vivik's new path
> > > > 
> > > >     - hugepd (which may be gone for good in a few months..)
> > > > IMHO none of them has perf concerns.  The real perf concern paths is
> > > > gup-fast when pgtable entry existed, but that must use atomic_add_unless()
> > > > anyway.  Even gup-slow !thp case won't regress as that uses try_get_page().
> > > 
> > > My point is primarily that we should be clear that the one thing is
> > > GUP-fast, and the other is for GUP-slow.
> > 
> > Yes, understood.
> > 
> > > 
> > > Sooner or later we'll see more code that uses try_grab_page() to be
> > > converted to folios, and people might naturally use try_grab_folio(), just
> > > like we did with Vivik's code.
> > > 
> > > And I don't think we'll want to make GUP-slow in general using
> > > try_grab_folio() in the future.
> > > 
> > > So ...
> > > 
> > > > 
> > > > So again, IMHO the easist way to fix this WARN is we drop the TINY_RCU bit,
> > > > if nobody worries on UP perf.
> > > > 
> > > > I don't have a strong opinion, if any of us really worry about above three
> > > > use cases on "unless" overhead, and think it worthwhile to add the code to
> > > > support it, I won't object. But to me it's adding pain with no real benefit
> > > > we could ever measure, and adding complexity to backport too since we'll
> > > > need a fix for as old as 6.6.
> > > 
> > > ... for the sake of fixing this WARN, I don't primarily care. Adjusting the
> > > TINY_RCU feels wrong because I suspect somebody had good reasons to do it
> > > like that, and it actually reported something valuable (using the wrong
> > > function for the job).
> > > 
> > > In any case, if we take the easy route to fix the WARN, I'll come back and
> > > clean the functions here up properly.
> > 
> > Definitely, then there can also be some measurements which will be even
> > better.  I mean, if the diff is minimal, we can be clearer on the path we
> > choose; while if it shows improvements we have more solid results than
> > predictions and discussions.
> > 
> > Yes I do worry about the UP change too, hence I sincerely was trying to
> > collect some feedback.  My current guess is the UP was still important in
> > 2008 when the code first wrote, and maybe it changed over the 16 years. I
> > just notice Nicolas wrote it; I know he's still active so I've added him
> > into the loop too.
> > 
> > Just for easier reference, the commit introduced the UP change is:
> > 
> > commit e286781d5f2e9c846e012a39653a166e9d31777d
> > Author: Nicholas Piggin <npiggin@gmail.com>
> > Date:   Fri Jul 25 19:45:30 2008 -0700
> > 
> >      mm: speculative page references
> > 
> > +static inline int page_cache_get_speculative(struct page *page)
> > +{
> > +       VM_BUG_ON(in_interrupt());
> > +
> > +#if !defined(CONFIG_SMP) && defined(CONFIG_CLASSIC_RCU)
> > +# ifdef CONFIG_PREEMPT
> > +       VM_BUG_ON(!in_atomic());
> > +# endif
> > +       /*
> > +        * Preempt must be disabled here - we rely on rcu_read_lock doing
> > +        * this for us.
> > +        *
> > +        * Pagecache won't be truncated from interrupt context, so if we have
> > +        * found a page in the radix tree here, we have pinned its refcount by
> > +        * disabling preempt, and hence no need for the "speculative get" that
> > +        * SMP requires.
> > +        */
> > +       VM_BUG_ON(page_count(page) == 0);
> > +       atomic_inc(&page->_count);
> > +
> > +#else
> > +       if (unlikely(!get_page_unless_zero(page))) {
> > +               /*
> > +                * Either the page has been freed, or will be freed.
> > +                * In either case, retry here and the caller should
> > +                * do the right thing (see comments above).
> > +                */
> > +               return 0;
> > +       }
> > +#endif
> > +       VM_BUG_ON(PageTail(page));
> > +
> > +       return 1;
> > +}
> > 
> > Thanks,
> > 
> 
> I chased it further to:
> 
> commit 8375ad98cc1defc36adf4a77d9ea1e71db51a371
> Author: Paul E. McKenney <paulmck@kernel.org>
> Date:   Mon Apr 29 15:06:13 2013 -0700
> 
>     vm: adjust ifdef for TINY_RCU
>     There is an ifdef in page_cache_get_speculative() that checks for !SMP
>     and TREE_RCU, which has been an impossible combination since the advent
>     of TINY_RCU.  The ifdef enables a fastpath that is valid when preemption
>     is disabled by rcu_read_lock() in UP systems, which is the case when
>     TINY_RCU is enabled.  This commit therefore adjusts the ifdef to
>     generate the fastpath when TINY_RCU is enabled.
> 
> 
> Where Paul explicitly restored that fastpath for TINY_RCU instead of removing that code.
> 
> So maybe Paul can comment if that is still worth having. CCing him.

It is currently an atomic operation either way, though the folio_ref_add()
avoids full ordering, but that is immaterial on x86.  Some say that it is
in the noise on server-class ARMv8 as well, though they have also said
a great many things in the past.  But if that is true, the big benefit
of the TINY_RCU check is that folio_ref_try_add_rcu() is guaranted not
to fail in that case (single CPU with preemption disabled).  Except that
everyone has to check the return value anyway, right?

So the usual advice, unsatisfying though it might be, is to remove that
#ifdef and see if anyone notices.

After all, both 2013 and 2008 were quite some time ago.  ;-)

							Thanx, Paul


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [linus:master] [mm] efa7df3e3b: kernel_BUG_at_include/linux/page_ref.h
  2024-06-03 22:43                         ` Paul E. McKenney
@ 2024-06-04 17:35                           ` Yang Shi
  0 siblings, 0 replies; 28+ messages in thread
From: Yang Shi @ 2024-06-04 17:35 UTC (permalink / raw)
  To: paulmck
  Cc: David Hildenbrand, Peter Xu, kernel test robot, Jason Gunthorpe,
	Vivek Kasireddy, Rik van Riel, oe-lkp, lkp, linux-kernel,
	Andrew Morton, Matthew Wilcox, Christopher Lameter, linux-mm

> >
> > I chased it further to:
> >
> > commit 8375ad98cc1defc36adf4a77d9ea1e71db51a371
> > Author: Paul E. McKenney <paulmck@kernel.org>
> > Date:   Mon Apr 29 15:06:13 2013 -0700
> >
> >     vm: adjust ifdef for TINY_RCU
> >     There is an ifdef in page_cache_get_speculative() that checks for !SMP
> >     and TREE_RCU, which has been an impossible combination since the advent
> >     of TINY_RCU.  The ifdef enables a fastpath that is valid when preemption
> >     is disabled by rcu_read_lock() in UP systems, which is the case when
> >     TINY_RCU is enabled.  This commit therefore adjusts the ifdef to
> >     generate the fastpath when TINY_RCU is enabled.
> >
> >
> > Where Paul explicitly restored that fastpath for TINY_RCU instead of removing that code.
> >
> > So maybe Paul can comment if that is still worth having. CCing him.
>
> It is currently an atomic operation either way, though the folio_ref_add()
> avoids full ordering, but that is immaterial on x86.  Some say that it is
> in the noise on server-class ARMv8 as well, though they have also said
> a great many things in the past.  But if that is true, the big benefit
> of the TINY_RCU check is that folio_ref_try_add_rcu() is guaranted not
> to fail in that case (single CPU with preemption disabled).  Except that
> everyone has to check the return value anyway, right?
>
> So the usual advice, unsatisfying though it might be, is to remove that
> #ifdef and see if anyone notices.
>
> After all, both 2013 and 2008 were quite some time ago.  ;-)

Thanks, Paul.

I will submit a patch to remove the #ifdef as the fix for the bug
report. And do the clean up in a separate patch which is preferred by
David.

>
>                                                         Thanx, Paul


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [linus:master] [mm] efa7df3e3b: kernel_BUG_at_include/linux/page_ref.h
  2024-06-03 16:54                     ` Yang Shi
@ 2024-06-04 23:53                       ` Yang Shi
  2024-06-06  2:15                         ` Oliver Sang
  0 siblings, 1 reply; 28+ messages in thread
From: Yang Shi @ 2024-06-04 23:53 UTC (permalink / raw)
  To: Oliver Sang
  Cc: David Hildenbrand, Peter Xu, Jason Gunthorpe, Vivek Kasireddy,
	Rik van Riel, oe-lkp, lkp, linux-kernel, Andrew Morton,
	Matthew Wilcox, Christopher Lameter, linux-mm

On Mon, Jun 3, 2024 at 9:54 AM Yang Shi <shy828301@gmail.com> wrote:
>
> On Mon, Jun 3, 2024 at 7:02 AM Oliver Sang <oliver.sang@intel.com> wrote:
> >
> > hi, Yang Shi,
> >
> > On Fri, May 31, 2024 at 01:57:06PM -0700, Yang Shi wrote:
> > > Hi Oliver,
> > >
> > > I just came up with a quick patch (just build test) per the discussion
> > > and attached, can you please to give it a try? Once it is verified, I
> > > will refine the patch and submit for review.
> >
> > what's the base of this patch? I tried to apply it upon efa7df3e3b or
> > v6.10-rc2. both failed.
>
> Its base is mm-unstable. The head commit is 8e06d6b9274d ("mm: add
> swappiness= arg to memory.reclaim"). Sorry for the confusion, I should
> have mentioned this.

I just figured out a bug in the patch. Anyway, we are going to take a
different approach to fix the issue per the discussion. I already sent
the series to the mailing list. Please refer to
https://lore.kernel.org/linux-mm/20240604234858.948986-1-yang@os.amperecomputing.com/

>
> >
> > >
> > > Thanks,
> > > Yang
> > >
> > > >
> > > > >
> > > > >
> > > > > --
> > > > > Cheers,
> > > > >
> > > > > David / dhildenb
> > > > >
> >
> >


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [linus:master] [mm] efa7df3e3b: kernel_BUG_at_include/linux/page_ref.h
  2024-06-04 23:53                       ` Yang Shi
@ 2024-06-06  2:15                         ` Oliver Sang
  2024-06-06  3:44                           ` Yang Shi
  0 siblings, 1 reply; 28+ messages in thread
From: Oliver Sang @ 2024-06-06  2:15 UTC (permalink / raw)
  To: Yang Shi
  Cc: David Hildenbrand, Peter Xu, Jason Gunthorpe, Vivek Kasireddy,
	Rik van Riel, oe-lkp, lkp, linux-kernel, Andrew Morton,
	Matthew Wilcox, Christopher Lameter, linux-mm, oliver.sang

hi, Yang Shi,

On Tue, Jun 04, 2024 at 04:53:56PM -0700, Yang Shi wrote:
> On Mon, Jun 3, 2024 at 9:54 AM Yang Shi <shy828301@gmail.com> wrote:
> >
> > On Mon, Jun 3, 2024 at 7:02 AM Oliver Sang <oliver.sang@intel.com> wrote:
> > >
> > > hi, Yang Shi,
> > >
> > > On Fri, May 31, 2024 at 01:57:06PM -0700, Yang Shi wrote:
> > > > Hi Oliver,
> > > >
> > > > I just came up with a quick patch (just build test) per the discussion
> > > > and attached, can you please to give it a try? Once it is verified, I
> > > > will refine the patch and submit for review.
> > >
> > > what's the base of this patch? I tried to apply it upon efa7df3e3b or
> > > v6.10-rc2. both failed.
> >
> > Its base is mm-unstable. The head commit is 8e06d6b9274d ("mm: add
> > swappiness= arg to memory.reclaim"). Sorry for the confusion, I should
> > have mentioned this.
> 
> I just figured out a bug in the patch. Anyway, we are going to take a
> different approach to fix the issue per the discussion. I already sent
> the series to the mailing list. Please refer to
> https://lore.kernel.org/linux-mm/20240604234858.948986-1-yang@os.amperecomputing.com/

got it. seems you will submit v2? should we wait v2 to do the tests?

sorry that due to resource constraint, we cannot respond test request very
quickly now.

> 
> >
> > >
> > > >
> > > > Thanks,
> > > > Yang
> > > >
> > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > > Cheers,
> > > > > >
> > > > > > David / dhildenb
> > > > > >
> > >
> > >
> 


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [linus:master] [mm] efa7df3e3b: kernel_BUG_at_include/linux/page_ref.h
  2024-06-06  2:15                         ` Oliver Sang
@ 2024-06-06  3:44                           ` Yang Shi
  2024-06-12  6:01                             ` Oliver Sang
  0 siblings, 1 reply; 28+ messages in thread
From: Yang Shi @ 2024-06-06  3:44 UTC (permalink / raw)
  To: Oliver Sang
  Cc: Andrew Morton, Christopher Lameter, David Hildenbrand,
	Jason Gunthorpe, Matthew Wilcox, Peter Xu, Rik van Riel,
	Vivek Kasireddy, linux-kernel, linux-mm, lkp, oe-lkp

[-- Attachment #1: Type: text/plain, Size: 2074 bytes --]

Oliver Sang <oliver.sang@intel.com>于2024年6月5日 周三19:16写道:

> hi, Yang Shi,
>
> On Tue, Jun 04, 2024 at 04:53:56PM -0700, Yang Shi wrote:
> > On Mon, Jun 3, 2024 at 9:54 AM Yang Shi <shy828301@gmail.com> wrote:
> > >
> > > On Mon, Jun 3, 2024 at 7:02 AM Oliver Sang <oliver.sang@intel.com>
> wrote:
> > > >
> > > > hi, Yang Shi,
> > > >
> > > > On Fri, May 31, 2024 at 01:57:06PM -0700, Yang Shi wrote:
> > > > > Hi Oliver,
> > > > >
> > > > > I just came up with a quick patch (just build test) per the
> discussion
> > > > > and attached, can you please to give it a try? Once it is
> verified, I
> > > > > will refine the patch and submit for review.
> > > >
> > > > what's the base of this patch? I tried to apply it upon efa7df3e3b or
> > > > v6.10-rc2. both failed.
> > >
> > > Its base is mm-unstable. The head commit is 8e06d6b9274d ("mm: add
> > > swappiness= arg to memory.reclaim"). Sorry for the confusion, I should
> > > have mentioned this.
> >
> > I just figured out a bug in the patch. Anyway, we are going to take a
> > different approach to fix the issue per the discussion. I already sent
> > the series to the mailing list. Please refer to
> >
> https://lore.kernel.org/linux-mm/20240604234858.948986-1-yang@os.amperecomputing.com/
>
> got it. seems you will submit v2? should we wait v2 to do the tests?


The real fix is patch #1, that doesn’t need v2. So you just need to test
that.

For patch #2, I haven’t received any comment yet and I’m going to travel so
I’m not going to submit v2 soon .

And I heard if hugepd is going to be gone soon, so I may wait for that then
rebase on top of it. Anyway it is just a clean up.



>
> sorry that due to resource constraint, we cannot respond test request very
> quickly now.
>
> >
> > >
> > > >
> > > > >
> > > > > Thanks,
> > > > > Yang
> > > > >
> > > > > >
> > > > > > >
> > > > > > >
> > > > > > > --
> > > > > > > Cheers,
> > > > > > >
> > > > > > > David / dhildenb
> > > > > > >
> > > >
> > > >
> >
>

[-- Attachment #2: Type: text/html, Size: 3536 bytes --]

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [linus:master] [mm] efa7df3e3b: kernel_BUG_at_include/linux/page_ref.h
  2024-06-06  3:44                           ` Yang Shi
@ 2024-06-12  6:01                             ` Oliver Sang
  2024-06-25 20:34                               ` Yang Shi
  0 siblings, 1 reply; 28+ messages in thread
From: Oliver Sang @ 2024-06-12  6:01 UTC (permalink / raw)
  To: Yang Shi
  Cc: Andrew Morton, Christopher Lameter, David Hildenbrand,
	Jason Gunthorpe, Matthew Wilcox, Peter Xu, Rik van Riel,
	Vivek Kasireddy, linux-kernel, linux-mm, lkp, oe-lkp,
	oliver.sang

hi, Yang Shi,

On Wed, Jun 05, 2024 at 08:44:37PM -0700, Yang Shi wrote:
> Oliver Sang <oliver.sang@intel.com>于2024年6月5日 周三19:16写道:
> 
> > hi, Yang Shi,
> >
> > On Tue, Jun 04, 2024 at 04:53:56PM -0700, Yang Shi wrote:
> > > On Mon, Jun 3, 2024 at 9:54 AM Yang Shi <shy828301@gmail.com> wrote:
> > > >
> > > > On Mon, Jun 3, 2024 at 7:02 AM Oliver Sang <oliver.sang@intel.com>
> > wrote:
> > > > >
> > > > > hi, Yang Shi,
> > > > >
> > > > > On Fri, May 31, 2024 at 01:57:06PM -0700, Yang Shi wrote:
> > > > > > Hi Oliver,
> > > > > >
> > > > > > I just came up with a quick patch (just build test) per the
> > discussion
> > > > > > and attached, can you please to give it a try? Once it is
> > verified, I
> > > > > > will refine the patch and submit for review.
> > > > >
> > > > > what's the base of this patch? I tried to apply it upon efa7df3e3b or
> > > > > v6.10-rc2. both failed.
> > > >
> > > > Its base is mm-unstable. The head commit is 8e06d6b9274d ("mm: add
> > > > swappiness= arg to memory.reclaim"). Sorry for the confusion, I should
> > > > have mentioned this.
> > >
> > > I just figured out a bug in the patch. Anyway, we are going to take a
> > > different approach to fix the issue per the discussion. I already sent
> > > the series to the mailing list. Please refer to
> > >
> > https://lore.kernel.org/linux-mm/20240604234858.948986-1-yang@os.amperecomputing.com/
> >
> > got it. seems you will submit v2? should we wait v2 to do the tests?
> 
> 
> The real fix is patch #1, that doesn’t need v2. So you just need to test
> that.

we've finished tests and confirmed patch #1 fixed the issue.
we also tested upon patch #2, still clean.

our bot applied your patch upon 306dde9ce5c951 as below

5d45cc9b1beb57 mm: gup: do not call try_grab_folio() in slow path
fd3fc964468925 mm: page_ref: remove folio_try_get_rcu()
306dde9ce5c951 foo

on 306dde9ce5c951, we still observed the issue we reported. clean on both patch
#1 and #2

306dde9ce5c9516d fd3fc96446892528af48d6271a3 5d45cc9b1beb57386992c005669
---------------- --------------------------- ---------------------------
       fail:runs  %reproduction    fail:runs  %reproduction    fail:runs
           |             |             |             |             |
         47:50         -94%            :50         -94%            :50    dmesg.Kernel_panic-not_syncing:Fatal_exception
         47:50         -94%            :50         -94%            :50    dmesg.Oops:invalid_opcode:#[##]KASAN
         47:50         -94%            :50         -94%            :50    dmesg.RIP:try_get_folio
         47:50         -94%            :50         -94%            :50    dmesg.kernel_BUG_at_include/linux/page_ref.h


> 
> For patch #2, I haven’t received any comment yet and I’m going to travel so
> I’m not going to submit v2 soon .
> 
> And I heard if hugepd is going to be gone soon, so I may wait for that then
> rebase on top of it. Anyway it is just a clean up.
> 
> 
> 
> >
> > sorry that due to resource constraint, we cannot respond test request very
> > quickly now.
> >
> > >
> > > >
> > > > >
> > > > > >
> > > > > > Thanks,
> > > > > > Yang
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > --
> > > > > > > > Cheers,
> > > > > > > >
> > > > > > > > David / dhildenb
> > > > > > > >
> > > > >
> > > > >
> > >
> >


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [linus:master] [mm] efa7df3e3b: kernel_BUG_at_include/linux/page_ref.h
  2024-06-12  6:01                             ` Oliver Sang
@ 2024-06-25 20:34                               ` Yang Shi
  2024-06-25 20:41                                 ` David Hildenbrand
  0 siblings, 1 reply; 28+ messages in thread
From: Yang Shi @ 2024-06-25 20:34 UTC (permalink / raw)
  To: Oliver Sang
  Cc: Andrew Morton, Christopher Lameter, David Hildenbrand,
	Jason Gunthorpe, Matthew Wilcox, Peter Xu, Rik van Riel,
	Vivek Kasireddy, linux-kernel, linux-mm, lkp, oe-lkp

On Tue, Jun 11, 2024 at 11:01 PM Oliver Sang <oliver.sang@intel.com> wrote:
>
> hi, Yang Shi,
>
> On Wed, Jun 05, 2024 at 08:44:37PM -0700, Yang Shi wrote:
> > Oliver Sang <oliver.sang@intel.com>于2024年6月5日 周三19:16写道:
> >
> > > hi, Yang Shi,
> > >
> > > On Tue, Jun 04, 2024 at 04:53:56PM -0700, Yang Shi wrote:
> > > > On Mon, Jun 3, 2024 at 9:54 AM Yang Shi <shy828301@gmail.com> wrote:
> > > > >
> > > > > On Mon, Jun 3, 2024 at 7:02 AM Oliver Sang <oliver.sang@intel.com>
> > > wrote:
> > > > > >
> > > > > > hi, Yang Shi,
> > > > > >
> > > > > > On Fri, May 31, 2024 at 01:57:06PM -0700, Yang Shi wrote:
> > > > > > > Hi Oliver,
> > > > > > >
> > > > > > > I just came up with a quick patch (just build test) per the
> > > discussion
> > > > > > > and attached, can you please to give it a try? Once it is
> > > verified, I
> > > > > > > will refine the patch and submit for review.
> > > > > >
> > > > > > what's the base of this patch? I tried to apply it upon efa7df3e3b or
> > > > > > v6.10-rc2. both failed.
> > > > >
> > > > > Its base is mm-unstable. The head commit is 8e06d6b9274d ("mm: add
> > > > > swappiness= arg to memory.reclaim"). Sorry for the confusion, I should
> > > > > have mentioned this.
> > > >
> > > > I just figured out a bug in the patch. Anyway, we are going to take a
> > > > different approach to fix the issue per the discussion. I already sent
> > > > the series to the mailing list. Please refer to
> > > >
> > > https://lore.kernel.org/linux-mm/20240604234858.948986-1-yang@os.amperecomputing.com/
> > >
> > > got it. seems you will submit v2? should we wait v2 to do the tests?
> >
> >
> > The real fix is patch #1, that doesn’t need v2. So you just need to test
> > that.
>
> we've finished tests and confirmed patch #1 fixed the issue.
> we also tested upon patch #2, still clean.

Thanks for testing. Sorry for the late reply, just came back from
vacation. It seems like Andrew didn't take the fix yet. I will resend
the patch with your tested-by tag. And I will drop the patch #2 since
it is just a clean up and I didn't receive any review comments. In
addition, the undergoing hugepd clean up may make this clean up
easier, so I will put the clean up on the back burner for now.

>
> our bot applied your patch upon 306dde9ce5c951 as below
>
> 5d45cc9b1beb57 mm: gup: do not call try_grab_folio() in slow path
> fd3fc964468925 mm: page_ref: remove folio_try_get_rcu()
> 306dde9ce5c951 foo
>
> on 306dde9ce5c951, we still observed the issue we reported. clean on both patch
> #1 and #2
>
> 306dde9ce5c9516d fd3fc96446892528af48d6271a3 5d45cc9b1beb57386992c005669
> ---------------- --------------------------- ---------------------------
>        fail:runs  %reproduction    fail:runs  %reproduction    fail:runs
>            |             |             |             |             |
>          47:50         -94%            :50         -94%            :50    dmesg.Kernel_panic-not_syncing:Fatal_exception
>          47:50         -94%            :50         -94%            :50    dmesg.Oops:invalid_opcode:#[##]KASAN
>          47:50         -94%            :50         -94%            :50    dmesg.RIP:try_get_folio
>          47:50         -94%            :50         -94%            :50    dmesg.kernel_BUG_at_include/linux/page_ref.h
>
>
> >
> > For patch #2, I haven’t received any comment yet and I’m going to travel so
> > I’m not going to submit v2 soon .
> >
> > And I heard if hugepd is going to be gone soon, so I may wait for that then
> > rebase on top of it. Anyway it is just a clean up.
> >
> >
> >
> > >
> > > sorry that due to resource constraint, we cannot respond test request very
> > > quickly now.
> > >
> > > >
> > > > >
> > > > > >
> > > > > > >
> > > > > > > Thanks,
> > > > > > > Yang
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > --
> > > > > > > > > Cheers,
> > > > > > > > >
> > > > > > > > > David / dhildenb
> > > > > > > > >
> > > > > >
> > > > > >
> > > >
> > >


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [linus:master] [mm] efa7df3e3b: kernel_BUG_at_include/linux/page_ref.h
  2024-06-25 20:34                               ` Yang Shi
@ 2024-06-25 20:41                                 ` David Hildenbrand
  2024-06-25 20:53                                   ` Yang Shi
  0 siblings, 1 reply; 28+ messages in thread
From: David Hildenbrand @ 2024-06-25 20:41 UTC (permalink / raw)
  To: Yang Shi, Oliver Sang
  Cc: Andrew Morton, Christopher Lameter, Jason Gunthorpe,
	Matthew Wilcox, Peter Xu, Rik van Riel, Vivek Kasireddy,
	linux-kernel, linux-mm, lkp, oe-lkp

On 25.06.24 22:34, Yang Shi wrote:
> On Tue, Jun 11, 2024 at 11:01 PM Oliver Sang <oliver.sang@intel.com> wrote:
>>
>> hi, Yang Shi,
>>
>> On Wed, Jun 05, 2024 at 08:44:37PM -0700, Yang Shi wrote:
>>> Oliver Sang <oliver.sang@intel.com>于2024年6月5日 周三19:16写道:
>>>
>>>> hi, Yang Shi,
>>>>
>>>> On Tue, Jun 04, 2024 at 04:53:56PM -0700, Yang Shi wrote:
>>>>> On Mon, Jun 3, 2024 at 9:54 AM Yang Shi <shy828301@gmail.com> wrote:
>>>>>>
>>>>>> On Mon, Jun 3, 2024 at 7:02 AM Oliver Sang <oliver.sang@intel.com>
>>>> wrote:
>>>>>>>
>>>>>>> hi, Yang Shi,
>>>>>>>
>>>>>>> On Fri, May 31, 2024 at 01:57:06PM -0700, Yang Shi wrote:
>>>>>>>> Hi Oliver,
>>>>>>>>
>>>>>>>> I just came up with a quick patch (just build test) per the
>>>> discussion
>>>>>>>> and attached, can you please to give it a try? Once it is
>>>> verified, I
>>>>>>>> will refine the patch and submit for review.
>>>>>>>
>>>>>>> what's the base of this patch? I tried to apply it upon efa7df3e3b or
>>>>>>> v6.10-rc2. both failed.
>>>>>>
>>>>>> Its base is mm-unstable. The head commit is 8e06d6b9274d ("mm: add
>>>>>> swappiness= arg to memory.reclaim"). Sorry for the confusion, I should
>>>>>> have mentioned this.
>>>>>
>>>>> I just figured out a bug in the patch. Anyway, we are going to take a
>>>>> different approach to fix the issue per the discussion. I already sent
>>>>> the series to the mailing list. Please refer to
>>>>>
>>>> https://lore.kernel.org/linux-mm/20240604234858.948986-1-yang@os.amperecomputing.com/
>>>>
>>>> got it. seems you will submit v2? should we wait v2 to do the tests?
>>>
>>>
>>> The real fix is patch #1, that doesn’t need v2. So you just need to test
>>> that.
>>
>> we've finished tests and confirmed patch #1 fixed the issue.
>> we also tested upon patch #2, still clean.
> 
> Thanks for testing. Sorry for the late reply, just came back from
> vacation. It seems like Andrew didn't take the fix yet. I will resend
> the patch with your tested-by tag. And I will drop the patch #2 since
> it is just a clean up and I didn't receive any review comments. In
> addition, the undergoing hugepd clean up may make this clean up
> easier, so I will put the clean up on the back burner for now.

Sorry, was expecting a v2 of the second patch. But agreed that posting 
the second patch separately is reasonable -- possibly after the hugepd 
stuff is gone for good.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [linus:master] [mm] efa7df3e3b: kernel_BUG_at_include/linux/page_ref.h
  2024-06-25 20:41                                 ` David Hildenbrand
@ 2024-06-25 20:53                                   ` Yang Shi
  0 siblings, 0 replies; 28+ messages in thread
From: Yang Shi @ 2024-06-25 20:53 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Oliver Sang, Andrew Morton, Christopher Lameter, Jason Gunthorpe,
	Matthew Wilcox, Peter Xu, Rik van Riel, Vivek Kasireddy,
	linux-kernel, linux-mm, lkp, oe-lkp

On Tue, Jun 25, 2024 at 1:41 PM David Hildenbrand <david@redhat.com> wrote:
>
> On 25.06.24 22:34, Yang Shi wrote:
> > On Tue, Jun 11, 2024 at 11:01 PM Oliver Sang <oliver.sang@intel.com> wrote:
> >>
> >> hi, Yang Shi,
> >>
> >> On Wed, Jun 05, 2024 at 08:44:37PM -0700, Yang Shi wrote:
> >>> Oliver Sang <oliver.sang@intel.com>于2024年6月5日 周三19:16写道:
> >>>
> >>>> hi, Yang Shi,
> >>>>
> >>>> On Tue, Jun 04, 2024 at 04:53:56PM -0700, Yang Shi wrote:
> >>>>> On Mon, Jun 3, 2024 at 9:54 AM Yang Shi <shy828301@gmail.com> wrote:
> >>>>>>
> >>>>>> On Mon, Jun 3, 2024 at 7:02 AM Oliver Sang <oliver.sang@intel.com>
> >>>> wrote:
> >>>>>>>
> >>>>>>> hi, Yang Shi,
> >>>>>>>
> >>>>>>> On Fri, May 31, 2024 at 01:57:06PM -0700, Yang Shi wrote:
> >>>>>>>> Hi Oliver,
> >>>>>>>>
> >>>>>>>> I just came up with a quick patch (just build test) per the
> >>>> discussion
> >>>>>>>> and attached, can you please to give it a try? Once it is
> >>>> verified, I
> >>>>>>>> will refine the patch and submit for review.
> >>>>>>>
> >>>>>>> what's the base of this patch? I tried to apply it upon efa7df3e3b or
> >>>>>>> v6.10-rc2. both failed.
> >>>>>>
> >>>>>> Its base is mm-unstable. The head commit is 8e06d6b9274d ("mm: add
> >>>>>> swappiness= arg to memory.reclaim"). Sorry for the confusion, I should
> >>>>>> have mentioned this.
> >>>>>
> >>>>> I just figured out a bug in the patch. Anyway, we are going to take a
> >>>>> different approach to fix the issue per the discussion. I already sent
> >>>>> the series to the mailing list. Please refer to
> >>>>>
> >>>> https://lore.kernel.org/linux-mm/20240604234858.948986-1-yang@os.amperecomputing.com/
> >>>>
> >>>> got it. seems you will submit v2? should we wait v2 to do the tests?
> >>>
> >>>
> >>> The real fix is patch #1, that doesn’t need v2. So you just need to test
> >>> that.
> >>
> >> we've finished tests and confirmed patch #1 fixed the issue.
> >> we also tested upon patch #2, still clean.
> >
> > Thanks for testing. Sorry for the late reply, just came back from
> > vacation. It seems like Andrew didn't take the fix yet. I will resend
> > the patch with your tested-by tag. And I will drop the patch #2 since
> > it is just a clean up and I didn't receive any review comments. In
> > addition, the undergoing hugepd clean up may make this clean up
> > easier, so I will put the clean up on the back burner for now.
>
> Sorry, was expecting a v2 of the second patch. But agreed that posting
> the second patch separately is reasonable -- possibly after the hugepd
> stuff is gone for good.

I can revisit the cleanup once the hugepd stuff is done.

>
> --
> Cheers,
>
> David / dhildenb
>


^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2024-06-25 20:53 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-05-31  8:24 [linus:master] [mm] efa7df3e3b: kernel_BUG_at_include/linux/page_ref.h kernel test robot
2024-05-31 16:50 ` Yang Shi
     [not found]   ` <890e5a79-8574-4a24-90ab-b9888968d5e5@redhat.com>
2024-05-31 18:07     ` Yang Shi
2024-05-31 18:13       ` Yang Shi
2024-05-31 18:24         ` David Hildenbrand
2024-05-31 18:30           ` Yang Shi
2024-05-31 18:38             ` David Hildenbrand
2024-05-31 19:06               ` Yang Shi
2024-05-31 20:57                 ` Yang Shi
2024-06-03 14:02                   ` Oliver Sang
2024-06-03 16:54                     ` Yang Shi
2024-06-04 23:53                       ` Yang Shi
2024-06-06  2:15                         ` Oliver Sang
2024-06-06  3:44                           ` Yang Shi
2024-06-12  6:01                             ` Oliver Sang
2024-06-25 20:34                               ` Yang Shi
2024-06-25 20:41                                 ` David Hildenbrand
2024-06-25 20:53                                   ` Yang Shi
2024-05-31 23:24     ` Peter Xu
2024-06-01  0:01       ` Yang Shi
2024-06-01  0:59         ` Yang Shi
     [not found]           ` <0edfcfed-e8c4-4c46-bbce-528c07084792@redhat.com>
2024-06-03 15:08             ` Peter Xu
     [not found]               ` <8da12503-839d-459f-a2fa-4abd6d21935d@redhat.com>
     [not found]                 ` <Zl4m-sAhZknHOHdb@x1n>
2024-06-03 20:37                   ` David Hildenbrand
2024-06-03 20:44                     ` Yang Shi
2024-06-03 21:01                       ` David Hildenbrand
     [not found]                     ` <Zl4vlGJsbHiahYil@x1n>
2024-06-03 21:05                       ` David Hildenbrand
2024-06-03 22:43                         ` Paul E. McKenney
2024-06-04 17:35                           ` Yang Shi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox