* [PATCH] mm/memfd: reserve hugetlb folios before allocation
@ 2025-01-07 7:25 Vivek Kasireddy
2025-01-07 9:36 ` David Hildenbrand
2025-01-07 17:12 ` Steven Sistare
0 siblings, 2 replies; 8+ messages in thread
From: Vivek Kasireddy @ 2025-01-07 7:25 UTC (permalink / raw)
To: linux-mm
Cc: Vivek Kasireddy, syzbot+a504cb5bae4fe117ba94, Steve Sistare,
Muchun Song, David Hildenbrand, Andrew Morton
There are cases when we try to pin a folio but discover that it has
not been faulted-in. So, we try to allocate it in memfd_alloc_folio()
but there is a chance that we might encounter a crash/failure
(VM_BUG_ON(!h->resv_huge_pages)) if there are no active reservations
at that instant. This issue was reported by syzbot:
kernel BUG at mm/hugetlb.c:2403!
Oops: invalid opcode: 0000 [#1] PREEMPT SMP KASAN NOPTI
CPU: 0 UID: 0 PID: 5315 Comm: syz.0.0 Not tainted
6.13.0-rc5-syzkaller-00161-g63676eefb7a0 #0
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS
1.16.3-debian-1.16.3-2~bpo12+1 04/01/2014
RIP: 0010:alloc_hugetlb_folio_reserve+0xbc/0xc0 mm/hugetlb.c:2403
Code: 1f eb 05 e8 56 18 a0 ff 48 c7 c7 40 56 61 8e e8 ba 21 cc 09 4c 89
f0 5b 41 5c 41 5e 41 5f 5d c3 cc cc cc cc e8 35 18 a0 ff 90 <0f> 0b 66
90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 f3 0f
RSP: 0018:ffffc9000d3d77f8 EFLAGS: 00010087
RAX: ffffffff81ff6beb RBX: 0000000000000000 RCX: 0000000000100000
RDX: ffffc9000e51a000 RSI: 00000000000003ec RDI: 00000000000003ed
RBP: 1ffffffff34810d9 R08: ffffffff81ff6ba3 R09: 1ffffd4000093005
R10: dffffc0000000000 R11: fffff94000093006 R12: dffffc0000000000
R13: dffffc0000000000 R14: ffffea0000498000 R15: ffffffff9a4086c8
FS: 00007f77ac12e6c0(0000) GS:ffff88801fc00000(0000)
knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f77ab54b170 CR3: 0000000040b70000 CR4: 0000000000352ef0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
<TASK>
memfd_alloc_folio+0x1bd/0x370 mm/memfd.c:88
memfd_pin_folios+0xf10/0x1570 mm/gup.c:3750
udmabuf_pin_folios drivers/dma-buf/udmabuf.c:346 [inline]
udmabuf_create+0x70e/0x10c0 drivers/dma-buf/udmabuf.c:443
udmabuf_ioctl_create drivers/dma-buf/udmabuf.c:495 [inline]
udmabuf_ioctl+0x301/0x4e0 drivers/dma-buf/udmabuf.c:526
vfs_ioctl fs/ioctl.c:51 [inline]
__do_sys_ioctl fs/ioctl.c:906 [inline]
__se_sys_ioctl+0xf5/0x170 fs/ioctl.c:892
do_syscall_x64 arch/x86/entry/common.c:52 [inline]
do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83
entry_SYSCALL_64_after_hwframe+0x77/0x7f
Therefore, to avoid this situation and fix this issue, we just need
to make a reservation before we try to allocate the folio. While at
it, also remove the VM_BUG_ON() as there is no need to crash the
system in this scenario and instead we could just fail the allocation.
Fixes: 26a8ea80929c ("mm/hugetlb: fix memfd_pin_folios resv_huge_pages leak")
Reported-by: syzbot+a504cb5bae4fe117ba94@syzkaller.appspotmail.com
Signed-off-by: Vivek Kasireddy <vivek.kasireddy@intel.com>
Cc: Steve Sistare <steven.sistare@oracle.com>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: David Hildenbrand <david@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
---
mm/hugetlb.c | 9 ++++++---
mm/memfd.c | 5 +++++
2 files changed, 11 insertions(+), 3 deletions(-)
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index c498874a7170..e46c461210a4 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -2397,12 +2397,15 @@ struct folio *alloc_hugetlb_folio_reserve(struct hstate *h, int preferred_nid,
struct folio *folio;
spin_lock_irq(&hugetlb_lock);
+ if (!h->resv_huge_pages) {
+ spin_unlock_irq(&hugetlb_lock);
+ return NULL;
+ }
+
folio = dequeue_hugetlb_folio_nodemask(h, gfp_mask, preferred_nid,
nmask);
- if (folio) {
- VM_BUG_ON(!h->resv_huge_pages);
+ if (folio)
h->resv_huge_pages--;
- }
spin_unlock_irq(&hugetlb_lock);
return folio;
diff --git a/mm/memfd.c b/mm/memfd.c
index 35a370d75c9a..a3012c444285 100644
--- a/mm/memfd.c
+++ b/mm/memfd.c
@@ -85,6 +85,10 @@ struct folio *memfd_alloc_folio(struct file *memfd, pgoff_t idx)
gfp_mask &= ~(__GFP_HIGHMEM | __GFP_MOVABLE);
idx >>= huge_page_order(h);
+ if (!hugetlb_reserve_pages(file_inode(memfd),
+ idx, idx + 1, NULL, 0))
+ return ERR_PTR(-ENOMEM);
+
folio = alloc_hugetlb_folio_reserve(h,
numa_node_id(),
NULL,
@@ -100,6 +104,7 @@ struct folio *memfd_alloc_folio(struct file *memfd, pgoff_t idx)
folio_unlock(folio);
return folio;
}
+ hugetlb_unreserve_pages(file_inode(memfd), idx, idx + 1, 1);
return ERR_PTR(-ENOMEM);
}
#endif
--
2.47.1
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH] mm/memfd: reserve hugetlb folios before allocation
2025-01-07 7:25 [PATCH] mm/memfd: reserve hugetlb folios before allocation Vivek Kasireddy
@ 2025-01-07 9:36 ` David Hildenbrand
2025-01-08 6:59 ` Kasireddy, Vivek
2025-01-07 17:12 ` Steven Sistare
1 sibling, 1 reply; 8+ messages in thread
From: David Hildenbrand @ 2025-01-07 9:36 UTC (permalink / raw)
To: Vivek Kasireddy, linux-mm
Cc: syzbot+a504cb5bae4fe117ba94, Steve Sistare, Muchun Song, Andrew Morton
On 07.01.25 08:25, Vivek Kasireddy wrote:
> There are cases when we try to pin a folio but discover that it has
> not been faulted-in. So, we try to allocate it in memfd_alloc_folio()
> but there is a chance that we might encounter a crash/failure
> (VM_BUG_ON(!h->resv_huge_pages)) if there are no active reservations
> at that instant. This issue was reported by syzbot:
>
> kernel BUG at mm/hugetlb.c:2403!
> Oops: invalid opcode: 0000 [#1] PREEMPT SMP KASAN NOPTI
> CPU: 0 UID: 0 PID: 5315 Comm: syz.0.0 Not tainted
> 6.13.0-rc5-syzkaller-00161-g63676eefb7a0 #0
> Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS
> 1.16.3-debian-1.16.3-2~bpo12+1 04/01/2014
> RIP: 0010:alloc_hugetlb_folio_reserve+0xbc/0xc0 mm/hugetlb.c:2403
> Code: 1f eb 05 e8 56 18 a0 ff 48 c7 c7 40 56 61 8e e8 ba 21 cc 09 4c 89
> f0 5b 41 5c 41 5e 41 5f 5d c3 cc cc cc cc e8 35 18 a0 ff 90 <0f> 0b 66
> 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 f3 0f
> RSP: 0018:ffffc9000d3d77f8 EFLAGS: 00010087
> RAX: ffffffff81ff6beb RBX: 0000000000000000 RCX: 0000000000100000
> RDX: ffffc9000e51a000 RSI: 00000000000003ec RDI: 00000000000003ed
> RBP: 1ffffffff34810d9 R08: ffffffff81ff6ba3 R09: 1ffffd4000093005
> R10: dffffc0000000000 R11: fffff94000093006 R12: dffffc0000000000
> R13: dffffc0000000000 R14: ffffea0000498000 R15: ffffffff9a4086c8
> FS: 00007f77ac12e6c0(0000) GS:ffff88801fc00000(0000)
> knlGS:0000000000000000
> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 00007f77ab54b170 CR3: 0000000040b70000 CR4: 0000000000352ef0
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> Call Trace:
> <TASK>
> memfd_alloc_folio+0x1bd/0x370 mm/memfd.c:88
> memfd_pin_folios+0xf10/0x1570 mm/gup.c:3750
> udmabuf_pin_folios drivers/dma-buf/udmabuf.c:346 [inline]
> udmabuf_create+0x70e/0x10c0 drivers/dma-buf/udmabuf.c:443
> udmabuf_ioctl_create drivers/dma-buf/udmabuf.c:495 [inline]
> udmabuf_ioctl+0x301/0x4e0 drivers/dma-buf/udmabuf.c:526
> vfs_ioctl fs/ioctl.c:51 [inline]
> __do_sys_ioctl fs/ioctl.c:906 [inline]
> __se_sys_ioctl+0xf5/0x170 fs/ioctl.c:892
> do_syscall_x64 arch/x86/entry/common.c:52 [inline]
> do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83
> entry_SYSCALL_64_after_hwframe+0x77/0x7f
>
> Therefore, to avoid this situation and fix this issue, we just need
> to make a reservation before we try to allocate the folio. While at
> it, also remove the VM_BUG_ON() as there is no need to crash the
> system in this scenario and instead we could just fail the allocation.
>
> Fixes: 26a8ea80929c ("mm/hugetlb: fix memfd_pin_folios resv_huge_pages leak")
> Reported-by: syzbot+a504cb5bae4fe117ba94@syzkaller.appspotmail.com
> Signed-off-by: Vivek Kasireddy <vivek.kasireddy@intel.com>
> Cc: Steve Sistare <steven.sistare@oracle.com>
> Cc: Muchun Song <muchun.song@linux.dev>
> Cc: David Hildenbrand <david@redhat.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> ---
> mm/hugetlb.c | 9 ++++++---
> mm/memfd.c | 5 +++++
> 2 files changed, 11 insertions(+), 3 deletions(-)
>
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index c498874a7170..e46c461210a4 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -2397,12 +2397,15 @@ struct folio *alloc_hugetlb_folio_reserve(struct hstate *h, int preferred_nid,
> struct folio *folio;
>
> spin_lock_irq(&hugetlb_lock);
> + if (!h->resv_huge_pages) {
Should this be a "if (WARN_ON_ONCE(!h->resv_huge_pages)) {", because the
"_reserve" in the function indicates that this is indeed something that
must never happen?
> + spin_unlock_irq(&hugetlb_lock);
> + return NULL;
> + }
> +
> folio = dequeue_hugetlb_folio_nodemask(h, gfp_mask, preferred_nid,
> nmask);
> - if (folio) {
> - VM_BUG_ON(!h->resv_huge_pages);
> + if (folio)
> h->resv_huge_pages--;
> - }
>
> spin_unlock_irq(&hugetlb_lock);
> return folio;
> diff --git a/mm/memfd.c b/mm/memfd.c
> index 35a370d75c9a..a3012c444285 100644
> --- a/mm/memfd.c
> +++ b/mm/memfd.c
> @@ -85,6 +85,10 @@ struct folio *memfd_alloc_folio(struct file *memfd, pgoff_t idx)
> gfp_mask &= ~(__GFP_HIGHMEM | __GFP_MOVABLE);
> idx >>= huge_page_order(h);
>
> + if (!hugetlb_reserve_pages(file_inode(memfd),
> + idx, idx + 1, NULL, 0))
> + return ERR_PTR(-ENOMEM);
> +
> folio = alloc_hugetlb_folio_reserve(h,
> numa_node_id(),
> NULL,
> @@ -100,6 +104,7 @@ struct folio *memfd_alloc_folio(struct file *memfd, pgoff_t idx)
> folio_unlock(folio);
> return folio;
> }
> + hugetlb_unreserve_pages(file_inode(memfd), idx, idx + 1, 1);
> return ERR_PTR(-ENOMEM);
Staring at hugetlb_reserve_pages() I assume this will also work as
expected if already reserved.
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH] mm/memfd: reserve hugetlb folios before allocation
2025-01-07 7:25 [PATCH] mm/memfd: reserve hugetlb folios before allocation Vivek Kasireddy
2025-01-07 9:36 ` David Hildenbrand
@ 2025-01-07 17:12 ` Steven Sistare
2025-01-08 7:24 ` Kasireddy, Vivek
1 sibling, 1 reply; 8+ messages in thread
From: Steven Sistare @ 2025-01-07 17:12 UTC (permalink / raw)
To: Vivek Kasireddy, linux-mm
Cc: syzbot+a504cb5bae4fe117ba94, Muchun Song, David Hildenbrand,
Andrew Morton
On 1/7/2025 2:25 AM, Vivek Kasireddy wrote:
> There are cases when we try to pin a folio but discover that it has
> not been faulted-in. So, we try to allocate it in memfd_alloc_folio()
> but there is a chance that we might encounter a crash/failure
> (VM_BUG_ON(!h->resv_huge_pages)) if there are no active reservations
> at that instant. This issue was reported by syzbot:
>
> kernel BUG at mm/hugetlb.c:2403!
> Oops: invalid opcode: 0000 [#1] PREEMPT SMP KASAN NOPTI
> CPU: 0 UID: 0 PID: 5315 Comm: syz.0.0 Not tainted
> 6.13.0-rc5-syzkaller-00161-g63676eefb7a0 #0
> Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS
> 1.16.3-debian-1.16.3-2~bpo12+1 04/01/2014
> RIP: 0010:alloc_hugetlb_folio_reserve+0xbc/0xc0 mm/hugetlb.c:2403
> Code: 1f eb 05 e8 56 18 a0 ff 48 c7 c7 40 56 61 8e e8 ba 21 cc 09 4c 89
> f0 5b 41 5c 41 5e 41 5f 5d c3 cc cc cc cc e8 35 18 a0 ff 90 <0f> 0b 66
> 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 f3 0f
> RSP: 0018:ffffc9000d3d77f8 EFLAGS: 00010087
> RAX: ffffffff81ff6beb RBX: 0000000000000000 RCX: 0000000000100000
> RDX: ffffc9000e51a000 RSI: 00000000000003ec RDI: 00000000000003ed
> RBP: 1ffffffff34810d9 R08: ffffffff81ff6ba3 R09: 1ffffd4000093005
> R10: dffffc0000000000 R11: fffff94000093006 R12: dffffc0000000000
> R13: dffffc0000000000 R14: ffffea0000498000 R15: ffffffff9a4086c8
> FS: 00007f77ac12e6c0(0000) GS:ffff88801fc00000(0000)
> knlGS:0000000000000000
> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 00007f77ab54b170 CR3: 0000000040b70000 CR4: 0000000000352ef0
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> Call Trace:
> <TASK>
> memfd_alloc_folio+0x1bd/0x370 mm/memfd.c:88
> memfd_pin_folios+0xf10/0x1570 mm/gup.c:3750
> udmabuf_pin_folios drivers/dma-buf/udmabuf.c:346 [inline]
> udmabuf_create+0x70e/0x10c0 drivers/dma-buf/udmabuf.c:443
> udmabuf_ioctl_create drivers/dma-buf/udmabuf.c:495 [inline]
> udmabuf_ioctl+0x301/0x4e0 drivers/dma-buf/udmabuf.c:526
> vfs_ioctl fs/ioctl.c:51 [inline]
> __do_sys_ioctl fs/ioctl.c:906 [inline]
> __se_sys_ioctl+0xf5/0x170 fs/ioctl.c:892
> do_syscall_x64 arch/x86/entry/common.c:52 [inline]
> do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83
> entry_SYSCALL_64_after_hwframe+0x77/0x7f
>
> Therefore, to avoid this situation and fix this issue, we just need
> to make a reservation before we try to allocate the folio. While at
> it, also remove the VM_BUG_ON() as there is no need to crash the
> system in this scenario and instead we could just fail the allocation.
>
> Fixes: 26a8ea80929c ("mm/hugetlb: fix memfd_pin_folios resv_huge_pages leak")
> Reported-by: syzbot+a504cb5bae4fe117ba94@syzkaller.appspotmail.com
> Signed-off-by: Vivek Kasireddy <vivek.kasireddy@intel.com>
> Cc: Steve Sistare <steven.sistare@oracle.com>
> Cc: Muchun Song <muchun.song@linux.dev>
> Cc: David Hildenbrand <david@redhat.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> ---
> mm/hugetlb.c | 9 ++++++---
> mm/memfd.c | 5 +++++
> 2 files changed, 11 insertions(+), 3 deletions(-)
>
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index c498874a7170..e46c461210a4 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -2397,12 +2397,15 @@ struct folio *alloc_hugetlb_folio_reserve(struct hstate *h, int preferred_nid,
> struct folio *folio;
>
> spin_lock_irq(&hugetlb_lock);
> + if (!h->resv_huge_pages) {
> + spin_unlock_irq(&hugetlb_lock);
> + return NULL;
> + }
This should be the entire fix, plus deleting the VM_BUG_ON. See below.
> +
> folio = dequeue_hugetlb_folio_nodemask(h, gfp_mask, preferred_nid,
> nmask);
> - if (folio) {
> - VM_BUG_ON(!h->resv_huge_pages);
> + if (folio)
> h->resv_huge_pages--;
> - }
>
> spin_unlock_irq(&hugetlb_lock);
> return folio;
> diff --git a/mm/memfd.c b/mm/memfd.c
> index 35a370d75c9a..a3012c444285 100644
> --- a/mm/memfd.c
> +++ b/mm/memfd.c
> @@ -85,6 +85,10 @@ struct folio *memfd_alloc_folio(struct file *memfd, pgoff_t idx)
> gfp_mask &= ~(__GFP_HIGHMEM | __GFP_MOVABLE);
> idx >>= huge_page_order(h);
>
> + if (!hugetlb_reserve_pages(file_inode(memfd),
> + idx, idx + 1, NULL, 0))
> + return ERR_PTR(-ENOMEM);
I believe it is wrong to force a reservation here. Pages should have already been
reserved at this point, eg by calls from hugetlbfs_file_mmap or hugetlb_file_setup.
syzcaller has forced its way down this path without calling those pre-requisites,
doing weird stuff as it should.
To fix, I suggest you simply fix alloc_hugetlb_folio_reserve as above.
- Steve
> +
> folio = alloc_hugetlb_folio_reserve(h,
> numa_node_id(),
> NULL,
> @@ -100,6 +104,7 @@ struct folio *memfd_alloc_folio(struct file *memfd, pgoff_t idx)
> folio_unlock(folio);
> return folio;
> }
> + hugetlb_unreserve_pages(file_inode(memfd), idx, idx + 1, 1);
> return ERR_PTR(-ENOMEM);
> }
> #endif
^ permalink raw reply [flat|nested] 8+ messages in thread
* RE: [PATCH] mm/memfd: reserve hugetlb folios before allocation
2025-01-07 9:36 ` David Hildenbrand
@ 2025-01-08 6:59 ` Kasireddy, Vivek
0 siblings, 0 replies; 8+ messages in thread
From: Kasireddy, Vivek @ 2025-01-08 6:59 UTC (permalink / raw)
To: David Hildenbrand, linux-mm
Cc: syzbot+a504cb5bae4fe117ba94, Steve Sistare, Muchun Song, Andrew Morton
Hi David,
> > There are cases when we try to pin a folio but discover that it has
> > not been faulted-in. So, we try to allocate it in memfd_alloc_folio()
> > but there is a chance that we might encounter a crash/failure
> > (VM_BUG_ON(!h->resv_huge_pages)) if there are no active reservations
> > at that instant. This issue was reported by syzbot:
> >
> > kernel BUG at mm/hugetlb.c:2403!
> > Oops: invalid opcode: 0000 [#1] PREEMPT SMP KASAN NOPTI
> > CPU: 0 UID: 0 PID: 5315 Comm: syz.0.0 Not tainted
> > 6.13.0-rc5-syzkaller-00161-g63676eefb7a0 #0
> > Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS
> > 1.16.3-debian-1.16.3-2~bpo12+1 04/01/2014
> > RIP: 0010:alloc_hugetlb_folio_reserve+0xbc/0xc0 mm/hugetlb.c:2403
> > Code: 1f eb 05 e8 56 18 a0 ff 48 c7 c7 40 56 61 8e e8 ba 21 cc 09 4c 89
> > f0 5b 41 5c 41 5e 41 5f 5d c3 cc cc cc cc e8 35 18 a0 ff 90 <0f> 0b 66
> > 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 f3 0f
> > RSP: 0018:ffffc9000d3d77f8 EFLAGS: 00010087
> > RAX: ffffffff81ff6beb RBX: 0000000000000000 RCX: 0000000000100000
> > RDX: ffffc9000e51a000 RSI: 00000000000003ec RDI: 00000000000003ed
> > RBP: 1ffffffff34810d9 R08: ffffffff81ff6ba3 R09: 1ffffd4000093005
> > R10: dffffc0000000000 R11: fffff94000093006 R12: dffffc0000000000
> > R13: dffffc0000000000 R14: ffffea0000498000 R15: ffffffff9a4086c8
> > FS: 00007f77ac12e6c0(0000) GS:ffff88801fc00000(0000)
> > knlGS:0000000000000000
> > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > CR2: 00007f77ab54b170 CR3: 0000000040b70000 CR4: 0000000000352ef0
> > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> > Call Trace:
> > <TASK>
> > memfd_alloc_folio+0x1bd/0x370 mm/memfd.c:88
> > memfd_pin_folios+0xf10/0x1570 mm/gup.c:3750
> > udmabuf_pin_folios drivers/dma-buf/udmabuf.c:346 [inline]
> > udmabuf_create+0x70e/0x10c0 drivers/dma-buf/udmabuf.c:443
> > udmabuf_ioctl_create drivers/dma-buf/udmabuf.c:495 [inline]
> > udmabuf_ioctl+0x301/0x4e0 drivers/dma-buf/udmabuf.c:526
> > vfs_ioctl fs/ioctl.c:51 [inline]
> > __do_sys_ioctl fs/ioctl.c:906 [inline]
> > __se_sys_ioctl+0xf5/0x170 fs/ioctl.c:892
> > do_syscall_x64 arch/x86/entry/common.c:52 [inline]
> > do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83
> > entry_SYSCALL_64_after_hwframe+0x77/0x7f
> >
> > Therefore, to avoid this situation and fix this issue, we just need
> > to make a reservation before we try to allocate the folio. While at
> > it, also remove the VM_BUG_ON() as there is no need to crash the
> > system in this scenario and instead we could just fail the allocation.
> >
> > Fixes: 26a8ea80929c ("mm/hugetlb: fix memfd_pin_folios resv_huge_pages
> leak")
> > Reported-by: syzbot+a504cb5bae4fe117ba94@syzkaller.appspotmail.com
> > Signed-off-by: Vivek Kasireddy <vivek.kasireddy@intel.com>
> > Cc: Steve Sistare <steven.sistare@oracle.com>
> > Cc: Muchun Song <muchun.song@linux.dev>
> > Cc: David Hildenbrand <david@redhat.com>
> > Cc: Andrew Morton <akpm@linux-foundation.org>
> > ---
> > mm/hugetlb.c | 9 ++++++---
> > mm/memfd.c | 5 +++++
> > 2 files changed, 11 insertions(+), 3 deletions(-)
> >
> > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > index c498874a7170..e46c461210a4 100644
> > --- a/mm/hugetlb.c
> > +++ b/mm/hugetlb.c
> > @@ -2397,12 +2397,15 @@ struct folio *alloc_hugetlb_folio_reserve(struct
> hstate *h, int preferred_nid,
> > struct folio *folio;
> >
> > spin_lock_irq(&hugetlb_lock);
> > + if (!h->resv_huge_pages) {
>
> Should this be a "if (WARN_ON_ONCE(!h->resv_huge_pages)) {", because the
> "_reserve" in the function indicates that this is indeed something that
> must never happen?
Yes, adding a WARN_ON_ONCE makes sense in this case. I'll add it in v2.
>
> > + spin_unlock_irq(&hugetlb_lock);
> > + return NULL;
> > + }
> > +
> > folio = dequeue_hugetlb_folio_nodemask(h, gfp_mask,
> preferred_nid,
> > nmask);
> > - if (folio) {
> > - VM_BUG_ON(!h->resv_huge_pages);
> > + if (folio)
> > h->resv_huge_pages--;
> > - }
> >
> > spin_unlock_irq(&hugetlb_lock);
> > return folio;
> > diff --git a/mm/memfd.c b/mm/memfd.c
> > index 35a370d75c9a..a3012c444285 100644
> > --- a/mm/memfd.c
> > +++ b/mm/memfd.c
> > @@ -85,6 +85,10 @@ struct folio *memfd_alloc_folio(struct file *memfd,
> pgoff_t idx)
> > gfp_mask &= ~(__GFP_HIGHMEM | __GFP_MOVABLE);
> > idx >>= huge_page_order(h);
> >
> > + if (!hugetlb_reserve_pages(file_inode(memfd),
> > + idx, idx + 1, NULL, 0))
> > + return ERR_PTR(-ENOMEM);
> > +
> > folio = alloc_hugetlb_folio_reserve(h,
> > numa_node_id(),
> > NULL,
> > @@ -100,6 +104,7 @@ struct folio *memfd_alloc_folio(struct file *memfd,
> pgoff_t idx)
> > folio_unlock(folio);
> > return folio;
> > }
> > + hugetlb_unreserve_pages(file_inode(memfd), idx, idx + 1, 1);
> > return ERR_PTR(-ENOMEM);
>
> Staring at hugetlb_reserve_pages() I assume this will also work as
> expected if already reserved.
IIUC, if a range is already reserved, and when hugetlb_reserve_pages()
gets called for the same range, it would mostly become a no-op as
region_chg() and region_add() return 0 (based on my light testing).
Thanks,
Vivek
>
>
> --
> Cheers,
>
> David / dhildenb
^ permalink raw reply [flat|nested] 8+ messages in thread
* RE: [PATCH] mm/memfd: reserve hugetlb folios before allocation
2025-01-07 17:12 ` Steven Sistare
@ 2025-01-08 7:24 ` Kasireddy, Vivek
2025-01-09 19:01 ` Steven Sistare
0 siblings, 1 reply; 8+ messages in thread
From: Kasireddy, Vivek @ 2025-01-08 7:24 UTC (permalink / raw)
To: Steven Sistare, linux-mm
Cc: syzbot+a504cb5bae4fe117ba94, Muchun Song, David Hildenbrand,
Andrew Morton
Hi Steve,
> > There are cases when we try to pin a folio but discover that it has
> > not been faulted-in. So, we try to allocate it in memfd_alloc_folio()
> > but there is a chance that we might encounter a crash/failure
> > (VM_BUG_ON(!h->resv_huge_pages)) if there are no active reservations
> > at that instant. This issue was reported by syzbot:
> >
> > kernel BUG at mm/hugetlb.c:2403!
> > Oops: invalid opcode: 0000 [#1] PREEMPT SMP KASAN NOPTI
> > CPU: 0 UID: 0 PID: 5315 Comm: syz.0.0 Not tainted
> > 6.13.0-rc5-syzkaller-00161-g63676eefb7a0 #0
> > Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS
> > 1.16.3-debian-1.16.3-2~bpo12+1 04/01/2014
> > RIP: 0010:alloc_hugetlb_folio_reserve+0xbc/0xc0 mm/hugetlb.c:2403
> > Code: 1f eb 05 e8 56 18 a0 ff 48 c7 c7 40 56 61 8e e8 ba 21 cc 09 4c 89
> > f0 5b 41 5c 41 5e 41 5f 5d c3 cc cc cc cc e8 35 18 a0 ff 90 <0f> 0b 66
> > 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 f3 0f
> > RSP: 0018:ffffc9000d3d77f8 EFLAGS: 00010087
> > RAX: ffffffff81ff6beb RBX: 0000000000000000 RCX: 0000000000100000
> > RDX: ffffc9000e51a000 RSI: 00000000000003ec RDI: 00000000000003ed
> > RBP: 1ffffffff34810d9 R08: ffffffff81ff6ba3 R09: 1ffffd4000093005
> > R10: dffffc0000000000 R11: fffff94000093006 R12: dffffc0000000000
> > R13: dffffc0000000000 R14: ffffea0000498000 R15: ffffffff9a4086c8
> > FS: 00007f77ac12e6c0(0000) GS:ffff88801fc00000(0000)
> > knlGS:0000000000000000
> > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > CR2: 00007f77ab54b170 CR3: 0000000040b70000 CR4: 0000000000352ef0
> > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> > Call Trace:
> > <TASK>
> > memfd_alloc_folio+0x1bd/0x370 mm/memfd.c:88
> > memfd_pin_folios+0xf10/0x1570 mm/gup.c:3750
> > udmabuf_pin_folios drivers/dma-buf/udmabuf.c:346 [inline]
> > udmabuf_create+0x70e/0x10c0 drivers/dma-buf/udmabuf.c:443
> > udmabuf_ioctl_create drivers/dma-buf/udmabuf.c:495 [inline]
> > udmabuf_ioctl+0x301/0x4e0 drivers/dma-buf/udmabuf.c:526
> > vfs_ioctl fs/ioctl.c:51 [inline]
> > __do_sys_ioctl fs/ioctl.c:906 [inline]
> > __se_sys_ioctl+0xf5/0x170 fs/ioctl.c:892
> > do_syscall_x64 arch/x86/entry/common.c:52 [inline]
> > do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83
> > entry_SYSCALL_64_after_hwframe+0x77/0x7f
> >
> > Therefore, to avoid this situation and fix this issue, we just need
> > to make a reservation before we try to allocate the folio. While at
> > it, also remove the VM_BUG_ON() as there is no need to crash the
> > system in this scenario and instead we could just fail the allocation.
> >
> > Fixes: 26a8ea80929c ("mm/hugetlb: fix memfd_pin_folios resv_huge_pages
> leak")
> > Reported-by: syzbot+a504cb5bae4fe117ba94@syzkaller.appspotmail.com
> > Signed-off-by: Vivek Kasireddy <vivek.kasireddy@intel.com>
> > Cc: Steve Sistare <steven.sistare@oracle.com>
> > Cc: Muchun Song <muchun.song@linux.dev>
> > Cc: David Hildenbrand <david@redhat.com>
> > Cc: Andrew Morton <akpm@linux-foundation.org>
> > ---
> > mm/hugetlb.c | 9 ++++++---
> > mm/memfd.c | 5 +++++
> > 2 files changed, 11 insertions(+), 3 deletions(-)
> >
> > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > index c498874a7170..e46c461210a4 100644
> > --- a/mm/hugetlb.c
> > +++ b/mm/hugetlb.c
> > @@ -2397,12 +2397,15 @@ struct folio *alloc_hugetlb_folio_reserve(struct
> hstate *h, int preferred_nid,
> > struct folio *folio;
> >
> > spin_lock_irq(&hugetlb_lock);
> > + if (!h->resv_huge_pages) {
> > + spin_unlock_irq(&hugetlb_lock);
> > + return NULL;
> > + }
>
> This should be the entire fix, plus deleting the VM_BUG_ON. See below.
>
> > +
> > folio = dequeue_hugetlb_folio_nodemask(h, gfp_mask,
> preferred_nid,
> > nmask);
> > - if (folio) {
> > - VM_BUG_ON(!h->resv_huge_pages);
> > + if (folio)
> > h->resv_huge_pages--;
> > - }
> >
> > spin_unlock_irq(&hugetlb_lock);
> > return folio;
> > diff --git a/mm/memfd.c b/mm/memfd.c
> > index 35a370d75c9a..a3012c444285 100644
> > --- a/mm/memfd.c
> > +++ b/mm/memfd.c
> > @@ -85,6 +85,10 @@ struct folio *memfd_alloc_folio(struct file *memfd,
> pgoff_t idx)
> > gfp_mask &= ~(__GFP_HIGHMEM | __GFP_MOVABLE);
> > idx >>= huge_page_order(h);
> >
> > + if (!hugetlb_reserve_pages(file_inode(memfd),
> > + idx, idx + 1, NULL, 0))
> > + return ERR_PTR(-ENOMEM);
>
> I believe it is wrong to force a reservation here.
Is there any particular reason why you believe a reservation here would be wrong?
AFAICS, at the moment, we are not doing any region/subpool accounting before
our folio allocation and this gets flagged in the form of elevated resv_huge_pages
value (hugetlb_acct_memory() elevates it based on the return value of region_del())
when hugetlb_unreserve_pages() eventually gets called.
> Pages should have already been
> reserved at this point, eg by calls from hugetlbfs_file_mmap or hugetlb_file_setup.
hugetlb_file_setup() does not reserve any pages as it passes in VM_NORESERVE.
And, the case we are trying to address is exactly when hugetlbfs_file_mmap() does
not get called before pinning. So, when hugetlbfs_file_mmap() does eventually
get called, I don't see any problem if it calls hugetlb_reserve_pages() again for the
same range or overlapping ranges.
> syzcaller has forced its way down this path without calling those pre-requisites,
> doing weird stuff as it should.
This issue is not very hard to reproduce. If we have free_huge_pages > 0 and
resv_huge_pages = 0, and then we call memfd_pin_folios() before mmap()/
hugetlbfs_file_mmap() we can easily encounter this issue. Furthermore, we
should be able to allocate a folio in this scenario (as available_huge_pages > 0),
which we would not be able to do if we don't call hugetlb_reserve_pages().
Note that hugetlb_reserve_pages() actually elevates resv_huge_pages in
this case and kind of gives a go-ahead for the allocation.
I have used a slightly modified udmabuf selftest to reproduce this issue which
I'll send out as part of v2.
Thanks,
Vivek
>
> To fix, I suggest you simply fix alloc_hugetlb_folio_reserve as above.
>
> - Steve
>
> > +
> > folio = alloc_hugetlb_folio_reserve(h,
> > numa_node_id(),
> > NULL,
> > @@ -100,6 +104,7 @@ struct folio *memfd_alloc_folio(struct file *memfd,
> pgoff_t idx)
> > folio_unlock(folio);
> > return folio;
> > }
> > + hugetlb_unreserve_pages(file_inode(memfd), idx, idx + 1, 1);
> > return ERR_PTR(-ENOMEM);
> > }
> > #endif
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH] mm/memfd: reserve hugetlb folios before allocation
2025-01-08 7:24 ` Kasireddy, Vivek
@ 2025-01-09 19:01 ` Steven Sistare
2025-01-10 6:17 ` Kasireddy, Vivek
0 siblings, 1 reply; 8+ messages in thread
From: Steven Sistare @ 2025-01-09 19:01 UTC (permalink / raw)
To: Kasireddy, Vivek, linux-mm
Cc: syzbot+a504cb5bae4fe117ba94, Muchun Song, David Hildenbrand,
Andrew Morton
On 1/8/2025 2:24 AM, Kasireddy, Vivek wrote:
> Hi Steve,
>
>>> There are cases when we try to pin a folio but discover that it has
>>> not been faulted-in. So, we try to allocate it in memfd_alloc_folio()
>>> but there is a chance that we might encounter a crash/failure
>>> (VM_BUG_ON(!h->resv_huge_pages)) if there are no active reservations
>>> at that instant. This issue was reported by syzbot:
>>>
>>> kernel BUG at mm/hugetlb.c:2403!
>>> Oops: invalid opcode: 0000 [#1] PREEMPT SMP KASAN NOPTI
>>> CPU: 0 UID: 0 PID: 5315 Comm: syz.0.0 Not tainted
>>> 6.13.0-rc5-syzkaller-00161-g63676eefb7a0 #0
>>> Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS
>>> 1.16.3-debian-1.16.3-2~bpo12+1 04/01/2014
>>> RIP: 0010:alloc_hugetlb_folio_reserve+0xbc/0xc0 mm/hugetlb.c:2403
>>> Code: 1f eb 05 e8 56 18 a0 ff 48 c7 c7 40 56 61 8e e8 ba 21 cc 09 4c 89
>>> f0 5b 41 5c 41 5e 41 5f 5d c3 cc cc cc cc e8 35 18 a0 ff 90 <0f> 0b 66
>>> 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 f3 0f
>>> RSP: 0018:ffffc9000d3d77f8 EFLAGS: 00010087
>>> RAX: ffffffff81ff6beb RBX: 0000000000000000 RCX: 0000000000100000
>>> RDX: ffffc9000e51a000 RSI: 00000000000003ec RDI: 00000000000003ed
>>> RBP: 1ffffffff34810d9 R08: ffffffff81ff6ba3 R09: 1ffffd4000093005
>>> R10: dffffc0000000000 R11: fffff94000093006 R12: dffffc0000000000
>>> R13: dffffc0000000000 R14: ffffea0000498000 R15: ffffffff9a4086c8
>>> FS: 00007f77ac12e6c0(0000) GS:ffff88801fc00000(0000)
>>> knlGS:0000000000000000
>>> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>> CR2: 00007f77ab54b170 CR3: 0000000040b70000 CR4: 0000000000352ef0
>>> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>>> DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
>>> Call Trace:
>>> <TASK>
>>> memfd_alloc_folio+0x1bd/0x370 mm/memfd.c:88
>>> memfd_pin_folios+0xf10/0x1570 mm/gup.c:3750
>>> udmabuf_pin_folios drivers/dma-buf/udmabuf.c:346 [inline]
>>> udmabuf_create+0x70e/0x10c0 drivers/dma-buf/udmabuf.c:443
>>> udmabuf_ioctl_create drivers/dma-buf/udmabuf.c:495 [inline]
>>> udmabuf_ioctl+0x301/0x4e0 drivers/dma-buf/udmabuf.c:526
>>> vfs_ioctl fs/ioctl.c:51 [inline]
>>> __do_sys_ioctl fs/ioctl.c:906 [inline]
>>> __se_sys_ioctl+0xf5/0x170 fs/ioctl.c:892
>>> do_syscall_x64 arch/x86/entry/common.c:52 [inline]
>>> do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83
>>> entry_SYSCALL_64_after_hwframe+0x77/0x7f
>>>
>>> Therefore, to avoid this situation and fix this issue, we just need
>>> to make a reservation before we try to allocate the folio. While at
>>> it, also remove the VM_BUG_ON() as there is no need to crash the
>>> system in this scenario and instead we could just fail the allocation.
>>>
>>> Fixes: 26a8ea80929c ("mm/hugetlb: fix memfd_pin_folios resv_huge_pages
>> leak")
>>> Reported-by: syzbot+a504cb5bae4fe117ba94@syzkaller.appspotmail.com
>>> Signed-off-by: Vivek Kasireddy <vivek.kasireddy@intel.com>
>>> Cc: Steve Sistare <steven.sistare@oracle.com>
>>> Cc: Muchun Song <muchun.song@linux.dev>
>>> Cc: David Hildenbrand <david@redhat.com>
>>> Cc: Andrew Morton <akpm@linux-foundation.org>
>>> ---
>>> mm/hugetlb.c | 9 ++++++---
>>> mm/memfd.c | 5 +++++
>>> 2 files changed, 11 insertions(+), 3 deletions(-)
>>>
>>> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
>>> index c498874a7170..e46c461210a4 100644
>>> --- a/mm/hugetlb.c
>>> +++ b/mm/hugetlb.c
>>> @@ -2397,12 +2397,15 @@ struct folio *alloc_hugetlb_folio_reserve(struct
>> hstate *h, int preferred_nid,
>>> struct folio *folio;
>>>
>>> spin_lock_irq(&hugetlb_lock);
>>> + if (!h->resv_huge_pages) {
>>> + spin_unlock_irq(&hugetlb_lock);
>>> + return NULL;
>>> + }
>>
>> This should be the entire fix, plus deleting the VM_BUG_ON. See below.
>>
>>> +
>>> folio = dequeue_hugetlb_folio_nodemask(h, gfp_mask,
>> preferred_nid,
>>> nmask);
>>> - if (folio) {
>>> - VM_BUG_ON(!h->resv_huge_pages);
>>> + if (folio)
>>> h->resv_huge_pages--;
>>> - }
>>>
>>> spin_unlock_irq(&hugetlb_lock);
>>> return folio;
>>> diff --git a/mm/memfd.c b/mm/memfd.c
>>> index 35a370d75c9a..a3012c444285 100644
>>> --- a/mm/memfd.c
>>> +++ b/mm/memfd.c
>>> @@ -85,6 +85,10 @@ struct folio *memfd_alloc_folio(struct file *memfd,
>> pgoff_t idx)
>>> gfp_mask &= ~(__GFP_HIGHMEM | __GFP_MOVABLE);
>>> idx >>= huge_page_order(h);
>>>
>>> + if (!hugetlb_reserve_pages(file_inode(memfd),
>>> + idx, idx + 1, NULL, 0))
>>> + return ERR_PTR(-ENOMEM);
>>
>> I believe it is wrong to force a reservation here.
> Is there any particular reason why you believe a reservation here would be wrong?
> AFAICS, at the moment, we are not doing any region/subpool accounting before
> our folio allocation and this gets flagged in the form of elevated resv_huge_pages
> value (hugetlb_acct_memory() elevates it based on the return value of region_del())
> when hugetlb_unreserve_pages() eventually gets called.
>
>> Pages should have already been
>> reserved at this point, eg by calls from hugetlbfs_file_mmap or hugetlb_file_setup.
> hugetlb_file_setup() does not reserve any pages as it passes in VM_NORESERVE.
> And, the case we are trying to address is exactly when hugetlbfs_file_mmap() does
> not get called before pinning.
But you must not break the case where hugetlbfs_file_mmap was called first, which
reserves, then memfd_alloc_folio is called, which reserves again with your fix. Does
that work correctly, or do the counts go bad?
> So, when hugetlbfs_file_mmap() does eventually
> get called, I don't see any problem if it calls hugetlb_reserve_pages() again for the
> same range or overlapping ranges.
Does that work correctly, or do the counts go bad?
Please try those scenarios with your test program: mmap + memfd_alloc_folio, and
memfd_alloc_folio + mmap.
- Steve
>> syzcaller has forced its way down this path without calling those pre-requisites,
>> doing weird stuff as it should.
> This issue is not very hard to reproduce. If we have free_huge_pages > 0 and
> resv_huge_pages = 0, and then we call memfd_pin_folios() before mmap()/
> hugetlbfs_file_mmap() we can easily encounter this issue. Furthermore, we
> should be able to allocate a folio in this scenario (as available_huge_pages > 0),
> which we would not be able to do if we don't call hugetlb_reserve_pages().
> Note that hugetlb_reserve_pages() actually elevates resv_huge_pages in
> this case and kind of gives a go-ahead for the allocation.
>
> I have used a slightly modified udmabuf selftest to reproduce this issue which
> I'll send out as part of v2.
>
> Thanks,
> Vivek
>
>>
>> To fix, I suggest you simply fix alloc_hugetlb_folio_reserve as above.
>>
>> - Steve
>>
>>> +
>>> folio = alloc_hugetlb_folio_reserve(h,
>>> numa_node_id(),
>>> NULL,
>>> @@ -100,6 +104,7 @@ struct folio *memfd_alloc_folio(struct file *memfd,
>> pgoff_t idx)
>>> folio_unlock(folio);
>>> return folio;
>>> }
>>> + hugetlb_unreserve_pages(file_inode(memfd), idx, idx + 1, 1);
>>> return ERR_PTR(-ENOMEM);
>>> }
>>> #endif
>
^ permalink raw reply [flat|nested] 8+ messages in thread
* RE: [PATCH] mm/memfd: reserve hugetlb folios before allocation
2025-01-09 19:01 ` Steven Sistare
@ 2025-01-10 6:17 ` Kasireddy, Vivek
2025-01-10 15:22 ` Steven Sistare
0 siblings, 1 reply; 8+ messages in thread
From: Kasireddy, Vivek @ 2025-01-10 6:17 UTC (permalink / raw)
To: Steven Sistare, linux-mm
Cc: syzbot+a504cb5bae4fe117ba94, Muchun Song, David Hildenbrand,
Andrew Morton
Hi Steve,
> Subject: Re: [PATCH] mm/memfd: reserve hugetlb folios before allocation
>
> >>> There are cases when we try to pin a folio but discover that it has
> >>> not been faulted-in. So, we try to allocate it in memfd_alloc_folio()
> >>> but there is a chance that we might encounter a crash/failure
> >>> (VM_BUG_ON(!h->resv_huge_pages)) if there are no active reservations
> >>> at that instant. This issue was reported by syzbot:
> >>>
> >>> kernel BUG at mm/hugetlb.c:2403!
> >>> Oops: invalid opcode: 0000 [#1] PREEMPT SMP KASAN NOPTI
> >>> CPU: 0 UID: 0 PID: 5315 Comm: syz.0.0 Not tainted
> >>> 6.13.0-rc5-syzkaller-00161-g63676eefb7a0 #0
> >>> Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS
> >>> 1.16.3-debian-1.16.3-2~bpo12+1 04/01/2014
> >>> RIP: 0010:alloc_hugetlb_folio_reserve+0xbc/0xc0 mm/hugetlb.c:2403
> >>> Code: 1f eb 05 e8 56 18 a0 ff 48 c7 c7 40 56 61 8e e8 ba 21 cc 09 4c 89
> >>> f0 5b 41 5c 41 5e 41 5f 5d c3 cc cc cc cc e8 35 18 a0 ff 90 <0f> 0b 66
> >>> 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 f3 0f
> >>> RSP: 0018:ffffc9000d3d77f8 EFLAGS: 00010087
> >>> RAX: ffffffff81ff6beb RBX: 0000000000000000 RCX: 0000000000100000
> >>> RDX: ffffc9000e51a000 RSI: 00000000000003ec RDI: 00000000000003ed
> >>> RBP: 1ffffffff34810d9 R08: ffffffff81ff6ba3 R09: 1ffffd4000093005
> >>> R10: dffffc0000000000 R11: fffff94000093006 R12: dffffc0000000000
> >>> R13: dffffc0000000000 R14: ffffea0000498000 R15: ffffffff9a4086c8
> >>> FS: 00007f77ac12e6c0(0000) GS:ffff88801fc00000(0000)
> >>> knlGS:0000000000000000
> >>> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> >>> CR2: 00007f77ab54b170 CR3: 0000000040b70000 CR4:
> 0000000000352ef0
> >>> DR0: 0000000000000000 DR1: 0000000000000000 DR2:
> 0000000000000000
> >>> DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> >>> Call Trace:
> >>> <TASK>
> >>> memfd_alloc_folio+0x1bd/0x370 mm/memfd.c:88
> >>> memfd_pin_folios+0xf10/0x1570 mm/gup.c:3750
> >>> udmabuf_pin_folios drivers/dma-buf/udmabuf.c:346 [inline]
> >>> udmabuf_create+0x70e/0x10c0 drivers/dma-buf/udmabuf.c:443
> >>> udmabuf_ioctl_create drivers/dma-buf/udmabuf.c:495 [inline]
> >>> udmabuf_ioctl+0x301/0x4e0 drivers/dma-buf/udmabuf.c:526
> >>> vfs_ioctl fs/ioctl.c:51 [inline]
> >>> __do_sys_ioctl fs/ioctl.c:906 [inline]
> >>> __se_sys_ioctl+0xf5/0x170 fs/ioctl.c:892
> >>> do_syscall_x64 arch/x86/entry/common.c:52 [inline]
> >>> do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83
> >>> entry_SYSCALL_64_after_hwframe+0x77/0x7f
> >>>
> >>> Therefore, to avoid this situation and fix this issue, we just need
> >>> to make a reservation before we try to allocate the folio. While at
> >>> it, also remove the VM_BUG_ON() as there is no need to crash the
> >>> system in this scenario and instead we could just fail the allocation.
> >>>
> >>> Fixes: 26a8ea80929c ("mm/hugetlb: fix memfd_pin_folios
> resv_huge_pages
> >> leak")
> >>> Reported-by:
> syzbot+a504cb5bae4fe117ba94@syzkaller.appspotmail.com
> >>> Signed-off-by: Vivek Kasireddy <vivek.kasireddy@intel.com>
> >>> Cc: Steve Sistare <steven.sistare@oracle.com>
> >>> Cc: Muchun Song <muchun.song@linux.dev>
> >>> Cc: David Hildenbrand <david@redhat.com>
> >>> Cc: Andrew Morton <akpm@linux-foundation.org>
> >>> ---
> >>> mm/hugetlb.c | 9 ++++++---
> >>> mm/memfd.c | 5 +++++
> >>> 2 files changed, 11 insertions(+), 3 deletions(-)
> >>>
> >>> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> >>> index c498874a7170..e46c461210a4 100644
> >>> --- a/mm/hugetlb.c
> >>> +++ b/mm/hugetlb.c
> >>> @@ -2397,12 +2397,15 @@ struct folio
> *alloc_hugetlb_folio_reserve(struct
> >> hstate *h, int preferred_nid,
> >>> struct folio *folio;
> >>>
> >>> spin_lock_irq(&hugetlb_lock);
> >>> + if (!h->resv_huge_pages) {
> >>> + spin_unlock_irq(&hugetlb_lock);
> >>> + return NULL;
> >>> + }
> >>
> >> This should be the entire fix, plus deleting the VM_BUG_ON. See below.
> >>
> >>> +
> >>> folio = dequeue_hugetlb_folio_nodemask(h, gfp_mask,
> >> preferred_nid,
> >>> nmask);
> >>> - if (folio) {
> >>> - VM_BUG_ON(!h->resv_huge_pages);
> >>> + if (folio)
> >>> h->resv_huge_pages--;
> >>> - }
> >>>
> >>> spin_unlock_irq(&hugetlb_lock);
> >>> return folio;
> >>> diff --git a/mm/memfd.c b/mm/memfd.c
> >>> index 35a370d75c9a..a3012c444285 100644
> >>> --- a/mm/memfd.c
> >>> +++ b/mm/memfd.c
> >>> @@ -85,6 +85,10 @@ struct folio *memfd_alloc_folio(struct file
> *memfd,
> >> pgoff_t idx)
> >>> gfp_mask &= ~(__GFP_HIGHMEM | __GFP_MOVABLE);
> >>> idx >>= huge_page_order(h);
> >>>
> >>> + if (!hugetlb_reserve_pages(file_inode(memfd),
> >>> + idx, idx + 1, NULL, 0))
> >>> + return ERR_PTR(-ENOMEM);
> >>
> >> I believe it is wrong to force a reservation here.
> > Is there any particular reason why you believe a reservation here would be
> wrong?
> > AFAICS, at the moment, we are not doing any region/subpool accounting
> before
> > our folio allocation and this gets flagged in the form of elevated
> resv_huge_pages
> > value (hugetlb_acct_memory() elevates it based on the return value of
> region_del())
> > when hugetlb_unreserve_pages() eventually gets called.
> >
> >> Pages should have already been
> >> reserved at this point, eg by calls from hugetlbfs_file_mmap or
> hugetlb_file_setup.
> > hugetlb_file_setup() does not reserve any pages as it passes in
> VM_NORESERVE.
> > And, the case we are trying to address is exactly when
> hugetlbfs_file_mmap() does
> > not get called before pinning.
>
> But you must not break the case where hugetlbfs_file_mmap was called first, which
AFAICS, this patch does not break the primary use-case where hugetlbfs_file_mmap()
gets called first.
> reserves, then memfd_alloc_folio is called, which reserves again with your fix. Does
> that work correctly, or do the counts go bad?
Yes it works correctly and ` cat /proc/meminfo|grep Huge` does not show any
unexpected counts.
>
> > So, when hugetlbfs_file_mmap() does eventually
> > get called, I don't see any problem if it calls hugetlb_reserve_pages() again
> for the
> > same range or overlapping ranges.
>
> Does that work correctly, or do the counts go bad?
It works correctly as expected. And, as mentioned to David in another email,
if a range is already reserved, and when hugetlb_reserve_pages() gets called
again for the same range (in this use-case), it would mostly become a no-op
as region_chg() and region_add() return 0 (based on my light testing).
>
> Please try those scenarios with your test program: mmap +
> memfd_alloc_folio, and
> memfd_alloc_folio + mmap.
Here are the scenarios I tried (with the calls in the exact order as below):
1) hugetlbfs_file_mmap()
memfd_alloc_folio()
hugetlb_fault()
2) memfd_alloc_folio()
hugetlbfs_file_mmap()
hugetlb_fault()
3) hugetlbfs_file_mmap()
hugetlb_fault()
alloc_hugetlb_folio()
memfd_alloc_folio() does not get called in the last case as the folio is already
allocated (and is present in the page cache). While testing all these cases, I
did not notice any splats or messed-up counts.
If it is not too much trouble, could you please try this patch with your test-cases?
Thanks,
Vivek
>
> - Steve
>
> >> syzcaller has forced its way down this path without calling those pre-
> requisites,
> >> doing weird stuff as it should.
> > This issue is not very hard to reproduce. If we have free_huge_pages > 0
> and
> > resv_huge_pages = 0, and then we call memfd_pin_folios() before mmap()/
> > hugetlbfs_file_mmap() we can easily encounter this issue. Furthermore, we
> > should be able to allocate a folio in this scenario (as available_huge_pages
> > 0),
> > which we would not be able to do if we don't call hugetlb_reserve_pages().
> > Note that hugetlb_reserve_pages() actually elevates resv_huge_pages in
> > this case and kind of gives a go-ahead for the allocation.
> >
> > I have used a slightly modified udmabuf selftest to reproduce this issue
> which
> > I'll send out as part of v2.
> >
> > Thanks,
> > Vivek
> >
> >>
> >> To fix, I suggest you simply fix alloc_hugetlb_folio_reserve as above.
> >>
> >> - Steve
> >>
> >>> +
> >>> folio = alloc_hugetlb_folio_reserve(h,
> >>> numa_node_id(),
> >>> NULL,
> >>> @@ -100,6 +104,7 @@ struct folio *memfd_alloc_folio(struct file
> *memfd,
> >> pgoff_t idx)
> >>> folio_unlock(folio);
> >>> return folio;
> >>> }
> >>> + hugetlb_unreserve_pages(file_inode(memfd), idx, idx + 1, 1);
> >>> return ERR_PTR(-ENOMEM);
> >>> }
> >>> #endif
> >
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH] mm/memfd: reserve hugetlb folios before allocation
2025-01-10 6:17 ` Kasireddy, Vivek
@ 2025-01-10 15:22 ` Steven Sistare
0 siblings, 0 replies; 8+ messages in thread
From: Steven Sistare @ 2025-01-10 15:22 UTC (permalink / raw)
To: Kasireddy, Vivek, linux-mm
Cc: syzbot+a504cb5bae4fe117ba94, Muchun Song, David Hildenbrand,
Andrew Morton
On 1/10/2025 1:17 AM, Kasireddy, Vivek wrote:
> Hi Steve,
>
>> Subject: Re: [PATCH] mm/memfd: reserve hugetlb folios before allocation
>>
>>>>> There are cases when we try to pin a folio but discover that it has
>>>>> not been faulted-in. So, we try to allocate it in memfd_alloc_folio()
>>>>> but there is a chance that we might encounter a crash/failure
>>>>> (VM_BUG_ON(!h->resv_huge_pages)) if there are no active reservations
>>>>> at that instant. This issue was reported by syzbot:
>>>>>
>>>>> kernel BUG at mm/hugetlb.c:2403!
>>>>> Oops: invalid opcode: 0000 [#1] PREEMPT SMP KASAN NOPTI
>>>>> CPU: 0 UID: 0 PID: 5315 Comm: syz.0.0 Not tainted
>>>>> 6.13.0-rc5-syzkaller-00161-g63676eefb7a0 #0
>>>>> Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS
>>>>> 1.16.3-debian-1.16.3-2~bpo12+1 04/01/2014
>>>>> RIP: 0010:alloc_hugetlb_folio_reserve+0xbc/0xc0 mm/hugetlb.c:2403
>>>>> Code: 1f eb 05 e8 56 18 a0 ff 48 c7 c7 40 56 61 8e e8 ba 21 cc 09 4c 89
>>>>> f0 5b 41 5c 41 5e 41 5f 5d c3 cc cc cc cc e8 35 18 a0 ff 90 <0f> 0b 66
>>>>> 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 f3 0f
>>>>> RSP: 0018:ffffc9000d3d77f8 EFLAGS: 00010087
>>>>> RAX: ffffffff81ff6beb RBX: 0000000000000000 RCX: 0000000000100000
>>>>> RDX: ffffc9000e51a000 RSI: 00000000000003ec RDI: 00000000000003ed
>>>>> RBP: 1ffffffff34810d9 R08: ffffffff81ff6ba3 R09: 1ffffd4000093005
>>>>> R10: dffffc0000000000 R11: fffff94000093006 R12: dffffc0000000000
>>>>> R13: dffffc0000000000 R14: ffffea0000498000 R15: ffffffff9a4086c8
>>>>> FS: 00007f77ac12e6c0(0000) GS:ffff88801fc00000(0000)
>>>>> knlGS:0000000000000000
>>>>> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>>>> CR2: 00007f77ab54b170 CR3: 0000000040b70000 CR4:
>> 0000000000352ef0
>>>>> DR0: 0000000000000000 DR1: 0000000000000000 DR2:
>> 0000000000000000
>>>>> DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
>>>>> Call Trace:
>>>>> <TASK>
>>>>> memfd_alloc_folio+0x1bd/0x370 mm/memfd.c:88
>>>>> memfd_pin_folios+0xf10/0x1570 mm/gup.c:3750
>>>>> udmabuf_pin_folios drivers/dma-buf/udmabuf.c:346 [inline]
>>>>> udmabuf_create+0x70e/0x10c0 drivers/dma-buf/udmabuf.c:443
>>>>> udmabuf_ioctl_create drivers/dma-buf/udmabuf.c:495 [inline]
>>>>> udmabuf_ioctl+0x301/0x4e0 drivers/dma-buf/udmabuf.c:526
>>>>> vfs_ioctl fs/ioctl.c:51 [inline]
>>>>> __do_sys_ioctl fs/ioctl.c:906 [inline]
>>>>> __se_sys_ioctl+0xf5/0x170 fs/ioctl.c:892
>>>>> do_syscall_x64 arch/x86/entry/common.c:52 [inline]
>>>>> do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83
>>>>> entry_SYSCALL_64_after_hwframe+0x77/0x7f
>>>>>
>>>>> Therefore, to avoid this situation and fix this issue, we just need
>>>>> to make a reservation before we try to allocate the folio. While at
>>>>> it, also remove the VM_BUG_ON() as there is no need to crash the
>>>>> system in this scenario and instead we could just fail the allocation.
>>>>>
>>>>> Fixes: 26a8ea80929c ("mm/hugetlb: fix memfd_pin_folios
>> resv_huge_pages
>>>> leak")
>>>>> Reported-by:
>> syzbot+a504cb5bae4fe117ba94@syzkaller.appspotmail.com
>>>>> Signed-off-by: Vivek Kasireddy <vivek.kasireddy@intel.com>
>>>>> Cc: Steve Sistare <steven.sistare@oracle.com>
>>>>> Cc: Muchun Song <muchun.song@linux.dev>
>>>>> Cc: David Hildenbrand <david@redhat.com>
>>>>> Cc: Andrew Morton <akpm@linux-foundation.org>
>>>>> ---
>>>>> mm/hugetlb.c | 9 ++++++---
>>>>> mm/memfd.c | 5 +++++
>>>>> 2 files changed, 11 insertions(+), 3 deletions(-)
>>>>>
>>>>> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
>>>>> index c498874a7170..e46c461210a4 100644
>>>>> --- a/mm/hugetlb.c
>>>>> +++ b/mm/hugetlb.c
>>>>> @@ -2397,12 +2397,15 @@ struct folio
>> *alloc_hugetlb_folio_reserve(struct
>>>> hstate *h, int preferred_nid,
>>>>> struct folio *folio;
>>>>>
>>>>> spin_lock_irq(&hugetlb_lock);
>>>>> + if (!h->resv_huge_pages) {
>>>>> + spin_unlock_irq(&hugetlb_lock);
>>>>> + return NULL;
>>>>> + }
>>>>
>>>> This should be the entire fix, plus deleting the VM_BUG_ON. See below.
>>>>
>>>>> +
>>>>> folio = dequeue_hugetlb_folio_nodemask(h, gfp_mask,
>>>> preferred_nid,
>>>>> nmask);
>>>>> - if (folio) {
>>>>> - VM_BUG_ON(!h->resv_huge_pages);
>>>>> + if (folio)
>>>>> h->resv_huge_pages--;
>>>>> - }
>>>>>
>>>>> spin_unlock_irq(&hugetlb_lock);
>>>>> return folio;
>>>>> diff --git a/mm/memfd.c b/mm/memfd.c
>>>>> index 35a370d75c9a..a3012c444285 100644
>>>>> --- a/mm/memfd.c
>>>>> +++ b/mm/memfd.c
>>>>> @@ -85,6 +85,10 @@ struct folio *memfd_alloc_folio(struct file
>> *memfd,
>>>> pgoff_t idx)
>>>>> gfp_mask &= ~(__GFP_HIGHMEM | __GFP_MOVABLE);
>>>>> idx >>= huge_page_order(h);
>>>>>
>>>>> + if (!hugetlb_reserve_pages(file_inode(memfd),
>>>>> + idx, idx + 1, NULL, 0))
>>>>> + return ERR_PTR(-ENOMEM);
>>>>
>>>> I believe it is wrong to force a reservation here.
>>> Is there any particular reason why you believe a reservation here would be
>> wrong?
>>> AFAICS, at the moment, we are not doing any region/subpool accounting
>> before
>>> our folio allocation and this gets flagged in the form of elevated
>> resv_huge_pages
>>> value (hugetlb_acct_memory() elevates it based on the return value of
>> region_del())
>>> when hugetlb_unreserve_pages() eventually gets called.
>>>
>>>> Pages should have already been
>>>> reserved at this point, eg by calls from hugetlbfs_file_mmap or
>> hugetlb_file_setup.
>>> hugetlb_file_setup() does not reserve any pages as it passes in
>> VM_NORESERVE.
>>> And, the case we are trying to address is exactly when
>> hugetlbfs_file_mmap() does
>>> not get called before pinning.
>>
>> But you must not break the case where hugetlbfs_file_mmap was called first, which
> AFAICS, this patch does not break the primary use-case where hugetlbfs_file_mmap()
> gets called first.
>
>> reserves, then memfd_alloc_folio is called, which reserves again with your fix. Does
>> that work correctly, or do the counts go bad?
> Yes it works correctly and ` cat /proc/meminfo|grep Huge` does not show any
> unexpected counts.
>
>>
>>> So, when hugetlbfs_file_mmap() does eventually
>>> get called, I don't see any problem if it calls hugetlb_reserve_pages() again
>> for the
>>> same range or overlapping ranges.
>>
>> Does that work correctly, or do the counts go bad?
> It works correctly as expected. And, as mentioned to David in another email,
> if a range is already reserved, and when hugetlb_reserve_pages() gets called
> again for the same range (in this use-case), it would mostly become a no-op
> as region_chg() and region_add() return 0 (based on my light testing).
Great, thanks.
>> Please try those scenarios with your test program: mmap +
>> memfd_alloc_folio, and
>> memfd_alloc_folio + mmap.
> Here are the scenarios I tried (with the calls in the exact order as below):
> 1) hugetlbfs_file_mmap()
> memfd_alloc_folio()
> hugetlb_fault()
>
> 2) memfd_alloc_folio()
> hugetlbfs_file_mmap()
> hugetlb_fault()
>
> 3) hugetlbfs_file_mmap()
> hugetlb_fault()
> alloc_hugetlb_folio()
>
> memfd_alloc_folio() does not get called in the last case as the folio is already
> allocated (and is present in the page cache). While testing all these cases, I
> did not notice any splats or messed-up counts.
>
> If it is not too much trouble, could you please try this patch with your test-cases?
Done, and all tests pass. I will review your V2 patch.
Thanks for working on this bug.
- Steve
>>>> syzcaller has forced its way down this path without calling those pre-
>> requisites,
>>>> doing weird stuff as it should.
>>> This issue is not very hard to reproduce. If we have free_huge_pages > 0
>> and
>>> resv_huge_pages = 0, and then we call memfd_pin_folios() before mmap()/
>>> hugetlbfs_file_mmap() we can easily encounter this issue. Furthermore, we
>>> should be able to allocate a folio in this scenario (as available_huge_pages
>>> 0),
>>> which we would not be able to do if we don't call hugetlb_reserve_pages().
>>> Note that hugetlb_reserve_pages() actually elevates resv_huge_pages in
>>> this case and kind of gives a go-ahead for the allocation.
>>>
>>> I have used a slightly modified udmabuf selftest to reproduce this issue
>> which
>>> I'll send out as part of v2.
>>>
>>> Thanks,
>>> Vivek
>>>
>>>>
>>>> To fix, I suggest you simply fix alloc_hugetlb_folio_reserve as above.
>>>>
>>>> - Steve
>>>>
>>>>> +
>>>>> folio = alloc_hugetlb_folio_reserve(h,
>>>>> numa_node_id(),
>>>>> NULL,
>>>>> @@ -100,6 +104,7 @@ struct folio *memfd_alloc_folio(struct file
>> *memfd,
>>>> pgoff_t idx)
>>>>> folio_unlock(folio);
>>>>> return folio;
>>>>> }
>>>>> + hugetlb_unreserve_pages(file_inode(memfd), idx, idx + 1, 1);
>>>>> return ERR_PTR(-ENOMEM);
>>>>> }
>>>>> #endif
>>>
>
^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2025-01-10 15:23 UTC | newest]
Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-01-07 7:25 [PATCH] mm/memfd: reserve hugetlb folios before allocation Vivek Kasireddy
2025-01-07 9:36 ` David Hildenbrand
2025-01-08 6:59 ` Kasireddy, Vivek
2025-01-07 17:12 ` Steven Sistare
2025-01-08 7:24 ` Kasireddy, Vivek
2025-01-09 19:01 ` Steven Sistare
2025-01-10 6:17 ` Kasireddy, Vivek
2025-01-10 15:22 ` Steven Sistare
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox