* [PATCH] mm/memfd: reserve hugetlb folios before allocation
@ 2025-01-07 7:25 Vivek Kasireddy
2025-01-07 9:36 ` David Hildenbrand
2025-01-07 17:12 ` Steven Sistare
0 siblings, 2 replies; 8+ messages in thread
From: Vivek Kasireddy @ 2025-01-07 7:25 UTC (permalink / raw)
To: linux-mm
Cc: Vivek Kasireddy, syzbot+a504cb5bae4fe117ba94, Steve Sistare,
Muchun Song, David Hildenbrand, Andrew Morton
There are cases when we try to pin a folio but discover that it has
not been faulted-in. So, we try to allocate it in memfd_alloc_folio()
but there is a chance that we might encounter a crash/failure
(VM_BUG_ON(!h->resv_huge_pages)) if there are no active reservations
at that instant. This issue was reported by syzbot:
kernel BUG at mm/hugetlb.c:2403!
Oops: invalid opcode: 0000 [#1] PREEMPT SMP KASAN NOPTI
CPU: 0 UID: 0 PID: 5315 Comm: syz.0.0 Not tainted
6.13.0-rc5-syzkaller-00161-g63676eefb7a0 #0
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS
1.16.3-debian-1.16.3-2~bpo12+1 04/01/2014
RIP: 0010:alloc_hugetlb_folio_reserve+0xbc/0xc0 mm/hugetlb.c:2403
Code: 1f eb 05 e8 56 18 a0 ff 48 c7 c7 40 56 61 8e e8 ba 21 cc 09 4c 89
f0 5b 41 5c 41 5e 41 5f 5d c3 cc cc cc cc e8 35 18 a0 ff 90 <0f> 0b 66
90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 f3 0f
RSP: 0018:ffffc9000d3d77f8 EFLAGS: 00010087
RAX: ffffffff81ff6beb RBX: 0000000000000000 RCX: 0000000000100000
RDX: ffffc9000e51a000 RSI: 00000000000003ec RDI: 00000000000003ed
RBP: 1ffffffff34810d9 R08: ffffffff81ff6ba3 R09: 1ffffd4000093005
R10: dffffc0000000000 R11: fffff94000093006 R12: dffffc0000000000
R13: dffffc0000000000 R14: ffffea0000498000 R15: ffffffff9a4086c8
FS: 00007f77ac12e6c0(0000) GS:ffff88801fc00000(0000)
knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f77ab54b170 CR3: 0000000040b70000 CR4: 0000000000352ef0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
<TASK>
memfd_alloc_folio+0x1bd/0x370 mm/memfd.c:88
memfd_pin_folios+0xf10/0x1570 mm/gup.c:3750
udmabuf_pin_folios drivers/dma-buf/udmabuf.c:346 [inline]
udmabuf_create+0x70e/0x10c0 drivers/dma-buf/udmabuf.c:443
udmabuf_ioctl_create drivers/dma-buf/udmabuf.c:495 [inline]
udmabuf_ioctl+0x301/0x4e0 drivers/dma-buf/udmabuf.c:526
vfs_ioctl fs/ioctl.c:51 [inline]
__do_sys_ioctl fs/ioctl.c:906 [inline]
__se_sys_ioctl+0xf5/0x170 fs/ioctl.c:892
do_syscall_x64 arch/x86/entry/common.c:52 [inline]
do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83
entry_SYSCALL_64_after_hwframe+0x77/0x7f
Therefore, to avoid this situation and fix this issue, we just need
to make a reservation before we try to allocate the folio. While at
it, also remove the VM_BUG_ON() as there is no need to crash the
system in this scenario and instead we could just fail the allocation.
Fixes: 26a8ea80929c ("mm/hugetlb: fix memfd_pin_folios resv_huge_pages leak")
Reported-by: syzbot+a504cb5bae4fe117ba94@syzkaller.appspotmail.com
Signed-off-by: Vivek Kasireddy <vivek.kasireddy@intel.com>
Cc: Steve Sistare <steven.sistare@oracle.com>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: David Hildenbrand <david@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
---
mm/hugetlb.c | 9 ++++++---
mm/memfd.c | 5 +++++
2 files changed, 11 insertions(+), 3 deletions(-)
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index c498874a7170..e46c461210a4 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -2397,12 +2397,15 @@ struct folio *alloc_hugetlb_folio_reserve(struct hstate *h, int preferred_nid,
struct folio *folio;
spin_lock_irq(&hugetlb_lock);
+ if (!h->resv_huge_pages) {
+ spin_unlock_irq(&hugetlb_lock);
+ return NULL;
+ }
+
folio = dequeue_hugetlb_folio_nodemask(h, gfp_mask, preferred_nid,
nmask);
- if (folio) {
- VM_BUG_ON(!h->resv_huge_pages);
+ if (folio)
h->resv_huge_pages--;
- }
spin_unlock_irq(&hugetlb_lock);
return folio;
diff --git a/mm/memfd.c b/mm/memfd.c
index 35a370d75c9a..a3012c444285 100644
--- a/mm/memfd.c
+++ b/mm/memfd.c
@@ -85,6 +85,10 @@ struct folio *memfd_alloc_folio(struct file *memfd, pgoff_t idx)
gfp_mask &= ~(__GFP_HIGHMEM | __GFP_MOVABLE);
idx >>= huge_page_order(h);
+ if (!hugetlb_reserve_pages(file_inode(memfd),
+ idx, idx + 1, NULL, 0))
+ return ERR_PTR(-ENOMEM);
+
folio = alloc_hugetlb_folio_reserve(h,
numa_node_id(),
NULL,
@@ -100,6 +104,7 @@ struct folio *memfd_alloc_folio(struct file *memfd, pgoff_t idx)
folio_unlock(folio);
return folio;
}
+ hugetlb_unreserve_pages(file_inode(memfd), idx, idx + 1, 1);
return ERR_PTR(-ENOMEM);
}
#endif
--
2.47.1
^ permalink raw reply [flat|nested] 8+ messages in thread* Re: [PATCH] mm/memfd: reserve hugetlb folios before allocation 2025-01-07 7:25 [PATCH] mm/memfd: reserve hugetlb folios before allocation Vivek Kasireddy @ 2025-01-07 9:36 ` David Hildenbrand 2025-01-08 6:59 ` Kasireddy, Vivek 2025-01-07 17:12 ` Steven Sistare 1 sibling, 1 reply; 8+ messages in thread From: David Hildenbrand @ 2025-01-07 9:36 UTC (permalink / raw) To: Vivek Kasireddy, linux-mm Cc: syzbot+a504cb5bae4fe117ba94, Steve Sistare, Muchun Song, Andrew Morton On 07.01.25 08:25, Vivek Kasireddy wrote: > There are cases when we try to pin a folio but discover that it has > not been faulted-in. So, we try to allocate it in memfd_alloc_folio() > but there is a chance that we might encounter a crash/failure > (VM_BUG_ON(!h->resv_huge_pages)) if there are no active reservations > at that instant. This issue was reported by syzbot: > > kernel BUG at mm/hugetlb.c:2403! > Oops: invalid opcode: 0000 [#1] PREEMPT SMP KASAN NOPTI > CPU: 0 UID: 0 PID: 5315 Comm: syz.0.0 Not tainted > 6.13.0-rc5-syzkaller-00161-g63676eefb7a0 #0 > Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS > 1.16.3-debian-1.16.3-2~bpo12+1 04/01/2014 > RIP: 0010:alloc_hugetlb_folio_reserve+0xbc/0xc0 mm/hugetlb.c:2403 > Code: 1f eb 05 e8 56 18 a0 ff 48 c7 c7 40 56 61 8e e8 ba 21 cc 09 4c 89 > f0 5b 41 5c 41 5e 41 5f 5d c3 cc cc cc cc e8 35 18 a0 ff 90 <0f> 0b 66 > 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 f3 0f > RSP: 0018:ffffc9000d3d77f8 EFLAGS: 00010087 > RAX: ffffffff81ff6beb RBX: 0000000000000000 RCX: 0000000000100000 > RDX: ffffc9000e51a000 RSI: 00000000000003ec RDI: 00000000000003ed > RBP: 1ffffffff34810d9 R08: ffffffff81ff6ba3 R09: 1ffffd4000093005 > R10: dffffc0000000000 R11: fffff94000093006 R12: dffffc0000000000 > R13: dffffc0000000000 R14: ffffea0000498000 R15: ffffffff9a4086c8 > FS: 00007f77ac12e6c0(0000) GS:ffff88801fc00000(0000) > knlGS:0000000000000000 > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > CR2: 00007f77ab54b170 CR3: 0000000040b70000 CR4: 0000000000352ef0 > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 > Call Trace: > <TASK> > memfd_alloc_folio+0x1bd/0x370 mm/memfd.c:88 > memfd_pin_folios+0xf10/0x1570 mm/gup.c:3750 > udmabuf_pin_folios drivers/dma-buf/udmabuf.c:346 [inline] > udmabuf_create+0x70e/0x10c0 drivers/dma-buf/udmabuf.c:443 > udmabuf_ioctl_create drivers/dma-buf/udmabuf.c:495 [inline] > udmabuf_ioctl+0x301/0x4e0 drivers/dma-buf/udmabuf.c:526 > vfs_ioctl fs/ioctl.c:51 [inline] > __do_sys_ioctl fs/ioctl.c:906 [inline] > __se_sys_ioctl+0xf5/0x170 fs/ioctl.c:892 > do_syscall_x64 arch/x86/entry/common.c:52 [inline] > do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83 > entry_SYSCALL_64_after_hwframe+0x77/0x7f > > Therefore, to avoid this situation and fix this issue, we just need > to make a reservation before we try to allocate the folio. While at > it, also remove the VM_BUG_ON() as there is no need to crash the > system in this scenario and instead we could just fail the allocation. > > Fixes: 26a8ea80929c ("mm/hugetlb: fix memfd_pin_folios resv_huge_pages leak") > Reported-by: syzbot+a504cb5bae4fe117ba94@syzkaller.appspotmail.com > Signed-off-by: Vivek Kasireddy <vivek.kasireddy@intel.com> > Cc: Steve Sistare <steven.sistare@oracle.com> > Cc: Muchun Song <muchun.song@linux.dev> > Cc: David Hildenbrand <david@redhat.com> > Cc: Andrew Morton <akpm@linux-foundation.org> > --- > mm/hugetlb.c | 9 ++++++--- > mm/memfd.c | 5 +++++ > 2 files changed, 11 insertions(+), 3 deletions(-) > > diff --git a/mm/hugetlb.c b/mm/hugetlb.c > index c498874a7170..e46c461210a4 100644 > --- a/mm/hugetlb.c > +++ b/mm/hugetlb.c > @@ -2397,12 +2397,15 @@ struct folio *alloc_hugetlb_folio_reserve(struct hstate *h, int preferred_nid, > struct folio *folio; > > spin_lock_irq(&hugetlb_lock); > + if (!h->resv_huge_pages) { Should this be a "if (WARN_ON_ONCE(!h->resv_huge_pages)) {", because the "_reserve" in the function indicates that this is indeed something that must never happen? > + spin_unlock_irq(&hugetlb_lock); > + return NULL; > + } > + > folio = dequeue_hugetlb_folio_nodemask(h, gfp_mask, preferred_nid, > nmask); > - if (folio) { > - VM_BUG_ON(!h->resv_huge_pages); > + if (folio) > h->resv_huge_pages--; > - } > > spin_unlock_irq(&hugetlb_lock); > return folio; > diff --git a/mm/memfd.c b/mm/memfd.c > index 35a370d75c9a..a3012c444285 100644 > --- a/mm/memfd.c > +++ b/mm/memfd.c > @@ -85,6 +85,10 @@ struct folio *memfd_alloc_folio(struct file *memfd, pgoff_t idx) > gfp_mask &= ~(__GFP_HIGHMEM | __GFP_MOVABLE); > idx >>= huge_page_order(h); > > + if (!hugetlb_reserve_pages(file_inode(memfd), > + idx, idx + 1, NULL, 0)) > + return ERR_PTR(-ENOMEM); > + > folio = alloc_hugetlb_folio_reserve(h, > numa_node_id(), > NULL, > @@ -100,6 +104,7 @@ struct folio *memfd_alloc_folio(struct file *memfd, pgoff_t idx) > folio_unlock(folio); > return folio; > } > + hugetlb_unreserve_pages(file_inode(memfd), idx, idx + 1, 1); > return ERR_PTR(-ENOMEM); Staring at hugetlb_reserve_pages() I assume this will also work as expected if already reserved. -- Cheers, David / dhildenb ^ permalink raw reply [flat|nested] 8+ messages in thread
* RE: [PATCH] mm/memfd: reserve hugetlb folios before allocation 2025-01-07 9:36 ` David Hildenbrand @ 2025-01-08 6:59 ` Kasireddy, Vivek 0 siblings, 0 replies; 8+ messages in thread From: Kasireddy, Vivek @ 2025-01-08 6:59 UTC (permalink / raw) To: David Hildenbrand, linux-mm Cc: syzbot+a504cb5bae4fe117ba94, Steve Sistare, Muchun Song, Andrew Morton Hi David, > > There are cases when we try to pin a folio but discover that it has > > not been faulted-in. So, we try to allocate it in memfd_alloc_folio() > > but there is a chance that we might encounter a crash/failure > > (VM_BUG_ON(!h->resv_huge_pages)) if there are no active reservations > > at that instant. This issue was reported by syzbot: > > > > kernel BUG at mm/hugetlb.c:2403! > > Oops: invalid opcode: 0000 [#1] PREEMPT SMP KASAN NOPTI > > CPU: 0 UID: 0 PID: 5315 Comm: syz.0.0 Not tainted > > 6.13.0-rc5-syzkaller-00161-g63676eefb7a0 #0 > > Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS > > 1.16.3-debian-1.16.3-2~bpo12+1 04/01/2014 > > RIP: 0010:alloc_hugetlb_folio_reserve+0xbc/0xc0 mm/hugetlb.c:2403 > > Code: 1f eb 05 e8 56 18 a0 ff 48 c7 c7 40 56 61 8e e8 ba 21 cc 09 4c 89 > > f0 5b 41 5c 41 5e 41 5f 5d c3 cc cc cc cc e8 35 18 a0 ff 90 <0f> 0b 66 > > 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 f3 0f > > RSP: 0018:ffffc9000d3d77f8 EFLAGS: 00010087 > > RAX: ffffffff81ff6beb RBX: 0000000000000000 RCX: 0000000000100000 > > RDX: ffffc9000e51a000 RSI: 00000000000003ec RDI: 00000000000003ed > > RBP: 1ffffffff34810d9 R08: ffffffff81ff6ba3 R09: 1ffffd4000093005 > > R10: dffffc0000000000 R11: fffff94000093006 R12: dffffc0000000000 > > R13: dffffc0000000000 R14: ffffea0000498000 R15: ffffffff9a4086c8 > > FS: 00007f77ac12e6c0(0000) GS:ffff88801fc00000(0000) > > knlGS:0000000000000000 > > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > > CR2: 00007f77ab54b170 CR3: 0000000040b70000 CR4: 0000000000352ef0 > > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > > DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 > > Call Trace: > > <TASK> > > memfd_alloc_folio+0x1bd/0x370 mm/memfd.c:88 > > memfd_pin_folios+0xf10/0x1570 mm/gup.c:3750 > > udmabuf_pin_folios drivers/dma-buf/udmabuf.c:346 [inline] > > udmabuf_create+0x70e/0x10c0 drivers/dma-buf/udmabuf.c:443 > > udmabuf_ioctl_create drivers/dma-buf/udmabuf.c:495 [inline] > > udmabuf_ioctl+0x301/0x4e0 drivers/dma-buf/udmabuf.c:526 > > vfs_ioctl fs/ioctl.c:51 [inline] > > __do_sys_ioctl fs/ioctl.c:906 [inline] > > __se_sys_ioctl+0xf5/0x170 fs/ioctl.c:892 > > do_syscall_x64 arch/x86/entry/common.c:52 [inline] > > do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83 > > entry_SYSCALL_64_after_hwframe+0x77/0x7f > > > > Therefore, to avoid this situation and fix this issue, we just need > > to make a reservation before we try to allocate the folio. While at > > it, also remove the VM_BUG_ON() as there is no need to crash the > > system in this scenario and instead we could just fail the allocation. > > > > Fixes: 26a8ea80929c ("mm/hugetlb: fix memfd_pin_folios resv_huge_pages > leak") > > Reported-by: syzbot+a504cb5bae4fe117ba94@syzkaller.appspotmail.com > > Signed-off-by: Vivek Kasireddy <vivek.kasireddy@intel.com> > > Cc: Steve Sistare <steven.sistare@oracle.com> > > Cc: Muchun Song <muchun.song@linux.dev> > > Cc: David Hildenbrand <david@redhat.com> > > Cc: Andrew Morton <akpm@linux-foundation.org> > > --- > > mm/hugetlb.c | 9 ++++++--- > > mm/memfd.c | 5 +++++ > > 2 files changed, 11 insertions(+), 3 deletions(-) > > > > diff --git a/mm/hugetlb.c b/mm/hugetlb.c > > index c498874a7170..e46c461210a4 100644 > > --- a/mm/hugetlb.c > > +++ b/mm/hugetlb.c > > @@ -2397,12 +2397,15 @@ struct folio *alloc_hugetlb_folio_reserve(struct > hstate *h, int preferred_nid, > > struct folio *folio; > > > > spin_lock_irq(&hugetlb_lock); > > + if (!h->resv_huge_pages) { > > Should this be a "if (WARN_ON_ONCE(!h->resv_huge_pages)) {", because the > "_reserve" in the function indicates that this is indeed something that > must never happen? Yes, adding a WARN_ON_ONCE makes sense in this case. I'll add it in v2. > > > + spin_unlock_irq(&hugetlb_lock); > > + return NULL; > > + } > > + > > folio = dequeue_hugetlb_folio_nodemask(h, gfp_mask, > preferred_nid, > > nmask); > > - if (folio) { > > - VM_BUG_ON(!h->resv_huge_pages); > > + if (folio) > > h->resv_huge_pages--; > > - } > > > > spin_unlock_irq(&hugetlb_lock); > > return folio; > > diff --git a/mm/memfd.c b/mm/memfd.c > > index 35a370d75c9a..a3012c444285 100644 > > --- a/mm/memfd.c > > +++ b/mm/memfd.c > > @@ -85,6 +85,10 @@ struct folio *memfd_alloc_folio(struct file *memfd, > pgoff_t idx) > > gfp_mask &= ~(__GFP_HIGHMEM | __GFP_MOVABLE); > > idx >>= huge_page_order(h); > > > > + if (!hugetlb_reserve_pages(file_inode(memfd), > > + idx, idx + 1, NULL, 0)) > > + return ERR_PTR(-ENOMEM); > > + > > folio = alloc_hugetlb_folio_reserve(h, > > numa_node_id(), > > NULL, > > @@ -100,6 +104,7 @@ struct folio *memfd_alloc_folio(struct file *memfd, > pgoff_t idx) > > folio_unlock(folio); > > return folio; > > } > > + hugetlb_unreserve_pages(file_inode(memfd), idx, idx + 1, 1); > > return ERR_PTR(-ENOMEM); > > Staring at hugetlb_reserve_pages() I assume this will also work as > expected if already reserved. IIUC, if a range is already reserved, and when hugetlb_reserve_pages() gets called for the same range, it would mostly become a no-op as region_chg() and region_add() return 0 (based on my light testing). Thanks, Vivek > > > -- > Cheers, > > David / dhildenb ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH] mm/memfd: reserve hugetlb folios before allocation 2025-01-07 7:25 [PATCH] mm/memfd: reserve hugetlb folios before allocation Vivek Kasireddy 2025-01-07 9:36 ` David Hildenbrand @ 2025-01-07 17:12 ` Steven Sistare 2025-01-08 7:24 ` Kasireddy, Vivek 1 sibling, 1 reply; 8+ messages in thread From: Steven Sistare @ 2025-01-07 17:12 UTC (permalink / raw) To: Vivek Kasireddy, linux-mm Cc: syzbot+a504cb5bae4fe117ba94, Muchun Song, David Hildenbrand, Andrew Morton On 1/7/2025 2:25 AM, Vivek Kasireddy wrote: > There are cases when we try to pin a folio but discover that it has > not been faulted-in. So, we try to allocate it in memfd_alloc_folio() > but there is a chance that we might encounter a crash/failure > (VM_BUG_ON(!h->resv_huge_pages)) if there are no active reservations > at that instant. This issue was reported by syzbot: > > kernel BUG at mm/hugetlb.c:2403! > Oops: invalid opcode: 0000 [#1] PREEMPT SMP KASAN NOPTI > CPU: 0 UID: 0 PID: 5315 Comm: syz.0.0 Not tainted > 6.13.0-rc5-syzkaller-00161-g63676eefb7a0 #0 > Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS > 1.16.3-debian-1.16.3-2~bpo12+1 04/01/2014 > RIP: 0010:alloc_hugetlb_folio_reserve+0xbc/0xc0 mm/hugetlb.c:2403 > Code: 1f eb 05 e8 56 18 a0 ff 48 c7 c7 40 56 61 8e e8 ba 21 cc 09 4c 89 > f0 5b 41 5c 41 5e 41 5f 5d c3 cc cc cc cc e8 35 18 a0 ff 90 <0f> 0b 66 > 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 f3 0f > RSP: 0018:ffffc9000d3d77f8 EFLAGS: 00010087 > RAX: ffffffff81ff6beb RBX: 0000000000000000 RCX: 0000000000100000 > RDX: ffffc9000e51a000 RSI: 00000000000003ec RDI: 00000000000003ed > RBP: 1ffffffff34810d9 R08: ffffffff81ff6ba3 R09: 1ffffd4000093005 > R10: dffffc0000000000 R11: fffff94000093006 R12: dffffc0000000000 > R13: dffffc0000000000 R14: ffffea0000498000 R15: ffffffff9a4086c8 > FS: 00007f77ac12e6c0(0000) GS:ffff88801fc00000(0000) > knlGS:0000000000000000 > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > CR2: 00007f77ab54b170 CR3: 0000000040b70000 CR4: 0000000000352ef0 > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 > Call Trace: > <TASK> > memfd_alloc_folio+0x1bd/0x370 mm/memfd.c:88 > memfd_pin_folios+0xf10/0x1570 mm/gup.c:3750 > udmabuf_pin_folios drivers/dma-buf/udmabuf.c:346 [inline] > udmabuf_create+0x70e/0x10c0 drivers/dma-buf/udmabuf.c:443 > udmabuf_ioctl_create drivers/dma-buf/udmabuf.c:495 [inline] > udmabuf_ioctl+0x301/0x4e0 drivers/dma-buf/udmabuf.c:526 > vfs_ioctl fs/ioctl.c:51 [inline] > __do_sys_ioctl fs/ioctl.c:906 [inline] > __se_sys_ioctl+0xf5/0x170 fs/ioctl.c:892 > do_syscall_x64 arch/x86/entry/common.c:52 [inline] > do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83 > entry_SYSCALL_64_after_hwframe+0x77/0x7f > > Therefore, to avoid this situation and fix this issue, we just need > to make a reservation before we try to allocate the folio. While at > it, also remove the VM_BUG_ON() as there is no need to crash the > system in this scenario and instead we could just fail the allocation. > > Fixes: 26a8ea80929c ("mm/hugetlb: fix memfd_pin_folios resv_huge_pages leak") > Reported-by: syzbot+a504cb5bae4fe117ba94@syzkaller.appspotmail.com > Signed-off-by: Vivek Kasireddy <vivek.kasireddy@intel.com> > Cc: Steve Sistare <steven.sistare@oracle.com> > Cc: Muchun Song <muchun.song@linux.dev> > Cc: David Hildenbrand <david@redhat.com> > Cc: Andrew Morton <akpm@linux-foundation.org> > --- > mm/hugetlb.c | 9 ++++++--- > mm/memfd.c | 5 +++++ > 2 files changed, 11 insertions(+), 3 deletions(-) > > diff --git a/mm/hugetlb.c b/mm/hugetlb.c > index c498874a7170..e46c461210a4 100644 > --- a/mm/hugetlb.c > +++ b/mm/hugetlb.c > @@ -2397,12 +2397,15 @@ struct folio *alloc_hugetlb_folio_reserve(struct hstate *h, int preferred_nid, > struct folio *folio; > > spin_lock_irq(&hugetlb_lock); > + if (!h->resv_huge_pages) { > + spin_unlock_irq(&hugetlb_lock); > + return NULL; > + } This should be the entire fix, plus deleting the VM_BUG_ON. See below. > + > folio = dequeue_hugetlb_folio_nodemask(h, gfp_mask, preferred_nid, > nmask); > - if (folio) { > - VM_BUG_ON(!h->resv_huge_pages); > + if (folio) > h->resv_huge_pages--; > - } > > spin_unlock_irq(&hugetlb_lock); > return folio; > diff --git a/mm/memfd.c b/mm/memfd.c > index 35a370d75c9a..a3012c444285 100644 > --- a/mm/memfd.c > +++ b/mm/memfd.c > @@ -85,6 +85,10 @@ struct folio *memfd_alloc_folio(struct file *memfd, pgoff_t idx) > gfp_mask &= ~(__GFP_HIGHMEM | __GFP_MOVABLE); > idx >>= huge_page_order(h); > > + if (!hugetlb_reserve_pages(file_inode(memfd), > + idx, idx + 1, NULL, 0)) > + return ERR_PTR(-ENOMEM); I believe it is wrong to force a reservation here. Pages should have already been reserved at this point, eg by calls from hugetlbfs_file_mmap or hugetlb_file_setup. syzcaller has forced its way down this path without calling those pre-requisites, doing weird stuff as it should. To fix, I suggest you simply fix alloc_hugetlb_folio_reserve as above. - Steve > + > folio = alloc_hugetlb_folio_reserve(h, > numa_node_id(), > NULL, > @@ -100,6 +104,7 @@ struct folio *memfd_alloc_folio(struct file *memfd, pgoff_t idx) > folio_unlock(folio); > return folio; > } > + hugetlb_unreserve_pages(file_inode(memfd), idx, idx + 1, 1); > return ERR_PTR(-ENOMEM); > } > #endif ^ permalink raw reply [flat|nested] 8+ messages in thread
* RE: [PATCH] mm/memfd: reserve hugetlb folios before allocation 2025-01-07 17:12 ` Steven Sistare @ 2025-01-08 7:24 ` Kasireddy, Vivek 2025-01-09 19:01 ` Steven Sistare 0 siblings, 1 reply; 8+ messages in thread From: Kasireddy, Vivek @ 2025-01-08 7:24 UTC (permalink / raw) To: Steven Sistare, linux-mm Cc: syzbot+a504cb5bae4fe117ba94, Muchun Song, David Hildenbrand, Andrew Morton Hi Steve, > > There are cases when we try to pin a folio but discover that it has > > not been faulted-in. So, we try to allocate it in memfd_alloc_folio() > > but there is a chance that we might encounter a crash/failure > > (VM_BUG_ON(!h->resv_huge_pages)) if there are no active reservations > > at that instant. This issue was reported by syzbot: > > > > kernel BUG at mm/hugetlb.c:2403! > > Oops: invalid opcode: 0000 [#1] PREEMPT SMP KASAN NOPTI > > CPU: 0 UID: 0 PID: 5315 Comm: syz.0.0 Not tainted > > 6.13.0-rc5-syzkaller-00161-g63676eefb7a0 #0 > > Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS > > 1.16.3-debian-1.16.3-2~bpo12+1 04/01/2014 > > RIP: 0010:alloc_hugetlb_folio_reserve+0xbc/0xc0 mm/hugetlb.c:2403 > > Code: 1f eb 05 e8 56 18 a0 ff 48 c7 c7 40 56 61 8e e8 ba 21 cc 09 4c 89 > > f0 5b 41 5c 41 5e 41 5f 5d c3 cc cc cc cc e8 35 18 a0 ff 90 <0f> 0b 66 > > 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 f3 0f > > RSP: 0018:ffffc9000d3d77f8 EFLAGS: 00010087 > > RAX: ffffffff81ff6beb RBX: 0000000000000000 RCX: 0000000000100000 > > RDX: ffffc9000e51a000 RSI: 00000000000003ec RDI: 00000000000003ed > > RBP: 1ffffffff34810d9 R08: ffffffff81ff6ba3 R09: 1ffffd4000093005 > > R10: dffffc0000000000 R11: fffff94000093006 R12: dffffc0000000000 > > R13: dffffc0000000000 R14: ffffea0000498000 R15: ffffffff9a4086c8 > > FS: 00007f77ac12e6c0(0000) GS:ffff88801fc00000(0000) > > knlGS:0000000000000000 > > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > > CR2: 00007f77ab54b170 CR3: 0000000040b70000 CR4: 0000000000352ef0 > > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > > DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 > > Call Trace: > > <TASK> > > memfd_alloc_folio+0x1bd/0x370 mm/memfd.c:88 > > memfd_pin_folios+0xf10/0x1570 mm/gup.c:3750 > > udmabuf_pin_folios drivers/dma-buf/udmabuf.c:346 [inline] > > udmabuf_create+0x70e/0x10c0 drivers/dma-buf/udmabuf.c:443 > > udmabuf_ioctl_create drivers/dma-buf/udmabuf.c:495 [inline] > > udmabuf_ioctl+0x301/0x4e0 drivers/dma-buf/udmabuf.c:526 > > vfs_ioctl fs/ioctl.c:51 [inline] > > __do_sys_ioctl fs/ioctl.c:906 [inline] > > __se_sys_ioctl+0xf5/0x170 fs/ioctl.c:892 > > do_syscall_x64 arch/x86/entry/common.c:52 [inline] > > do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83 > > entry_SYSCALL_64_after_hwframe+0x77/0x7f > > > > Therefore, to avoid this situation and fix this issue, we just need > > to make a reservation before we try to allocate the folio. While at > > it, also remove the VM_BUG_ON() as there is no need to crash the > > system in this scenario and instead we could just fail the allocation. > > > > Fixes: 26a8ea80929c ("mm/hugetlb: fix memfd_pin_folios resv_huge_pages > leak") > > Reported-by: syzbot+a504cb5bae4fe117ba94@syzkaller.appspotmail.com > > Signed-off-by: Vivek Kasireddy <vivek.kasireddy@intel.com> > > Cc: Steve Sistare <steven.sistare@oracle.com> > > Cc: Muchun Song <muchun.song@linux.dev> > > Cc: David Hildenbrand <david@redhat.com> > > Cc: Andrew Morton <akpm@linux-foundation.org> > > --- > > mm/hugetlb.c | 9 ++++++--- > > mm/memfd.c | 5 +++++ > > 2 files changed, 11 insertions(+), 3 deletions(-) > > > > diff --git a/mm/hugetlb.c b/mm/hugetlb.c > > index c498874a7170..e46c461210a4 100644 > > --- a/mm/hugetlb.c > > +++ b/mm/hugetlb.c > > @@ -2397,12 +2397,15 @@ struct folio *alloc_hugetlb_folio_reserve(struct > hstate *h, int preferred_nid, > > struct folio *folio; > > > > spin_lock_irq(&hugetlb_lock); > > + if (!h->resv_huge_pages) { > > + spin_unlock_irq(&hugetlb_lock); > > + return NULL; > > + } > > This should be the entire fix, plus deleting the VM_BUG_ON. See below. > > > + > > folio = dequeue_hugetlb_folio_nodemask(h, gfp_mask, > preferred_nid, > > nmask); > > - if (folio) { > > - VM_BUG_ON(!h->resv_huge_pages); > > + if (folio) > > h->resv_huge_pages--; > > - } > > > > spin_unlock_irq(&hugetlb_lock); > > return folio; > > diff --git a/mm/memfd.c b/mm/memfd.c > > index 35a370d75c9a..a3012c444285 100644 > > --- a/mm/memfd.c > > +++ b/mm/memfd.c > > @@ -85,6 +85,10 @@ struct folio *memfd_alloc_folio(struct file *memfd, > pgoff_t idx) > > gfp_mask &= ~(__GFP_HIGHMEM | __GFP_MOVABLE); > > idx >>= huge_page_order(h); > > > > + if (!hugetlb_reserve_pages(file_inode(memfd), > > + idx, idx + 1, NULL, 0)) > > + return ERR_PTR(-ENOMEM); > > I believe it is wrong to force a reservation here. Is there any particular reason why you believe a reservation here would be wrong? AFAICS, at the moment, we are not doing any region/subpool accounting before our folio allocation and this gets flagged in the form of elevated resv_huge_pages value (hugetlb_acct_memory() elevates it based on the return value of region_del()) when hugetlb_unreserve_pages() eventually gets called. > Pages should have already been > reserved at this point, eg by calls from hugetlbfs_file_mmap or hugetlb_file_setup. hugetlb_file_setup() does not reserve any pages as it passes in VM_NORESERVE. And, the case we are trying to address is exactly when hugetlbfs_file_mmap() does not get called before pinning. So, when hugetlbfs_file_mmap() does eventually get called, I don't see any problem if it calls hugetlb_reserve_pages() again for the same range or overlapping ranges. > syzcaller has forced its way down this path without calling those pre-requisites, > doing weird stuff as it should. This issue is not very hard to reproduce. If we have free_huge_pages > 0 and resv_huge_pages = 0, and then we call memfd_pin_folios() before mmap()/ hugetlbfs_file_mmap() we can easily encounter this issue. Furthermore, we should be able to allocate a folio in this scenario (as available_huge_pages > 0), which we would not be able to do if we don't call hugetlb_reserve_pages(). Note that hugetlb_reserve_pages() actually elevates resv_huge_pages in this case and kind of gives a go-ahead for the allocation. I have used a slightly modified udmabuf selftest to reproduce this issue which I'll send out as part of v2. Thanks, Vivek > > To fix, I suggest you simply fix alloc_hugetlb_folio_reserve as above. > > - Steve > > > + > > folio = alloc_hugetlb_folio_reserve(h, > > numa_node_id(), > > NULL, > > @@ -100,6 +104,7 @@ struct folio *memfd_alloc_folio(struct file *memfd, > pgoff_t idx) > > folio_unlock(folio); > > return folio; > > } > > + hugetlb_unreserve_pages(file_inode(memfd), idx, idx + 1, 1); > > return ERR_PTR(-ENOMEM); > > } > > #endif ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH] mm/memfd: reserve hugetlb folios before allocation 2025-01-08 7:24 ` Kasireddy, Vivek @ 2025-01-09 19:01 ` Steven Sistare 2025-01-10 6:17 ` Kasireddy, Vivek 0 siblings, 1 reply; 8+ messages in thread From: Steven Sistare @ 2025-01-09 19:01 UTC (permalink / raw) To: Kasireddy, Vivek, linux-mm Cc: syzbot+a504cb5bae4fe117ba94, Muchun Song, David Hildenbrand, Andrew Morton On 1/8/2025 2:24 AM, Kasireddy, Vivek wrote: > Hi Steve, > >>> There are cases when we try to pin a folio but discover that it has >>> not been faulted-in. So, we try to allocate it in memfd_alloc_folio() >>> but there is a chance that we might encounter a crash/failure >>> (VM_BUG_ON(!h->resv_huge_pages)) if there are no active reservations >>> at that instant. This issue was reported by syzbot: >>> >>> kernel BUG at mm/hugetlb.c:2403! >>> Oops: invalid opcode: 0000 [#1] PREEMPT SMP KASAN NOPTI >>> CPU: 0 UID: 0 PID: 5315 Comm: syz.0.0 Not tainted >>> 6.13.0-rc5-syzkaller-00161-g63676eefb7a0 #0 >>> Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS >>> 1.16.3-debian-1.16.3-2~bpo12+1 04/01/2014 >>> RIP: 0010:alloc_hugetlb_folio_reserve+0xbc/0xc0 mm/hugetlb.c:2403 >>> Code: 1f eb 05 e8 56 18 a0 ff 48 c7 c7 40 56 61 8e e8 ba 21 cc 09 4c 89 >>> f0 5b 41 5c 41 5e 41 5f 5d c3 cc cc cc cc e8 35 18 a0 ff 90 <0f> 0b 66 >>> 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 f3 0f >>> RSP: 0018:ffffc9000d3d77f8 EFLAGS: 00010087 >>> RAX: ffffffff81ff6beb RBX: 0000000000000000 RCX: 0000000000100000 >>> RDX: ffffc9000e51a000 RSI: 00000000000003ec RDI: 00000000000003ed >>> RBP: 1ffffffff34810d9 R08: ffffffff81ff6ba3 R09: 1ffffd4000093005 >>> R10: dffffc0000000000 R11: fffff94000093006 R12: dffffc0000000000 >>> R13: dffffc0000000000 R14: ffffea0000498000 R15: ffffffff9a4086c8 >>> FS: 00007f77ac12e6c0(0000) GS:ffff88801fc00000(0000) >>> knlGS:0000000000000000 >>> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 >>> CR2: 00007f77ab54b170 CR3: 0000000040b70000 CR4: 0000000000352ef0 >>> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 >>> DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 >>> Call Trace: >>> <TASK> >>> memfd_alloc_folio+0x1bd/0x370 mm/memfd.c:88 >>> memfd_pin_folios+0xf10/0x1570 mm/gup.c:3750 >>> udmabuf_pin_folios drivers/dma-buf/udmabuf.c:346 [inline] >>> udmabuf_create+0x70e/0x10c0 drivers/dma-buf/udmabuf.c:443 >>> udmabuf_ioctl_create drivers/dma-buf/udmabuf.c:495 [inline] >>> udmabuf_ioctl+0x301/0x4e0 drivers/dma-buf/udmabuf.c:526 >>> vfs_ioctl fs/ioctl.c:51 [inline] >>> __do_sys_ioctl fs/ioctl.c:906 [inline] >>> __se_sys_ioctl+0xf5/0x170 fs/ioctl.c:892 >>> do_syscall_x64 arch/x86/entry/common.c:52 [inline] >>> do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83 >>> entry_SYSCALL_64_after_hwframe+0x77/0x7f >>> >>> Therefore, to avoid this situation and fix this issue, we just need >>> to make a reservation before we try to allocate the folio. While at >>> it, also remove the VM_BUG_ON() as there is no need to crash the >>> system in this scenario and instead we could just fail the allocation. >>> >>> Fixes: 26a8ea80929c ("mm/hugetlb: fix memfd_pin_folios resv_huge_pages >> leak") >>> Reported-by: syzbot+a504cb5bae4fe117ba94@syzkaller.appspotmail.com >>> Signed-off-by: Vivek Kasireddy <vivek.kasireddy@intel.com> >>> Cc: Steve Sistare <steven.sistare@oracle.com> >>> Cc: Muchun Song <muchun.song@linux.dev> >>> Cc: David Hildenbrand <david@redhat.com> >>> Cc: Andrew Morton <akpm@linux-foundation.org> >>> --- >>> mm/hugetlb.c | 9 ++++++--- >>> mm/memfd.c | 5 +++++ >>> 2 files changed, 11 insertions(+), 3 deletions(-) >>> >>> diff --git a/mm/hugetlb.c b/mm/hugetlb.c >>> index c498874a7170..e46c461210a4 100644 >>> --- a/mm/hugetlb.c >>> +++ b/mm/hugetlb.c >>> @@ -2397,12 +2397,15 @@ struct folio *alloc_hugetlb_folio_reserve(struct >> hstate *h, int preferred_nid, >>> struct folio *folio; >>> >>> spin_lock_irq(&hugetlb_lock); >>> + if (!h->resv_huge_pages) { >>> + spin_unlock_irq(&hugetlb_lock); >>> + return NULL; >>> + } >> >> This should be the entire fix, plus deleting the VM_BUG_ON. See below. >> >>> + >>> folio = dequeue_hugetlb_folio_nodemask(h, gfp_mask, >> preferred_nid, >>> nmask); >>> - if (folio) { >>> - VM_BUG_ON(!h->resv_huge_pages); >>> + if (folio) >>> h->resv_huge_pages--; >>> - } >>> >>> spin_unlock_irq(&hugetlb_lock); >>> return folio; >>> diff --git a/mm/memfd.c b/mm/memfd.c >>> index 35a370d75c9a..a3012c444285 100644 >>> --- a/mm/memfd.c >>> +++ b/mm/memfd.c >>> @@ -85,6 +85,10 @@ struct folio *memfd_alloc_folio(struct file *memfd, >> pgoff_t idx) >>> gfp_mask &= ~(__GFP_HIGHMEM | __GFP_MOVABLE); >>> idx >>= huge_page_order(h); >>> >>> + if (!hugetlb_reserve_pages(file_inode(memfd), >>> + idx, idx + 1, NULL, 0)) >>> + return ERR_PTR(-ENOMEM); >> >> I believe it is wrong to force a reservation here. > Is there any particular reason why you believe a reservation here would be wrong? > AFAICS, at the moment, we are not doing any region/subpool accounting before > our folio allocation and this gets flagged in the form of elevated resv_huge_pages > value (hugetlb_acct_memory() elevates it based on the return value of region_del()) > when hugetlb_unreserve_pages() eventually gets called. > >> Pages should have already been >> reserved at this point, eg by calls from hugetlbfs_file_mmap or hugetlb_file_setup. > hugetlb_file_setup() does not reserve any pages as it passes in VM_NORESERVE. > And, the case we are trying to address is exactly when hugetlbfs_file_mmap() does > not get called before pinning. But you must not break the case where hugetlbfs_file_mmap was called first, which reserves, then memfd_alloc_folio is called, which reserves again with your fix. Does that work correctly, or do the counts go bad? > So, when hugetlbfs_file_mmap() does eventually > get called, I don't see any problem if it calls hugetlb_reserve_pages() again for the > same range or overlapping ranges. Does that work correctly, or do the counts go bad? Please try those scenarios with your test program: mmap + memfd_alloc_folio, and memfd_alloc_folio + mmap. - Steve >> syzcaller has forced its way down this path without calling those pre-requisites, >> doing weird stuff as it should. > This issue is not very hard to reproduce. If we have free_huge_pages > 0 and > resv_huge_pages = 0, and then we call memfd_pin_folios() before mmap()/ > hugetlbfs_file_mmap() we can easily encounter this issue. Furthermore, we > should be able to allocate a folio in this scenario (as available_huge_pages > 0), > which we would not be able to do if we don't call hugetlb_reserve_pages(). > Note that hugetlb_reserve_pages() actually elevates resv_huge_pages in > this case and kind of gives a go-ahead for the allocation. > > I have used a slightly modified udmabuf selftest to reproduce this issue which > I'll send out as part of v2. > > Thanks, > Vivek > >> >> To fix, I suggest you simply fix alloc_hugetlb_folio_reserve as above. >> >> - Steve >> >>> + >>> folio = alloc_hugetlb_folio_reserve(h, >>> numa_node_id(), >>> NULL, >>> @@ -100,6 +104,7 @@ struct folio *memfd_alloc_folio(struct file *memfd, >> pgoff_t idx) >>> folio_unlock(folio); >>> return folio; >>> } >>> + hugetlb_unreserve_pages(file_inode(memfd), idx, idx + 1, 1); >>> return ERR_PTR(-ENOMEM); >>> } >>> #endif > ^ permalink raw reply [flat|nested] 8+ messages in thread
* RE: [PATCH] mm/memfd: reserve hugetlb folios before allocation 2025-01-09 19:01 ` Steven Sistare @ 2025-01-10 6:17 ` Kasireddy, Vivek 2025-01-10 15:22 ` Steven Sistare 0 siblings, 1 reply; 8+ messages in thread From: Kasireddy, Vivek @ 2025-01-10 6:17 UTC (permalink / raw) To: Steven Sistare, linux-mm Cc: syzbot+a504cb5bae4fe117ba94, Muchun Song, David Hildenbrand, Andrew Morton Hi Steve, > Subject: Re: [PATCH] mm/memfd: reserve hugetlb folios before allocation > > >>> There are cases when we try to pin a folio but discover that it has > >>> not been faulted-in. So, we try to allocate it in memfd_alloc_folio() > >>> but there is a chance that we might encounter a crash/failure > >>> (VM_BUG_ON(!h->resv_huge_pages)) if there are no active reservations > >>> at that instant. This issue was reported by syzbot: > >>> > >>> kernel BUG at mm/hugetlb.c:2403! > >>> Oops: invalid opcode: 0000 [#1] PREEMPT SMP KASAN NOPTI > >>> CPU: 0 UID: 0 PID: 5315 Comm: syz.0.0 Not tainted > >>> 6.13.0-rc5-syzkaller-00161-g63676eefb7a0 #0 > >>> Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS > >>> 1.16.3-debian-1.16.3-2~bpo12+1 04/01/2014 > >>> RIP: 0010:alloc_hugetlb_folio_reserve+0xbc/0xc0 mm/hugetlb.c:2403 > >>> Code: 1f eb 05 e8 56 18 a0 ff 48 c7 c7 40 56 61 8e e8 ba 21 cc 09 4c 89 > >>> f0 5b 41 5c 41 5e 41 5f 5d c3 cc cc cc cc e8 35 18 a0 ff 90 <0f> 0b 66 > >>> 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 f3 0f > >>> RSP: 0018:ffffc9000d3d77f8 EFLAGS: 00010087 > >>> RAX: ffffffff81ff6beb RBX: 0000000000000000 RCX: 0000000000100000 > >>> RDX: ffffc9000e51a000 RSI: 00000000000003ec RDI: 00000000000003ed > >>> RBP: 1ffffffff34810d9 R08: ffffffff81ff6ba3 R09: 1ffffd4000093005 > >>> R10: dffffc0000000000 R11: fffff94000093006 R12: dffffc0000000000 > >>> R13: dffffc0000000000 R14: ffffea0000498000 R15: ffffffff9a4086c8 > >>> FS: 00007f77ac12e6c0(0000) GS:ffff88801fc00000(0000) > >>> knlGS:0000000000000000 > >>> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > >>> CR2: 00007f77ab54b170 CR3: 0000000040b70000 CR4: > 0000000000352ef0 > >>> DR0: 0000000000000000 DR1: 0000000000000000 DR2: > 0000000000000000 > >>> DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 > >>> Call Trace: > >>> <TASK> > >>> memfd_alloc_folio+0x1bd/0x370 mm/memfd.c:88 > >>> memfd_pin_folios+0xf10/0x1570 mm/gup.c:3750 > >>> udmabuf_pin_folios drivers/dma-buf/udmabuf.c:346 [inline] > >>> udmabuf_create+0x70e/0x10c0 drivers/dma-buf/udmabuf.c:443 > >>> udmabuf_ioctl_create drivers/dma-buf/udmabuf.c:495 [inline] > >>> udmabuf_ioctl+0x301/0x4e0 drivers/dma-buf/udmabuf.c:526 > >>> vfs_ioctl fs/ioctl.c:51 [inline] > >>> __do_sys_ioctl fs/ioctl.c:906 [inline] > >>> __se_sys_ioctl+0xf5/0x170 fs/ioctl.c:892 > >>> do_syscall_x64 arch/x86/entry/common.c:52 [inline] > >>> do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83 > >>> entry_SYSCALL_64_after_hwframe+0x77/0x7f > >>> > >>> Therefore, to avoid this situation and fix this issue, we just need > >>> to make a reservation before we try to allocate the folio. While at > >>> it, also remove the VM_BUG_ON() as there is no need to crash the > >>> system in this scenario and instead we could just fail the allocation. > >>> > >>> Fixes: 26a8ea80929c ("mm/hugetlb: fix memfd_pin_folios > resv_huge_pages > >> leak") > >>> Reported-by: > syzbot+a504cb5bae4fe117ba94@syzkaller.appspotmail.com > >>> Signed-off-by: Vivek Kasireddy <vivek.kasireddy@intel.com> > >>> Cc: Steve Sistare <steven.sistare@oracle.com> > >>> Cc: Muchun Song <muchun.song@linux.dev> > >>> Cc: David Hildenbrand <david@redhat.com> > >>> Cc: Andrew Morton <akpm@linux-foundation.org> > >>> --- > >>> mm/hugetlb.c | 9 ++++++--- > >>> mm/memfd.c | 5 +++++ > >>> 2 files changed, 11 insertions(+), 3 deletions(-) > >>> > >>> diff --git a/mm/hugetlb.c b/mm/hugetlb.c > >>> index c498874a7170..e46c461210a4 100644 > >>> --- a/mm/hugetlb.c > >>> +++ b/mm/hugetlb.c > >>> @@ -2397,12 +2397,15 @@ struct folio > *alloc_hugetlb_folio_reserve(struct > >> hstate *h, int preferred_nid, > >>> struct folio *folio; > >>> > >>> spin_lock_irq(&hugetlb_lock); > >>> + if (!h->resv_huge_pages) { > >>> + spin_unlock_irq(&hugetlb_lock); > >>> + return NULL; > >>> + } > >> > >> This should be the entire fix, plus deleting the VM_BUG_ON. See below. > >> > >>> + > >>> folio = dequeue_hugetlb_folio_nodemask(h, gfp_mask, > >> preferred_nid, > >>> nmask); > >>> - if (folio) { > >>> - VM_BUG_ON(!h->resv_huge_pages); > >>> + if (folio) > >>> h->resv_huge_pages--; > >>> - } > >>> > >>> spin_unlock_irq(&hugetlb_lock); > >>> return folio; > >>> diff --git a/mm/memfd.c b/mm/memfd.c > >>> index 35a370d75c9a..a3012c444285 100644 > >>> --- a/mm/memfd.c > >>> +++ b/mm/memfd.c > >>> @@ -85,6 +85,10 @@ struct folio *memfd_alloc_folio(struct file > *memfd, > >> pgoff_t idx) > >>> gfp_mask &= ~(__GFP_HIGHMEM | __GFP_MOVABLE); > >>> idx >>= huge_page_order(h); > >>> > >>> + if (!hugetlb_reserve_pages(file_inode(memfd), > >>> + idx, idx + 1, NULL, 0)) > >>> + return ERR_PTR(-ENOMEM); > >> > >> I believe it is wrong to force a reservation here. > > Is there any particular reason why you believe a reservation here would be > wrong? > > AFAICS, at the moment, we are not doing any region/subpool accounting > before > > our folio allocation and this gets flagged in the form of elevated > resv_huge_pages > > value (hugetlb_acct_memory() elevates it based on the return value of > region_del()) > > when hugetlb_unreserve_pages() eventually gets called. > > > >> Pages should have already been > >> reserved at this point, eg by calls from hugetlbfs_file_mmap or > hugetlb_file_setup. > > hugetlb_file_setup() does not reserve any pages as it passes in > VM_NORESERVE. > > And, the case we are trying to address is exactly when > hugetlbfs_file_mmap() does > > not get called before pinning. > > But you must not break the case where hugetlbfs_file_mmap was called first, which AFAICS, this patch does not break the primary use-case where hugetlbfs_file_mmap() gets called first. > reserves, then memfd_alloc_folio is called, which reserves again with your fix. Does > that work correctly, or do the counts go bad? Yes it works correctly and ` cat /proc/meminfo|grep Huge` does not show any unexpected counts. > > > So, when hugetlbfs_file_mmap() does eventually > > get called, I don't see any problem if it calls hugetlb_reserve_pages() again > for the > > same range or overlapping ranges. > > Does that work correctly, or do the counts go bad? It works correctly as expected. And, as mentioned to David in another email, if a range is already reserved, and when hugetlb_reserve_pages() gets called again for the same range (in this use-case), it would mostly become a no-op as region_chg() and region_add() return 0 (based on my light testing). > > Please try those scenarios with your test program: mmap + > memfd_alloc_folio, and > memfd_alloc_folio + mmap. Here are the scenarios I tried (with the calls in the exact order as below): 1) hugetlbfs_file_mmap() memfd_alloc_folio() hugetlb_fault() 2) memfd_alloc_folio() hugetlbfs_file_mmap() hugetlb_fault() 3) hugetlbfs_file_mmap() hugetlb_fault() alloc_hugetlb_folio() memfd_alloc_folio() does not get called in the last case as the folio is already allocated (and is present in the page cache). While testing all these cases, I did not notice any splats or messed-up counts. If it is not too much trouble, could you please try this patch with your test-cases? Thanks, Vivek > > - Steve > > >> syzcaller has forced its way down this path without calling those pre- > requisites, > >> doing weird stuff as it should. > > This issue is not very hard to reproduce. If we have free_huge_pages > 0 > and > > resv_huge_pages = 0, and then we call memfd_pin_folios() before mmap()/ > > hugetlbfs_file_mmap() we can easily encounter this issue. Furthermore, we > > should be able to allocate a folio in this scenario (as available_huge_pages > > 0), > > which we would not be able to do if we don't call hugetlb_reserve_pages(). > > Note that hugetlb_reserve_pages() actually elevates resv_huge_pages in > > this case and kind of gives a go-ahead for the allocation. > > > > I have used a slightly modified udmabuf selftest to reproduce this issue > which > > I'll send out as part of v2. > > > > Thanks, > > Vivek > > > >> > >> To fix, I suggest you simply fix alloc_hugetlb_folio_reserve as above. > >> > >> - Steve > >> > >>> + > >>> folio = alloc_hugetlb_folio_reserve(h, > >>> numa_node_id(), > >>> NULL, > >>> @@ -100,6 +104,7 @@ struct folio *memfd_alloc_folio(struct file > *memfd, > >> pgoff_t idx) > >>> folio_unlock(folio); > >>> return folio; > >>> } > >>> + hugetlb_unreserve_pages(file_inode(memfd), idx, idx + 1, 1); > >>> return ERR_PTR(-ENOMEM); > >>> } > >>> #endif > > ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH] mm/memfd: reserve hugetlb folios before allocation 2025-01-10 6:17 ` Kasireddy, Vivek @ 2025-01-10 15:22 ` Steven Sistare 0 siblings, 0 replies; 8+ messages in thread From: Steven Sistare @ 2025-01-10 15:22 UTC (permalink / raw) To: Kasireddy, Vivek, linux-mm Cc: syzbot+a504cb5bae4fe117ba94, Muchun Song, David Hildenbrand, Andrew Morton On 1/10/2025 1:17 AM, Kasireddy, Vivek wrote: > Hi Steve, > >> Subject: Re: [PATCH] mm/memfd: reserve hugetlb folios before allocation >> >>>>> There are cases when we try to pin a folio but discover that it has >>>>> not been faulted-in. So, we try to allocate it in memfd_alloc_folio() >>>>> but there is a chance that we might encounter a crash/failure >>>>> (VM_BUG_ON(!h->resv_huge_pages)) if there are no active reservations >>>>> at that instant. This issue was reported by syzbot: >>>>> >>>>> kernel BUG at mm/hugetlb.c:2403! >>>>> Oops: invalid opcode: 0000 [#1] PREEMPT SMP KASAN NOPTI >>>>> CPU: 0 UID: 0 PID: 5315 Comm: syz.0.0 Not tainted >>>>> 6.13.0-rc5-syzkaller-00161-g63676eefb7a0 #0 >>>>> Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS >>>>> 1.16.3-debian-1.16.3-2~bpo12+1 04/01/2014 >>>>> RIP: 0010:alloc_hugetlb_folio_reserve+0xbc/0xc0 mm/hugetlb.c:2403 >>>>> Code: 1f eb 05 e8 56 18 a0 ff 48 c7 c7 40 56 61 8e e8 ba 21 cc 09 4c 89 >>>>> f0 5b 41 5c 41 5e 41 5f 5d c3 cc cc cc cc e8 35 18 a0 ff 90 <0f> 0b 66 >>>>> 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 f3 0f >>>>> RSP: 0018:ffffc9000d3d77f8 EFLAGS: 00010087 >>>>> RAX: ffffffff81ff6beb RBX: 0000000000000000 RCX: 0000000000100000 >>>>> RDX: ffffc9000e51a000 RSI: 00000000000003ec RDI: 00000000000003ed >>>>> RBP: 1ffffffff34810d9 R08: ffffffff81ff6ba3 R09: 1ffffd4000093005 >>>>> R10: dffffc0000000000 R11: fffff94000093006 R12: dffffc0000000000 >>>>> R13: dffffc0000000000 R14: ffffea0000498000 R15: ffffffff9a4086c8 >>>>> FS: 00007f77ac12e6c0(0000) GS:ffff88801fc00000(0000) >>>>> knlGS:0000000000000000 >>>>> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 >>>>> CR2: 00007f77ab54b170 CR3: 0000000040b70000 CR4: >> 0000000000352ef0 >>>>> DR0: 0000000000000000 DR1: 0000000000000000 DR2: >> 0000000000000000 >>>>> DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 >>>>> Call Trace: >>>>> <TASK> >>>>> memfd_alloc_folio+0x1bd/0x370 mm/memfd.c:88 >>>>> memfd_pin_folios+0xf10/0x1570 mm/gup.c:3750 >>>>> udmabuf_pin_folios drivers/dma-buf/udmabuf.c:346 [inline] >>>>> udmabuf_create+0x70e/0x10c0 drivers/dma-buf/udmabuf.c:443 >>>>> udmabuf_ioctl_create drivers/dma-buf/udmabuf.c:495 [inline] >>>>> udmabuf_ioctl+0x301/0x4e0 drivers/dma-buf/udmabuf.c:526 >>>>> vfs_ioctl fs/ioctl.c:51 [inline] >>>>> __do_sys_ioctl fs/ioctl.c:906 [inline] >>>>> __se_sys_ioctl+0xf5/0x170 fs/ioctl.c:892 >>>>> do_syscall_x64 arch/x86/entry/common.c:52 [inline] >>>>> do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83 >>>>> entry_SYSCALL_64_after_hwframe+0x77/0x7f >>>>> >>>>> Therefore, to avoid this situation and fix this issue, we just need >>>>> to make a reservation before we try to allocate the folio. While at >>>>> it, also remove the VM_BUG_ON() as there is no need to crash the >>>>> system in this scenario and instead we could just fail the allocation. >>>>> >>>>> Fixes: 26a8ea80929c ("mm/hugetlb: fix memfd_pin_folios >> resv_huge_pages >>>> leak") >>>>> Reported-by: >> syzbot+a504cb5bae4fe117ba94@syzkaller.appspotmail.com >>>>> Signed-off-by: Vivek Kasireddy <vivek.kasireddy@intel.com> >>>>> Cc: Steve Sistare <steven.sistare@oracle.com> >>>>> Cc: Muchun Song <muchun.song@linux.dev> >>>>> Cc: David Hildenbrand <david@redhat.com> >>>>> Cc: Andrew Morton <akpm@linux-foundation.org> >>>>> --- >>>>> mm/hugetlb.c | 9 ++++++--- >>>>> mm/memfd.c | 5 +++++ >>>>> 2 files changed, 11 insertions(+), 3 deletions(-) >>>>> >>>>> diff --git a/mm/hugetlb.c b/mm/hugetlb.c >>>>> index c498874a7170..e46c461210a4 100644 >>>>> --- a/mm/hugetlb.c >>>>> +++ b/mm/hugetlb.c >>>>> @@ -2397,12 +2397,15 @@ struct folio >> *alloc_hugetlb_folio_reserve(struct >>>> hstate *h, int preferred_nid, >>>>> struct folio *folio; >>>>> >>>>> spin_lock_irq(&hugetlb_lock); >>>>> + if (!h->resv_huge_pages) { >>>>> + spin_unlock_irq(&hugetlb_lock); >>>>> + return NULL; >>>>> + } >>>> >>>> This should be the entire fix, plus deleting the VM_BUG_ON. See below. >>>> >>>>> + >>>>> folio = dequeue_hugetlb_folio_nodemask(h, gfp_mask, >>>> preferred_nid, >>>>> nmask); >>>>> - if (folio) { >>>>> - VM_BUG_ON(!h->resv_huge_pages); >>>>> + if (folio) >>>>> h->resv_huge_pages--; >>>>> - } >>>>> >>>>> spin_unlock_irq(&hugetlb_lock); >>>>> return folio; >>>>> diff --git a/mm/memfd.c b/mm/memfd.c >>>>> index 35a370d75c9a..a3012c444285 100644 >>>>> --- a/mm/memfd.c >>>>> +++ b/mm/memfd.c >>>>> @@ -85,6 +85,10 @@ struct folio *memfd_alloc_folio(struct file >> *memfd, >>>> pgoff_t idx) >>>>> gfp_mask &= ~(__GFP_HIGHMEM | __GFP_MOVABLE); >>>>> idx >>= huge_page_order(h); >>>>> >>>>> + if (!hugetlb_reserve_pages(file_inode(memfd), >>>>> + idx, idx + 1, NULL, 0)) >>>>> + return ERR_PTR(-ENOMEM); >>>> >>>> I believe it is wrong to force a reservation here. >>> Is there any particular reason why you believe a reservation here would be >> wrong? >>> AFAICS, at the moment, we are not doing any region/subpool accounting >> before >>> our folio allocation and this gets flagged in the form of elevated >> resv_huge_pages >>> value (hugetlb_acct_memory() elevates it based on the return value of >> region_del()) >>> when hugetlb_unreserve_pages() eventually gets called. >>> >>>> Pages should have already been >>>> reserved at this point, eg by calls from hugetlbfs_file_mmap or >> hugetlb_file_setup. >>> hugetlb_file_setup() does not reserve any pages as it passes in >> VM_NORESERVE. >>> And, the case we are trying to address is exactly when >> hugetlbfs_file_mmap() does >>> not get called before pinning. >> >> But you must not break the case where hugetlbfs_file_mmap was called first, which > AFAICS, this patch does not break the primary use-case where hugetlbfs_file_mmap() > gets called first. > >> reserves, then memfd_alloc_folio is called, which reserves again with your fix. Does >> that work correctly, or do the counts go bad? > Yes it works correctly and ` cat /proc/meminfo|grep Huge` does not show any > unexpected counts. > >> >>> So, when hugetlbfs_file_mmap() does eventually >>> get called, I don't see any problem if it calls hugetlb_reserve_pages() again >> for the >>> same range or overlapping ranges. >> >> Does that work correctly, or do the counts go bad? > It works correctly as expected. And, as mentioned to David in another email, > if a range is already reserved, and when hugetlb_reserve_pages() gets called > again for the same range (in this use-case), it would mostly become a no-op > as region_chg() and region_add() return 0 (based on my light testing). Great, thanks. >> Please try those scenarios with your test program: mmap + >> memfd_alloc_folio, and >> memfd_alloc_folio + mmap. > Here are the scenarios I tried (with the calls in the exact order as below): > 1) hugetlbfs_file_mmap() > memfd_alloc_folio() > hugetlb_fault() > > 2) memfd_alloc_folio() > hugetlbfs_file_mmap() > hugetlb_fault() > > 3) hugetlbfs_file_mmap() > hugetlb_fault() > alloc_hugetlb_folio() > > memfd_alloc_folio() does not get called in the last case as the folio is already > allocated (and is present in the page cache). While testing all these cases, I > did not notice any splats or messed-up counts. > > If it is not too much trouble, could you please try this patch with your test-cases? Done, and all tests pass. I will review your V2 patch. Thanks for working on this bug. - Steve >>>> syzcaller has forced its way down this path without calling those pre- >> requisites, >>>> doing weird stuff as it should. >>> This issue is not very hard to reproduce. If we have free_huge_pages > 0 >> and >>> resv_huge_pages = 0, and then we call memfd_pin_folios() before mmap()/ >>> hugetlbfs_file_mmap() we can easily encounter this issue. Furthermore, we >>> should be able to allocate a folio in this scenario (as available_huge_pages >>> 0), >>> which we would not be able to do if we don't call hugetlb_reserve_pages(). >>> Note that hugetlb_reserve_pages() actually elevates resv_huge_pages in >>> this case and kind of gives a go-ahead for the allocation. >>> >>> I have used a slightly modified udmabuf selftest to reproduce this issue >> which >>> I'll send out as part of v2. >>> >>> Thanks, >>> Vivek >>> >>>> >>>> To fix, I suggest you simply fix alloc_hugetlb_folio_reserve as above. >>>> >>>> - Steve >>>> >>>>> + >>>>> folio = alloc_hugetlb_folio_reserve(h, >>>>> numa_node_id(), >>>>> NULL, >>>>> @@ -100,6 +104,7 @@ struct folio *memfd_alloc_folio(struct file >> *memfd, >>>> pgoff_t idx) >>>>> folio_unlock(folio); >>>>> return folio; >>>>> } >>>>> + hugetlb_unreserve_pages(file_inode(memfd), idx, idx + 1, 1); >>>>> return ERR_PTR(-ENOMEM); >>>>> } >>>>> #endif >>> > ^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2025-01-10 15:23 UTC | newest] Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2025-01-07 7:25 [PATCH] mm/memfd: reserve hugetlb folios before allocation Vivek Kasireddy 2025-01-07 9:36 ` David Hildenbrand 2025-01-08 6:59 ` Kasireddy, Vivek 2025-01-07 17:12 ` Steven Sistare 2025-01-08 7:24 ` Kasireddy, Vivek 2025-01-09 19:01 ` Steven Sistare 2025-01-10 6:17 ` Kasireddy, Vivek 2025-01-10 15:22 ` Steven Sistare
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox