* [PATCH v2 0/2] mm/memfd: reserve hugetlb folios before allocation @ 2025-01-14 8:07 Vivek Kasireddy 2025-01-14 8:08 ` [PATCH v2 1/2] " Vivek Kasireddy 2025-01-14 8:08 ` [PATCH v2 2/2] selftests/udmabuf: add a test to pin first before writing to memfd Vivek Kasireddy 0 siblings, 2 replies; 7+ messages in thread From: Vivek Kasireddy @ 2025-01-14 8:07 UTC (permalink / raw) To: dri-devel, linux-mm Cc: Vivek Kasireddy, Gerd Hoffmann, Steve Sistare, Muchun Song, David Hildenbrand, Andrew Morton There are cases when we try to pin a folio but discover that it has not been faulted-in. So, we try to allocate it in memfd_alloc_folio() but there is a chance that we might encounter a crash/failure (VM_BUG_ON(!h->resv_huge_pages)) if there are no active reservations at that instant. This issue was reported by syzbot. Therefore, to avoid this situation and fix this issue, we just need to make a reservation (by calling hugetlb_reserve_pages()) before we try to allocate the folio. This will ensure that we are properly doing region/subpool accounting associated with our allocation. ----------------------------- Patchset overview: Patch 1: Fix for VM_BUG_ON(!h->resv_huge_pages) crash reported by syzbot Patch 2: New udmabuf selftest to invoke memfd_alloc_folio() This series is tested by running the new udmabuf selftest introduced in patch #2 along with the other selftests. Changelog: v1 -> v2: - Replace VM_BUG_ON() with WARN_ON_ONCE() in the function alloc_hugetlb_folio_reserve() (David) - Move the inline function subpool_inode() from hugetlb.c into the relevant header (hugetlb.h) - Call hugetlb_unreserve_pages() if the folio cannot be added to the page cache as well - Added a new udmabuf selftest to exercise the same path as that of syzbot Cc: Gerd Hoffmann <kraxel@redhat.com> Cc: Steve Sistare <steven.sistare@oracle.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: David Hildenbrand <david@redhat.com> Cc: Andrew Morton <akpm@linux-foundation.org> Vivek Kasireddy (2): mm/memfd: reserve hugetlb folios before allocation selftests/udmabuf: add a test to pin first before writing to memfd include/linux/hugetlb.h | 5 +++++ mm/hugetlb.c | 14 ++++++------- mm/memfd.c | 14 ++++++++++--- .../selftests/drivers/dma-buf/udmabuf.c | 20 ++++++++++++++++++- 4 files changed, 41 insertions(+), 12 deletions(-) -- 2.47.1 ^ permalink raw reply [flat|nested] 7+ messages in thread
* [PATCH v2 1/2] mm/memfd: reserve hugetlb folios before allocation 2025-01-14 8:07 [PATCH v2 0/2] mm/memfd: reserve hugetlb folios before allocation Vivek Kasireddy @ 2025-01-14 8:08 ` Vivek Kasireddy 2025-01-15 3:32 ` Andrew Morton 2025-01-17 16:54 ` David Hildenbrand 2025-01-14 8:08 ` [PATCH v2 2/2] selftests/udmabuf: add a test to pin first before writing to memfd Vivek Kasireddy 1 sibling, 2 replies; 7+ messages in thread From: Vivek Kasireddy @ 2025-01-14 8:08 UTC (permalink / raw) To: dri-devel, linux-mm Cc: Vivek Kasireddy, syzbot+a504cb5bae4fe117ba94, Steve Sistare, Muchun Song, David Hildenbrand, Andrew Morton There are cases when we try to pin a folio but discover that it has not been faulted-in. So, we try to allocate it in memfd_alloc_folio() but there is a chance that we might encounter a crash/failure (VM_BUG_ON(!h->resv_huge_pages)) if there are no active reservations at that instant. This issue was reported by syzbot: kernel BUG at mm/hugetlb.c:2403! Oops: invalid opcode: 0000 [#1] PREEMPT SMP KASAN NOPTI CPU: 0 UID: 0 PID: 5315 Comm: syz.0.0 Not tainted 6.13.0-rc5-syzkaller-00161-g63676eefb7a0 #0 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2~bpo12+1 04/01/2014 RIP: 0010:alloc_hugetlb_folio_reserve+0xbc/0xc0 mm/hugetlb.c:2403 Code: 1f eb 05 e8 56 18 a0 ff 48 c7 c7 40 56 61 8e e8 ba 21 cc 09 4c 89 f0 5b 41 5c 41 5e 41 5f 5d c3 cc cc cc cc e8 35 18 a0 ff 90 <0f> 0b 66 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 f3 0f RSP: 0018:ffffc9000d3d77f8 EFLAGS: 00010087 RAX: ffffffff81ff6beb RBX: 0000000000000000 RCX: 0000000000100000 RDX: ffffc9000e51a000 RSI: 00000000000003ec RDI: 00000000000003ed RBP: 1ffffffff34810d9 R08: ffffffff81ff6ba3 R09: 1ffffd4000093005 R10: dffffc0000000000 R11: fffff94000093006 R12: dffffc0000000000 R13: dffffc0000000000 R14: ffffea0000498000 R15: ffffffff9a4086c8 FS: 00007f77ac12e6c0(0000) GS:ffff88801fc00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f77ab54b170 CR3: 0000000040b70000 CR4: 0000000000352ef0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <TASK> memfd_alloc_folio+0x1bd/0x370 mm/memfd.c:88 memfd_pin_folios+0xf10/0x1570 mm/gup.c:3750 udmabuf_pin_folios drivers/dma-buf/udmabuf.c:346 [inline] udmabuf_create+0x70e/0x10c0 drivers/dma-buf/udmabuf.c:443 udmabuf_ioctl_create drivers/dma-buf/udmabuf.c:495 [inline] udmabuf_ioctl+0x301/0x4e0 drivers/dma-buf/udmabuf.c:526 vfs_ioctl fs/ioctl.c:51 [inline] __do_sys_ioctl fs/ioctl.c:906 [inline] __se_sys_ioctl+0xf5/0x170 fs/ioctl.c:892 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x77/0x7f Therefore, to avoid this situation and fix this issue, we just need to make a reservation (by calling hugetlb_reserve_pages()) before we try to allocate the folio. This will ensure that we are properly doing region/subpool accounting associated with our allocation. While at it, move subpool_inode() into hugetlb header and also replace the VM_BUG_ON() with WARN_ON_ONCE() as there is no need to crash the system in this scenario and instead we could just warn and fail the allocation. Fixes: 26a8ea80929c ("mm/hugetlb: fix memfd_pin_folios resv_huge_pages leak") Reported-by: syzbot+a504cb5bae4fe117ba94@syzkaller.appspotmail.com Signed-off-by: Vivek Kasireddy <vivek.kasireddy@intel.com> Cc: Steve Sistare <steven.sistare@oracle.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: David Hildenbrand <david@redhat.com> Cc: Andrew Morton <akpm@linux-foundation.org> --- include/linux/hugetlb.h | 5 +++++ mm/hugetlb.c | 14 ++++++-------- mm/memfd.c | 14 +++++++++++--- 3 files changed, 22 insertions(+), 11 deletions(-) diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h index ae4fe8615bb6..38c580548564 100644 --- a/include/linux/hugetlb.h +++ b/include/linux/hugetlb.h @@ -712,6 +712,11 @@ extern unsigned int default_hstate_idx; #define default_hstate (hstates[default_hstate_idx]) +static inline struct hugepage_subpool *subpool_inode(struct inode *inode) +{ + return HUGETLBFS_SB(inode->i_sb)->spool; +} + static inline struct hugepage_subpool *hugetlb_folio_subpool(struct folio *folio) { return folio->_hugetlb_subpool; diff --git a/mm/hugetlb.c b/mm/hugetlb.c index c498874a7170..ef948f56b864 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -251,11 +251,6 @@ static long hugepage_subpool_put_pages(struct hugepage_subpool *spool, return ret; } -static inline struct hugepage_subpool *subpool_inode(struct inode *inode) -{ - return HUGETLBFS_SB(inode->i_sb)->spool; -} - static inline struct hugepage_subpool *subpool_vma(struct vm_area_struct *vma) { return subpool_inode(file_inode(vma->vm_file)); @@ -2397,12 +2392,15 @@ struct folio *alloc_hugetlb_folio_reserve(struct hstate *h, int preferred_nid, struct folio *folio; spin_lock_irq(&hugetlb_lock); + if (WARN_ON_ONCE(!h->resv_huge_pages)) { + spin_unlock_irq(&hugetlb_lock); + return NULL; + } + folio = dequeue_hugetlb_folio_nodemask(h, gfp_mask, preferred_nid, nmask); - if (folio) { - VM_BUG_ON(!h->resv_huge_pages); + if (folio) h->resv_huge_pages--; - } spin_unlock_irq(&hugetlb_lock); return folio; diff --git a/mm/memfd.c b/mm/memfd.c index 35a370d75c9a..0d128c44fb78 100644 --- a/mm/memfd.c +++ b/mm/memfd.c @@ -70,7 +70,7 @@ struct folio *memfd_alloc_folio(struct file *memfd, pgoff_t idx) #ifdef CONFIG_HUGETLB_PAGE struct folio *folio; gfp_t gfp_mask; - int err; + int err = -ENOMEM; if (is_file_hugepages(memfd)) { /* @@ -79,12 +79,16 @@ struct folio *memfd_alloc_folio(struct file *memfd, pgoff_t idx) * alloc from. Also, the folio will be pinned for an indefinite * amount of time, so it is not expected to be migrated away. */ + struct inode *inode = file_inode(memfd); struct hstate *h = hstate_file(memfd); gfp_mask = htlb_alloc_mask(h); gfp_mask &= ~(__GFP_HIGHMEM | __GFP_MOVABLE); idx >>= huge_page_order(h); + if (!hugetlb_reserve_pages(inode, idx, idx + 1, NULL, 0)) + return ERR_PTR(err); + folio = alloc_hugetlb_folio_reserve(h, numa_node_id(), NULL, @@ -95,12 +99,16 @@ struct folio *memfd_alloc_folio(struct file *memfd, pgoff_t idx) idx); if (err) { folio_put(folio); - return ERR_PTR(err); + goto err; } + + hugetlb_set_folio_subpool(folio, subpool_inode(inode)); folio_unlock(folio); return folio; } - return ERR_PTR(-ENOMEM); +err: + hugetlb_unreserve_pages(inode, idx, idx + 1, 0); + return ERR_PTR(err); } #endif return shmem_read_folio(memfd->f_mapping, idx); -- 2.47.1 ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH v2 1/2] mm/memfd: reserve hugetlb folios before allocation 2025-01-14 8:08 ` [PATCH v2 1/2] " Vivek Kasireddy @ 2025-01-15 3:32 ` Andrew Morton 2025-01-16 5:58 ` Kasireddy, Vivek 2025-01-17 16:54 ` David Hildenbrand 1 sibling, 1 reply; 7+ messages in thread From: Andrew Morton @ 2025-01-15 3:32 UTC (permalink / raw) To: Vivek Kasireddy Cc: dri-devel, linux-mm, syzbot+a504cb5bae4fe117ba94, Steve Sistare, Muchun Song, David Hildenbrand On Tue, 14 Jan 2025 00:08:00 -0800 Vivek Kasireddy <vivek.kasireddy@intel.com> wrote: > There are cases when we try to pin a folio but discover that it has > not been faulted-in. So, we try to allocate it in memfd_alloc_folio() > but there is a chance that we might encounter a crash/failure > (VM_BUG_ON(!h->resv_huge_pages)) if there are no active reservations > at that instant. This issue was reported by syzbot: > > kernel BUG at mm/hugetlb.c:2403! > > ... > > Therefore, to avoid this situation and fix this issue, we just need > to make a reservation (by calling hugetlb_reserve_pages()) before > we try to allocate the folio. This will ensure that we are properly > doing region/subpool accounting associated with our allocation. > > While at it, move subpool_inode() into hugetlb header and also > replace the VM_BUG_ON() with WARN_ON_ONCE() as there is no need to > crash the system in this scenario and instead we could just warn > and fail the allocation. > > ... > > @@ -2397,12 +2392,15 @@ struct folio *alloc_hugetlb_folio_reserve(struct hstate *h, int preferred_nid, > struct folio *folio; > > spin_lock_irq(&hugetlb_lock); > + if (WARN_ON_ONCE(!h->resv_huge_pages)) { > + spin_unlock_irq(&hugetlb_lock); > + return NULL; > + } > + What is is that we're warning of here? Is there any action which either kernel developers or the user can take to prevent this warning from being issued? IOW, maybe the WARN shouldn't be present? ^ permalink raw reply [flat|nested] 7+ messages in thread
* RE: [PATCH v2 1/2] mm/memfd: reserve hugetlb folios before allocation 2025-01-15 3:32 ` Andrew Morton @ 2025-01-16 5:58 ` Kasireddy, Vivek 0 siblings, 0 replies; 7+ messages in thread From: Kasireddy, Vivek @ 2025-01-16 5:58 UTC (permalink / raw) To: Andrew Morton Cc: dri-devel, linux-mm, syzbot+a504cb5bae4fe117ba94, Steve Sistare, Muchun Song, David Hildenbrand Hi Andrew, > Subject: Re: [PATCH v2 1/2] mm/memfd: reserve hugetlb folios before > allocation > > > > There are cases when we try to pin a folio but discover that it has > > not been faulted-in. So, we try to allocate it in memfd_alloc_folio() > > but there is a chance that we might encounter a crash/failure > > (VM_BUG_ON(!h->resv_huge_pages)) if there are no active reservations > > at that instant. This issue was reported by syzbot: > > > > kernel BUG at mm/hugetlb.c:2403! > > > > ... > > > > Therefore, to avoid this situation and fix this issue, we just need > > to make a reservation (by calling hugetlb_reserve_pages()) before > > we try to allocate the folio. This will ensure that we are properly > > doing region/subpool accounting associated with our allocation. > > > > While at it, move subpool_inode() into hugetlb header and also > > replace the VM_BUG_ON() with WARN_ON_ONCE() as there is no need to > > crash the system in this scenario and instead we could just warn > > and fail the allocation. > > > > ... > > > > @@ -2397,12 +2392,15 @@ struct folio *alloc_hugetlb_folio_reserve(struct > hstate *h, int preferred_nid, > > struct folio *folio; > > > > spin_lock_irq(&hugetlb_lock); > > + if (WARN_ON_ONCE(!h->resv_huge_pages)) { > > + spin_unlock_irq(&hugetlb_lock); > > + return NULL; > > + } > > + > > What is is that we're warning of here? The warning serves two purposes: 1) To flag a situation that is unexpected at that instant 2) To alert the callers (mostly future) that they need to somehow reserve their hugetlb folios before calling this function > Is there any action which > either kernel developers or the user can take to prevent this warning > from being issued? Yeah, the callers of this function need to make a reservation and ensure that hugetlb_reserve_pages() gets called (probably via hugetlbfs_file_mmap() or other possible means) before they get their folios allocated via this function. > > IOW, maybe the WARN shouldn't be present? Instead of silently failing, warning the caller about the failure mode seems like the right thing to do in this situation. However, I am OK if this warning is not present (given that this not a common use-case as of now) but do you see any concern if it stays? Thanks, Vivek ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH v2 1/2] mm/memfd: reserve hugetlb folios before allocation 2025-01-14 8:08 ` [PATCH v2 1/2] " Vivek Kasireddy 2025-01-15 3:32 ` Andrew Morton @ 2025-01-17 16:54 ` David Hildenbrand 2025-01-20 8:02 ` Kasireddy, Vivek 1 sibling, 1 reply; 7+ messages in thread From: David Hildenbrand @ 2025-01-17 16:54 UTC (permalink / raw) To: Vivek Kasireddy, dri-devel, linux-mm Cc: syzbot+a504cb5bae4fe117ba94, Steve Sistare, Muchun Song, Andrew Morton On 14.01.25 09:08, Vivek Kasireddy wrote: > There are cases when we try to pin a folio but discover that it has > not been faulted-in. So, we try to allocate it in memfd_alloc_folio() > but there is a chance that we might encounter a crash/failure > (VM_BUG_ON(!h->resv_huge_pages)) if there are no active reservations > at that instant. This issue was reported by syzbot: > > kernel BUG at mm/hugetlb.c:2403! > Oops: invalid opcode: 0000 [#1] PREEMPT SMP KASAN NOPTI > CPU: 0 UID: 0 PID: 5315 Comm: syz.0.0 Not tainted > 6.13.0-rc5-syzkaller-00161-g63676eefb7a0 #0 > Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS > 1.16.3-debian-1.16.3-2~bpo12+1 04/01/2014 > RIP: 0010:alloc_hugetlb_folio_reserve+0xbc/0xc0 mm/hugetlb.c:2403 > Code: 1f eb 05 e8 56 18 a0 ff 48 c7 c7 40 56 61 8e e8 ba 21 cc 09 4c 89 > f0 5b 41 5c 41 5e 41 5f 5d c3 cc cc cc cc e8 35 18 a0 ff 90 <0f> 0b 66 > 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 f3 0f > RSP: 0018:ffffc9000d3d77f8 EFLAGS: 00010087 > RAX: ffffffff81ff6beb RBX: 0000000000000000 RCX: 0000000000100000 > RDX: ffffc9000e51a000 RSI: 00000000000003ec RDI: 00000000000003ed > RBP: 1ffffffff34810d9 R08: ffffffff81ff6ba3 R09: 1ffffd4000093005 > R10: dffffc0000000000 R11: fffff94000093006 R12: dffffc0000000000 > R13: dffffc0000000000 R14: ffffea0000498000 R15: ffffffff9a4086c8 > FS: 00007f77ac12e6c0(0000) GS:ffff88801fc00000(0000) > knlGS:0000000000000000 > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > CR2: 00007f77ab54b170 CR3: 0000000040b70000 CR4: 0000000000352ef0 > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 > Call Trace: > <TASK> > memfd_alloc_folio+0x1bd/0x370 mm/memfd.c:88 > memfd_pin_folios+0xf10/0x1570 mm/gup.c:3750 > udmabuf_pin_folios drivers/dma-buf/udmabuf.c:346 [inline] > udmabuf_create+0x70e/0x10c0 drivers/dma-buf/udmabuf.c:443 > udmabuf_ioctl_create drivers/dma-buf/udmabuf.c:495 [inline] > udmabuf_ioctl+0x301/0x4e0 drivers/dma-buf/udmabuf.c:526 > vfs_ioctl fs/ioctl.c:51 [inline] > __do_sys_ioctl fs/ioctl.c:906 [inline] > __se_sys_ioctl+0xf5/0x170 fs/ioctl.c:892 > do_syscall_x64 arch/x86/entry/common.c:52 [inline] > do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83 > entry_SYSCALL_64_after_hwframe+0x77/0x7f > > Therefore, to avoid this situation and fix this issue, we just need > to make a reservation (by calling hugetlb_reserve_pages()) before > we try to allocate the folio. This will ensure that we are properly > doing region/subpool accounting associated with our allocation. > > While at it, move subpool_inode() into hugetlb header and also > replace the VM_BUG_ON() with WARN_ON_ONCE() as there is no need to > crash the system in this scenario and instead we could just warn > and fail the allocation. > > Fixes: 26a8ea80929c ("mm/hugetlb: fix memfd_pin_folios resv_huge_pages leak") > Reported-by: syzbot+a504cb5bae4fe117ba94@syzkaller.appspotmail.com > Signed-off-by: Vivek Kasireddy <vivek.kasireddy@intel.com> > Cc: Steve Sistare <steven.sistare@oracle.com> > Cc: Muchun Song <muchun.song@linux.dev> > Cc: David Hildenbrand <david@redhat.com> > Cc: Andrew Morton <akpm@linux-foundation.org> > --- > include/linux/hugetlb.h | 5 +++++ > mm/hugetlb.c | 14 ++++++-------- > mm/memfd.c | 14 +++++++++++--- > 3 files changed, 22 insertions(+), 11 deletions(-) > > diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h > index ae4fe8615bb6..38c580548564 100644 > --- a/include/linux/hugetlb.h > +++ b/include/linux/hugetlb.h > @@ -712,6 +712,11 @@ extern unsigned int default_hstate_idx; > > #define default_hstate (hstates[default_hstate_idx]) > > +static inline struct hugepage_subpool *subpool_inode(struct inode *inode) > +{ > + return HUGETLBFS_SB(inode->i_sb)->spool; > +} > + > static inline struct hugepage_subpool *hugetlb_folio_subpool(struct folio *folio) > { > return folio->_hugetlb_subpool; > diff --git a/mm/hugetlb.c b/mm/hugetlb.c > index c498874a7170..ef948f56b864 100644 > --- a/mm/hugetlb.c > +++ b/mm/hugetlb.c > @@ -251,11 +251,6 @@ static long hugepage_subpool_put_pages(struct hugepage_subpool *spool, > return ret; > } > > -static inline struct hugepage_subpool *subpool_inode(struct inode *inode) > -{ > - return HUGETLBFS_SB(inode->i_sb)->spool; > -} > - > static inline struct hugepage_subpool *subpool_vma(struct vm_area_struct *vma) > { > return subpool_inode(file_inode(vma->vm_file)); > @@ -2397,12 +2392,15 @@ struct folio *alloc_hugetlb_folio_reserve(struct hstate *h, int preferred_nid, > struct folio *folio; > > spin_lock_irq(&hugetlb_lock); > + if (WARN_ON_ONCE(!h->resv_huge_pages)) { > + spin_unlock_irq(&hugetlb_lock); > + return NULL; > + } > + > folio = dequeue_hugetlb_folio_nodemask(h, gfp_mask, preferred_nid, > nmask); > - if (folio) { > - VM_BUG_ON(!h->resv_huge_pages); > + if (folio) > h->resv_huge_pages--; > - } > > spin_unlock_irq(&hugetlb_lock); > return folio; > diff --git a/mm/memfd.c b/mm/memfd.c > index 35a370d75c9a..0d128c44fb78 100644 > --- a/mm/memfd.c > +++ b/mm/memfd.c > @@ -70,7 +70,7 @@ struct folio *memfd_alloc_folio(struct file *memfd, pgoff_t idx) > #ifdef CONFIG_HUGETLB_PAGE > struct folio *folio; > gfp_t gfp_mask; > - int err; > + int err = -ENOMEM; > > if (is_file_hugepages(memfd)) { > /* > @@ -79,12 +79,16 @@ struct folio *memfd_alloc_folio(struct file *memfd, pgoff_t idx) > * alloc from. Also, the folio will be pinned for an indefinite > * amount of time, so it is not expected to be migrated away. > */ > + struct inode *inode = file_inode(memfd); > struct hstate *h = hstate_file(memfd); > > gfp_mask = htlb_alloc_mask(h); > gfp_mask &= ~(__GFP_HIGHMEM | __GFP_MOVABLE); > idx >>= huge_page_order(h); > > + if (!hugetlb_reserve_pages(inode, idx, idx + 1, NULL, 0)) > + return ERR_PTR(err); > + > folio = alloc_hugetlb_folio_reserve(h, > numa_node_id(), > NULL, > @@ -95,12 +99,16 @@ struct folio *memfd_alloc_folio(struct file *memfd, pgoff_t idx) > idx); > if (err) { > folio_put(folio); > - return ERR_PTR(err); > + goto err; > } > + > + hugetlb_set_folio_subpool(folio, subpool_inode(inode)); > folio_unlock(folio); > return folio; > } > - return ERR_PTR(-ENOMEM); > +err: > + hugetlb_unreserve_pages(inode, idx, idx + 1, 0); Hmmm, shouldn't we maybe only un-reserve if we were responsible for the reservation above? If it's already reserved before this call, we should probably leave it as is? Or maybe we never want to un-reserve at all here? -- Cheers, David / dhildenb ^ permalink raw reply [flat|nested] 7+ messages in thread
* RE: [PATCH v2 1/2] mm/memfd: reserve hugetlb folios before allocation 2025-01-17 16:54 ` David Hildenbrand @ 2025-01-20 8:02 ` Kasireddy, Vivek 0 siblings, 0 replies; 7+ messages in thread From: Kasireddy, Vivek @ 2025-01-20 8:02 UTC (permalink / raw) To: David Hildenbrand, dri-devel, linux-mm Cc: syzbot+a504cb5bae4fe117ba94, Steve Sistare, Muchun Song, Andrew Morton Hi David, > Subject: Re: [PATCH v2 1/2] mm/memfd: reserve hugetlb folios before > allocation > > On 14.01.25 09:08, Vivek Kasireddy wrote: > > There are cases when we try to pin a folio but discover that it has > > not been faulted-in. So, we try to allocate it in memfd_alloc_folio() > > but there is a chance that we might encounter a crash/failure > > (VM_BUG_ON(!h->resv_huge_pages)) if there are no active reservations > > at that instant. This issue was reported by syzbot: > > > > kernel BUG at mm/hugetlb.c:2403! > > Oops: invalid opcode: 0000 [#1] PREEMPT SMP KASAN NOPTI > > CPU: 0 UID: 0 PID: 5315 Comm: syz.0.0 Not tainted > > 6.13.0-rc5-syzkaller-00161-g63676eefb7a0 #0 > > Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS > > 1.16.3-debian-1.16.3-2~bpo12+1 04/01/2014 > > RIP: 0010:alloc_hugetlb_folio_reserve+0xbc/0xc0 mm/hugetlb.c:2403 > > Code: 1f eb 05 e8 56 18 a0 ff 48 c7 c7 40 56 61 8e e8 ba 21 cc 09 4c 89 > > f0 5b 41 5c 41 5e 41 5f 5d c3 cc cc cc cc e8 35 18 a0 ff 90 <0f> 0b 66 > > 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 f3 0f > > RSP: 0018:ffffc9000d3d77f8 EFLAGS: 00010087 > > RAX: ffffffff81ff6beb RBX: 0000000000000000 RCX: 0000000000100000 > > RDX: ffffc9000e51a000 RSI: 00000000000003ec RDI: 00000000000003ed > > RBP: 1ffffffff34810d9 R08: ffffffff81ff6ba3 R09: 1ffffd4000093005 > > R10: dffffc0000000000 R11: fffff94000093006 R12: dffffc0000000000 > > R13: dffffc0000000000 R14: ffffea0000498000 R15: ffffffff9a4086c8 > > FS: 00007f77ac12e6c0(0000) GS:ffff88801fc00000(0000) > > knlGS:0000000000000000 > > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > > CR2: 00007f77ab54b170 CR3: 0000000040b70000 CR4: 0000000000352ef0 > > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > > DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 > > Call Trace: > > <TASK> > > memfd_alloc_folio+0x1bd/0x370 mm/memfd.c:88 > > memfd_pin_folios+0xf10/0x1570 mm/gup.c:3750 > > udmabuf_pin_folios drivers/dma-buf/udmabuf.c:346 [inline] > > udmabuf_create+0x70e/0x10c0 drivers/dma-buf/udmabuf.c:443 > > udmabuf_ioctl_create drivers/dma-buf/udmabuf.c:495 [inline] > > udmabuf_ioctl+0x301/0x4e0 drivers/dma-buf/udmabuf.c:526 > > vfs_ioctl fs/ioctl.c:51 [inline] > > __do_sys_ioctl fs/ioctl.c:906 [inline] > > __se_sys_ioctl+0xf5/0x170 fs/ioctl.c:892 > > do_syscall_x64 arch/x86/entry/common.c:52 [inline] > > do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83 > > entry_SYSCALL_64_after_hwframe+0x77/0x7f > > > > Therefore, to avoid this situation and fix this issue, we just need > > to make a reservation (by calling hugetlb_reserve_pages()) before > > we try to allocate the folio. This will ensure that we are properly > > doing region/subpool accounting associated with our allocation. > > > > While at it, move subpool_inode() into hugetlb header and also > > replace the VM_BUG_ON() with WARN_ON_ONCE() as there is no need to > > crash the system in this scenario and instead we could just warn > > and fail the allocation. > > > > Fixes: 26a8ea80929c ("mm/hugetlb: fix memfd_pin_folios resv_huge_pages > leak") > > Reported-by: syzbot+a504cb5bae4fe117ba94@syzkaller.appspotmail.com > > Signed-off-by: Vivek Kasireddy <vivek.kasireddy@intel.com> > > Cc: Steve Sistare <steven.sistare@oracle.com> > > Cc: Muchun Song <muchun.song@linux.dev> > > Cc: David Hildenbrand <david@redhat.com> > > Cc: Andrew Morton <akpm@linux-foundation.org> > > --- > > include/linux/hugetlb.h | 5 +++++ > > mm/hugetlb.c | 14 ++++++-------- > > mm/memfd.c | 14 +++++++++++--- > > 3 files changed, 22 insertions(+), 11 deletions(-) > > > > diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h > > index ae4fe8615bb6..38c580548564 100644 > > --- a/include/linux/hugetlb.h > > +++ b/include/linux/hugetlb.h > > @@ -712,6 +712,11 @@ extern unsigned int default_hstate_idx; > > > > #define default_hstate (hstates[default_hstate_idx]) > > > > +static inline struct hugepage_subpool *subpool_inode(struct inode > *inode) > > +{ > > + return HUGETLBFS_SB(inode->i_sb)->spool; > > +} > > + > > static inline struct hugepage_subpool *hugetlb_folio_subpool(struct folio > *folio) > > { > > return folio->_hugetlb_subpool; > > diff --git a/mm/hugetlb.c b/mm/hugetlb.c > > index c498874a7170..ef948f56b864 100644 > > --- a/mm/hugetlb.c > > +++ b/mm/hugetlb.c > > @@ -251,11 +251,6 @@ static long hugepage_subpool_put_pages(struct > hugepage_subpool *spool, > > return ret; > > } > > > > -static inline struct hugepage_subpool *subpool_inode(struct inode > *inode) > > -{ > > - return HUGETLBFS_SB(inode->i_sb)->spool; > > -} > > - > > static inline struct hugepage_subpool *subpool_vma(struct > vm_area_struct *vma) > > { > > return subpool_inode(file_inode(vma->vm_file)); > > @@ -2397,12 +2392,15 @@ struct folio *alloc_hugetlb_folio_reserve(struct > hstate *h, int preferred_nid, > > struct folio *folio; > > > > spin_lock_irq(&hugetlb_lock); > > + if (WARN_ON_ONCE(!h->resv_huge_pages)) { > > + spin_unlock_irq(&hugetlb_lock); > > + return NULL; > > + } > > + > > folio = dequeue_hugetlb_folio_nodemask(h, gfp_mask, > preferred_nid, > > nmask); > > - if (folio) { > > - VM_BUG_ON(!h->resv_huge_pages); > > + if (folio) > > h->resv_huge_pages--; > > - } > > > > spin_unlock_irq(&hugetlb_lock); > > return folio; > > diff --git a/mm/memfd.c b/mm/memfd.c > > index 35a370d75c9a..0d128c44fb78 100644 > > --- a/mm/memfd.c > > +++ b/mm/memfd.c > > @@ -70,7 +70,7 @@ struct folio *memfd_alloc_folio(struct file *memfd, > pgoff_t idx) > > #ifdef CONFIG_HUGETLB_PAGE > > struct folio *folio; > > gfp_t gfp_mask; > > - int err; > > + int err = -ENOMEM; > > > > if (is_file_hugepages(memfd)) { > > /* > > @@ -79,12 +79,16 @@ struct folio *memfd_alloc_folio(struct file *memfd, > pgoff_t idx) > > * alloc from. Also, the folio will be pinned for an indefinite > > * amount of time, so it is not expected to be migrated away. > > */ > > + struct inode *inode = file_inode(memfd); > > struct hstate *h = hstate_file(memfd); > > > > gfp_mask = htlb_alloc_mask(h); > > gfp_mask &= ~(__GFP_HIGHMEM | __GFP_MOVABLE); > > idx >>= huge_page_order(h); > > > > + if (!hugetlb_reserve_pages(inode, idx, idx + 1, NULL, 0)) > > + return ERR_PTR(err); > > + > > folio = alloc_hugetlb_folio_reserve(h, > > numa_node_id(), > > NULL, > > @@ -95,12 +99,16 @@ struct folio *memfd_alloc_folio(struct file *memfd, > pgoff_t idx) > > idx); > > if (err) { > > folio_put(folio); > > - return ERR_PTR(err); > > + goto err; > > } > > + > > + hugetlb_set_folio_subpool(folio, > subpool_inode(inode)); > > folio_unlock(folio); > > return folio; > > } > > - return ERR_PTR(-ENOMEM); > > +err: > > + hugetlb_unreserve_pages(inode, idx, idx + 1, 0); > > Hmmm, shouldn't we maybe only un-reserve if we were responsible for the > reservation above? Good catch! Yes, I agree that un-reserving, only if we were responsible for the reservation is the right thing to do in this case. > > If it's already reserved before this call, we should probably leave it > as is? Yeah, that makes sense. However, there is currently no way to know if a range has already been reserved or not. One option I can think of is to have hugetlb_reserve_pages() provide return info (nr_reserved?) about how many pages were successfully reserved. This way we can know if we were the ones responsible for the reservation. > > Or maybe we never want to un-reserve at all here? That also seems to be an option. AFAICS, un-reserving only appears to happen as part of truncate/hole punch/inode eviction and not if hugetlb_fault() fails. Unless I am overlooking something, not calling hugetlb_unreserve_pages() if allocation fails seems to be OK. Thanks, Vivek > > -- > Cheers, > > David / dhildenb ^ permalink raw reply [flat|nested] 7+ messages in thread
* [PATCH v2 2/2] selftests/udmabuf: add a test to pin first before writing to memfd 2025-01-14 8:07 [PATCH v2 0/2] mm/memfd: reserve hugetlb folios before allocation Vivek Kasireddy 2025-01-14 8:08 ` [PATCH v2 1/2] " Vivek Kasireddy @ 2025-01-14 8:08 ` Vivek Kasireddy 1 sibling, 0 replies; 7+ messages in thread From: Vivek Kasireddy @ 2025-01-14 8:08 UTC (permalink / raw) To: dri-devel, linux-mm Cc: Vivek Kasireddy, Gerd Hoffmann, Steve Sistare, Muchun Song, David Hildenbrand, Andrew Morton Unlike the existing tests, this new test will create a memfd (backed by hugetlb) and pin the folios in it (a small subset) before writing/ populating it with data. This is a valid use-case that invokes the memfd_alloc_folio() kernel API and is expected to work unless there aren't enough hugetlb folios to satisfy the allocation needs. Cc: Gerd Hoffmann <kraxel@redhat.com> Cc: Steve Sistare <steven.sistare@oracle.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: David Hildenbrand <david@redhat.com> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Vivek Kasireddy <vivek.kasireddy@intel.com> --- .../selftests/drivers/dma-buf/udmabuf.c | 20 ++++++++++++++++++- 1 file changed, 19 insertions(+), 1 deletion(-) diff --git a/tools/testing/selftests/drivers/dma-buf/udmabuf.c b/tools/testing/selftests/drivers/dma-buf/udmabuf.c index 6062723a172e..77aa2897e79f 100644 --- a/tools/testing/selftests/drivers/dma-buf/udmabuf.c +++ b/tools/testing/selftests/drivers/dma-buf/udmabuf.c @@ -138,7 +138,7 @@ int main(int argc, char *argv[]) void *addr1, *addr2; ksft_print_header(); - ksft_set_plan(6); + ksft_set_plan(7); devfd = open("/dev/udmabuf", O_RDWR); if (devfd < 0) { @@ -248,6 +248,24 @@ int main(int argc, char *argv[]) else ksft_test_result_pass("%s: [PASS,test-6]\n", TEST_PREFIX); + close(buf); + close(memfd); + + /* same test as above but we pin first before writing to memfd */ + page_size = getpagesize() * 512; /* 2 MB */ + size = MEMFD_SIZE * page_size; + memfd = create_memfd_with_seals(size, true); + buf = create_udmabuf_list(devfd, memfd, size); + addr2 = mmap_fd(buf, NUM_PAGES * NUM_ENTRIES * getpagesize()); + addr1 = mmap_fd(memfd, size); + write_to_memfd(addr1, size, 'a'); + write_to_memfd(addr1, size, 'b'); + ret = compare_chunks(addr1, addr2, size); + if (ret < 0) + ksft_test_result_fail("%s: [FAIL,test-7]\n", TEST_PREFIX); + else + ksft_test_result_pass("%s: [PASS,test-7]\n", TEST_PREFIX); + close(buf); close(memfd); close(devfd); -- 2.47.1 ^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2025-01-20 8:03 UTC | newest] Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2025-01-14 8:07 [PATCH v2 0/2] mm/memfd: reserve hugetlb folios before allocation Vivek Kasireddy 2025-01-14 8:08 ` [PATCH v2 1/2] " Vivek Kasireddy 2025-01-15 3:32 ` Andrew Morton 2025-01-16 5:58 ` Kasireddy, Vivek 2025-01-17 16:54 ` David Hildenbrand 2025-01-20 8:02 ` Kasireddy, Vivek 2025-01-14 8:08 ` [PATCH v2 2/2] selftests/udmabuf: add a test to pin first before writing to memfd Vivek Kasireddy
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox