* [PATCH v2 0/2] mm/memfd: reserve hugetlb folios before allocation
@ 2025-01-14 8:07 Vivek Kasireddy
2025-01-14 8:08 ` [PATCH v2 1/2] " Vivek Kasireddy
2025-01-14 8:08 ` [PATCH v2 2/2] selftests/udmabuf: add a test to pin first before writing to memfd Vivek Kasireddy
0 siblings, 2 replies; 7+ messages in thread
From: Vivek Kasireddy @ 2025-01-14 8:07 UTC (permalink / raw)
To: dri-devel, linux-mm
Cc: Vivek Kasireddy, Gerd Hoffmann, Steve Sistare, Muchun Song,
David Hildenbrand, Andrew Morton
There are cases when we try to pin a folio but discover that it has
not been faulted-in. So, we try to allocate it in memfd_alloc_folio()
but there is a chance that we might encounter a crash/failure
(VM_BUG_ON(!h->resv_huge_pages)) if there are no active reservations
at that instant. This issue was reported by syzbot.
Therefore, to avoid this situation and fix this issue, we just need
to make a reservation (by calling hugetlb_reserve_pages()) before
we try to allocate the folio. This will ensure that we are properly
doing region/subpool accounting associated with our allocation.
-----------------------------
Patchset overview:
Patch 1: Fix for VM_BUG_ON(!h->resv_huge_pages) crash reported by syzbot
Patch 2: New udmabuf selftest to invoke memfd_alloc_folio()
This series is tested by running the new udmabuf selftest introduced
in patch #2 along with the other selftests.
Changelog:
v1 -> v2:
- Replace VM_BUG_ON() with WARN_ON_ONCE() in the function
alloc_hugetlb_folio_reserve() (David)
- Move the inline function subpool_inode() from hugetlb.c into the
relevant header (hugetlb.h)
- Call hugetlb_unreserve_pages() if the folio cannot be added to
the page cache as well
- Added a new udmabuf selftest to exercise the same path as that
of syzbot
Cc: Gerd Hoffmann <kraxel@redhat.com>
Cc: Steve Sistare <steven.sistare@oracle.com>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: David Hildenbrand <david@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Vivek Kasireddy (2):
mm/memfd: reserve hugetlb folios before allocation
selftests/udmabuf: add a test to pin first before writing to memfd
include/linux/hugetlb.h | 5 +++++
mm/hugetlb.c | 14 ++++++-------
mm/memfd.c | 14 ++++++++++---
.../selftests/drivers/dma-buf/udmabuf.c | 20 ++++++++++++++++++-
4 files changed, 41 insertions(+), 12 deletions(-)
--
2.47.1
^ permalink raw reply [flat|nested] 7+ messages in thread
* [PATCH v2 1/2] mm/memfd: reserve hugetlb folios before allocation
2025-01-14 8:07 [PATCH v2 0/2] mm/memfd: reserve hugetlb folios before allocation Vivek Kasireddy
@ 2025-01-14 8:08 ` Vivek Kasireddy
2025-01-15 3:32 ` Andrew Morton
2025-01-17 16:54 ` David Hildenbrand
2025-01-14 8:08 ` [PATCH v2 2/2] selftests/udmabuf: add a test to pin first before writing to memfd Vivek Kasireddy
1 sibling, 2 replies; 7+ messages in thread
From: Vivek Kasireddy @ 2025-01-14 8:08 UTC (permalink / raw)
To: dri-devel, linux-mm
Cc: Vivek Kasireddy, syzbot+a504cb5bae4fe117ba94, Steve Sistare,
Muchun Song, David Hildenbrand, Andrew Morton
There are cases when we try to pin a folio but discover that it has
not been faulted-in. So, we try to allocate it in memfd_alloc_folio()
but there is a chance that we might encounter a crash/failure
(VM_BUG_ON(!h->resv_huge_pages)) if there are no active reservations
at that instant. This issue was reported by syzbot:
kernel BUG at mm/hugetlb.c:2403!
Oops: invalid opcode: 0000 [#1] PREEMPT SMP KASAN NOPTI
CPU: 0 UID: 0 PID: 5315 Comm: syz.0.0 Not tainted
6.13.0-rc5-syzkaller-00161-g63676eefb7a0 #0
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS
1.16.3-debian-1.16.3-2~bpo12+1 04/01/2014
RIP: 0010:alloc_hugetlb_folio_reserve+0xbc/0xc0 mm/hugetlb.c:2403
Code: 1f eb 05 e8 56 18 a0 ff 48 c7 c7 40 56 61 8e e8 ba 21 cc 09 4c 89
f0 5b 41 5c 41 5e 41 5f 5d c3 cc cc cc cc e8 35 18 a0 ff 90 <0f> 0b 66
90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 f3 0f
RSP: 0018:ffffc9000d3d77f8 EFLAGS: 00010087
RAX: ffffffff81ff6beb RBX: 0000000000000000 RCX: 0000000000100000
RDX: ffffc9000e51a000 RSI: 00000000000003ec RDI: 00000000000003ed
RBP: 1ffffffff34810d9 R08: ffffffff81ff6ba3 R09: 1ffffd4000093005
R10: dffffc0000000000 R11: fffff94000093006 R12: dffffc0000000000
R13: dffffc0000000000 R14: ffffea0000498000 R15: ffffffff9a4086c8
FS: 00007f77ac12e6c0(0000) GS:ffff88801fc00000(0000)
knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f77ab54b170 CR3: 0000000040b70000 CR4: 0000000000352ef0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
<TASK>
memfd_alloc_folio+0x1bd/0x370 mm/memfd.c:88
memfd_pin_folios+0xf10/0x1570 mm/gup.c:3750
udmabuf_pin_folios drivers/dma-buf/udmabuf.c:346 [inline]
udmabuf_create+0x70e/0x10c0 drivers/dma-buf/udmabuf.c:443
udmabuf_ioctl_create drivers/dma-buf/udmabuf.c:495 [inline]
udmabuf_ioctl+0x301/0x4e0 drivers/dma-buf/udmabuf.c:526
vfs_ioctl fs/ioctl.c:51 [inline]
__do_sys_ioctl fs/ioctl.c:906 [inline]
__se_sys_ioctl+0xf5/0x170 fs/ioctl.c:892
do_syscall_x64 arch/x86/entry/common.c:52 [inline]
do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83
entry_SYSCALL_64_after_hwframe+0x77/0x7f
Therefore, to avoid this situation and fix this issue, we just need
to make a reservation (by calling hugetlb_reserve_pages()) before
we try to allocate the folio. This will ensure that we are properly
doing region/subpool accounting associated with our allocation.
While at it, move subpool_inode() into hugetlb header and also
replace the VM_BUG_ON() with WARN_ON_ONCE() as there is no need to
crash the system in this scenario and instead we could just warn
and fail the allocation.
Fixes: 26a8ea80929c ("mm/hugetlb: fix memfd_pin_folios resv_huge_pages leak")
Reported-by: syzbot+a504cb5bae4fe117ba94@syzkaller.appspotmail.com
Signed-off-by: Vivek Kasireddy <vivek.kasireddy@intel.com>
Cc: Steve Sistare <steven.sistare@oracle.com>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: David Hildenbrand <david@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
---
include/linux/hugetlb.h | 5 +++++
mm/hugetlb.c | 14 ++++++--------
mm/memfd.c | 14 +++++++++++---
3 files changed, 22 insertions(+), 11 deletions(-)
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index ae4fe8615bb6..38c580548564 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -712,6 +712,11 @@ extern unsigned int default_hstate_idx;
#define default_hstate (hstates[default_hstate_idx])
+static inline struct hugepage_subpool *subpool_inode(struct inode *inode)
+{
+ return HUGETLBFS_SB(inode->i_sb)->spool;
+}
+
static inline struct hugepage_subpool *hugetlb_folio_subpool(struct folio *folio)
{
return folio->_hugetlb_subpool;
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index c498874a7170..ef948f56b864 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -251,11 +251,6 @@ static long hugepage_subpool_put_pages(struct hugepage_subpool *spool,
return ret;
}
-static inline struct hugepage_subpool *subpool_inode(struct inode *inode)
-{
- return HUGETLBFS_SB(inode->i_sb)->spool;
-}
-
static inline struct hugepage_subpool *subpool_vma(struct vm_area_struct *vma)
{
return subpool_inode(file_inode(vma->vm_file));
@@ -2397,12 +2392,15 @@ struct folio *alloc_hugetlb_folio_reserve(struct hstate *h, int preferred_nid,
struct folio *folio;
spin_lock_irq(&hugetlb_lock);
+ if (WARN_ON_ONCE(!h->resv_huge_pages)) {
+ spin_unlock_irq(&hugetlb_lock);
+ return NULL;
+ }
+
folio = dequeue_hugetlb_folio_nodemask(h, gfp_mask, preferred_nid,
nmask);
- if (folio) {
- VM_BUG_ON(!h->resv_huge_pages);
+ if (folio)
h->resv_huge_pages--;
- }
spin_unlock_irq(&hugetlb_lock);
return folio;
diff --git a/mm/memfd.c b/mm/memfd.c
index 35a370d75c9a..0d128c44fb78 100644
--- a/mm/memfd.c
+++ b/mm/memfd.c
@@ -70,7 +70,7 @@ struct folio *memfd_alloc_folio(struct file *memfd, pgoff_t idx)
#ifdef CONFIG_HUGETLB_PAGE
struct folio *folio;
gfp_t gfp_mask;
- int err;
+ int err = -ENOMEM;
if (is_file_hugepages(memfd)) {
/*
@@ -79,12 +79,16 @@ struct folio *memfd_alloc_folio(struct file *memfd, pgoff_t idx)
* alloc from. Also, the folio will be pinned for an indefinite
* amount of time, so it is not expected to be migrated away.
*/
+ struct inode *inode = file_inode(memfd);
struct hstate *h = hstate_file(memfd);
gfp_mask = htlb_alloc_mask(h);
gfp_mask &= ~(__GFP_HIGHMEM | __GFP_MOVABLE);
idx >>= huge_page_order(h);
+ if (!hugetlb_reserve_pages(inode, idx, idx + 1, NULL, 0))
+ return ERR_PTR(err);
+
folio = alloc_hugetlb_folio_reserve(h,
numa_node_id(),
NULL,
@@ -95,12 +99,16 @@ struct folio *memfd_alloc_folio(struct file *memfd, pgoff_t idx)
idx);
if (err) {
folio_put(folio);
- return ERR_PTR(err);
+ goto err;
}
+
+ hugetlb_set_folio_subpool(folio, subpool_inode(inode));
folio_unlock(folio);
return folio;
}
- return ERR_PTR(-ENOMEM);
+err:
+ hugetlb_unreserve_pages(inode, idx, idx + 1, 0);
+ return ERR_PTR(err);
}
#endif
return shmem_read_folio(memfd->f_mapping, idx);
--
2.47.1
^ permalink raw reply [flat|nested] 7+ messages in thread
* [PATCH v2 2/2] selftests/udmabuf: add a test to pin first before writing to memfd
2025-01-14 8:07 [PATCH v2 0/2] mm/memfd: reserve hugetlb folios before allocation Vivek Kasireddy
2025-01-14 8:08 ` [PATCH v2 1/2] " Vivek Kasireddy
@ 2025-01-14 8:08 ` Vivek Kasireddy
1 sibling, 0 replies; 7+ messages in thread
From: Vivek Kasireddy @ 2025-01-14 8:08 UTC (permalink / raw)
To: dri-devel, linux-mm
Cc: Vivek Kasireddy, Gerd Hoffmann, Steve Sistare, Muchun Song,
David Hildenbrand, Andrew Morton
Unlike the existing tests, this new test will create a memfd (backed
by hugetlb) and pin the folios in it (a small subset) before writing/
populating it with data. This is a valid use-case that invokes the
memfd_alloc_folio() kernel API and is expected to work unless there
aren't enough hugetlb folios to satisfy the allocation needs.
Cc: Gerd Hoffmann <kraxel@redhat.com>
Cc: Steve Sistare <steven.sistare@oracle.com>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: David Hildenbrand <david@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Vivek Kasireddy <vivek.kasireddy@intel.com>
---
.../selftests/drivers/dma-buf/udmabuf.c | 20 ++++++++++++++++++-
1 file changed, 19 insertions(+), 1 deletion(-)
diff --git a/tools/testing/selftests/drivers/dma-buf/udmabuf.c b/tools/testing/selftests/drivers/dma-buf/udmabuf.c
index 6062723a172e..77aa2897e79f 100644
--- a/tools/testing/selftests/drivers/dma-buf/udmabuf.c
+++ b/tools/testing/selftests/drivers/dma-buf/udmabuf.c
@@ -138,7 +138,7 @@ int main(int argc, char *argv[])
void *addr1, *addr2;
ksft_print_header();
- ksft_set_plan(6);
+ ksft_set_plan(7);
devfd = open("/dev/udmabuf", O_RDWR);
if (devfd < 0) {
@@ -248,6 +248,24 @@ int main(int argc, char *argv[])
else
ksft_test_result_pass("%s: [PASS,test-6]\n", TEST_PREFIX);
+ close(buf);
+ close(memfd);
+
+ /* same test as above but we pin first before writing to memfd */
+ page_size = getpagesize() * 512; /* 2 MB */
+ size = MEMFD_SIZE * page_size;
+ memfd = create_memfd_with_seals(size, true);
+ buf = create_udmabuf_list(devfd, memfd, size);
+ addr2 = mmap_fd(buf, NUM_PAGES * NUM_ENTRIES * getpagesize());
+ addr1 = mmap_fd(memfd, size);
+ write_to_memfd(addr1, size, 'a');
+ write_to_memfd(addr1, size, 'b');
+ ret = compare_chunks(addr1, addr2, size);
+ if (ret < 0)
+ ksft_test_result_fail("%s: [FAIL,test-7]\n", TEST_PREFIX);
+ else
+ ksft_test_result_pass("%s: [PASS,test-7]\n", TEST_PREFIX);
+
close(buf);
close(memfd);
close(devfd);
--
2.47.1
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH v2 1/2] mm/memfd: reserve hugetlb folios before allocation
2025-01-14 8:08 ` [PATCH v2 1/2] " Vivek Kasireddy
@ 2025-01-15 3:32 ` Andrew Morton
2025-01-16 5:58 ` Kasireddy, Vivek
2025-01-17 16:54 ` David Hildenbrand
1 sibling, 1 reply; 7+ messages in thread
From: Andrew Morton @ 2025-01-15 3:32 UTC (permalink / raw)
To: Vivek Kasireddy
Cc: dri-devel, linux-mm, syzbot+a504cb5bae4fe117ba94, Steve Sistare,
Muchun Song, David Hildenbrand
On Tue, 14 Jan 2025 00:08:00 -0800 Vivek Kasireddy <vivek.kasireddy@intel.com> wrote:
> There are cases when we try to pin a folio but discover that it has
> not been faulted-in. So, we try to allocate it in memfd_alloc_folio()
> but there is a chance that we might encounter a crash/failure
> (VM_BUG_ON(!h->resv_huge_pages)) if there are no active reservations
> at that instant. This issue was reported by syzbot:
>
> kernel BUG at mm/hugetlb.c:2403!
>
> ...
>
> Therefore, to avoid this situation and fix this issue, we just need
> to make a reservation (by calling hugetlb_reserve_pages()) before
> we try to allocate the folio. This will ensure that we are properly
> doing region/subpool accounting associated with our allocation.
>
> While at it, move subpool_inode() into hugetlb header and also
> replace the VM_BUG_ON() with WARN_ON_ONCE() as there is no need to
> crash the system in this scenario and instead we could just warn
> and fail the allocation.
>
> ...
>
> @@ -2397,12 +2392,15 @@ struct folio *alloc_hugetlb_folio_reserve(struct hstate *h, int preferred_nid,
> struct folio *folio;
>
> spin_lock_irq(&hugetlb_lock);
> + if (WARN_ON_ONCE(!h->resv_huge_pages)) {
> + spin_unlock_irq(&hugetlb_lock);
> + return NULL;
> + }
> +
What is is that we're warning of here? Is there any action which
either kernel developers or the user can take to prevent this warning
from being issued?
IOW, maybe the WARN shouldn't be present?
^ permalink raw reply [flat|nested] 7+ messages in thread
* RE: [PATCH v2 1/2] mm/memfd: reserve hugetlb folios before allocation
2025-01-15 3:32 ` Andrew Morton
@ 2025-01-16 5:58 ` Kasireddy, Vivek
0 siblings, 0 replies; 7+ messages in thread
From: Kasireddy, Vivek @ 2025-01-16 5:58 UTC (permalink / raw)
To: Andrew Morton
Cc: dri-devel, linux-mm, syzbot+a504cb5bae4fe117ba94, Steve Sistare,
Muchun Song, David Hildenbrand
Hi Andrew,
> Subject: Re: [PATCH v2 1/2] mm/memfd: reserve hugetlb folios before
> allocation
>
>
> > There are cases when we try to pin a folio but discover that it has
> > not been faulted-in. So, we try to allocate it in memfd_alloc_folio()
> > but there is a chance that we might encounter a crash/failure
> > (VM_BUG_ON(!h->resv_huge_pages)) if there are no active reservations
> > at that instant. This issue was reported by syzbot:
> >
> > kernel BUG at mm/hugetlb.c:2403!
> >
> > ...
> >
> > Therefore, to avoid this situation and fix this issue, we just need
> > to make a reservation (by calling hugetlb_reserve_pages()) before
> > we try to allocate the folio. This will ensure that we are properly
> > doing region/subpool accounting associated with our allocation.
> >
> > While at it, move subpool_inode() into hugetlb header and also
> > replace the VM_BUG_ON() with WARN_ON_ONCE() as there is no need to
> > crash the system in this scenario and instead we could just warn
> > and fail the allocation.
> >
> > ...
> >
> > @@ -2397,12 +2392,15 @@ struct folio *alloc_hugetlb_folio_reserve(struct
> hstate *h, int preferred_nid,
> > struct folio *folio;
> >
> > spin_lock_irq(&hugetlb_lock);
> > + if (WARN_ON_ONCE(!h->resv_huge_pages)) {
> > + spin_unlock_irq(&hugetlb_lock);
> > + return NULL;
> > + }
> > +
>
> What is is that we're warning of here?
The warning serves two purposes:
1) To flag a situation that is unexpected at that instant
2) To alert the callers (mostly future) that they need to somehow reserve
their hugetlb folios before calling this function
> Is there any action which
> either kernel developers or the user can take to prevent this warning
> from being issued?
Yeah, the callers of this function need to make a reservation and ensure that
hugetlb_reserve_pages() gets called (probably via hugetlbfs_file_mmap() or
other possible means) before they get their folios allocated via this function.
>
> IOW, maybe the WARN shouldn't be present?
Instead of silently failing, warning the caller about the failure mode seems
like the right thing to do in this situation. However, I am OK if this warning is
not present (given that this not a common use-case as of now) but do you
see any concern if it stays?
Thanks,
Vivek
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH v2 1/2] mm/memfd: reserve hugetlb folios before allocation
2025-01-14 8:08 ` [PATCH v2 1/2] " Vivek Kasireddy
2025-01-15 3:32 ` Andrew Morton
@ 2025-01-17 16:54 ` David Hildenbrand
2025-01-20 8:02 ` Kasireddy, Vivek
1 sibling, 1 reply; 7+ messages in thread
From: David Hildenbrand @ 2025-01-17 16:54 UTC (permalink / raw)
To: Vivek Kasireddy, dri-devel, linux-mm
Cc: syzbot+a504cb5bae4fe117ba94, Steve Sistare, Muchun Song, Andrew Morton
On 14.01.25 09:08, Vivek Kasireddy wrote:
> There are cases when we try to pin a folio but discover that it has
> not been faulted-in. So, we try to allocate it in memfd_alloc_folio()
> but there is a chance that we might encounter a crash/failure
> (VM_BUG_ON(!h->resv_huge_pages)) if there are no active reservations
> at that instant. This issue was reported by syzbot:
>
> kernel BUG at mm/hugetlb.c:2403!
> Oops: invalid opcode: 0000 [#1] PREEMPT SMP KASAN NOPTI
> CPU: 0 UID: 0 PID: 5315 Comm: syz.0.0 Not tainted
> 6.13.0-rc5-syzkaller-00161-g63676eefb7a0 #0
> Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS
> 1.16.3-debian-1.16.3-2~bpo12+1 04/01/2014
> RIP: 0010:alloc_hugetlb_folio_reserve+0xbc/0xc0 mm/hugetlb.c:2403
> Code: 1f eb 05 e8 56 18 a0 ff 48 c7 c7 40 56 61 8e e8 ba 21 cc 09 4c 89
> f0 5b 41 5c 41 5e 41 5f 5d c3 cc cc cc cc e8 35 18 a0 ff 90 <0f> 0b 66
> 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 f3 0f
> RSP: 0018:ffffc9000d3d77f8 EFLAGS: 00010087
> RAX: ffffffff81ff6beb RBX: 0000000000000000 RCX: 0000000000100000
> RDX: ffffc9000e51a000 RSI: 00000000000003ec RDI: 00000000000003ed
> RBP: 1ffffffff34810d9 R08: ffffffff81ff6ba3 R09: 1ffffd4000093005
> R10: dffffc0000000000 R11: fffff94000093006 R12: dffffc0000000000
> R13: dffffc0000000000 R14: ffffea0000498000 R15: ffffffff9a4086c8
> FS: 00007f77ac12e6c0(0000) GS:ffff88801fc00000(0000)
> knlGS:0000000000000000
> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 00007f77ab54b170 CR3: 0000000040b70000 CR4: 0000000000352ef0
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> Call Trace:
> <TASK>
> memfd_alloc_folio+0x1bd/0x370 mm/memfd.c:88
> memfd_pin_folios+0xf10/0x1570 mm/gup.c:3750
> udmabuf_pin_folios drivers/dma-buf/udmabuf.c:346 [inline]
> udmabuf_create+0x70e/0x10c0 drivers/dma-buf/udmabuf.c:443
> udmabuf_ioctl_create drivers/dma-buf/udmabuf.c:495 [inline]
> udmabuf_ioctl+0x301/0x4e0 drivers/dma-buf/udmabuf.c:526
> vfs_ioctl fs/ioctl.c:51 [inline]
> __do_sys_ioctl fs/ioctl.c:906 [inline]
> __se_sys_ioctl+0xf5/0x170 fs/ioctl.c:892
> do_syscall_x64 arch/x86/entry/common.c:52 [inline]
> do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83
> entry_SYSCALL_64_after_hwframe+0x77/0x7f
>
> Therefore, to avoid this situation and fix this issue, we just need
> to make a reservation (by calling hugetlb_reserve_pages()) before
> we try to allocate the folio. This will ensure that we are properly
> doing region/subpool accounting associated with our allocation.
>
> While at it, move subpool_inode() into hugetlb header and also
> replace the VM_BUG_ON() with WARN_ON_ONCE() as there is no need to
> crash the system in this scenario and instead we could just warn
> and fail the allocation.
>
> Fixes: 26a8ea80929c ("mm/hugetlb: fix memfd_pin_folios resv_huge_pages leak")
> Reported-by: syzbot+a504cb5bae4fe117ba94@syzkaller.appspotmail.com
> Signed-off-by: Vivek Kasireddy <vivek.kasireddy@intel.com>
> Cc: Steve Sistare <steven.sistare@oracle.com>
> Cc: Muchun Song <muchun.song@linux.dev>
> Cc: David Hildenbrand <david@redhat.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> ---
> include/linux/hugetlb.h | 5 +++++
> mm/hugetlb.c | 14 ++++++--------
> mm/memfd.c | 14 +++++++++++---
> 3 files changed, 22 insertions(+), 11 deletions(-)
>
> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> index ae4fe8615bb6..38c580548564 100644
> --- a/include/linux/hugetlb.h
> +++ b/include/linux/hugetlb.h
> @@ -712,6 +712,11 @@ extern unsigned int default_hstate_idx;
>
> #define default_hstate (hstates[default_hstate_idx])
>
> +static inline struct hugepage_subpool *subpool_inode(struct inode *inode)
> +{
> + return HUGETLBFS_SB(inode->i_sb)->spool;
> +}
> +
> static inline struct hugepage_subpool *hugetlb_folio_subpool(struct folio *folio)
> {
> return folio->_hugetlb_subpool;
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index c498874a7170..ef948f56b864 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -251,11 +251,6 @@ static long hugepage_subpool_put_pages(struct hugepage_subpool *spool,
> return ret;
> }
>
> -static inline struct hugepage_subpool *subpool_inode(struct inode *inode)
> -{
> - return HUGETLBFS_SB(inode->i_sb)->spool;
> -}
> -
> static inline struct hugepage_subpool *subpool_vma(struct vm_area_struct *vma)
> {
> return subpool_inode(file_inode(vma->vm_file));
> @@ -2397,12 +2392,15 @@ struct folio *alloc_hugetlb_folio_reserve(struct hstate *h, int preferred_nid,
> struct folio *folio;
>
> spin_lock_irq(&hugetlb_lock);
> + if (WARN_ON_ONCE(!h->resv_huge_pages)) {
> + spin_unlock_irq(&hugetlb_lock);
> + return NULL;
> + }
> +
> folio = dequeue_hugetlb_folio_nodemask(h, gfp_mask, preferred_nid,
> nmask);
> - if (folio) {
> - VM_BUG_ON(!h->resv_huge_pages);
> + if (folio)
> h->resv_huge_pages--;
> - }
>
> spin_unlock_irq(&hugetlb_lock);
> return folio;
> diff --git a/mm/memfd.c b/mm/memfd.c
> index 35a370d75c9a..0d128c44fb78 100644
> --- a/mm/memfd.c
> +++ b/mm/memfd.c
> @@ -70,7 +70,7 @@ struct folio *memfd_alloc_folio(struct file *memfd, pgoff_t idx)
> #ifdef CONFIG_HUGETLB_PAGE
> struct folio *folio;
> gfp_t gfp_mask;
> - int err;
> + int err = -ENOMEM;
>
> if (is_file_hugepages(memfd)) {
> /*
> @@ -79,12 +79,16 @@ struct folio *memfd_alloc_folio(struct file *memfd, pgoff_t idx)
> * alloc from. Also, the folio will be pinned for an indefinite
> * amount of time, so it is not expected to be migrated away.
> */
> + struct inode *inode = file_inode(memfd);
> struct hstate *h = hstate_file(memfd);
>
> gfp_mask = htlb_alloc_mask(h);
> gfp_mask &= ~(__GFP_HIGHMEM | __GFP_MOVABLE);
> idx >>= huge_page_order(h);
>
> + if (!hugetlb_reserve_pages(inode, idx, idx + 1, NULL, 0))
> + return ERR_PTR(err);
> +
> folio = alloc_hugetlb_folio_reserve(h,
> numa_node_id(),
> NULL,
> @@ -95,12 +99,16 @@ struct folio *memfd_alloc_folio(struct file *memfd, pgoff_t idx)
> idx);
> if (err) {
> folio_put(folio);
> - return ERR_PTR(err);
> + goto err;
> }
> +
> + hugetlb_set_folio_subpool(folio, subpool_inode(inode));
> folio_unlock(folio);
> return folio;
> }
> - return ERR_PTR(-ENOMEM);
> +err:
> + hugetlb_unreserve_pages(inode, idx, idx + 1, 0);
Hmmm, shouldn't we maybe only un-reserve if we were responsible for the
reservation above?
If it's already reserved before this call, we should probably leave it
as is?
Or maybe we never want to un-reserve at all here?
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 7+ messages in thread
* RE: [PATCH v2 1/2] mm/memfd: reserve hugetlb folios before allocation
2025-01-17 16:54 ` David Hildenbrand
@ 2025-01-20 8:02 ` Kasireddy, Vivek
0 siblings, 0 replies; 7+ messages in thread
From: Kasireddy, Vivek @ 2025-01-20 8:02 UTC (permalink / raw)
To: David Hildenbrand, dri-devel, linux-mm
Cc: syzbot+a504cb5bae4fe117ba94, Steve Sistare, Muchun Song, Andrew Morton
Hi David,
> Subject: Re: [PATCH v2 1/2] mm/memfd: reserve hugetlb folios before
> allocation
>
> On 14.01.25 09:08, Vivek Kasireddy wrote:
> > There are cases when we try to pin a folio but discover that it has
> > not been faulted-in. So, we try to allocate it in memfd_alloc_folio()
> > but there is a chance that we might encounter a crash/failure
> > (VM_BUG_ON(!h->resv_huge_pages)) if there are no active reservations
> > at that instant. This issue was reported by syzbot:
> >
> > kernel BUG at mm/hugetlb.c:2403!
> > Oops: invalid opcode: 0000 [#1] PREEMPT SMP KASAN NOPTI
> > CPU: 0 UID: 0 PID: 5315 Comm: syz.0.0 Not tainted
> > 6.13.0-rc5-syzkaller-00161-g63676eefb7a0 #0
> > Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS
> > 1.16.3-debian-1.16.3-2~bpo12+1 04/01/2014
> > RIP: 0010:alloc_hugetlb_folio_reserve+0xbc/0xc0 mm/hugetlb.c:2403
> > Code: 1f eb 05 e8 56 18 a0 ff 48 c7 c7 40 56 61 8e e8 ba 21 cc 09 4c 89
> > f0 5b 41 5c 41 5e 41 5f 5d c3 cc cc cc cc e8 35 18 a0 ff 90 <0f> 0b 66
> > 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 f3 0f
> > RSP: 0018:ffffc9000d3d77f8 EFLAGS: 00010087
> > RAX: ffffffff81ff6beb RBX: 0000000000000000 RCX: 0000000000100000
> > RDX: ffffc9000e51a000 RSI: 00000000000003ec RDI: 00000000000003ed
> > RBP: 1ffffffff34810d9 R08: ffffffff81ff6ba3 R09: 1ffffd4000093005
> > R10: dffffc0000000000 R11: fffff94000093006 R12: dffffc0000000000
> > R13: dffffc0000000000 R14: ffffea0000498000 R15: ffffffff9a4086c8
> > FS: 00007f77ac12e6c0(0000) GS:ffff88801fc00000(0000)
> > knlGS:0000000000000000
> > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > CR2: 00007f77ab54b170 CR3: 0000000040b70000 CR4: 0000000000352ef0
> > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> > Call Trace:
> > <TASK>
> > memfd_alloc_folio+0x1bd/0x370 mm/memfd.c:88
> > memfd_pin_folios+0xf10/0x1570 mm/gup.c:3750
> > udmabuf_pin_folios drivers/dma-buf/udmabuf.c:346 [inline]
> > udmabuf_create+0x70e/0x10c0 drivers/dma-buf/udmabuf.c:443
> > udmabuf_ioctl_create drivers/dma-buf/udmabuf.c:495 [inline]
> > udmabuf_ioctl+0x301/0x4e0 drivers/dma-buf/udmabuf.c:526
> > vfs_ioctl fs/ioctl.c:51 [inline]
> > __do_sys_ioctl fs/ioctl.c:906 [inline]
> > __se_sys_ioctl+0xf5/0x170 fs/ioctl.c:892
> > do_syscall_x64 arch/x86/entry/common.c:52 [inline]
> > do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83
> > entry_SYSCALL_64_after_hwframe+0x77/0x7f
> >
> > Therefore, to avoid this situation and fix this issue, we just need
> > to make a reservation (by calling hugetlb_reserve_pages()) before
> > we try to allocate the folio. This will ensure that we are properly
> > doing region/subpool accounting associated with our allocation.
> >
> > While at it, move subpool_inode() into hugetlb header and also
> > replace the VM_BUG_ON() with WARN_ON_ONCE() as there is no need to
> > crash the system in this scenario and instead we could just warn
> > and fail the allocation.
> >
> > Fixes: 26a8ea80929c ("mm/hugetlb: fix memfd_pin_folios resv_huge_pages
> leak")
> > Reported-by: syzbot+a504cb5bae4fe117ba94@syzkaller.appspotmail.com
> > Signed-off-by: Vivek Kasireddy <vivek.kasireddy@intel.com>
> > Cc: Steve Sistare <steven.sistare@oracle.com>
> > Cc: Muchun Song <muchun.song@linux.dev>
> > Cc: David Hildenbrand <david@redhat.com>
> > Cc: Andrew Morton <akpm@linux-foundation.org>
> > ---
> > include/linux/hugetlb.h | 5 +++++
> > mm/hugetlb.c | 14 ++++++--------
> > mm/memfd.c | 14 +++++++++++---
> > 3 files changed, 22 insertions(+), 11 deletions(-)
> >
> > diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> > index ae4fe8615bb6..38c580548564 100644
> > --- a/include/linux/hugetlb.h
> > +++ b/include/linux/hugetlb.h
> > @@ -712,6 +712,11 @@ extern unsigned int default_hstate_idx;
> >
> > #define default_hstate (hstates[default_hstate_idx])
> >
> > +static inline struct hugepage_subpool *subpool_inode(struct inode
> *inode)
> > +{
> > + return HUGETLBFS_SB(inode->i_sb)->spool;
> > +}
> > +
> > static inline struct hugepage_subpool *hugetlb_folio_subpool(struct folio
> *folio)
> > {
> > return folio->_hugetlb_subpool;
> > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > index c498874a7170..ef948f56b864 100644
> > --- a/mm/hugetlb.c
> > +++ b/mm/hugetlb.c
> > @@ -251,11 +251,6 @@ static long hugepage_subpool_put_pages(struct
> hugepage_subpool *spool,
> > return ret;
> > }
> >
> > -static inline struct hugepage_subpool *subpool_inode(struct inode
> *inode)
> > -{
> > - return HUGETLBFS_SB(inode->i_sb)->spool;
> > -}
> > -
> > static inline struct hugepage_subpool *subpool_vma(struct
> vm_area_struct *vma)
> > {
> > return subpool_inode(file_inode(vma->vm_file));
> > @@ -2397,12 +2392,15 @@ struct folio *alloc_hugetlb_folio_reserve(struct
> hstate *h, int preferred_nid,
> > struct folio *folio;
> >
> > spin_lock_irq(&hugetlb_lock);
> > + if (WARN_ON_ONCE(!h->resv_huge_pages)) {
> > + spin_unlock_irq(&hugetlb_lock);
> > + return NULL;
> > + }
> > +
> > folio = dequeue_hugetlb_folio_nodemask(h, gfp_mask,
> preferred_nid,
> > nmask);
> > - if (folio) {
> > - VM_BUG_ON(!h->resv_huge_pages);
> > + if (folio)
> > h->resv_huge_pages--;
> > - }
> >
> > spin_unlock_irq(&hugetlb_lock);
> > return folio;
> > diff --git a/mm/memfd.c b/mm/memfd.c
> > index 35a370d75c9a..0d128c44fb78 100644
> > --- a/mm/memfd.c
> > +++ b/mm/memfd.c
> > @@ -70,7 +70,7 @@ struct folio *memfd_alloc_folio(struct file *memfd,
> pgoff_t idx)
> > #ifdef CONFIG_HUGETLB_PAGE
> > struct folio *folio;
> > gfp_t gfp_mask;
> > - int err;
> > + int err = -ENOMEM;
> >
> > if (is_file_hugepages(memfd)) {
> > /*
> > @@ -79,12 +79,16 @@ struct folio *memfd_alloc_folio(struct file *memfd,
> pgoff_t idx)
> > * alloc from. Also, the folio will be pinned for an indefinite
> > * amount of time, so it is not expected to be migrated away.
> > */
> > + struct inode *inode = file_inode(memfd);
> > struct hstate *h = hstate_file(memfd);
> >
> > gfp_mask = htlb_alloc_mask(h);
> > gfp_mask &= ~(__GFP_HIGHMEM | __GFP_MOVABLE);
> > idx >>= huge_page_order(h);
> >
> > + if (!hugetlb_reserve_pages(inode, idx, idx + 1, NULL, 0))
> > + return ERR_PTR(err);
> > +
> > folio = alloc_hugetlb_folio_reserve(h,
> > numa_node_id(),
> > NULL,
> > @@ -95,12 +99,16 @@ struct folio *memfd_alloc_folio(struct file *memfd,
> pgoff_t idx)
> > idx);
> > if (err) {
> > folio_put(folio);
> > - return ERR_PTR(err);
> > + goto err;
> > }
> > +
> > + hugetlb_set_folio_subpool(folio,
> subpool_inode(inode));
> > folio_unlock(folio);
> > return folio;
> > }
> > - return ERR_PTR(-ENOMEM);
> > +err:
> > + hugetlb_unreserve_pages(inode, idx, idx + 1, 0);
>
> Hmmm, shouldn't we maybe only un-reserve if we were responsible for the
> reservation above?
Good catch! Yes, I agree that un-reserving, only if we were responsible for
the reservation is the right thing to do in this case.
>
> If it's already reserved before this call, we should probably leave it
> as is?
Yeah, that makes sense. However, there is currently no way to know if a range has
already been reserved or not. One option I can think of is to have hugetlb_reserve_pages()
provide return info (nr_reserved?) about how many pages were successfully reserved.
This way we can know if we were the ones responsible for the reservation.
>
> Or maybe we never want to un-reserve at all here?
That also seems to be an option. AFAICS, un-reserving only appears to happen as
part of truncate/hole punch/inode eviction and not if hugetlb_fault() fails. Unless
I am overlooking something, not calling hugetlb_unreserve_pages() if allocation
fails seems to be OK.
Thanks,
Vivek
>
> --
> Cheers,
>
> David / dhildenb
^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2025-01-20 8:03 UTC | newest]
Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-01-14 8:07 [PATCH v2 0/2] mm/memfd: reserve hugetlb folios before allocation Vivek Kasireddy
2025-01-14 8:08 ` [PATCH v2 1/2] " Vivek Kasireddy
2025-01-15 3:32 ` Andrew Morton
2025-01-16 5:58 ` Kasireddy, Vivek
2025-01-17 16:54 ` David Hildenbrand
2025-01-20 8:02 ` Kasireddy, Vivek
2025-01-14 8:08 ` [PATCH v2 2/2] selftests/udmabuf: add a test to pin first before writing to memfd Vivek Kasireddy
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox