Re: [PATCH] mm/memfd: reserve hugetlb folios before allocation

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Steven Sistare <steven.sistare@oracle.com>
To: "Kasireddy, Vivek" <vivek.kasireddy@intel.com>,
	"linux-mm@kvack.org" <linux-mm@kvack.org>
Cc: "syzbot+a504cb5bae4fe117ba94@syzkaller.appspotmail.com"
	<syzbot+a504cb5bae4fe117ba94@syzkaller.appspotmail.com>,
	Muchun Song <muchun.song@linux.dev>,
	David Hildenbrand <david@redhat.com>,
	Andrew Morton <akpm@linux-foundation.org>
Subject: Re: [PATCH] mm/memfd: reserve hugetlb folios before allocation
Date: Fri, 10 Jan 2025 10:22:47 -0500	[thread overview]
Message-ID: <2b428c92-8953-45cf-be48-b73f8efc9261@oracle.com> (raw)
In-Reply-To: <IA0PR11MB718540BD7FF194FC135DC5A4F81C2@IA0PR11MB7185.namprd11.prod.outlook.com>

On 1/10/2025 1:17 AM, Kasireddy, Vivek wrote:
> Hi Steve,
> 
>> Subject: Re: [PATCH] mm/memfd: reserve hugetlb folios before allocation
>>
>>>>> There are cases when we try to pin a folio but discover that it has
>>>>> not been faulted-in. So, we try to allocate it in memfd_alloc_folio()
>>>>> but there is a chance that we might encounter a crash/failure
>>>>> (VM_BUG_ON(!h->resv_huge_pages)) if there are no active reservations
>>>>> at that instant. This issue was reported by syzbot:
>>>>>
>>>>> kernel BUG at mm/hugetlb.c:2403!
>>>>> Oops: invalid opcode: 0000 [#1] PREEMPT SMP KASAN NOPTI
>>>>> CPU: 0 UID: 0 PID: 5315 Comm: syz.0.0 Not tainted
>>>>> 6.13.0-rc5-syzkaller-00161-g63676eefb7a0 #0
>>>>> Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS
>>>>> 1.16.3-debian-1.16.3-2~bpo12+1 04/01/2014
>>>>> RIP: 0010:alloc_hugetlb_folio_reserve+0xbc/0xc0 mm/hugetlb.c:2403
>>>>> Code: 1f eb 05 e8 56 18 a0 ff 48 c7 c7 40 56 61 8e e8 ba 21 cc 09 4c 89
>>>>> f0 5b 41 5c 41 5e 41 5f 5d c3 cc cc cc cc e8 35 18 a0 ff 90 <0f> 0b 66
>>>>> 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 f3 0f
>>>>> RSP: 0018:ffffc9000d3d77f8 EFLAGS: 00010087
>>>>> RAX: ffffffff81ff6beb RBX: 0000000000000000 RCX: 0000000000100000
>>>>> RDX: ffffc9000e51a000 RSI: 00000000000003ec RDI: 00000000000003ed
>>>>> RBP: 1ffffffff34810d9 R08: ffffffff81ff6ba3 R09: 1ffffd4000093005
>>>>> R10: dffffc0000000000 R11: fffff94000093006 R12: dffffc0000000000
>>>>> R13: dffffc0000000000 R14: ffffea0000498000 R15: ffffffff9a4086c8
>>>>> FS:  00007f77ac12e6c0(0000) GS:ffff88801fc00000(0000)
>>>>> knlGS:0000000000000000
>>>>> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>>>> CR2: 00007f77ab54b170 CR3: 0000000040b70000 CR4:
>> 0000000000352ef0
>>>>> DR0: 0000000000000000 DR1: 0000000000000000 DR2:
>> 0000000000000000
>>>>> DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
>>>>> Call Trace:
>>>>>     <TASK>
>>>>>     memfd_alloc_folio+0x1bd/0x370 mm/memfd.c:88
>>>>>     memfd_pin_folios+0xf10/0x1570 mm/gup.c:3750
>>>>>     udmabuf_pin_folios drivers/dma-buf/udmabuf.c:346 [inline]
>>>>>     udmabuf_create+0x70e/0x10c0 drivers/dma-buf/udmabuf.c:443
>>>>>     udmabuf_ioctl_create drivers/dma-buf/udmabuf.c:495 [inline]
>>>>>     udmabuf_ioctl+0x301/0x4e0 drivers/dma-buf/udmabuf.c:526
>>>>>     vfs_ioctl fs/ioctl.c:51 [inline]
>>>>>     __do_sys_ioctl fs/ioctl.c:906 [inline]
>>>>>     __se_sys_ioctl+0xf5/0x170 fs/ioctl.c:892
>>>>>     do_syscall_x64 arch/x86/entry/common.c:52 [inline]
>>>>>     do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83
>>>>>     entry_SYSCALL_64_after_hwframe+0x77/0x7f
>>>>>
>>>>> Therefore, to avoid this situation and fix this issue, we just need
>>>>> to make a reservation before we try to allocate the folio. While at
>>>>> it, also remove the VM_BUG_ON() as there is no need to crash the
>>>>> system in this scenario and instead we could just fail the allocation.
>>>>>
>>>>> Fixes: 26a8ea80929c ("mm/hugetlb: fix memfd_pin_folios
>> resv_huge_pages
>>>> leak")
>>>>> Reported-by:
>> syzbot+a504cb5bae4fe117ba94@syzkaller.appspotmail.com
>>>>> Signed-off-by: Vivek Kasireddy <vivek.kasireddy@intel.com>
>>>>> Cc: Steve Sistare <steven.sistare@oracle.com>
>>>>> Cc: Muchun Song <muchun.song@linux.dev>
>>>>> Cc: David Hildenbrand <david@redhat.com>
>>>>> Cc: Andrew Morton <akpm@linux-foundation.org>
>>>>> ---
>>>>>     mm/hugetlb.c | 9 ++++++---
>>>>>     mm/memfd.c   | 5 +++++
>>>>>     2 files changed, 11 insertions(+), 3 deletions(-)
>>>>>
>>>>> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
>>>>> index c498874a7170..e46c461210a4 100644
>>>>> --- a/mm/hugetlb.c
>>>>> +++ b/mm/hugetlb.c
>>>>> @@ -2397,12 +2397,15 @@ struct folio
>> *alloc_hugetlb_folio_reserve(struct
>>>> hstate *h, int preferred_nid,
>>>>>     	struct folio *folio;
>>>>>
>>>>>     	spin_lock_irq(&hugetlb_lock);
>>>>> +	if (!h->resv_huge_pages) {
>>>>> +		spin_unlock_irq(&hugetlb_lock);
>>>>> +		return NULL;
>>>>> +	}
>>>>
>>>> This should be the entire fix, plus deleting the VM_BUG_ON.  See below.
>>>>
>>>>> +
>>>>>     	folio = dequeue_hugetlb_folio_nodemask(h, gfp_mask,
>>>> preferred_nid,
>>>>>     					       nmask);
>>>>> -	if (folio) {
>>>>> -		VM_BUG_ON(!h->resv_huge_pages);
>>>>> +	if (folio)
>>>>>     		h->resv_huge_pages--;
>>>>> -	}
>>>>>
>>>>>     	spin_unlock_irq(&hugetlb_lock);
>>>>>     	return folio;
>>>>> diff --git a/mm/memfd.c b/mm/memfd.c
>>>>> index 35a370d75c9a..a3012c444285 100644
>>>>> --- a/mm/memfd.c
>>>>> +++ b/mm/memfd.c
>>>>> @@ -85,6 +85,10 @@ struct folio *memfd_alloc_folio(struct file
>> *memfd,
>>>> pgoff_t idx)
>>>>>     		gfp_mask &= ~(__GFP_HIGHMEM | __GFP_MOVABLE);
>>>>>     		idx >>= huge_page_order(h);
>>>>>
>>>>> +		if (!hugetlb_reserve_pages(file_inode(memfd),
>>>>> +					   idx, idx + 1, NULL, 0))
>>>>> +			return ERR_PTR(-ENOMEM);
>>>>
>>>> I believe it is wrong to force a reservation here.
>>> Is there any particular reason why you believe a reservation here would be
>> wrong?
>>> AFAICS, at the moment, we are not doing any region/subpool accounting
>> before
>>> our folio allocation and this gets flagged in the form of elevated
>> resv_huge_pages
>>> value (hugetlb_acct_memory() elevates it based on the return value of
>> region_del())
>>> when hugetlb_unreserve_pages() eventually gets called.
>>>
>>>> Pages should have already been
>>>> reserved at this point, eg by calls from hugetlbfs_file_mmap or
>> hugetlb_file_setup.
>>> hugetlb_file_setup() does not reserve any pages as it passes in
>> VM_NORESERVE.
>>> And, the case we are trying to address is exactly when
>> hugetlbfs_file_mmap() does
>>> not get called before pinning.
>>
>> But you must not break the case where hugetlbfs_file_mmap was called first, which
> AFAICS, this patch does not break the primary use-case where hugetlbfs_file_mmap()
> gets called first.
> 
>> reserves, then memfd_alloc_folio is called, which reserves again with your fix. Does
>> that work correctly, or do the counts go bad?
> Yes it works correctly and ` cat /proc/meminfo|grep Huge` does not show any
> unexpected counts.
> 
>>
>>> So, when hugetlbfs_file_mmap() does eventually
>>> get called, I don't see any problem if it calls hugetlb_reserve_pages() again
>> for the
>>> same range or overlapping ranges.
>>
>> Does that work correctly, or do the counts go bad?
> It works correctly as expected. And, as mentioned to David in another email,
> if a range is already reserved, and when hugetlb_reserve_pages() gets called
> again for the same range (in this use-case), it would mostly become a no-op
> as region_chg() and region_add() return 0 (based on my light testing).

Great, thanks.

>> Please try those scenarios with your test program:  mmap +
>> memfd_alloc_folio, and
>> memfd_alloc_folio + mmap.
> Here are the scenarios I tried (with the calls in the exact order as below):
> 1) hugetlbfs_file_mmap()
>      memfd_alloc_folio()
>      hugetlb_fault()
> 
> 2) memfd_alloc_folio()
>      hugetlbfs_file_mmap()
>      hugetlb_fault()
> 
> 3) hugetlbfs_file_mmap()
>      hugetlb_fault()
>          alloc_hugetlb_folio()
> 
> memfd_alloc_folio() does not get called in the last case as the folio is already
> allocated (and is present in the page cache). While testing all these cases, I
> did not notice any splats or messed-up counts.
> 
> If it is not too much trouble, could you please try this patch with your test-cases?

Done, and all tests pass.  I will review your V2 patch.
Thanks for working on this bug.

- Steve

>>>> syzcaller has forced its way down this path without calling those pre-
>> requisites,
>>>> doing weird stuff as it should.
>>> This issue is not very hard to reproduce. If we have free_huge_pages > 0
>> and
>>> resv_huge_pages = 0, and then we call memfd_pin_folios() before mmap()/
>>> hugetlbfs_file_mmap() we can easily encounter this issue. Furthermore, we
>>> should be able to allocate a folio in this scenario (as available_huge_pages
>>> 0),
>>> which we would not be able to do if we don't call hugetlb_reserve_pages().
>>> Note that hugetlb_reserve_pages() actually elevates resv_huge_pages in
>>> this case and kind of gives a go-ahead for the allocation.
>>>
>>> I have used a slightly modified udmabuf selftest to reproduce this issue
>> which
>>> I'll send out as part of v2.
>>>
>>> Thanks,
>>> Vivek
>>>
>>>>
>>>> To fix, I suggest you simply fix alloc_hugetlb_folio_reserve as above.
>>>>
>>>> - Steve
>>>>
>>>>> +
>>>>>     		folio = alloc_hugetlb_folio_reserve(h,
>>>>>     						    numa_node_id(),
>>>>>     						    NULL,
>>>>> @@ -100,6 +104,7 @@ struct folio *memfd_alloc_folio(struct file
>> *memfd,
>>>> pgoff_t idx)
>>>>>     			folio_unlock(folio);
>>>>>     			return folio;
>>>>>     		}
>>>>> +		hugetlb_unreserve_pages(file_inode(memfd), idx, idx + 1, 1);
>>>>>     		return ERR_PTR(-ENOMEM);
>>>>>     	}
>>>>>     #endif
>>>
>

     prev parent reply	other threads:[~2025-01-10 15:23 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-01-07  7:25 Vivek Kasireddy
2025-01-07  9:36 ` David Hildenbrand
2025-01-08  6:59   ` Kasireddy, Vivek
2025-01-07 17:12 ` Steven Sistare
2025-01-08  7:24   ` Kasireddy, Vivek
2025-01-09 19:01     ` Steven Sistare
2025-01-10  6:17       ` Kasireddy, Vivek
2025-01-10 15:22         ` Steven Sistare [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=2b428c92-8953-45cf-be48-b73f8efc9261@oracle.com \
    --to=steven.sistare@oracle.com \
    --cc=akpm@linux-foundation.org \
    --cc=david@redhat.com \
    --cc=linux-mm@kvack.org \
    --cc=muchun.song@linux.dev \
    --cc=syzbot+a504cb5bae4fe117ba94@syzkaller.appspotmail.com \
    --cc=vivek.kasireddy@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox