From: Mina Almasry <almasrymina@google.com>
To: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Linux-MM <linux-mm@kvack.org>,
Axel Rasmussen <axelrasmussen@google.com>,
aarcange@redhat.com, peterx@redhat.com
Subject: Re: resv_huge_page underflow with userfaultfd test
Date: Mon, 10 May 2021 17:52:28 -0700 [thread overview]
Message-ID: <CAHS8izOz_ogEjLDQac8iR5O+w1MBanoiJzJ9XaikUzurp+dPXQ@mail.gmail.com> (raw)
In-Reply-To: <b9d15426-9fcf-8133-94bd-4ba9a8fec272@oracle.com>
On Mon, May 10, 2021 at 5:33 PM Mike Kravetz <mike.kravetz@oracle.com> wrote:
>
> On 5/7/21 2:21 PM, Mina Almasry wrote:
> > Hi folks,
> >
> > I ran into a bug that I'm not sure how to solve so I'm wondering if
> > anyone has suggestions on what the issue could be and how to
> > investigate. I added the WARN_ON_ONCE() here to catch instances of
> > resv_huge_pages underflowing:
> >
> > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > index 629aa4c2259c..7d763eed650f 100644
> > --- a/mm/hugetlb.c
> > +++ b/mm/hugetlb.c
> > @@ -1165,7 +1165,21 @@ static struct page
> > *dequeue_huge_page_vma(struct hstate *h,
> > page = dequeue_huge_page_nodemask(h, gfp_mask, nid, nodemask);
> > if (page && !avoid_reserve && vma_has_reserves(vma, chg)) {
> > SetHPageRestoreReserve(page);
> > + WARN_ON_ONCE(!h->resv_huge_pages);
> > h->resv_huge_pages--;
> > }
> >
> > And ran the userfaultfd selftests like so:
> >
> > echo 1024 > /proc/sys/vm/nr_hugepages
> > mkdir -p /mnt/huge
> > mount -t hugetlbfs none /mnt/huge
> > ./tools/testings/selftests/vm/userfaultfd hugetlb_shared 1024 200
> > /mnt/huge/userfaultfd_test
> >
> > And run into this warning indicating this test does discover an underflow:
> >
> > [ 11.163403] ------------[ cut here ]------------
> > [ 11.163404] WARNING: CPU: 0 PID: 237 at mm/hugetlb.c:1178
> > alloc_huge_page+0x558/0x5a0
> > [ 11.163413] Modules linked in:
> > [ 11.163419] CPU: 0 PID: 237 Comm: userfaultfd Not tainted 5.12.0-dbg-DEV #135
> > [ 11.163424] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
> > BIOS 1.14.0-2 04/01/2014
> > [ 11.163429] RIP: 0010:alloc_huge_page+0x558/0x5a0
> > [ 11.163432] Code: b0 00 0f 85 3d ff ff ff e9 2a ff ff ff be 01 00
> > 00 00 48 89 df e8 18 e7 ff ff 48 f7 d8 4c 89 ef 48 89 c6 e8 da d7 ff
> > ff eb 8c <0f> 0b 4d 8b 85 c0 00 00 00 e9 95 fd ff ff e8 35 59 84 00 4c
> > 897
> > [ 11.163434] RSP: 0018:ffff94bb0073fc80 EFLAGS: 00010046
> > [ 11.163436] RAX: 0000000000000080 RBX: 0000000000000000 RCX: 5fa252c406a76700
> > [ 11.163438] RDX: c0000000ffff7fff RSI: 0000000000000004 RDI:
> > 0000000000017ffd
> > [ 11.163439] RBP: ffff94bb0073fcf8 R08: 0000000000000000 R09:
> > ffffffff9813ba70
> > [ 11.163440] R10: 00000000ffff7fff R11: 0000000000000000 R12:
> > ffff8ac7800558c8
> > [ 11.163442] R13: ffffffff993f8880 R14: 00007f0dfa200000 R15:
> > ffffed85453e0000
> > [ 11.163443] FS: 00007f0d731fc700(0000) GS:ffff8acba9400000(0000)
> > knlGS:0000000000000000
> > [ 11.163445] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > [ 11.163448] CR2: 00007f0e65e00028 CR3: 0000000108d50003 CR4:
> > 0000000000370ef0
> > [ 11.163452] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
> > 0000000000000000
> > [ 11.163453] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
> > 0000000000000400
> > [ 11.163455] Call Trace:
> > [ 11.163468] hugetlb_mcopy_atomic_pte+0xcb/0x450
> > [ 11.163477] mcopy_atomic+0xa08/0xd60
> > [ 11.163480] ? __might_fault+0x56/0x80
> > [ 11.163493] userfaultfd_ioctl+0xb18/0xd60
> > [ 11.163502] __se_sys_ioctl+0x77/0xc0
> > [ 11.163507] __x64_sys_ioctl+0x1d/0x20
> > [ 11.163510] do_syscall_64+0x3f/0x80
> > [ 11.163515] entry_SYSCALL_64_after_hwframe+0x44/0xae
> > [ 11.163519] RIP: 0033:0x45ec87
> > [ 11.163531] Code: 3c 1c 48 f7 d8 49 39 c4 72 b8 e8 64 63 03 00 85
> > c0 78 bd 48 83 c4 08 4c 89 e0 5b 41 5c c3 0f 1f 44 00 00 b8 10 00 00
> > 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b8 ff ff ff f7 d8 64 89
> > 018
> > [ 11.163532] RSP: 002b:00007f0d731fc248 EFLAGS: 00000206 ORIG_RAX:
> > 0000000000000010
> > [ 11.163534] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 000000000045ec87
> > [ 11.163536] RDX: 00007f0d731fc290 RSI: 00000000c028aa03 RDI: 0000000000000004
> > [ 11.163537] RBP: 00007f0d731fc270 R08: 00000000004022b3 R09: 00007f0d731fc700
> > [ 11.163538] R10: 00007f0d731fc9d0 R11: 0000000000000206 R12: 00007fff610cd82e
> > [ 11.163539] R13: 00007fff610cd82f R14: 00007f0d731fc400 R15: 0000000001002000
> > [ 11.163549] irq event stamp: 722
> > [ 11.163550] hardirqs last enabled at (721): [<ffffffff967cd41b>]
> > kmem_cache_alloc_trace+0x1db/0x370
> > [ 11.163558] hardirqs last disabled at (722): [<ffffffff9700c052>]
> > _raw_spin_lock_irq+0x32/0x80
> > [ 11.163560] softirqs last enabled at (130): [<ffffffff9654e0d6>]
> > __irq_exit_rcu+0xf6/0x100
> > [ 11.163564] softirqs last disabled at (125): [<ffffffff9654e0d6>]
> > __irq_exit_rcu+0xf6/0x100
> > [ 11.163567] ---[ end trace 358ac5c76c211ea1 ]---
> >
> > Debugging further I find the resv_huge_pages underflows by 1
> > temporarily during the run of the test multiple times, but a
> > __free_huge_page() is always subsequently called that overflows it
> > back to 0. resv_huge_pages is always 0 at the end of the test. I've
> > initially looked at this as I suspected a problem in the
> > resv_huge_pages accounting, but seems the resv_huge_pages accounting
> > is fine in itself as it correctly decrements resv_huge_pages when a
> > page is allocated from reservation and correctly increments it back up
> > when that page is freed.
> >
> > I'm not that familiar with the userfaultfd/hugetlb code so I was
> > hoping to solicit some suggestions for what the issue could be. Things
> > I've tried so far:
> >
> > - Adding code that prevents resv_huge_pages to underflow causes the
> > test to fail, so it seems in this test the calling code actually
> > expects to be able to temporarily allocate 1 more page than the VMA
> > has reserved, which seems like a bug maybe?
> > - Modifying hugetlb_mcopy_atomic_pte() to not use reserved pages
> > causes the test to fail again. Doin that and overprovisioning
> > /proc/sys/vm/nr_hugepages causes the test to pass again but I'm not
> > sure that's right (not familiar with the code).
> > - The failure gets reproduced as far back as 5.11, so it doesn't seem
> > to be related to any recent changes.
>
> Hi Mina,
>
> I am fairly confident the issue is with hugetlb_mcopy_atomic_pte. It
> does not detect/handle the case where a page cache page already exists
> when in MCOPY_ATOMIC_NORMAL mode. If you add a printk/warning after the
> failure of huge_add_to_page_cache, these will generally correspond to
> the underflow. From a reservation POV, if the page exists in the cache
> the reservation was already consumed. The call to alloc_huge_page will
> 'consume' another reservation which can lead to the underflow. As you
> noted, this underflow gets cleaned up in the error path. However, we
> should prevent it from happening as we do not want anyone making
> decisions on that underflow value.
>
> hugetlb_mcopy_atomic_pte should check for a page in the cache and if it
> exists use it in MCOPY_ATOMIC_NORMAL. This code is quite tricky and my
> first simple attempt at this did not work. I am happy to continue
> working on this. However, if you or anyone else want to jump in and fix
> feel free.
Thank you! This is the guidance I was looking for. I'll jump in and
try to fix it definitely but I'm learning the code, etc and will not
be terribly quick about this unfortunately. I will start looking into
this nevertheless and take a stab at it.
Thanks so much!
> --
> Mike Kravetz
next prev parent reply other threads:[~2021-05-11 0:52 UTC|newest]
Thread overview: 15+ messages / expand[flat|nested] mbox.gz Atom feed top
2021-05-07 21:21 Mina Almasry
2021-05-11 0:33 ` Mike Kravetz
2021-05-11 0:52 ` Mina Almasry [this message]
2021-05-11 6:45 ` Axel Rasmussen
2021-05-11 7:08 ` Mina Almasry
2021-05-11 16:38 ` Mike Kravetz
2021-05-11 19:08 ` Mina Almasry
2021-05-12 2:25 ` Mike Kravetz
2021-05-12 2:35 ` Peter Xu
2021-05-12 3:06 ` Mike Kravetz
[not found] ` <20210512065813.89270-1-almasrymina@google.com>
2021-05-12 7:44 ` [PATCH] mm, hugetlb: fix resv_huge_pages underflow on UFFDIO_COPY Mina Almasry
[not found] ` <CAJHvVch0ZMapPVEc0Ge5V4KDgNDNhECbqwDi0y9XxsxFXQZ-gg@mail.gmail.com>
[not found] ` <c455d241-11f6-95a6-eb29-0ddd94eedbd7@oracle.com>
2021-05-12 19:42 ` Mina Almasry
2021-05-12 20:14 ` Peter Xu
2021-05-12 21:31 ` Mike Kravetz
2021-05-12 21:52 ` Mina Almasry
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=CAHS8izOz_ogEjLDQac8iR5O+w1MBanoiJzJ9XaikUzurp+dPXQ@mail.gmail.com \
--to=almasrymina@google.com \
--cc=aarcange@redhat.com \
--cc=axelrasmussen@google.com \
--cc=linux-mm@kvack.org \
--cc=mike.kravetz@oracle.com \
--cc=peterx@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox