From: Mina Almasry <almasrymina@google.com>
To: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>,
Peter Xu <peterx@redhat.com>, Linux-MM <linux-mm@kvack.org>,
Andrew Morton <akpm@linux-foundation.org>,
open list <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH] mm, hugetlb: fix resv_huge_pages underflow on UFFDIO_COPY
Date: Thu, 20 May 2021 13:31:48 -0700 [thread overview]
Message-ID: <CAHS8izOHs0ZYR-0FCsDS6nxYbnNfYJFzWKGwB_utBJ_9bswjkA@mail.gmail.com> (raw)
In-Reply-To: <61c78897-27a2-768b-f4fe-04e24b617ab6@oracle.com>
[-- Attachment #1: Type: text/plain, Size: 3317 bytes --]
On Thu, May 20, 2021 at 1:00 PM Mike Kravetz <mike.kravetz@oracle.com> wrote:
>
> On 5/20/21 12:21 PM, Mina Almasry wrote:
> > On Thu, May 20, 2021 at 12:18 PM Mina Almasry <almasrymina@google.com> wrote:
> >>
> >> On Thu, May 13, 2021 at 5:14 PM Mike Kravetz <mike.kravetz@oracle.com> wrote:
> >>>
> >>> How about this approach?
> >>> - Keep the check for hugetlbfs_pagecache_present in hugetlb_mcopy_atomic_pte
> >>> that you added. That will catch the race where the page was added to
> >>> the cache before entering the routine.
> >>> - With the above check in place, we only need to worry about the case
> >>> where copy_huge_page_from_user fails and we must drop locks. In this
> >>> case we:
> >>> - Free the page previously allocated.
> >>> - Allocate a 'temporary' huge page without consuming reserves. I'm
> >>> thinking of something similar to page migration.
> >>> - Drop the locks and let the copy_huge_page_from_user be done to the
> >>> temporary page.
> >>> - When reentering hugetlb_mcopy_atomic_pte after dropping locks (the
> >>> *pagep case) we need to once again check
> >>> hugetlbfs_pagecache_present.
> >>> - We then try to allocate the huge page which will consume the
> >>> reserve. If successful, copy contents of temporary page to newly
> >>> allocated page. Free temporary page.
> >>>
> >>> There may be issues with this, and I have not given it deep thought. It
> >>> does abuse the temporary huge page concept, but perhaps no more than
> >>> page migration. Things do slow down if the extra page allocation and
> >>> copy is required, but that would only be the case if copy_huge_page_from_user
> >>> needs to be done without locks. Not sure, but hoping that is rare.
> >>
> >> Just following up this a bit: I've implemented this approach locally,
> >> and with it it's passing the test as-is. When I hack the code such
> >> that the copy in hugetlb_mcopy_atomic_pte() always fails, I run into
> >> this edge case, which causes resv_huge_pages to underflow again (this
> >> time permemantly):
> >>
> >> - hugetlb_no_page() is called on an index and a page is allocated and
> >> inserted into the cache consuming the reservation.
> >> - remove_huge_page() is called on this index and the page is removed from cache.
> >> - hugetlb_mcopy_atomic_pte() is called on this index, we do not find
> >> the page in the cache and we trigger this code patch and the copy
> >> fails.
> >> - The allocations in this code path seem to double consume the
> >> reservation and resv_huge_pages underflows.
> >>
> >> I'm looking at this edge case to understand why a prior
> >> remove_huge_page() causes my code to underflow resv_huge_pages.
> >>
> >
> > I should also mention, without a prior remove_huge_page() this code
> > path works fine, so it seems the fact that the reservation is consumed
> > before causes trouble, but I'm not sure why yet.
> >
>
> Hi Mina,
>
> How about quickly posting the code? I may be able to provide more
> suggestions if I can see the actual code.
Sure thing, attached my patch so far. It's quite messy with prints
everywhere and VM_BUG_ON() in error paths that I'm not handling yet.
I've also hacked the code so that the hugetlb_mcopy_atomic_pte() copy
always fails so I exercise that code path.
> --
> Mike Kravetz
[-- Attachment #2: 0001-mm-hugetlb-fix-resv_huge_pages-underflow-on-UFFDIO_C.patch --]
[-- Type: application/octet-stream, Size: 8530 bytes --]
From ff86c690f9236d8ba74cb2eb6f529b6354a96223 Mon Sep 17 00:00:00 2001
From: Mina Almasry <almasrymina@google.com>
Date: Tue, 11 May 2021 17:01:50 -0700
Subject: [PATCH] mm, hugetlb: fix resv_huge_pages underflow on UFFDIO_COPY
test passes.
With force copy fail, test fails resv underflows permenantly. Seems
pages are added, removed, then not found in cache double consuming the
reservation.
Signed-off-by: Mina Almasry <almasrymina@google.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: linux-mm@kvack.org
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org
Change-Id: Ida20f35fdfa7fce598582dcfa199845115eaac18
---
fs/hugetlbfs/inode.c | 6 +++
mm/hugetlb.c | 124 ++++++++++++++++++++++++++++++++++++++-----
2 files changed, 116 insertions(+), 14 deletions(-)
diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index a2a42335e8fd..ba0b3fe88f18 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -525,6 +525,9 @@ static void remove_inode_hugepages(struct inode *inode, loff_t lstart,
* to be adjusted.
*/
VM_BUG_ON(PagePrivate(page));
+ printk("%s:%d:%s removing huge page, idx=%d h->resv_huge_pages=%d\n",
+ __FILE__, __LINE__, __func__, index,
+ h->resv_huge_pages);
remove_huge_page(page);
freed++;
if (!truncate_op) {
@@ -532,6 +535,9 @@ static void remove_inode_hugepages(struct inode *inode, loff_t lstart,
index, index + 1, 1)))
hugetlb_fix_reserve_counts(inode);
}
+ printk("%s:%d:%s removed huge page, idx=%d h->resv_huge_pages=%d\n",
+ __FILE__, __LINE__, __func__, index,
+ h->resv_huge_pages);
unlock_page(page);
if (!truncate_op)
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 629aa4c2259c..2d9b557cbafa 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -93,6 +93,28 @@ static inline bool subpool_is_free(struct hugepage_subpool *spool)
return true;
}
+/*
+ * Gigantic pages are so large that we do not guarantee that page++ pointer
+ * arithmetic will work across the entire page. We need something more
+ * specialized.
+ */
+static void __copy_gigantic_page(struct page *dst, struct page *src,
+ int nr_pages)
+{
+ int i;
+ struct page *dst_base = dst;
+ struct page *src_base = src;
+
+ for (i = 0; i < nr_pages;) {
+ cond_resched();
+ copy_highpage(dst, src);
+
+ i++;
+ dst = mem_map_next(dst, dst_base, i);
+ src = mem_map_next(src, src_base, i);
+ }
+}
+
static inline void unlock_or_release_subpool(struct hugepage_subpool *spool,
unsigned long irq_flags)
{
@@ -1165,6 +1187,7 @@ static struct page *dequeue_huge_page_vma(struct hstate *h,
page = dequeue_huge_page_nodemask(h, gfp_mask, nid, nodemask);
if (page && !avoid_reserve && vma_has_reserves(vma, chg)) {
SetHPageRestoreReserve(page);
+ WARN_ON_ONCE(!h->resv_huge_pages);
h->resv_huge_pages--;
}
@@ -2405,6 +2428,7 @@ struct page *alloc_huge_page(struct vm_area_struct *vma,
int ret, idx;
struct hugetlb_cgroup *h_cg;
bool deferred_reserve;
+ pgoff_t pageidx = vma_hugecache_offset(h, vma, addr);
idx = hstate_index(h);
/*
@@ -2508,6 +2532,9 @@ struct page *alloc_huge_page(struct vm_area_struct *vma,
hugetlb_cgroup_uncharge_page_rsvd(hstate_index(h),
pages_per_huge_page(h), page);
}
+ printk("%s:%d:%s allocated page idx=%d, deferred_reserve=%d, avoid_reserve=%d\n",
+ __FILE__, __LINE__, __func__, pageidx, deferred_reserve,
+ avoid_reserve);
return page;
out_uncharge_cgroup:
@@ -4571,6 +4598,8 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
if (huge_pte_none(huge_ptep_get(ptep)))
ret = vmf_error(PTR_ERR(page));
spin_unlock(ptl);
+ printk("%s:%d:%s failed \n", __FILE__, __LINE__,
+ __func__);
goto out;
}
clear_huge_page(page, address, pages_per_huge_page(h));
@@ -4578,8 +4607,12 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
new_page = true;
if (vma->vm_flags & VM_MAYSHARE) {
+ printk("%s:%d:%s adding page to cache idx=%d mapping=%px\n",
+ __FILE__, __LINE__, __func__, idx, mapping);
int err = huge_add_to_page_cache(page, mapping, idx);
if (err) {
+ printk("%s:%d:%s failed adding page to cache idx=%d\n",
+ __FILE__, __LINE__, __func__, idx);
put_page(page);
if (err == -EEXIST)
goto retry;
@@ -4868,44 +4901,102 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
struct page **pagep)
{
bool is_continue = (mode == MCOPY_ATOMIC_CONTINUE);
- struct address_space *mapping;
- pgoff_t idx;
+ struct hstate *h = hstate_vma(dst_vma);
+ struct address_space *mapping = dst_vma->vm_file->f_mapping;
+ pgoff_t idx = vma_hugecache_offset(h, dst_vma, dst_addr);
unsigned long size;
int vm_shared = dst_vma->vm_flags & VM_SHARED;
- struct hstate *h = hstate_vma(dst_vma);
pte_t _dst_pte;
spinlock_t *ptl;
- int ret;
+ int ret = -ENOMEM;
struct page *page;
int writable;
-
- mapping = dst_vma->vm_file->f_mapping;
- idx = vma_hugecache_offset(h, dst_vma, dst_addr);
+ unsigned long resv_temp = 0;
if (is_continue) {
ret = -EFAULT;
page = find_lock_page(mapping, idx);
- if (!page)
+ if (!page) {
+ ret = -ENOMEM;
goto out;
+ }
} else if (!*pagep) {
- ret = -ENOMEM;
+ /* If a page already exists, then it's UFFDIO_COPY for
+ * a non-missing case. Return -EEXIST.
+ */
+ if (hugetlbfs_pagecache_present(h, dst_vma, dst_addr)) {
+ ret = -EEXIST;
+ printk("%s:%d:%s found in cache. idx=%d mapping=%px\n",
+ __FILE__, __LINE__, __func__, idx,
+ dst_vma->vm_file->f_mapping);
+ goto out;
+ }
+
+ printk("%s:%d:%s not found in cache. Allocating consuming reservation idx=%d mapping=%px\n",
+ __FILE__, __LINE__, __func__, idx,
+ dst_vma->vm_file->f_mapping);
page = alloc_huge_page(dst_vma, dst_addr, 0);
- if (IS_ERR(page))
+ BUG_ON(IS_ERR(page));
+ if (IS_ERR(page)) {
+ printk("%s:%d:%s page allocation failed\n", __FILE__,
+ __LINE__, __func__);
goto out;
+ }
+#if 0
ret = copy_huge_page_from_user(page,
- (const void __user *) src_addr,
- pages_per_huge_page(h), false);
+ (const void __user *)src_addr,
+ pages_per_huge_page(h), false);
+#else
+ ret = -ENOENT;
+#endif
/* fallback to copy_from_user outside mmap_lock */
if (unlikely(ret)) {
ret = -ENOENT;
+ printk("%s:%d:%s copy failed, freeing page idx=%d\n",
+ __FILE__, __LINE__, __func__, idx);
+ resv_temp = h->resv_huge_pages;
+ put_page(page);
+
+ /* Reallocate the page outside the reserves. */
+ struct mempolicy *mpol;
+ nodemask_t *nodemask;
+ gfp_t gfp_mask = htlb_alloc_mask(h);
+ int node = huge_node(dst_vma, dst_addr, gfp_mask, &mpol,
+ &nodemask);
+ resv_temp = h->resv_huge_pages;
+ page = alloc_migrate_huge_page(h, gfp_mask, node,
+ nodemask);
+ VM_BUG_ON(h->resv_huge_pages != resv_temp);
+ if (IS_ERR(page)) {
+ VM_BUG_ON(true);
+ ret = -ENOMEM;
+ printk("%s:%d:%s failed allocating migrate_huge_page\n",
+ __FILE__, __LINE__, __func__);
+ goto out;
+ }
*pagep = page;
/* don't free the page */
goto out;
}
} else {
- page = *pagep;
+ if (hugetlbfs_pagecache_present(h, dst_vma, dst_addr)) {
+ printk("%s:%d:%s found huge_page in cache idx=%d mapping=%px\n",
+ __FILE__, __LINE__, __func__, idx,
+ dst_vma->vm_file->f_mapping);
+ put_page(*pagep);
+ ret = -EEXIST;
+ goto out;
+ }
+
+ printk("%s:%d:%s not found in cache, allocating consuming reservation idx=%d mapping=%px\n",
+ __FILE__, __LINE__, __func__, idx,
+ dst_vma->vm_file->f_mapping);
+ page = alloc_huge_page(dst_vma, dst_addr, 0);
+ VM_BUG_ON(IS_ERR(page));
+ __copy_gigantic_page(page, *pagep, pages_per_huge_page(h));
+ put_page(*pagep);
*pagep = NULL;
}
@@ -4929,9 +5020,14 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
* hugetlb_fault_mutex_table that here must be hold by
* the caller.
*/
+ printk("%s:%d:%s adding page to cache idx=%d mapping=%px\n",
+ __FILE__, __LINE__, __func__, idx, mapping);
ret = huge_add_to_page_cache(page, mapping, idx);
- if (ret)
+ if (ret) {
+ printk("%s:%d:%s failed adding to cache\n", __FILE__,
+ __LINE__, __func__);
goto out_release_nounlock;
+ }
}
ptl = huge_pte_lockptr(h, dst_mm, dst_pte);
--
2.31.1.818.g46aad6cb9e-goog
next prev parent reply other threads:[~2021-05-20 20:32 UTC|newest]
Thread overview: 19+ messages / expand[flat|nested] mbox.gz Atom feed top
2021-05-13 23:43 Mina Almasry
2021-05-13 23:49 ` Mina Almasry
2021-05-14 0:14 ` Mike Kravetz
2021-05-14 0:23 ` Mina Almasry
2021-05-14 4:02 ` Mike Kravetz
2021-05-14 12:31 ` Peter Xu
2021-05-14 17:56 ` Mike Kravetz
2021-05-14 18:30 ` Axel Rasmussen
2021-05-14 19:16 ` Peter Xu
2021-05-20 19:18 ` Mina Almasry
2021-05-20 19:21 ` Mina Almasry
2021-05-20 20:00 ` Mike Kravetz
2021-05-20 20:31 ` Mina Almasry [this message]
2021-05-21 2:05 ` Mina Almasry
-- strict thread matches above, loose matches on Subject: below --
2021-05-12 3:06 resv_huge_page underflow with userfaultfd test Mike Kravetz
[not found] ` <20210512065813.89270-1-almasrymina@google.com>
2021-05-12 7:44 ` [PATCH] mm, hugetlb: fix resv_huge_pages underflow on UFFDIO_COPY Mina Almasry
[not found] ` <CAJHvVch0ZMapPVEc0Ge5V4KDgNDNhECbqwDi0y9XxsxFXQZ-gg@mail.gmail.com>
[not found] ` <c455d241-11f6-95a6-eb29-0ddd94eedbd7@oracle.com>
2021-05-12 19:42 ` Mina Almasry
2021-05-12 20:14 ` Peter Xu
2021-05-12 21:31 ` Mike Kravetz
2021-05-12 21:52 ` Mina Almasry
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=CAHS8izOHs0ZYR-0FCsDS6nxYbnNfYJFzWKGwB_utBJ_9bswjkA@mail.gmail.com \
--to=almasrymina@google.com \
--cc=akpm@linux-foundation.org \
--cc=axelrasmussen@google.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=mike.kravetz@oracle.com \
--cc=peterx@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox