Re: [PATCH] mm, hugetlb: fix resv_huge_pages underflow on UFFDIO_COPY

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Mina Almasry <almasrymina@google.com>
To: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>,
	Peter Xu <peterx@redhat.com>,  Linux-MM <linux-mm@kvack.org>,
	Andrew Morton <akpm@linux-foundation.org>,
	 open list <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH] mm, hugetlb: fix resv_huge_pages underflow on UFFDIO_COPY
Date: Thu, 20 May 2021 13:31:48 -0700	[thread overview]
Message-ID: <CAHS8izOHs0ZYR-0FCsDS6nxYbnNfYJFzWKGwB_utBJ_9bswjkA@mail.gmail.com> (raw)
In-Reply-To: <61c78897-27a2-768b-f4fe-04e24b617ab6@oracle.com>

[-- Attachment #1: Type: text/plain, Size: 3317 bytes --]

On Thu, May 20, 2021 at 1:00 PM Mike Kravetz <mike.kravetz@oracle.com> wrote:
>
> On 5/20/21 12:21 PM, Mina Almasry wrote:
> > On Thu, May 20, 2021 at 12:18 PM Mina Almasry <almasrymina@google.com> wrote:
> >>
> >> On Thu, May 13, 2021 at 5:14 PM Mike Kravetz <mike.kravetz@oracle.com> wrote:
> >>>
> >>> How about this approach?
> >>> - Keep the check for hugetlbfs_pagecache_present in hugetlb_mcopy_atomic_pte
> >>>   that you added.  That will catch the race where the page was added to
> >>>   the cache before entering the routine.
> >>> - With the above check in place, we only need to worry about the case
> >>>   where copy_huge_page_from_user fails and we must drop locks.  In this
> >>>   case we:
> >>>   - Free the page previously allocated.
> >>>   - Allocate a 'temporary' huge page without consuming reserves.  I'm
> >>>     thinking of something similar to page migration.
> >>>   - Drop the locks and let the copy_huge_page_from_user be done to the
> >>>     temporary page.
> >>>   - When reentering hugetlb_mcopy_atomic_pte after dropping locks (the
> >>>     *pagep case) we need to once again check
> >>>     hugetlbfs_pagecache_present.
> >>>   - We then try to allocate the huge page which will consume the
> >>>     reserve.  If successful, copy contents of temporary page to newly
> >>>     allocated page.  Free temporary page.
> >>>
> >>> There may be issues with this, and I have not given it deep thought.  It
> >>> does abuse the temporary huge page concept, but perhaps no more than
> >>> page migration.  Things do slow down if the extra page allocation and
> >>> copy is required, but that would only be the case if copy_huge_page_from_user
> >>> needs to be done without locks.  Not sure, but hoping that is rare.
> >>
> >> Just following up this a bit: I've implemented this approach locally,
> >> and with it it's passing the test as-is. When I hack the code such
> >> that the copy in hugetlb_mcopy_atomic_pte() always fails, I run into
> >> this edge case, which causes resv_huge_pages to underflow again (this
> >> time permemantly):
> >>
> >> - hugetlb_no_page() is called on an index and a page is allocated and
> >> inserted into the cache consuming the reservation.
> >> - remove_huge_page() is called on this index and the page is removed from cache.
> >> - hugetlb_mcopy_atomic_pte() is called on this index, we do not find
> >> the page in the cache and we trigger this code patch and the copy
> >> fails.
> >> - The allocations in this code path seem to double consume the
> >> reservation and resv_huge_pages underflows.
> >>
> >> I'm looking at this edge case to understand why a prior
> >> remove_huge_page() causes my code to underflow resv_huge_pages.
> >>
> >
> > I should also mention, without a prior remove_huge_page() this code
> > path works fine, so it seems the fact that the reservation is consumed
> > before causes trouble, but I'm not sure why yet.
> >
>
> Hi Mina,
>
> How about quickly posting the code?  I may be able to provide more
> suggestions if I can see the actual code.

Sure thing, attached my patch so far. It's quite messy with prints
everywhere and VM_BUG_ON() in error paths that I'm not handling yet.
I've also hacked the code so that the hugetlb_mcopy_atomic_pte() copy
always fails so I exercise that code path.

> --
> Mike Kravetz

[-- Attachment #2: 0001-mm-hugetlb-fix-resv_huge_pages-underflow-on-UFFDIO_C.patch --]
[-- Type: application/octet-stream, Size: 8530 bytes --]

From ff86c690f9236d8ba74cb2eb6f529b6354a96223 Mon Sep 17 00:00:00 2001
From: Mina Almasry <almasrymina@google.com>
Date: Tue, 11 May 2021 17:01:50 -0700
Subject: [PATCH] mm, hugetlb: fix resv_huge_pages underflow on UFFDIO_COPY

test passes.

With force copy fail, test fails resv underflows permenantly. Seems
pages are added, removed, then not found in cache double consuming the
reservation.

Signed-off-by: Mina Almasry <almasrymina@google.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: linux-mm@kvack.org
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org

Change-Id: Ida20f35fdfa7fce598582dcfa199845115eaac18
---
 fs/hugetlbfs/inode.c |   6 +++
 mm/hugetlb.c         | 124 ++++++++++++++++++++++++++++++++++++++-----
 2 files changed, 116 insertions(+), 14 deletions(-)

diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index a2a42335e8fd..ba0b3fe88f18 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -525,6 +525,9 @@ static void remove_inode_hugepages(struct inode *inode, loff_t lstart,
 			 * to be adjusted.
 			 */
 			VM_BUG_ON(PagePrivate(page));
+			printk("%s:%d:%s removing huge page, idx=%d h->resv_huge_pages=%d\n",
+			       __FILE__, __LINE__, __func__, index,
+			       h->resv_huge_pages);
 			remove_huge_page(page);
 			freed++;
 			if (!truncate_op) {
@@ -532,6 +535,9 @@ static void remove_inode_hugepages(struct inode *inode, loff_t lstart,
 							index, index + 1, 1)))
 					hugetlb_fix_reserve_counts(inode);
 			}
+			printk("%s:%d:%s removed huge page, idx=%d h->resv_huge_pages=%d\n",
+			       __FILE__, __LINE__, __func__, index,
+			       h->resv_huge_pages);
 
 			unlock_page(page);
 			if (!truncate_op)
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 629aa4c2259c..2d9b557cbafa 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -93,6 +93,28 @@ static inline bool subpool_is_free(struct hugepage_subpool *spool)
 	return true;
 }
 
+/*
+ * Gigantic pages are so large that we do not guarantee that page++ pointer
+ * arithmetic will work across the entire page.  We need something more
+ * specialized.
+ */
+static void __copy_gigantic_page(struct page *dst, struct page *src,
+				 int nr_pages)
+{
+	int i;
+	struct page *dst_base = dst;
+	struct page *src_base = src;
+
+	for (i = 0; i < nr_pages;) {
+		cond_resched();
+		copy_highpage(dst, src);
+
+		i++;
+		dst = mem_map_next(dst, dst_base, i);
+		src = mem_map_next(src, src_base, i);
+	}
+}
+
 static inline void unlock_or_release_subpool(struct hugepage_subpool *spool,
 						unsigned long irq_flags)
 {
@@ -1165,6 +1187,7 @@ static struct page *dequeue_huge_page_vma(struct hstate *h,
 	page = dequeue_huge_page_nodemask(h, gfp_mask, nid, nodemask);
 	if (page && !avoid_reserve && vma_has_reserves(vma, chg)) {
 		SetHPageRestoreReserve(page);
+		WARN_ON_ONCE(!h->resv_huge_pages);
 		h->resv_huge_pages--;
 	}
 
@@ -2405,6 +2428,7 @@ struct page *alloc_huge_page(struct vm_area_struct *vma,
 	int ret, idx;
 	struct hugetlb_cgroup *h_cg;
 	bool deferred_reserve;
+	pgoff_t pageidx = vma_hugecache_offset(h, vma, addr);
 
 	idx = hstate_index(h);
 	/*
@@ -2508,6 +2532,9 @@ struct page *alloc_huge_page(struct vm_area_struct *vma,
 			hugetlb_cgroup_uncharge_page_rsvd(hstate_index(h),
 					pages_per_huge_page(h), page);
 	}
+	printk("%s:%d:%s allocated page idx=%d, deferred_reserve=%d, avoid_reserve=%d\n",
+	       __FILE__, __LINE__, __func__, pageidx, deferred_reserve,
+	       avoid_reserve);
 	return page;
 
 out_uncharge_cgroup:
@@ -4571,6 +4598,8 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
 			if (huge_pte_none(huge_ptep_get(ptep)))
 				ret = vmf_error(PTR_ERR(page));
 			spin_unlock(ptl);
+			printk("%s:%d:%s failed \n", __FILE__, __LINE__,
+			       __func__);
 			goto out;
 		}
 		clear_huge_page(page, address, pages_per_huge_page(h));
@@ -4578,8 +4607,12 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
 		new_page = true;
 
 		if (vma->vm_flags & VM_MAYSHARE) {
+			printk("%s:%d:%s adding page to cache idx=%d mapping=%px\n",
+			       __FILE__, __LINE__, __func__, idx, mapping);
 			int err = huge_add_to_page_cache(page, mapping, idx);
 			if (err) {
+				printk("%s:%d:%s failed adding page to cache idx=%d\n",
+				       __FILE__, __LINE__, __func__, idx);
 				put_page(page);
 				if (err == -EEXIST)
 					goto retry;
@@ -4868,44 +4901,102 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
 			    struct page **pagep)
 {
 	bool is_continue = (mode == MCOPY_ATOMIC_CONTINUE);
-	struct address_space *mapping;
-	pgoff_t idx;
+	struct hstate *h = hstate_vma(dst_vma);
+	struct address_space *mapping = dst_vma->vm_file->f_mapping;
+	pgoff_t idx = vma_hugecache_offset(h, dst_vma, dst_addr);
 	unsigned long size;
 	int vm_shared = dst_vma->vm_flags & VM_SHARED;
-	struct hstate *h = hstate_vma(dst_vma);
 	pte_t _dst_pte;
 	spinlock_t *ptl;
-	int ret;
+	int ret = -ENOMEM;
 	struct page *page;
 	int writable;
-
-	mapping = dst_vma->vm_file->f_mapping;
-	idx = vma_hugecache_offset(h, dst_vma, dst_addr);
+	unsigned long resv_temp = 0;
 
 	if (is_continue) {
 		ret = -EFAULT;
 		page = find_lock_page(mapping, idx);
-		if (!page)
+		if (!page) {
+			ret = -ENOMEM;
 			goto out;
+		}
 	} else if (!*pagep) {
-		ret = -ENOMEM;
+		/* If a page already exists, then it's UFFDIO_COPY for
+		 * a non-missing case. Return -EEXIST.
+		 */
+		if (hugetlbfs_pagecache_present(h, dst_vma, dst_addr)) {
+			ret = -EEXIST;
+			printk("%s:%d:%s found in cache. idx=%d mapping=%px\n",
+			       __FILE__, __LINE__, __func__, idx,
+			       dst_vma->vm_file->f_mapping);
+			goto out;
+		}
+
+		printk("%s:%d:%s not found in cache. Allocating consuming reservation idx=%d mapping=%px\n",
+		       __FILE__, __LINE__, __func__, idx,
+		       dst_vma->vm_file->f_mapping);
 		page = alloc_huge_page(dst_vma, dst_addr, 0);
-		if (IS_ERR(page))
+		BUG_ON(IS_ERR(page));
+		if (IS_ERR(page)) {
+			printk("%s:%d:%s page allocation failed\n", __FILE__,
+			       __LINE__, __func__);
 			goto out;
+		}
 
+#if 0
 		ret = copy_huge_page_from_user(page,
-						(const void __user *) src_addr,
-						pages_per_huge_page(h), false);
+					       (const void __user *)src_addr,
+					       pages_per_huge_page(h), false);
+#else
+		ret = -ENOENT;
+#endif
 
 		/* fallback to copy_from_user outside mmap_lock */
 		if (unlikely(ret)) {
 			ret = -ENOENT;
+			printk("%s:%d:%s copy failed, freeing page idx=%d\n",
+			       __FILE__, __LINE__, __func__, idx);
+			resv_temp = h->resv_huge_pages;
+			put_page(page);
+
+			/* Reallocate the page outside the reserves. */
+			struct mempolicy *mpol;
+			nodemask_t *nodemask;
+			gfp_t gfp_mask = htlb_alloc_mask(h);
+			int node = huge_node(dst_vma, dst_addr, gfp_mask, &mpol,
+					     &nodemask);
+			resv_temp = h->resv_huge_pages;
+			page = alloc_migrate_huge_page(h, gfp_mask, node,
+						       nodemask);
+			VM_BUG_ON(h->resv_huge_pages != resv_temp);
+			if (IS_ERR(page)) {
+				VM_BUG_ON(true);
+				ret = -ENOMEM;
+				printk("%s:%d:%s failed allocating migrate_huge_page\n",
+				       __FILE__, __LINE__, __func__);
+				goto out;
+			}
 			*pagep = page;
 			/* don't free the page */
 			goto out;
 		}
 	} else {
-		page = *pagep;
+		if (hugetlbfs_pagecache_present(h, dst_vma, dst_addr)) {
+			printk("%s:%d:%s found huge_page in cache idx=%d mapping=%px\n",
+			       __FILE__, __LINE__, __func__, idx,
+			       dst_vma->vm_file->f_mapping);
+			put_page(*pagep);
+			ret = -EEXIST;
+			goto out;
+		}
+
+		printk("%s:%d:%s not found in cache, allocating consuming reservation idx=%d mapping=%px\n",
+		       __FILE__, __LINE__, __func__, idx,
+		       dst_vma->vm_file->f_mapping);
+		page = alloc_huge_page(dst_vma, dst_addr, 0);
+		VM_BUG_ON(IS_ERR(page));
+		__copy_gigantic_page(page, *pagep, pages_per_huge_page(h));
+		put_page(*pagep);
 		*pagep = NULL;
 	}
 
@@ -4929,9 +5020,14 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
 		 * hugetlb_fault_mutex_table that here must be hold by
 		 * the caller.
 		 */
+		printk("%s:%d:%s adding page to cache idx=%d mapping=%px\n",
+		       __FILE__, __LINE__, __func__, idx, mapping);
 		ret = huge_add_to_page_cache(page, mapping, idx);
-		if (ret)
+		if (ret) {
+			printk("%s:%d:%s failed adding to cache\n", __FILE__,
+			       __LINE__, __func__);
 			goto out_release_nounlock;
+		}
 	}
 
 	ptl = huge_pte_lockptr(h, dst_mm, dst_pte);
-- 
2.31.1.818.g46aad6cb9e-goog

next prev parent reply	other threads:[~2021-05-20 20:32 UTC|newest]

Thread overview: 19+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-05-13 23:43 Mina Almasry
2021-05-13 23:49 ` Mina Almasry
2021-05-14  0:14   ` Mike Kravetz
2021-05-14  0:23     ` Mina Almasry
2021-05-14  4:02       ` Mike Kravetz
2021-05-14 12:31         ` Peter Xu
2021-05-14 17:56           ` Mike Kravetz
2021-05-14 18:30             ` Axel Rasmussen
2021-05-14 19:16             ` Peter Xu
2021-05-20 19:18     ` Mina Almasry
2021-05-20 19:21       ` Mina Almasry
2021-05-20 20:00         ` Mike Kravetz
2021-05-20 20:31           ` Mina Almasry [this message]
2021-05-21  2:05             ` Mina Almasry
  -- strict thread matches above, loose matches on Subject: below --
2021-05-12  3:06 resv_huge_page underflow with userfaultfd test Mike Kravetz
     [not found] ` <20210512065813.89270-1-almasrymina@google.com>
2021-05-12  7:44   ` [PATCH] mm, hugetlb: fix resv_huge_pages underflow on UFFDIO_COPY Mina Almasry
     [not found]   ` <CAJHvVch0ZMapPVEc0Ge5V4KDgNDNhECbqwDi0y9XxsxFXQZ-gg@mail.gmail.com>
     [not found]     ` <c455d241-11f6-95a6-eb29-0ddd94eedbd7@oracle.com>
2021-05-12 19:42       ` Mina Almasry
2021-05-12 20:14         ` Peter Xu
2021-05-12 21:31           ` Mike Kravetz
2021-05-12 21:52             ` Mina Almasry

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAHS8izOHs0ZYR-0FCsDS6nxYbnNfYJFzWKGwB_utBJ_9bswjkA@mail.gmail.com \
    --to=almasrymina@google.com \
    --cc=akpm@linux-foundation.org \
    --cc=axelrasmussen@google.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mike.kravetz@oracle.com \
    --cc=peterx@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox