[PATCH V1 0/5] memfd-pin huge page fixes

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH V1 0/5] memfd-pin huge page fixes
@ 2024-09-03 14:25 Steve Sistare
  2024-09-03 14:25 ` [PATCH V1 1/5] mm/filemap: fix filemap_get_folios_contig THP panic Steve Sistare
                   ` (5 more replies)
  0 siblings, 6 replies; 16+ messages in thread
From: Steve Sistare @ 2024-09-03 14:25 UTC (permalink / raw)
  To: linux-mm
  Cc: Vivek Kasireddy, Muchun Song, Andrew Morton, Matthew Wilcox,
	Peter Xu, David Hildenbrand, Jason Gunthorpe, Steve Sistare

Fix multiple bugs that occur when using memfd_pin_folios with hugetlb pages
and THP.  The hugetlb bugs only bite when the page is not yet faulted in
when memfd_pin_folios is called.  The THP bug bites when the starting offset
passed to memfd_pin_folios is not huge page aligned.  See the commit messages
for details.

Steve Sistare (5):
  mm/filemap: fix filemap_get_folios_contig THP panic
  mm/hugetlb: fix memfd_pin_folios free_huge_pages leak
  mm/hugetlb: fix memfd_pin_folios resv_huge_pages leak
  mm/gup: fix memfd_pin_folios hugetlb page allocation
  mm/gup: fix memfd_pin_folios alloc race panic

 include/linux/hugetlb.h | 10 ++++++++++
 mm/filemap.c            |  4 ++++
 mm/gup.c                |  5 ++++-
 mm/hugetlb.c            | 17 +++++++++++++++++
 mm/memfd.c              | 15 +++++++++------
 5 files changed, 44 insertions(+), 7 deletions(-)

-- 
1.8.3.1



^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH V1 1/5] mm/filemap: fix filemap_get_folios_contig THP panic
  2024-09-03 14:25 [PATCH V1 0/5] memfd-pin huge page fixes Steve Sistare
@ 2024-09-03 14:25 ` Steve Sistare
  2024-09-03 14:25 ` [PATCH V1 2/5] mm/hugetlb: fix memfd_pin_folios free_huge_pages leak Steve Sistare
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 16+ messages in thread
From: Steve Sistare @ 2024-09-03 14:25 UTC (permalink / raw)
  To: linux-mm
  Cc: Vivek Kasireddy, Muchun Song, Andrew Morton, Matthew Wilcox,
	Peter Xu, David Hildenbrand, Jason Gunthorpe, Steve Sistare

memfd_pin_folios on memory backed by THP panics if the requested start
offset is not huge page aligned:

BUG: kernel NULL pointer dereference, address: 0000000000000036
RIP: 0010:filemap_get_folios_contig+0xdf/0x290
RSP: 0018:ffffc9002092fbe8 EFLAGS: 00010202
RAX: 0000000000000002 RBX: 0000000000000002 RCX: 0000000000000002

The fault occurs here, because xas_load returns a folio with value 2:

    filemap_get_folios_contig()
        for (folio = xas_load(&xas); folio && xas.xa_index <= end;
                        folio = xas_next(&xas)) {
                ...
                if (!folio_try_get(folio))   <-- BOOM

"2" is an xarray sibling entry.  We get it because memfd_pin_folios
does not round the indices passed to filemap_get_folios_contig to huge
page boundaries for THP, so we load from the middle of a huge page range
see a sibling.  (It does round for hugetlbfs, at the is_file_hugepages
test).

To fix, if the folio is a sibling, then return the next index as
the starting point for the next call to filemap_get_folios_contig.

Fixes: 89c1905d9c14 ("mm/gup: introduce memfd_pin_folios() for pinning memfd folios")

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 mm/filemap.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/mm/filemap.c b/mm/filemap.c
index af99bf9..c385b7a 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2183,6 +2183,10 @@ unsigned filemap_get_folios_contig(struct address_space *mapping,
 		if (xa_is_value(folio))
 			goto update_start;
 
+		/* If we landed in the middle of a THP, continue at its end. */
+		if (xa_is_sibling(folio))
+			goto update_start;
+
 		if (!folio_try_get(folio))
 			goto retry;
 
-- 
1.8.3.1



^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH V1 2/5] mm/hugetlb: fix memfd_pin_folios free_huge_pages leak
  2024-09-03 14:25 [PATCH V1 0/5] memfd-pin huge page fixes Steve Sistare
  2024-09-03 14:25 ` [PATCH V1 1/5] mm/filemap: fix filemap_get_folios_contig THP panic Steve Sistare
@ 2024-09-03 14:25 ` Steve Sistare
  2024-09-04  0:45   ` Kasireddy, Vivek
  2024-09-03 14:25 ` [PATCH V1 3/5] mm/hugetlb: fix memfd_pin_folios resv_huge_pages leak Steve Sistare
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 16+ messages in thread
From: Steve Sistare @ 2024-09-03 14:25 UTC (permalink / raw)
  To: linux-mm
  Cc: Vivek Kasireddy, Muchun Song, Andrew Morton, Matthew Wilcox,
	Peter Xu, David Hildenbrand, Jason Gunthorpe, Steve Sistare

memfd_pin_folios followed by unpin_folios fails to restore
free_huge_pages if the pages were not already faulted in, because the
folio refcount for pages created by memfd_alloc_folio never goes to 0.
memfd_pin_folios needs another folio_put to undo the folio_try_get
below:

memfd_alloc_folio()
  alloc_hugetlb_folio_nodemask()
    dequeue_hugetlb_folio_nodemask()
      dequeue_hugetlb_folio_node_exact()
        folio_ref_unfreeze(folio, 1);    ; adds 1 refcount
  folio_try_get()                        ; adds 1 refcount
  hugetlb_add_to_page_cache()            ; adds 512 refcount (on x86)

With the fix, after memfd_pin_folios + unpin_folios, the refcount for
the (unfaulted) page is 512, which is correct, as the refcount for a
faulted unpinned page is 513.

Fixes: 89c1905d9c14 ("mm/gup: introduce memfd_pin_folios() for pinning memfd folios")

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 mm/gup.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/mm/gup.c b/mm/gup.c
index 54d0dc3..5b92f1d 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -3618,7 +3618,7 @@ long memfd_pin_folios(struct file *memfd, loff_t start, loff_t end,
 	pgoff_t start_idx, end_idx, next_idx;
 	struct folio *folio = NULL;
 	struct folio_batch fbatch;
-	struct hstate *h;
+	struct hstate *h = NULL;
 	long ret = -EINVAL;
 
 	if (start < 0 || start > end || !max_folios)
@@ -3662,6 +3662,8 @@ long memfd_pin_folios(struct file *memfd, loff_t start, loff_t end,
 							     &fbatch);
 			if (folio) {
 				folio_put(folio);
+				if (h)
+					folio_put(folio);
 				folio = NULL;
 			}
 
-- 
1.8.3.1



^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH V1 3/5] mm/hugetlb: fix memfd_pin_folios resv_huge_pages leak
  2024-09-03 14:25 [PATCH V1 0/5] memfd-pin huge page fixes Steve Sistare
  2024-09-03 14:25 ` [PATCH V1 1/5] mm/filemap: fix filemap_get_folios_contig THP panic Steve Sistare
  2024-09-03 14:25 ` [PATCH V1 2/5] mm/hugetlb: fix memfd_pin_folios free_huge_pages leak Steve Sistare
@ 2024-09-03 14:25 ` Steve Sistare
  2024-09-04  1:04   ` Kasireddy, Vivek
  2024-09-03 14:25 ` [PATCH V1 4/5] mm/gup: fix memfd_pin_folios hugetlb page allocation Steve Sistare
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 16+ messages in thread
From: Steve Sistare @ 2024-09-03 14:25 UTC (permalink / raw)
  To: linux-mm
  Cc: Vivek Kasireddy, Muchun Song, Andrew Morton, Matthew Wilcox,
	Peter Xu, David Hildenbrand, Jason Gunthorpe, Steve Sistare

memfd_pin_folios followed by unpin_folios leaves resv_huge_pages elevated
if the pages were not already faulted in.  During a normal page fault,
resv_huge_pages is consumed here:

hugetlb_fault()
  alloc_hugetlb_folio()
    dequeue_hugetlb_folio_vma()
      dequeue_hugetlb_folio_nodemask()
        dequeue_hugetlb_folio_node_exact()
          free_huge_pages--
      resv_huge_pages--

During memfd_pin_folios, the page is created by calling
alloc_hugetlb_folio_nodemask instead of alloc_hugetlb_folio, and
resv_huge_pages is not modified:

memfd_alloc_folio()
  alloc_hugetlb_folio_nodemask()
    dequeue_hugetlb_folio_nodemask()
      dequeue_hugetlb_folio_node_exact()
        free_huge_pages--

alloc_hugetlb_folio_nodemask has other callers that must not modify
resv_huge_pages.  Therefore, to fix, define an alternate version of
alloc_hugetlb_folio_nodemask for this call site that adjusts
resv_huge_pages.

Fixes: 89c1905d9c14 ("mm/gup: introduce memfd_pin_folios() for pinning memfd folios")

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 include/linux/hugetlb.h | 10 ++++++++++
 mm/hugetlb.c            | 17 +++++++++++++++++
 mm/memfd.c              |  9 ++++-----
 3 files changed, 31 insertions(+), 5 deletions(-)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 45bf05a..3ddd69b 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -695,6 +695,9 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
 struct folio *alloc_hugetlb_folio_nodemask(struct hstate *h, int preferred_nid,
 				nodemask_t *nmask, gfp_t gfp_mask,
 				bool allow_alloc_fallback);
+struct folio *alloc_hugetlb_folio_reserve(struct hstate *h, int preferred_nid,
+					  nodemask_t *nmask, gfp_t gfp_mask);
+
 int hugetlb_add_to_page_cache(struct folio *folio, struct address_space *mapping,
 			pgoff_t idx);
 void restore_reserve_on_error(struct hstate *h, struct vm_area_struct *vma,
@@ -1062,6 +1065,13 @@ static inline struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
 }
 
 static inline struct folio *
+alloc_hugetlb_folio_reserve(struct hstate *h, int preferred_nid,
+			    nodemask_t *nmask, gfp_t gfp_mask)
+{
+	return NULL;
+}
+
+static inline struct folio *
 alloc_hugetlb_folio_nodemask(struct hstate *h, int preferred_nid,
 			nodemask_t *nmask, gfp_t gfp_mask,
 			bool allow_alloc_fallback)
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index aaf508b..c2d44a1 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -2564,6 +2564,23 @@ struct folio *alloc_buddy_hugetlb_folio_with_mpol(struct hstate *h,
 	return folio;
 }
 
+struct folio *alloc_hugetlb_folio_reserve(struct hstate *h, int preferred_nid,
+		nodemask_t *nmask, gfp_t gfp_mask)
+{
+	struct folio *folio;
+
+	spin_lock_irq(&hugetlb_lock);
+	folio = dequeue_hugetlb_folio_nodemask(h, gfp_mask, preferred_nid,
+					       nmask);
+	if (folio) {
+		VM_BUG_ON(!h->resv_huge_pages);
+		h->resv_huge_pages--;
+	}
+
+	spin_unlock_irq(&hugetlb_lock);
+	return folio;
+}
+
 /* folio migration callback function */
 struct folio *alloc_hugetlb_folio_nodemask(struct hstate *h, int preferred_nid,
 		nodemask_t *nmask, gfp_t gfp_mask, bool allow_alloc_fallback)
diff --git a/mm/memfd.c b/mm/memfd.c
index e7b7c52..bfe0e71 100644
--- a/mm/memfd.c
+++ b/mm/memfd.c
@@ -82,11 +82,10 @@ struct folio *memfd_alloc_folio(struct file *memfd, pgoff_t idx)
 		gfp_mask = htlb_alloc_mask(hstate_file(memfd));
 		gfp_mask &= ~(__GFP_HIGHMEM | __GFP_MOVABLE);
 
-		folio = alloc_hugetlb_folio_nodemask(hstate_file(memfd),
-						     numa_node_id(),
-						     NULL,
-						     gfp_mask,
-						     false);
+		folio = alloc_hugetlb_folio_reserve(hstate_file(memfd),
+						    numa_node_id(),
+						    NULL,
+						    gfp_mask);
 		if (folio && folio_try_get(folio)) {
 			err = hugetlb_add_to_page_cache(folio,
 							memfd->f_mapping,
-- 
1.8.3.1



^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH V1 4/5] mm/gup: fix memfd_pin_folios hugetlb page allocation
  2024-09-03 14:25 [PATCH V1 0/5] memfd-pin huge page fixes Steve Sistare
                   ` (2 preceding siblings ...)
  2024-09-03 14:25 ` [PATCH V1 3/5] mm/hugetlb: fix memfd_pin_folios resv_huge_pages leak Steve Sistare
@ 2024-09-03 14:25 ` Steve Sistare
  2024-09-04  1:06   ` Kasireddy, Vivek
  2024-09-03 14:25 ` [PATCH V1 5/5] mm/gup: fix memfd_pin_folios alloc race panic Steve Sistare
  2024-09-04  1:12 ` [PATCH V1 0/5] memfd-pin huge page fixes Kasireddy, Vivek
  5 siblings, 1 reply; 16+ messages in thread
From: Steve Sistare @ 2024-09-03 14:25 UTC (permalink / raw)
  To: linux-mm
  Cc: Vivek Kasireddy, Muchun Song, Andrew Morton, Matthew Wilcox,
	Peter Xu, David Hildenbrand, Jason Gunthorpe, Steve Sistare

When memfd_pin_folios -> memfd_alloc_folio creates a hugetlb page, the
index is wrong.  The subsequent call to filemap_get_folios_contig thus
cannot find it, and fails, and memfd_pin_folios loops forever.
To fix, adjust the index for the huge_page_order.

memfd_alloc_folio also forgets to unlock the folio, so the next touch
of the page calls hugetlb_fault which blocks forever trying to take
the lock.  Unlock it.

Fixes: 89c1905d9c14 ("mm/gup: introduce memfd_pin_folios() for pinning memfd folios")

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 mm/memfd.c | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/mm/memfd.c b/mm/memfd.c
index bfe0e71..bcb131d 100644
--- a/mm/memfd.c
+++ b/mm/memfd.c
@@ -79,10 +79,13 @@ struct folio *memfd_alloc_folio(struct file *memfd, pgoff_t idx)
 		 * alloc from. Also, the folio will be pinned for an indefinite
 		 * amount of time, so it is not expected to be migrated away.
 		 */
-		gfp_mask = htlb_alloc_mask(hstate_file(memfd));
+		struct hstate *h = hstate_file(memfd);
+
+		gfp_mask = htlb_alloc_mask(h);
 		gfp_mask &= ~(__GFP_HIGHMEM | __GFP_MOVABLE);
+		idx >>= huge_page_order(h);
 
-		folio = alloc_hugetlb_folio_reserve(hstate_file(memfd),
+		folio = alloc_hugetlb_folio_reserve(h,
 						    numa_node_id(),
 						    NULL,
 						    gfp_mask);
@@ -95,6 +98,7 @@ struct folio *memfd_alloc_folio(struct file *memfd, pgoff_t idx)
 				free_huge_folio(folio);
 				return ERR_PTR(err);
 			}
+			folio_unlock(folio);
 			return folio;
 		}
 		return ERR_PTR(-ENOMEM);
-- 
1.8.3.1



^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH V1 5/5] mm/gup: fix memfd_pin_folios alloc race panic
  2024-09-03 14:25 [PATCH V1 0/5] memfd-pin huge page fixes Steve Sistare
                   ` (3 preceding siblings ...)
  2024-09-03 14:25 ` [PATCH V1 4/5] mm/gup: fix memfd_pin_folios hugetlb page allocation Steve Sistare
@ 2024-09-03 14:25 ` Steve Sistare
  2024-09-04  1:07   ` Kasireddy, Vivek
  2024-09-04  1:12 ` [PATCH V1 0/5] memfd-pin huge page fixes Kasireddy, Vivek
  5 siblings, 1 reply; 16+ messages in thread
From: Steve Sistare @ 2024-09-03 14:25 UTC (permalink / raw)
  To: linux-mm
  Cc: Vivek Kasireddy, Muchun Song, Andrew Morton, Matthew Wilcox,
	Peter Xu, David Hildenbrand, Jason Gunthorpe, Steve Sistare

If memfd_pin_folios tries to create a hugetlb page, but someone else
already did, then folio gets the value -EEXIST here:

        folio = memfd_alloc_folio(memfd, start_idx);
        if (IS_ERR(folio)) {
                ret = PTR_ERR(folio);
                if (ret != -EEXIST)
                        goto err;

then on the next trip through the "while start_idx" loop we panic here:

        if (folio) {
                folio_put(folio);

To fix, set the folio to NULL on error.

Fixes: 89c1905d9c14 ("mm/gup: introduce memfd_pin_folios() for pinning memfd folios")

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 mm/gup.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/mm/gup.c b/mm/gup.c
index 5b92f1d..bccabaa 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -3705,6 +3705,7 @@ long memfd_pin_folios(struct file *memfd, loff_t start, loff_t end,
 					ret = PTR_ERR(folio);
 					if (ret != -EEXIST)
 						goto err;
+					folio = NULL;
 				}
 			}
 		}
-- 
1.8.3.1



^ permalink raw reply	[flat|nested] 16+ messages in thread

* RE: [PATCH V1 2/5] mm/hugetlb: fix memfd_pin_folios free_huge_pages leak
  2024-09-03 14:25 ` [PATCH V1 2/5] mm/hugetlb: fix memfd_pin_folios free_huge_pages leak Steve Sistare
@ 2024-09-04  0:45   ` Kasireddy, Vivek
  2024-09-04 14:52     ` Steven Sistare
  0 siblings, 1 reply; 16+ messages in thread
From: Kasireddy, Vivek @ 2024-09-04  0:45 UTC (permalink / raw)
  To: Steve Sistare, linux-mm
  Cc: Muchun Song, Andrew Morton, Matthew Wilcox, Peter Xu,
	David Hildenbrand, Jason Gunthorpe

Hi Steve,

> Subject: [PATCH V1 2/5] mm/hugetlb: fix memfd_pin_folios free_huge_pages
> leak
> 
> memfd_pin_folios followed by unpin_folios fails to restore
> free_huge_pages if the pages were not already faulted in, because the
> folio refcount for pages created by memfd_alloc_folio never goes to 0.
> memfd_pin_folios needs another folio_put to undo the folio_try_get
> below:
> 
> memfd_alloc_folio()
>   alloc_hugetlb_folio_nodemask()
>     dequeue_hugetlb_folio_nodemask()
>       dequeue_hugetlb_folio_node_exact()
>         folio_ref_unfreeze(folio, 1);    ; adds 1 refcount
>   folio_try_get()                        ; adds 1 refcount
I wonder if it is more optimal to skip the folio_try_get() above.
I think it (folio_try_get) was probably added because I didn't realize that
alloc_hugetlb_folio_nodemask() already adds a reference.

>   hugetlb_add_to_page_cache()            ; adds 512 refcount (on x86)
> 
> With the fix, after memfd_pin_folios + unpin_folios, the refcount for
> the (unfaulted) page is 512, which is correct, as the refcount for a
> faulted unpinned page is 513.
> 
> Fixes: 89c1905d9c14 ("mm/gup: introduce memfd_pin_folios() for pinning
> memfd folios")
> 
> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> ---
>  mm/gup.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/gup.c b/mm/gup.c
> index 54d0dc3..5b92f1d 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -3618,7 +3618,7 @@ long memfd_pin_folios(struct file *memfd, loff_t
> start, loff_t end,
>  	pgoff_t start_idx, end_idx, next_idx;
>  	struct folio *folio = NULL;
>  	struct folio_batch fbatch;
> -	struct hstate *h;
> +	struct hstate *h = NULL;
>  	long ret = -EINVAL;
> 
>  	if (start < 0 || start > end || !max_folios)
> @@ -3662,6 +3662,8 @@ long memfd_pin_folios(struct file *memfd, loff_t
> start, loff_t end,
>  							     &fbatch);
>  			if (folio) {
>  				folio_put(folio);
> +				if (h)
> +					folio_put(folio);
If we stick with this change, I guess there needs to be an additional
folio_put() in memfd_alloc_folio() as well if adding the folio to the
page cache fails.

Acked-by: Vivek Kasireddy <vivek.kasireddy@intel.com>

Thanks,
Vivek
>  				folio = NULL;
>  			}
> 
> --
> 1.8.3.1



^ permalink raw reply	[flat|nested] 16+ messages in thread

* RE: [PATCH V1 3/5] mm/hugetlb: fix memfd_pin_folios resv_huge_pages leak
  2024-09-03 14:25 ` [PATCH V1 3/5] mm/hugetlb: fix memfd_pin_folios resv_huge_pages leak Steve Sistare
@ 2024-09-04  1:04   ` Kasireddy, Vivek
  2024-09-04 14:52     ` Steven Sistare
  0 siblings, 1 reply; 16+ messages in thread
From: Kasireddy, Vivek @ 2024-09-04  1:04 UTC (permalink / raw)
  To: Steve Sistare, linux-mm
  Cc: Muchun Song, Andrew Morton, Matthew Wilcox, Peter Xu,
	David Hildenbrand, Jason Gunthorpe

Hi Steve,

> Subject: [PATCH V1 3/5] mm/hugetlb: fix memfd_pin_folios resv_huge_pages
> leak
> 
> memfd_pin_folios followed by unpin_folios leaves resv_huge_pages
> elevated
> if the pages were not already faulted in.  During a normal page fault,
> resv_huge_pages is consumed here:
> 
> hugetlb_fault()
>   alloc_hugetlb_folio()
>     dequeue_hugetlb_folio_vma()
>       dequeue_hugetlb_folio_nodemask()
>         dequeue_hugetlb_folio_node_exact()
>           free_huge_pages--
>       resv_huge_pages--
> 
> During memfd_pin_folios, the page is created by calling
> alloc_hugetlb_folio_nodemask instead of alloc_hugetlb_folio, and
> resv_huge_pages is not modified:
> 
> memfd_alloc_folio()
>   alloc_hugetlb_folio_nodemask()
>     dequeue_hugetlb_folio_nodemask()
>       dequeue_hugetlb_folio_node_exact()
>         free_huge_pages--
> 
> alloc_hugetlb_folio_nodemask has other callers that must not modify
> resv_huge_pages.  Therefore, to fix, define an alternate version of
> alloc_hugetlb_folio_nodemask for this call site that adjusts
> resv_huge_pages.
> 
> Fixes: 89c1905d9c14 ("mm/gup: introduce memfd_pin_folios() for pinning
> memfd folios")
> 
> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> ---
>  include/linux/hugetlb.h | 10 ++++++++++
>  mm/hugetlb.c            | 17 +++++++++++++++++
>  mm/memfd.c              |  9 ++++-----
>  3 files changed, 31 insertions(+), 5 deletions(-)
> 
> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> index 45bf05a..3ddd69b 100644
> --- a/include/linux/hugetlb.h
> +++ b/include/linux/hugetlb.h
> @@ -695,6 +695,9 @@ struct folio *alloc_hugetlb_folio(struct
> vm_area_struct *vma,
>  struct folio *alloc_hugetlb_folio_nodemask(struct hstate *h, int
> preferred_nid,
>  				nodemask_t *nmask, gfp_t gfp_mask,
>  				bool allow_alloc_fallback);
> +struct folio *alloc_hugetlb_folio_reserve(struct hstate *h, int preferred_nid,
> +					  nodemask_t *nmask, gfp_t
> gfp_mask);
> +
>  int hugetlb_add_to_page_cache(struct folio *folio, struct address_space
> *mapping,
>  			pgoff_t idx);
>  void restore_reserve_on_error(struct hstate *h, struct vm_area_struct
> *vma,
> @@ -1062,6 +1065,13 @@ static inline struct folio
> *alloc_hugetlb_folio(struct vm_area_struct *vma,
>  }
> 
>  static inline struct folio *
> +alloc_hugetlb_folio_reserve(struct hstate *h, int preferred_nid,
> +			    nodemask_t *nmask, gfp_t gfp_mask)
> +{
> +	return NULL;
> +}
> +
> +static inline struct folio *
>  alloc_hugetlb_folio_nodemask(struct hstate *h, int preferred_nid,
>  			nodemask_t *nmask, gfp_t gfp_mask,
>  			bool allow_alloc_fallback)
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index aaf508b..c2d44a1 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -2564,6 +2564,23 @@ struct folio
> *alloc_buddy_hugetlb_folio_with_mpol(struct hstate *h,
>  	return folio;
>  }
> 
> +struct folio *alloc_hugetlb_folio_reserve(struct hstate *h, int preferred_nid,
> +		nodemask_t *nmask, gfp_t gfp_mask)
> +{
> +	struct folio *folio;
> +
> +	spin_lock_irq(&hugetlb_lock);
> +	folio = dequeue_hugetlb_folio_nodemask(h, gfp_mask,
> preferred_nid,
I am assuming a check for available_huge_pages(h) before calling dequeue
would be redundant as it would simply return NULL if no huge pages are
available?

Acked-by: Vivek Kasireddy <vivek.kasireddy@intel.com>

Thanks,
Vivek

> +					       nmask);
> +	if (folio) {
> +		VM_BUG_ON(!h->resv_huge_pages);
> +		h->resv_huge_pages--;
> +	}
> +
> +	spin_unlock_irq(&hugetlb_lock);
> +	return folio;
> +}
> +
>  /* folio migration callback function */
>  struct folio *alloc_hugetlb_folio_nodemask(struct hstate *h, int
> preferred_nid,
>  		nodemask_t *nmask, gfp_t gfp_mask, bool
> allow_alloc_fallback)
> diff --git a/mm/memfd.c b/mm/memfd.c
> index e7b7c52..bfe0e71 100644
> --- a/mm/memfd.c
> +++ b/mm/memfd.c
> @@ -82,11 +82,10 @@ struct folio *memfd_alloc_folio(struct file *memfd,
> pgoff_t idx)
>  		gfp_mask = htlb_alloc_mask(hstate_file(memfd));
>  		gfp_mask &= ~(__GFP_HIGHMEM | __GFP_MOVABLE);
> 
> -		folio = alloc_hugetlb_folio_nodemask(hstate_file(memfd),
> -						     numa_node_id(),
> -						     NULL,
> -						     gfp_mask,
> -						     false);
> +		folio = alloc_hugetlb_folio_reserve(hstate_file(memfd),
> +						    numa_node_id(),
> +						    NULL,
> +						    gfp_mask);
>  		if (folio && folio_try_get(folio)) {
>  			err = hugetlb_add_to_page_cache(folio,
>  							memfd->f_mapping,
> --
> 1.8.3.1



^ permalink raw reply	[flat|nested] 16+ messages in thread

* RE: [PATCH V1 4/5] mm/gup: fix memfd_pin_folios hugetlb page allocation
  2024-09-03 14:25 ` [PATCH V1 4/5] mm/gup: fix memfd_pin_folios hugetlb page allocation Steve Sistare
@ 2024-09-04  1:06   ` Kasireddy, Vivek
  2024-09-04 14:51     ` Steven Sistare
  0 siblings, 1 reply; 16+ messages in thread
From: Kasireddy, Vivek @ 2024-09-04  1:06 UTC (permalink / raw)
  To: Steve Sistare, linux-mm
  Cc: Muchun Song, Andrew Morton, Matthew Wilcox, Peter Xu,
	David Hildenbrand, Jason Gunthorpe

Hi Steve,

> Subject: [PATCH V1 4/5] mm/gup: fix memfd_pin_folios hugetlb page
> allocation
> 
> When memfd_pin_folios -> memfd_alloc_folio creates a hugetlb page, the
> index is wrong.  The subsequent call to filemap_get_folios_contig thus
> cannot find it, and fails, and memfd_pin_folios loops forever.
> To fix, adjust the index for the huge_page_order.
> 
> memfd_alloc_folio also forgets to unlock the folio, so the next touch
> of the page calls hugetlb_fault which blocks forever trying to take
> the lock.  Unlock it.
Where exactly is the lock taken from? I did a quick search but couldn't
immediately figure out in which function is the lock taken, while allocating
the folio.

> 
> Fixes: 89c1905d9c14 ("mm/gup: introduce memfd_pin_folios() for pinning
> memfd folios")
> 
> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> ---
>  mm/memfd.c | 8 ++++++--
>  1 file changed, 6 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/memfd.c b/mm/memfd.c
> index bfe0e71..bcb131d 100644
> --- a/mm/memfd.c
> +++ b/mm/memfd.c
> @@ -79,10 +79,13 @@ struct folio *memfd_alloc_folio(struct file *memfd,
> pgoff_t idx)
>  		 * alloc from. Also, the folio will be pinned for an indefinite
>  		 * amount of time, so it is not expected to be migrated away.
>  		 */
> -		gfp_mask = htlb_alloc_mask(hstate_file(memfd));
> +		struct hstate *h = hstate_file(memfd);
> +
> +		gfp_mask = htlb_alloc_mask(h);
>  		gfp_mask &= ~(__GFP_HIGHMEM | __GFP_MOVABLE);
> +		idx >>= huge_page_order(h);
> 
> -		folio = alloc_hugetlb_folio_reserve(hstate_file(memfd),
> +		folio = alloc_hugetlb_folio_reserve(h,
>  						    numa_node_id(),
>  						    NULL,
>  						    gfp_mask);
> @@ -95,6 +98,7 @@ struct folio *memfd_alloc_folio(struct file *memfd,
> pgoff_t idx)
>  				free_huge_folio(folio);
>  				return ERR_PTR(err);
>  			}
> +			folio_unlock(folio);
Acked-by: Vivek Kasireddy <vivek.kasireddy@intel.com>

Thanks,
Vivek

>  			return folio;
>  		}
>  		return ERR_PTR(-ENOMEM);
> --
> 1.8.3.1



^ permalink raw reply	[flat|nested] 16+ messages in thread

* RE: [PATCH V1 5/5] mm/gup: fix memfd_pin_folios alloc race panic
  2024-09-03 14:25 ` [PATCH V1 5/5] mm/gup: fix memfd_pin_folios alloc race panic Steve Sistare
@ 2024-09-04  1:07   ` Kasireddy, Vivek
  0 siblings, 0 replies; 16+ messages in thread
From: Kasireddy, Vivek @ 2024-09-04  1:07 UTC (permalink / raw)
  To: Steve Sistare, linux-mm
  Cc: Muchun Song, Andrew Morton, Matthew Wilcox, Peter Xu,
	David Hildenbrand, Jason Gunthorpe

> Subject: [PATCH V1 5/5] mm/gup: fix memfd_pin_folios alloc race panic
> 
> If memfd_pin_folios tries to create a hugetlb page, but someone else
> already did, then folio gets the value -EEXIST here:
> 
>         folio = memfd_alloc_folio(memfd, start_idx);
>         if (IS_ERR(folio)) {
>                 ret = PTR_ERR(folio);
>                 if (ret != -EEXIST)
>                         goto err;
> 
> then on the next trip through the "while start_idx" loop we panic here:
> 
>         if (folio) {
>                 folio_put(folio);
> 
> To fix, set the folio to NULL on error.
> 
> Fixes: 89c1905d9c14 ("mm/gup: introduce memfd_pin_folios() for pinning
> memfd folios")
> 
> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> ---
>  mm/gup.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/mm/gup.c b/mm/gup.c
> index 5b92f1d..bccabaa 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -3705,6 +3705,7 @@ long memfd_pin_folios(struct file *memfd, loff_t
> start, loff_t end,
>  					ret = PTR_ERR(folio);
>  					if (ret != -EEXIST)
>  						goto err;
> +					folio = NULL;
Acked-by: Vivek Kasireddy <vivek.kasireddy@intel.com>

Thanks,
Vivek
>  				}
>  			}
>  		}
> --
> 1.8.3.1



^ permalink raw reply	[flat|nested] 16+ messages in thread

* RE: [PATCH V1 0/5] memfd-pin huge page fixes
  2024-09-03 14:25 [PATCH V1 0/5] memfd-pin huge page fixes Steve Sistare
                   ` (4 preceding siblings ...)
  2024-09-03 14:25 ` [PATCH V1 5/5] mm/gup: fix memfd_pin_folios alloc race panic Steve Sistare
@ 2024-09-04  1:12 ` Kasireddy, Vivek
  2024-09-04 14:51   ` Steven Sistare
  5 siblings, 1 reply; 16+ messages in thread
From: Kasireddy, Vivek @ 2024-09-04  1:12 UTC (permalink / raw)
  To: Steve Sistare, linux-mm
  Cc: Muchun Song, Andrew Morton, Matthew Wilcox, Peter Xu,
	David Hildenbrand, Jason Gunthorpe

Hi Steve,

> Subject: [PATCH V1 0/5] memfd-pin huge page fixes
> 
> Fix multiple bugs that occur when using memfd_pin_folios with hugetlb
> pages
> and THP.  The hugetlb bugs only bite when the page is not yet faulted in
> when memfd_pin_folios is called.  The THP bug bites when the starting offset
> passed to memfd_pin_folios is not huge page aligned.  See the commit
> messages
> for details.
Thank you for fixing these bugs. I have Acked all patches except for patch #1,
as my understanding of xarrays is limited at this point.

Also, could you please briefly describe how you have exercised memfd_alloc_folio()
code path or what tests you have run to uncover these bugs? I'd like to
figure out ways to augment the list of udmabuf tests to validate scenarios
where hugetlb pages are not faulted in and memfd_pin_folios() is called.

Thanks,
Vivek

> 
> Steve Sistare (5):
>   mm/filemap: fix filemap_get_folios_contig THP panic
>   mm/hugetlb: fix memfd_pin_folios free_huge_pages leak
>   mm/hugetlb: fix memfd_pin_folios resv_huge_pages leak
>   mm/gup: fix memfd_pin_folios hugetlb page allocation
>   mm/gup: fix memfd_pin_folios alloc race panic
> 
>  include/linux/hugetlb.h | 10 ++++++++++
>  mm/filemap.c            |  4 ++++
>  mm/gup.c                |  5 ++++-
>  mm/hugetlb.c            | 17 +++++++++++++++++
>  mm/memfd.c              | 15 +++++++++------
>  5 files changed, 44 insertions(+), 7 deletions(-)
> 
> --
> 1.8.3.1



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH V1 4/5] mm/gup: fix memfd_pin_folios hugetlb page allocation
  2024-09-04  1:06   ` Kasireddy, Vivek
@ 2024-09-04 14:51     ` Steven Sistare
  0 siblings, 0 replies; 16+ messages in thread
From: Steven Sistare @ 2024-09-04 14:51 UTC (permalink / raw)
  To: Kasireddy, Vivek, linux-mm
  Cc: Muchun Song, Andrew Morton, Matthew Wilcox, Peter Xu,
	David Hildenbrand, Jason Gunthorpe

On 9/3/2024 9:06 PM, Kasireddy, Vivek wrote:
> Hi Steve,
> 
>> Subject: [PATCH V1 4/5] mm/gup: fix memfd_pin_folios hugetlb page
>> allocation
>>
>> When memfd_pin_folios -> memfd_alloc_folio creates a hugetlb page, the
>> index is wrong.  The subsequent call to filemap_get_folios_contig thus
>> cannot find it, and fails, and memfd_pin_folios loops forever.
>> To fix, adjust the index for the huge_page_order.
>>
>> memfd_alloc_folio also forgets to unlock the folio, so the next touch
>> of the page calls hugetlb_fault which blocks forever trying to take
>> the lock.  Unlock it.
> Where exactly is the lock taken from? I did a quick search but couldn't
> immediately figure out in which function is the lock taken, while allocating
> the folio.

memfd_alloc_folio -> hugetlb_add_to_page_cache -> __folio_set_locked

I forgot to add that detail to the commit message.

See for example hugetlbfs_fallocate which calls hugetlb_add_to_page_cache and
then calls folio_unlock before returning.

- Steve

>> Fixes: 89c1905d9c14 ("mm/gup: introduce memfd_pin_folios() for pinning
>> memfd folios")
>>
>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>> ---
>>   mm/memfd.c | 8 ++++++--
>>   1 file changed, 6 insertions(+), 2 deletions(-)
>>
>> diff --git a/mm/memfd.c b/mm/memfd.c
>> index bfe0e71..bcb131d 100644
>> --- a/mm/memfd.c
>> +++ b/mm/memfd.c
>> @@ -79,10 +79,13 @@ struct folio *memfd_alloc_folio(struct file *memfd,
>> pgoff_t idx)
>>   		 * alloc from. Also, the folio will be pinned for an indefinite
>>   		 * amount of time, so it is not expected to be migrated away.
>>   		 */
>> -		gfp_mask = htlb_alloc_mask(hstate_file(memfd));
>> +		struct hstate *h = hstate_file(memfd);
>> +
>> +		gfp_mask = htlb_alloc_mask(h);
>>   		gfp_mask &= ~(__GFP_HIGHMEM | __GFP_MOVABLE);
>> +		idx >>= huge_page_order(h);
>>
>> -		folio = alloc_hugetlb_folio_reserve(hstate_file(memfd),
>> +		folio = alloc_hugetlb_folio_reserve(h,
>>   						    numa_node_id(),
>>   						    NULL,
>>   						    gfp_mask);
>> @@ -95,6 +98,7 @@ struct folio *memfd_alloc_folio(struct file *memfd,
>> pgoff_t idx)
>>   				free_huge_folio(folio);
>>   				return ERR_PTR(err);
>>   			}
>> +			folio_unlock(folio);
> Acked-by: Vivek Kasireddy <vivek.kasireddy@intel.com>
> 
> Thanks,
> Vivek
> 
>>   			return folio;
>>   		}
>>   		return ERR_PTR(-ENOMEM);
>> --
>> 1.8.3.1
> 


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH V1 0/5] memfd-pin huge page fixes
  2024-09-04  1:12 ` [PATCH V1 0/5] memfd-pin huge page fixes Kasireddy, Vivek
@ 2024-09-04 14:51   ` Steven Sistare
  2024-09-06  8:09     ` Kasireddy, Vivek
  0 siblings, 1 reply; 16+ messages in thread
From: Steven Sistare @ 2024-09-04 14:51 UTC (permalink / raw)
  To: Kasireddy, Vivek, linux-mm
  Cc: Muchun Song, Andrew Morton, Matthew Wilcox, Peter Xu,
	David Hildenbrand, Jason Gunthorpe

On 9/3/2024 9:12 PM, Kasireddy, Vivek wrote:
> Hi Steve,
> 
>> Subject: [PATCH V1 0/5] memfd-pin huge page fixes
>>
>> Fix multiple bugs that occur when using memfd_pin_folios with hugetlb
>> pages
>> and THP.  The hugetlb bugs only bite when the page is not yet faulted in
>> when memfd_pin_folios is called.  The THP bug bites when the starting offset
>> passed to memfd_pin_folios is not huge page aligned.  See the commit
>> messages
>> for details.
> Thank you for fixing these bugs. I have Acked all patches except for patch #1,
> as my understanding of xarrays is limited at this point.
> 
> Also, could you please briefly describe how you have exercised memfd_alloc_folio()
> code path or what tests you have run to uncover these bugs? I'd like to
> figure out ways to augment the list of udmabuf tests to validate scenarios
> where hugetlb pages are not faulted in and memfd_pin_folios() is called.

I am extending iommufd to support memfd pinning, so I added a new ioctl which
takes an fd, offset, and length to pin.  I am just getting started, so currently
it does nothing more than call memfd_pin_folios immediately followed by unpin_folios.
Then exit the process.  That is sufficient to trigger all the bugs except the alloc
race panic. I tested with these files:
   /dev/hugepages/file
   /dev/shm/file    with: mount -o remount,huge=always /dev/shm
   memfd_create(MFD_HUGETLB)
   memfd_create(0)  with: echo always > /sys/kernel/mm/transparent_hugepage/shmem_enabled

- Steve

>> Steve Sistare (5):
>>    mm/filemap: fix filemap_get_folios_contig THP panic
>>    mm/hugetlb: fix memfd_pin_folios free_huge_pages leak
>>    mm/hugetlb: fix memfd_pin_folios resv_huge_pages leak
>>    mm/gup: fix memfd_pin_folios hugetlb page allocation
>>    mm/gup: fix memfd_pin_folios alloc race panic
>>
>>   include/linux/hugetlb.h | 10 ++++++++++
>>   mm/filemap.c            |  4 ++++
>>   mm/gup.c                |  5 ++++-
>>   mm/hugetlb.c            | 17 +++++++++++++++++
>>   mm/memfd.c              | 15 +++++++++------
>>   5 files changed, 44 insertions(+), 7 deletions(-)
>>
>> --
>> 1.8.3.1
> 


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH V1 2/5] mm/hugetlb: fix memfd_pin_folios free_huge_pages leak
  2024-09-04  0:45   ` Kasireddy, Vivek
@ 2024-09-04 14:52     ` Steven Sistare
  0 siblings, 0 replies; 16+ messages in thread
From: Steven Sistare @ 2024-09-04 14:52 UTC (permalink / raw)
  To: Kasireddy, Vivek, linux-mm
  Cc: Muchun Song, Andrew Morton, Matthew Wilcox, Peter Xu,
	David Hildenbrand, Jason Gunthorpe

On 9/3/2024 8:45 PM, Kasireddy, Vivek wrote:
> Hi Steve,
> 
>> Subject: [PATCH V1 2/5] mm/hugetlb: fix memfd_pin_folios free_huge_pages
>> leak
>>
>> memfd_pin_folios followed by unpin_folios fails to restore
>> free_huge_pages if the pages were not already faulted in, because the
>> folio refcount for pages created by memfd_alloc_folio never goes to 0.
>> memfd_pin_folios needs another folio_put to undo the folio_try_get
>> below:
>>
>> memfd_alloc_folio()
>>    alloc_hugetlb_folio_nodemask()
>>      dequeue_hugetlb_folio_nodemask()
>>        dequeue_hugetlb_folio_node_exact()
>>          folio_ref_unfreeze(folio, 1);    ; adds 1 refcount
>>    folio_try_get()                        ; adds 1 refcount
> I wonder if it is more optimal to skip the folio_try_get() above.
> I think it (folio_try_get) was probably added because I didn't realize that
> alloc_hugetlb_folio_nodemask() already adds a reference.

Agreed, that is a simpler fix.  I tried it and my tests pass.

>>    hugetlb_add_to_page_cache()            ; adds 512 refcount (on x86)
>>
>> With the fix, after memfd_pin_folios + unpin_folios, the refcount for
>> the (unfaulted) page is 512, which is correct, as the refcount for a
>> faulted unpinned page is 513.
>>
>> Fixes: 89c1905d9c14 ("mm/gup: introduce memfd_pin_folios() for pinning
>> memfd folios")
>>
>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>> ---
>>   mm/gup.c | 4 +++-
>>   1 file changed, 3 insertions(+), 1 deletion(-)
>>
>> diff --git a/mm/gup.c b/mm/gup.c
>> index 54d0dc3..5b92f1d 100644
>> --- a/mm/gup.c
>> +++ b/mm/gup.c
>> @@ -3618,7 +3618,7 @@ long memfd_pin_folios(struct file *memfd, loff_t
>> start, loff_t end,
>>   	pgoff_t start_idx, end_idx, next_idx;
>>   	struct folio *folio = NULL;
>>   	struct folio_batch fbatch;
>> -	struct hstate *h;
>> +	struct hstate *h = NULL;
>>   	long ret = -EINVAL;
>>
>>   	if (start < 0 || start > end || !max_folios)
>> @@ -3662,6 +3662,8 @@ long memfd_pin_folios(struct file *memfd, loff_t
>> start, loff_t end,
>>   							     &fbatch);
>>   			if (folio) {
>>   				folio_put(folio);
>> +				if (h)
>> +					folio_put(folio);
> If we stick with this change, I guess there needs to be an additional
> folio_put() in memfd_alloc_folio() as well if adding the folio to the
> page cache fails.

Indeed, another reason why just deleting the folio_try_get is better.  Thanks!

I will submit a V2 of this patch.

- Steve


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH V1 3/5] mm/hugetlb: fix memfd_pin_folios resv_huge_pages leak
  2024-09-04  1:04   ` Kasireddy, Vivek
@ 2024-09-04 14:52     ` Steven Sistare
  0 siblings, 0 replies; 16+ messages in thread
From: Steven Sistare @ 2024-09-04 14:52 UTC (permalink / raw)
  To: Kasireddy, Vivek, linux-mm
  Cc: Muchun Song, Andrew Morton, Matthew Wilcox, Peter Xu,
	David Hildenbrand, Jason Gunthorpe

On 9/3/2024 9:04 PM, Kasireddy, Vivek wrote:
> Hi Steve,
> 
>> Subject: [PATCH V1 3/5] mm/hugetlb: fix memfd_pin_folios resv_huge_pages
>> leak
>>
>> memfd_pin_folios followed by unpin_folios leaves resv_huge_pages
>> elevated
>> if the pages were not already faulted in.  During a normal page fault,
>> resv_huge_pages is consumed here:
>>
>> hugetlb_fault()
>>    alloc_hugetlb_folio()
>>      dequeue_hugetlb_folio_vma()
>>        dequeue_hugetlb_folio_nodemask()
>>          dequeue_hugetlb_folio_node_exact()
>>            free_huge_pages--
>>        resv_huge_pages--
>>
>> During memfd_pin_folios, the page is created by calling
>> alloc_hugetlb_folio_nodemask instead of alloc_hugetlb_folio, and
>> resv_huge_pages is not modified:
>>
>> memfd_alloc_folio()
>>    alloc_hugetlb_folio_nodemask()
>>      dequeue_hugetlb_folio_nodemask()
>>        dequeue_hugetlb_folio_node_exact()
>>          free_huge_pages--
>>
>> alloc_hugetlb_folio_nodemask has other callers that must not modify
>> resv_huge_pages.  Therefore, to fix, define an alternate version of
>> alloc_hugetlb_folio_nodemask for this call site that adjusts
>> resv_huge_pages.
>>
>> Fixes: 89c1905d9c14 ("mm/gup: introduce memfd_pin_folios() for pinning
>> memfd folios")
>>
>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>> ---
>>   include/linux/hugetlb.h | 10 ++++++++++
>>   mm/hugetlb.c            | 17 +++++++++++++++++
>>   mm/memfd.c              |  9 ++++-----
>>   3 files changed, 31 insertions(+), 5 deletions(-)
>>
>> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
>> index 45bf05a..3ddd69b 100644
>> --- a/include/linux/hugetlb.h
>> +++ b/include/linux/hugetlb.h
>> @@ -695,6 +695,9 @@ struct folio *alloc_hugetlb_folio(struct
>> vm_area_struct *vma,
>>   struct folio *alloc_hugetlb_folio_nodemask(struct hstate *h, int
>> preferred_nid,
>>   				nodemask_t *nmask, gfp_t gfp_mask,
>>   				bool allow_alloc_fallback);
>> +struct folio *alloc_hugetlb_folio_reserve(struct hstate *h, int preferred_nid,
>> +					  nodemask_t *nmask, gfp_t
>> gfp_mask);
>> +
>>   int hugetlb_add_to_page_cache(struct folio *folio, struct address_space
>> *mapping,
>>   			pgoff_t idx);
>>   void restore_reserve_on_error(struct hstate *h, struct vm_area_struct
>> *vma,
>> @@ -1062,6 +1065,13 @@ static inline struct folio
>> *alloc_hugetlb_folio(struct vm_area_struct *vma,
>>   }
>>
>>   static inline struct folio *
>> +alloc_hugetlb_folio_reserve(struct hstate *h, int preferred_nid,
>> +			    nodemask_t *nmask, gfp_t gfp_mask)
>> +{
>> +	return NULL;
>> +}
>> +
>> +static inline struct folio *
>>   alloc_hugetlb_folio_nodemask(struct hstate *h, int preferred_nid,
>>   			nodemask_t *nmask, gfp_t gfp_mask,
>>   			bool allow_alloc_fallback)
>> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
>> index aaf508b..c2d44a1 100644
>> --- a/mm/hugetlb.c
>> +++ b/mm/hugetlb.c
>> @@ -2564,6 +2564,23 @@ struct folio
>> *alloc_buddy_hugetlb_folio_with_mpol(struct hstate *h,
>>   	return folio;
>>   }
>>
>> +struct folio *alloc_hugetlb_folio_reserve(struct hstate *h, int preferred_nid,
>> +		nodemask_t *nmask, gfp_t gfp_mask)
>> +{
>> +	struct folio *folio;
>> +
>> +	spin_lock_irq(&hugetlb_lock);
>> +	folio = dequeue_hugetlb_folio_nodemask(h, gfp_mask,
>> preferred_nid,
> I am assuming a check for available_huge_pages(h) before calling dequeue
> would be redundant as it would simply return NULL if no huge pages are
> available?

It would be wrong to check available_huge_pages() first, because it would return false
if free_huge_pages == resv_huge_pages, but at this point a page was already reserved
for us in resv_huge_pages, so we should always succeed.  The invariant here is
free_huge_pages >= resv_huge_pages. We simply do resv_huge_pages-- here, and do
free_huge_pages-- in the dequeue_hugetlb_folio_nodemask call stack.

- Steve

> Acked-by: Vivek Kasireddy <vivek.kasireddy@intel.com>
> 
> Thanks,
> Vivek
> 
>> +					       nmask);
>> +	if (folio) {
>> +		VM_BUG_ON(!h->resv_huge_pages);
>> +		h->resv_huge_pages--;
>> +	}
>> +
>> +	spin_unlock_irq(&hugetlb_lock);
>> +	return folio;
>> +}
>> +
>>   /* folio migration callback function */
>>   struct folio *alloc_hugetlb_folio_nodemask(struct hstate *h, int
>> preferred_nid,
>>   		nodemask_t *nmask, gfp_t gfp_mask, bool
>> allow_alloc_fallback)
>> diff --git a/mm/memfd.c b/mm/memfd.c
>> index e7b7c52..bfe0e71 100644
>> --- a/mm/memfd.c
>> +++ b/mm/memfd.c
>> @@ -82,11 +82,10 @@ struct folio *memfd_alloc_folio(struct file *memfd,
>> pgoff_t idx)
>>   		gfp_mask = htlb_alloc_mask(hstate_file(memfd));
>>   		gfp_mask &= ~(__GFP_HIGHMEM | __GFP_MOVABLE);
>>
>> -		folio = alloc_hugetlb_folio_nodemask(hstate_file(memfd),
>> -						     numa_node_id(),
>> -						     NULL,
>> -						     gfp_mask,
>> -						     false);
>> +		folio = alloc_hugetlb_folio_reserve(hstate_file(memfd),
>> +						    numa_node_id(),
>> +						    NULL,
>> +						    gfp_mask);
>>   		if (folio && folio_try_get(folio)) {
>>   			err = hugetlb_add_to_page_cache(folio,
>>   							memfd->f_mapping,
>> --
>> 1.8.3.1
> 


^ permalink raw reply	[flat|nested] 16+ messages in thread

* RE: [PATCH V1 0/5] memfd-pin huge page fixes
  2024-09-04 14:51   ` Steven Sistare
@ 2024-09-06  8:09     ` Kasireddy, Vivek
  0 siblings, 0 replies; 16+ messages in thread
From: Kasireddy, Vivek @ 2024-09-06  8:09 UTC (permalink / raw)
  To: Steven Sistare, linux-mm
  Cc: Muchun Song, Andrew Morton, Matthew Wilcox, Peter Xu,
	David Hildenbrand, Jason Gunthorpe

> Subject: Re: [PATCH V1 0/5] memfd-pin huge page fixes
> 
> On 9/3/2024 9:12 PM, Kasireddy, Vivek wrote:
> > Hi Steve,
> >
> >> Subject: [PATCH V1 0/5] memfd-pin huge page fixes
> >>
> >> Fix multiple bugs that occur when using memfd_pin_folios with hugetlb
> >> pages
> >> and THP.  The hugetlb bugs only bite when the page is not yet faulted in
> >> when memfd_pin_folios is called.  The THP bug bites when the starting
> offset
> >> passed to memfd_pin_folios is not huge page aligned.  See the commit
> >> messages
> >> for details.
> > Thank you for fixing these bugs. I have Acked all patches except for patch
> #1,
> > as my understanding of xarrays is limited at this point.
> >
> > Also, could you please briefly describe how you have exercised
> memfd_alloc_folio()
> > code path or what tests you have run to uncover these bugs? I'd like to
> > figure out ways to augment the list of udmabuf tests to validate scenarios
> > where hugetlb pages are not faulted in and memfd_pin_folios() is called.
> 
> I am extending iommufd to support memfd pinning, so I added a new ioctl
> which
> takes an fd, offset, and length to pin.  I am just getting started, so currently
> it does nothing more than call memfd_pin_folios immediately followed by
> unpin_folios.
> Then exit the process.  That is sufficient to trigger all the bugs except the alloc
> race panic. I tested with these files:
>    /dev/hugepages/file
>    /dev/shm/file    with: mount -o remount,huge=always /dev/shm
>    memfd_create(MFD_HUGETLB)
>    memfd_create(0)  with: echo always >
> /sys/kernel/mm/transparent_hugepage/shmem_enabled
Thank you for explaining your test-case.

Thanks,
Vivek

> 
> - Steve
> 
> >> Steve Sistare (5):
> >>    mm/filemap: fix filemap_get_folios_contig THP panic
> >>    mm/hugetlb: fix memfd_pin_folios free_huge_pages leak
> >>    mm/hugetlb: fix memfd_pin_folios resv_huge_pages leak
> >>    mm/gup: fix memfd_pin_folios hugetlb page allocation
> >>    mm/gup: fix memfd_pin_folios alloc race panic
> >>
> >>   include/linux/hugetlb.h | 10 ++++++++++
> >>   mm/filemap.c            |  4 ++++
> >>   mm/gup.c                |  5 ++++-
> >>   mm/hugetlb.c            | 17 +++++++++++++++++
> >>   mm/memfd.c              | 15 +++++++++------
> >>   5 files changed, 44 insertions(+), 7 deletions(-)
> >>
> >> --
> >> 1.8.3.1
> >


^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2024-09-06  8:09 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-09-03 14:25 [PATCH V1 0/5] memfd-pin huge page fixes Steve Sistare
2024-09-03 14:25 ` [PATCH V1 1/5] mm/filemap: fix filemap_get_folios_contig THP panic Steve Sistare
2024-09-03 14:25 ` [PATCH V1 2/5] mm/hugetlb: fix memfd_pin_folios free_huge_pages leak Steve Sistare
2024-09-04  0:45   ` Kasireddy, Vivek
2024-09-04 14:52     ` Steven Sistare
2024-09-03 14:25 ` [PATCH V1 3/5] mm/hugetlb: fix memfd_pin_folios resv_huge_pages leak Steve Sistare
2024-09-04  1:04   ` Kasireddy, Vivek
2024-09-04 14:52     ` Steven Sistare
2024-09-03 14:25 ` [PATCH V1 4/5] mm/gup: fix memfd_pin_folios hugetlb page allocation Steve Sistare
2024-09-04  1:06   ` Kasireddy, Vivek
2024-09-04 14:51     ` Steven Sistare
2024-09-03 14:25 ` [PATCH V1 5/5] mm/gup: fix memfd_pin_folios alloc race panic Steve Sistare
2024-09-04  1:07   ` Kasireddy, Vivek
2024-09-04  1:12 ` [PATCH V1 0/5] memfd-pin huge page fixes Kasireddy, Vivek
2024-09-04 14:51   ` Steven Sistare
2024-09-06  8:09     ` Kasireddy, Vivek

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox