[PATCH 0/7] mm/hugetlb: Refactor hugetlb allocation resv accounting

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/7] mm/hugetlb: Refactor hugetlb allocation resv accounting
@ 2024-12-01 21:22 Peter Xu
  2024-12-01 21:22 ` [PATCH 1/7] mm/hugetlb: Fix avoid_reserve to allow taking folio from subpool Peter Xu
                   ` (6 more replies)
  0 siblings, 7 replies; 15+ messages in thread
From: Peter Xu @ 2024-12-01 21:22 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Rik van Riel, Breno Leitao, Andrew Morton, peterx, Muchun Song,
	Oscar Salvador, Roman Gushchin, Naoya Horiguchi, Ackerley Tng

[based on akpm/mm-unstable, latest 8351c1503010, of Dec 1st 2024]

This is a follow up on Ackerley's series here as replacement:

https://lore.kernel.org/r/cover.1728684491.git.ackerleytng@google.com

Ackerley, I wanted to reuse some of your patches, but when looking at this
issue I found a bug which is described in patch 1. I'll need to have that
patch to be the 1st patch of the series, and then I also found this is so
far the best way to layout this whole set.  It should have gone a bit
further than what you tried to do in your series, but I assume many of your
gmem 1G patches after that will still apply on top.

The goal of this series is to cleanup hugetlb resv accounting, especially
during folio allocation.  It paves way for other users to allocate hugetlb
folios out of either system reservations, or subpools (instead of
hugetlbfs, as a file system).  So for the longer term, maybe there's chance
to use hugetlb to be separate concept v.s. hugetlbfs, which I hope would
work.

Going back to this small (not so much..) refactoring series.  It touches
the (probably.. hard to read for most) hugetlb resv code, try to make it
more readable, decouple things so that it might be easier in the future to
allocate the folios without hugetlb VMAs.

Tests I've done:

- I had a reproducer in patch 1 for the bug I found, this will start to
  work after patch 1 or the whole set applied.

- Hugetlb regression tests (on x86_64 2MBs), includes:

  - All vmtests on hugetlbfs

  - libhugetlbfs test suite

Note that I found libhugetlbfs test suites can fail on some of the tests,
but it doesn't look like to be caused by this series, as I can get the same
results when I run the test suites on either akpm/mm-stable or v6.12 tag.
I didn't yet have time to look into all the issues, but the current guess
is those issues are separate from this series.

Comments welcomed, thanks.

Peter Xu (7):
  mm/hugetlb: Fix avoid_reserve to allow taking folio from subpool
  mm/hugetlb: Stop using avoid_reserve flag in fork()
  mm/hugetlb: Rename avoid_reserve to cow_from_owner
  mm/hugetlb: Clean up map/global resv accounting when allocate
  mm/hugetlb: Simplify vma_has_reserves()
  mm/hugetlb: Drop vma_has_reserves()
  mm/hugetlb: Unify restore reserve accounting for new allocations

 fs/hugetlbfs/inode.c    |   2 +-
 include/linux/hugetlb.h |   4 +-
 mm/hugetlb.c            | 243 ++++++++++++++++++----------------------
 3 files changed, 110 insertions(+), 139 deletions(-)

-- 
2.47.0

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH 1/7] mm/hugetlb: Fix avoid_reserve to allow taking folio from subpool
  2024-12-01 21:22 [PATCH 0/7] mm/hugetlb: Refactor hugetlb allocation resv accounting Peter Xu
@ 2024-12-01 21:22 ` Peter Xu
  2024-12-18 14:33   ` Ackerley Tng
  2024-12-01 21:22 ` [PATCH 2/7] mm/hugetlb: Stop using avoid_reserve flag in fork() Peter Xu
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 15+ messages in thread
From: Peter Xu @ 2024-12-01 21:22 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Rik van Riel, Breno Leitao, Andrew Morton, peterx, Muchun Song,
	Oscar Salvador, Roman Gushchin, Naoya Horiguchi, Ackerley Tng,
	linux-stable

Since commit 04f2cbe35699 ("hugetlb: guarantee that COW faults for a
process that called mmap(MAP_PRIVATE) on hugetlbfs will succeed"),
avoid_reserve was introduced for a special case of CoW on hugetlb private
mappings, and only if the owner VMA is trying to allocate yet another
hugetlb folio that is not reserved within the private vma reserved map.

Later on, in commit d85f69b0b533 ("mm/hugetlb: alloc_huge_page handle areas
hole punched by fallocate"), alloc_huge_page() enforced to not consume any
global reservation as long as avoid_reserve=true.  This operation doesn't
look correct, because even if it will enforce the allocation to not use
global reservation at all, it will still try to take one reservation from
the spool (if the subpool existed).  Then since the spool reserved pages
take from global reservation, it'll also take one reservation globally.

Logically it can cause global reservation to go wrong.

I wrote a reproducer below, trigger this special path, and every run of
such program will cause global reservation count to increment by one, until
it hits the number of free pages:

  #define _GNU_SOURCE             /* See feature_test_macros(7) */
  #include <stdio.h>
  #include <fcntl.h>
  #include <errno.h>
  #include <unistd.h>
  #include <stdlib.h>
  #include <sys/mman.h>

  #define  MSIZE  (2UL << 20)

  int main(int argc, char *argv[])
  {
      const char *path;
      int *buf;
      int fd, ret;
      pid_t child;

      if (argc < 2) {
          printf("usage: %s <hugetlb_file>\n", argv[0]);
          return -1;
      }

      path = argv[1];

      fd = open(path, O_RDWR | O_CREAT, 0666);
      if (fd < 0) {
          perror("open failed");
          return -1;
      }

      ret = fallocate(fd, 0, 0, MSIZE);
      if (ret != 0) {
          perror("fallocate");
          return -1;
      }

      buf = mmap(NULL, MSIZE, PROT_READ|PROT_WRITE,
                 MAP_PRIVATE, fd, 0);

      if (buf == MAP_FAILED) {
          perror("mmap() failed");
          return -1;
      }

      /* Allocate a page */
      *buf = 1;

      child = fork();
      if (child == 0) {
          /* child doesn't need to do anything */
          exit(0);
      }

      /* Trigger CoW from owner */
      *buf = 2;

      munmap(buf, MSIZE);
      close(fd);
      unlink(path);

      return 0;
  }

It can only reproduce with a sub-mount when there're reserved pages on the
spool, like:

  # sysctl vm.nr_hugepages=128
  # mkdir ./hugetlb-pool
  # mount -t hugetlbfs -o min_size=8M,pagesize=2M none ./hugetlb-pool

Then run the reproducer on the mountpoint:

  # ./reproducer ./hugetlb-pool/test

Fix it by taking the reservation from spool if available.  In general,
avoid_reserve is IMHO more about "avoid vma resv map", not spool's.

I copied stable, however I have no intention for backporting if it's not a
clean cherry-pick, because private hugetlb mapping, and then fork() on top
is too rare to hit.

Cc: linux-stable <stable@vger.kernel.org>
Fixes: d85f69b0b533 ("mm/hugetlb: alloc_huge_page handle areas hole punched by fallocate")
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 mm/hugetlb.c | 22 +++-------------------
 1 file changed, 3 insertions(+), 19 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index cec4b121193f..9ce69fd22a01 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1394,8 +1394,7 @@ static unsigned long available_huge_pages(struct hstate *h)
 
 static struct folio *dequeue_hugetlb_folio_vma(struct hstate *h,
 				struct vm_area_struct *vma,
-				unsigned long address, int avoid_reserve,
-				long chg)
+				unsigned long address, long chg)
 {
 	struct folio *folio = NULL;
 	struct mempolicy *mpol;
@@ -1411,10 +1410,6 @@ static struct folio *dequeue_hugetlb_folio_vma(struct hstate *h,
 	if (!vma_has_reserves(vma, chg) && !available_huge_pages(h))
 		goto err;
 
-	/* If reserves cannot be used, ensure enough pages are in the pool */
-	if (avoid_reserve && !available_huge_pages(h))
-		goto err;
-
 	gfp_mask = htlb_alloc_mask(h);
 	nid = huge_node(vma, address, gfp_mask, &mpol, &nodemask);
 
@@ -1430,7 +1425,7 @@ static struct folio *dequeue_hugetlb_folio_vma(struct hstate *h,
 		folio = dequeue_hugetlb_folio_nodemask(h, gfp_mask,
 							nid, nodemask);
 
-	if (folio && !avoid_reserve && vma_has_reserves(vma, chg)) {
+	if (folio && vma_has_reserves(vma, chg)) {
 		folio_set_hugetlb_restore_reserve(folio);
 		h->resv_huge_pages--;
 	}
@@ -3007,17 +3002,6 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
 		gbl_chg = hugepage_subpool_get_pages(spool, 1);
 		if (gbl_chg < 0)
 			goto out_end_reservation;
-
-		/*
-		 * Even though there was no reservation in the region/reserve
-		 * map, there could be reservations associated with the
-		 * subpool that can be used.  This would be indicated if the
-		 * return value of hugepage_subpool_get_pages() is zero.
-		 * However, if avoid_reserve is specified we still avoid even
-		 * the subpool reservations.
-		 */
-		if (avoid_reserve)
-			gbl_chg = 1;
 	}
 
 	/* If this allocation is not consuming a reservation, charge it now.
@@ -3040,7 +3024,7 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
 	 * from the global free pool (global change).  gbl_chg == 0 indicates
 	 * a reservation exists for the allocation.
 	 */
-	folio = dequeue_hugetlb_folio_vma(h, vma, addr, avoid_reserve, gbl_chg);
+	folio = dequeue_hugetlb_folio_vma(h, vma, addr, gbl_chg);
 	if (!folio) {
 		spin_unlock_irq(&hugetlb_lock);
 		folio = alloc_buddy_hugetlb_folio_with_mpol(h, vma, addr);
-- 
2.47.0



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/7] mm/hugetlb: Fix avoid_reserve to allow taking folio from subpool
  2024-12-01 21:22 ` [PATCH 1/7] mm/hugetlb: Fix avoid_reserve to allow taking folio from subpool Peter Xu
@ 2024-12-18 14:33   ` Ackerley Tng
  2024-12-27 23:15     ` Ackerley Tng
  0 siblings, 1 reply; 15+ messages in thread
From: Ackerley Tng @ 2024-12-18 14:33 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-kernel, linux-mm, riel, leitao, akpm, peterx, muchun.song,
	osalvador, roman.gushchin, nao.horiguchi, stable

Peter Xu <peterx@redhat.com> writes:

> Since commit 04f2cbe35699 ("hugetlb: guarantee that COW faults for a
> process that called mmap(MAP_PRIVATE) on hugetlbfs will succeed"),
> avoid_reserve was introduced for a special case of CoW on hugetlb private
> mappings, and only if the owner VMA is trying to allocate yet another
> hugetlb folio that is not reserved within the private vma reserved map.
>
> Later on, in commit d85f69b0b533 ("mm/hugetlb: alloc_huge_page handle areas
> hole punched by fallocate"), alloc_huge_page() enforced to not consume any
> global reservation as long as avoid_reserve=true.  This operation doesn't
> look correct, because even if it will enforce the allocation to not use
> global reservation at all, it will still try to take one reservation from
> the spool (if the subpool existed).  Then since the spool reserved pages
> take from global reservation, it'll also take one reservation globally.
>
> Logically it can cause global reservation to go wrong.
>
> I wrote a reproducer below

Thank you so much for looking into this!

> <snip>

I was able to reproduce this using your
reproducer. /sys/kernel/mm/hugepages/hugepages-2048kB/resv_hugepages
is not decremented even after the reproducer exits.

  # sysctl vm.nr_hugepages=16 
  vm.nr_hugepages = 16
  # mkdir ./hugetlb-pool
  # mount -t hugetlbfs -o min_size=8M,pagesize=2M none ./hugetlb-pool
  # for i in $(seq 16); do ./a.out hugetlb-pool/test; cat /sys/kernel/mm/hugepages/hugepages-2048kB/resv_hugepages; done
  5
  6
  7
  8
  9
  10
  11
  12
  13
  14
  15
  16
  16
  16
  16
  16
  # 

I'll go over the rest of your patches and dig into the meaning of `avoid_reserve`.


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/7] mm/hugetlb: Fix avoid_reserve to allow taking folio from subpool
  2024-12-18 14:33   ` Ackerley Tng
@ 2024-12-27 23:15     ` Ackerley Tng
  2025-01-03 16:26       ` Peter Xu
  0 siblings, 1 reply; 15+ messages in thread
From: Ackerley Tng @ 2024-12-27 23:15 UTC (permalink / raw)
  To: Ackerley Tng
  Cc: peterx, linux-kernel, linux-mm, riel, leitao, akpm, muchun.song,
	osalvador, roman.gushchin, nao.horiguchi, stable

Ackerley Tng <ackerleytng@google.com> writes:

> <snip>
>
> I'll go over the rest of your patches and dig into the meaning of `avoid_reserve`.

Yes, after looking into this more deeply, I agree that avoid_reserve
means avoiding the reservations in the resv_map rather than reservations
in the subpool or hstate.

Here's more detail of what's going on in the reproducer that I wrote as I
reviewed Peter's patch:

1. On fallocate(), allocate page A
2. On mmap(), set up a vma without VM_MAYSHARE since MAP_PRIVATE was requested
3. On faulting *buf = 1, allocate a new page B, copy A to B because the mmap request was MAP_PRIVATE
4. On fork, prep for COW by marking page as read only. Both parent and child share B.
5. On faulting *buf = 2 (write fault), allocate page C, copy B to C
    + B belongs to the child, C belongs to the parent
    + C is owned by the parent
6. Child exits, B is freed
7. On munmap(), C is freed
8. On unlink(), A is freed

When C was allocated in the parent (owns MAP_PRIVATE page, doing a copy on
write), spool->rsv_hpages was decreased but h->resv_huge_pages was not. This is
the root of the bug.

We should decrement h->resv_huge_pages if a reserved page from the subpool was
used, instead of whether avoid_reserve or vma_has_reserves() is set. If
avoid_reserve is set, the subpool shouldn't be checked for a reservation, so we
won't be decrementing h->resv_huge_pages anyway.

I agree with Peter's fix as a whole (the entire patch series).

Reviewed-by: Ackerley Tng <ackerleytng@google.com>
Tested-by: Ackerley Tng <ackerleytng@google.com>

---

Some definitions which might be helpful:

+ h->resv_huge_pages indicates number of reserved pages globally.
    + This number increases when pages are reserved
    + This number decreases when reserved pages are allocated, or when pages are unreserved
+ spool->rsv_hpages indicates number of reserved pages in this subpool.
    + This number increases when pages are reserved
    + This number decreases when reserved pages are allocated, or when pages are unreserved
+ h->resv_huge_pages should be the sum of all subpools' spool->rsv_hpages.

More details on the flow in alloc_hugetlb_folio() which might be helpful:

hugepage_subpool_get_pages() returns "the number of pages by which the global
pools must be adjusted (upward)". This return value is never negative other than
errors. (hugepage_subpool_get_pages() always gets called with a positive delta).

Specifically in alloc_hugetlb_folio(), the return value is either 0 or 1 (other
than errors).

If the return value is 0, the subpool had enough reservations and so we should
decrement h->resv_huge_pages.

If the return value is 1, it means that this subpool did not have any more
reserved hugepages, and we need to get a page from the global
hstate. dequeue_hugetlb_folio_vma() will get us a page that was already
allocated.

In dequeue_hugetlb_folio_vma(), if the vma doesn't have enough reserves for 1
page, and there are no available_huge_pages() left, we quit dequeueing since we
will need to allocate a new page. If we want to avoid_reserve, that means we
don't want to use the vma's reserves in resv_map, we also check
available_huge_pages(). If there are available_huge_pages(), we go on to dequeue
a page.

Then, we determine whether to decrement h->resv_huge_pages. We should decrement
if a reserved page from the subpool was used, instead of whether avoid_reserve
or vma_has_reserves() is set.

In the case where a surplus page needs to be allocated, the surplus page isn't
and doesn't need to be associated with a subpool, so no subpool hugepage number
tracking updates are required. h->resv_huge_pages still has to be updated... is
this where h->resv_huge_pages can go negative?

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/7] mm/hugetlb: Fix avoid_reserve to allow taking folio from subpool
  2024-12-27 23:15     ` Ackerley Tng
@ 2025-01-03 16:26       ` Peter Xu
  0 siblings, 0 replies; 15+ messages in thread
From: Peter Xu @ 2025-01-03 16:26 UTC (permalink / raw)
  To: Ackerley Tng
  Cc: linux-kernel, linux-mm, riel, leitao, akpm, muchun.song,
	osalvador, roman.gushchin, nao.horiguchi, stable

On Fri, Dec 27, 2024 at 11:15:44PM +0000, Ackerley Tng wrote:
> Ackerley Tng <ackerleytng@google.com> writes:
> 
> > <snip>
> >
> > I'll go over the rest of your patches and dig into the meaning of `avoid_reserve`.
> 
> Yes, after looking into this more deeply, I agree that avoid_reserve
> means avoiding the reservations in the resv_map rather than reservations
> in the subpool or hstate.
> 
> Here's more detail of what's going on in the reproducer that I wrote as I
> reviewed Peter's patch:
> 
> 1. On fallocate(), allocate page A
> 2. On mmap(), set up a vma without VM_MAYSHARE since MAP_PRIVATE was requested
> 3. On faulting *buf = 1, allocate a new page B, copy A to B because the mmap request was MAP_PRIVATE
> 4. On fork, prep for COW by marking page as read only. Both parent and child share B.
> 5. On faulting *buf = 2 (write fault), allocate page C, copy B to C
>     + B belongs to the child, C belongs to the parent
>     + C is owned by the parent
> 6. Child exits, B is freed
> 7. On munmap(), C is freed
> 8. On unlink(), A is freed
> 
> When C was allocated in the parent (owns MAP_PRIVATE page, doing a copy on
> write), spool->rsv_hpages was decreased but h->resv_huge_pages was not. This is
> the root of the bug.
> 
> We should decrement h->resv_huge_pages if a reserved page from the subpool was
> used, instead of whether avoid_reserve or vma_has_reserves() is set. If
> avoid_reserve is set, the subpool shouldn't be checked for a reservation, so we
> won't be decrementing h->resv_huge_pages anyway.
> 
> I agree with Peter's fix as a whole (the entire patch series).
> 
> Reviewed-by: Ackerley Tng <ackerleytng@google.com>
> Tested-by: Ackerley Tng <ackerleytng@google.com>
> 
> ---
> 
> Some definitions which might be helpful:
> 
> + h->resv_huge_pages indicates number of reserved pages globally.
>     + This number increases when pages are reserved
>     + This number decreases when reserved pages are allocated, or when pages are unreserved
> + spool->rsv_hpages indicates number of reserved pages in this subpool.
>     + This number increases when pages are reserved
>     + This number decreases when reserved pages are allocated, or when pages are unreserved
> + h->resv_huge_pages should be the sum of all subpools' spool->rsv_hpages.

I think you're correct. One add-on comment: I think when taking vma
reservation into accout, then the global reservation should be a sum of
all spools' and all vmas' reservations.

> 
> More details on the flow in alloc_hugetlb_folio() which might be helpful:
> 
> hugepage_subpool_get_pages() returns "the number of pages by which the global
> pools must be adjusted (upward)". This return value is never negative other than
> errors. (hugepage_subpool_get_pages() always gets called with a positive delta).
> 
> Specifically in alloc_hugetlb_folio(), the return value is either 0 or 1 (other
> than errors).
> 
> If the return value is 0, the subpool had enough reservations and so we should
> decrement h->resv_huge_pages.
> 
> If the return value is 1, it means that this subpool did not have any more
> reserved hugepages, and we need to get a page from the global
> hstate. dequeue_hugetlb_folio_vma() will get us a page that was already
> allocated.
> 
> In dequeue_hugetlb_folio_vma(), if the vma doesn't have enough reserves for 1
> page, and there are no available_huge_pages() left, we quit dequeueing since we
> will need to allocate a new page. If we want to avoid_reserve, that means we
> don't want to use the vma's reserves in resv_map, we also check
> available_huge_pages(). If there are available_huge_pages(), we go on to dequeue
> a page.
> 
> Then, we determine whether to decrement h->resv_huge_pages. We should decrement
> if a reserved page from the subpool was used, instead of whether avoid_reserve
> or vma_has_reserves() is set.
> 
> In the case where a surplus page needs to be allocated, the surplus page isn't
> and doesn't need to be associated with a subpool, so no subpool hugepage number
> tracking updates are required. h->resv_huge_pages still has to be updated... is
> this where h->resv_huge_pages can go negative?

This question doesn't sound like relevant to this specific scenario that
this patch (or, the reproducer attached in the patch) was about.  In the
reproducer of this patch, we don't need to have surplus page involved.

Going back to the question you're asking - I don't think resv_huge_pages
will go negative for the surplus case?

IIUC updating resv_huge_pages is the correct behavior even for surplus
pages, as long as gbl_chg==0.

The initial change was done by Naoya in commit a88c76954804 ("mm: hugetlb:
fix hugepage memory leak caused by wrong reserve count").  There're some
more information in the commit log.  In general, when gbl_chg==0 it means
we consumed a global reservation either in vma or spool, so it must be
accounted globally after the folio is successfully allocated.  Here "being
accounted" should mean the global resv count will be properly decremented.

Thanks for taking a look, Ackerley!

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH 2/7] mm/hugetlb: Stop using avoid_reserve flag in fork()
  2024-12-01 21:22 [PATCH 0/7] mm/hugetlb: Refactor hugetlb allocation resv accounting Peter Xu
  2024-12-01 21:22 ` [PATCH 1/7] mm/hugetlb: Fix avoid_reserve to allow taking folio from subpool Peter Xu
@ 2024-12-01 21:22 ` Peter Xu
  2024-12-01 21:22 ` [PATCH 3/7] mm/hugetlb: Rename avoid_reserve to cow_from_owner Peter Xu
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 15+ messages in thread
From: Peter Xu @ 2024-12-01 21:22 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Rik van Riel, Breno Leitao, Andrew Morton, peterx, Muchun Song,
	Oscar Salvador, Roman Gushchin, Naoya Horiguchi, Ackerley Tng

When fork() and stumble on top of a dma-pinned hugetlb private page, CoW
must happen during fork() to guarantee dma coherency.

In this specific path, hugetlb pages need to be allocated for the child
process.  Stop using avoid_reserve=1 flag here: it's not required to be
used here, as dest_vma (which is destined to be a MAP_PRIVATE hugetlb vma)
will have no private vma resv map, and that will make sure it won't be able
to use a vma reservation later.

No functional change intended with this change.  Said that, it's still
wanted to do this, so as to reduce the usage of avoid_reserve to the only
one user, which is also why this flag was introduced initially in commit
04f2cbe35699 ("hugetlb: guarantee that COW faults for a process that called
mmap(MAP_PRIVATE) on hugetlbfs will succeed").  I don't see whoever else
should set it at all.

Further patch will clean up resv accounting based on this.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 mm/hugetlb.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 9ce69fd22a01..8d4b4197d11b 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -5317,7 +5317,7 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
 				spin_unlock(src_ptl);
 				spin_unlock(dst_ptl);
 				/* Do not use reserve as it's private owned */
-				new_folio = alloc_hugetlb_folio(dst_vma, addr, 1);
+				new_folio = alloc_hugetlb_folio(dst_vma, addr, 0);
 				if (IS_ERR(new_folio)) {
 					folio_put(pte_folio);
 					ret = PTR_ERR(new_folio);
-- 
2.47.0

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH 3/7] mm/hugetlb: Rename avoid_reserve to cow_from_owner
  2024-12-01 21:22 [PATCH 0/7] mm/hugetlb: Refactor hugetlb allocation resv accounting Peter Xu
  2024-12-01 21:22 ` [PATCH 1/7] mm/hugetlb: Fix avoid_reserve to allow taking folio from subpool Peter Xu
  2024-12-01 21:22 ` [PATCH 2/7] mm/hugetlb: Stop using avoid_reserve flag in fork() Peter Xu
@ 2024-12-01 21:22 ` Peter Xu
  2024-12-01 21:22 ` [PATCH 4/7] mm/hugetlb: Clean up map/global resv accounting when allocate Peter Xu
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 15+ messages in thread
From: Peter Xu @ 2024-12-01 21:22 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Rik van Riel, Breno Leitao, Andrew Morton, peterx, Muchun Song,
	Oscar Salvador, Roman Gushchin, Naoya Horiguchi, Ackerley Tng

The old name "avoid_reserve" can be too generic and can be used wrongly in
the new call sites that want to allocate a hugetlb folio.

It's confusing on two things: (1) whether one can opt-in to avoid global
reservation, and (2) whether it should take more than one count.

In reality, this flag is only used in an extremely hacky path, in an
extremely hacky way in hugetlb CoW path only, and always use with 1 saying
"skip global reservation".  Rename the flag to avoid future abuse of this
flag, making it a boolean so as to reflect its true representation that
it's not a counter.  To make it even harder to abuse, add a comment above
the function to explain it.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 fs/hugetlbfs/inode.c    |  2 +-
 include/linux/hugetlb.h |  4 ++--
 mm/hugetlb.c            | 33 ++++++++++++++++++++-------------
 3 files changed, 23 insertions(+), 16 deletions(-)

diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index a5ea006f403e..665c736bdb30 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -819,7 +819,7 @@ static long hugetlbfs_fallocate(struct file *file, int mode, loff_t offset,
 		 * folios in these areas, we need to consume the reserves
 		 * to keep reservation accounting consistent.
 		 */
-		folio = alloc_hugetlb_folio(&pseudo_vma, addr, 0);
+		folio = alloc_hugetlb_folio(&pseudo_vma, addr, false);
 		if (IS_ERR(folio)) {
 			mutex_unlock(&hugetlb_fault_mutex_table[hash]);
 			error = PTR_ERR(folio);
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index ae4fe8615bb6..6189d0383c7f 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -682,7 +682,7 @@ struct huge_bootmem_page {
 
 int isolate_or_dissolve_huge_page(struct page *page, struct list_head *list);
 struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
-				unsigned long addr, int avoid_reserve);
+				unsigned long addr, bool cow_from_owner);
 struct folio *alloc_hugetlb_folio_nodemask(struct hstate *h, int preferred_nid,
 				nodemask_t *nmask, gfp_t gfp_mask,
 				bool allow_alloc_fallback);
@@ -1061,7 +1061,7 @@ static inline int isolate_or_dissolve_huge_page(struct page *page,
 
 static inline struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
 					   unsigned long addr,
-					   int avoid_reserve)
+					   bool cow_from_owner)
 {
 	return NULL;
 }
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 8d4b4197d11b..dfd479a857b6 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -2956,8 +2956,15 @@ int isolate_or_dissolve_huge_page(struct page *page, struct list_head *list)
 	return ret;
 }
 
+/*
+ * NOTE! "cow_from_owner" represents a very hacky usage only used in CoW
+ * faults of hugetlb private mappings on top of a non-page-cache folio (in
+ * which case even if there's a private vma resv map it won't cover such
+ * allocation).  New call sites should (probably) never set it to true!!
+ * When it's set, the allocation will bypass all vma level reservations.
+ */
 struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
-				    unsigned long addr, int avoid_reserve)
+				    unsigned long addr, bool cow_from_owner)
 {
 	struct hugepage_subpool *spool = subpool_vma(vma);
 	struct hstate *h = hstate_vma(vma);
@@ -2998,7 +3005,7 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
 	 * Allocations for MAP_NORESERVE mappings also need to be
 	 * checked against any subpool limit.
 	 */
-	if (map_chg || avoid_reserve) {
+	if (map_chg || cow_from_owner) {
 		gbl_chg = hugepage_subpool_get_pages(spool, 1);
 		if (gbl_chg < 0)
 			goto out_end_reservation;
@@ -3006,7 +3013,7 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
 
 	/* If this allocation is not consuming a reservation, charge it now.
 	 */
-	deferred_reserve = map_chg || avoid_reserve;
+	deferred_reserve = map_chg || cow_from_owner;
 	if (deferred_reserve) {
 		ret = hugetlb_cgroup_charge_cgroup_rsvd(
 			idx, pages_per_huge_page(h), &h_cg);
@@ -3031,7 +3038,7 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
 		if (!folio)
 			goto out_uncharge_cgroup;
 		spin_lock_irq(&hugetlb_lock);
-		if (!avoid_reserve && vma_has_reserves(vma, gbl_chg)) {
+		if (!cow_from_owner && vma_has_reserves(vma, gbl_chg)) {
 			folio_set_hugetlb_restore_reserve(folio);
 			h->resv_huge_pages--;
 		}
@@ -3090,7 +3097,7 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
 		hugetlb_cgroup_uncharge_cgroup_rsvd(idx, pages_per_huge_page(h),
 						    h_cg);
 out_subpool_put:
-	if (map_chg || avoid_reserve)
+	if (map_chg || cow_from_owner)
 		hugepage_subpool_put_pages(spool, 1);
 out_end_reservation:
 	vma_end_reservation(h, vma, addr);
@@ -5317,7 +5324,7 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
 				spin_unlock(src_ptl);
 				spin_unlock(dst_ptl);
 				/* Do not use reserve as it's private owned */
-				new_folio = alloc_hugetlb_folio(dst_vma, addr, 0);
+				new_folio = alloc_hugetlb_folio(dst_vma, addr, false);
 				if (IS_ERR(new_folio)) {
 					folio_put(pte_folio);
 					ret = PTR_ERR(new_folio);
@@ -5771,7 +5778,7 @@ static vm_fault_t hugetlb_wp(struct folio *pagecache_folio,
 	struct hstate *h = hstate_vma(vma);
 	struct folio *old_folio;
 	struct folio *new_folio;
-	int outside_reserve = 0;
+	bool cow_from_owner = 0;
 	vm_fault_t ret = 0;
 	struct mmu_notifier_range range;
 
@@ -5840,7 +5847,7 @@ static vm_fault_t hugetlb_wp(struct folio *pagecache_folio,
 	 */
 	if (is_vma_resv_set(vma, HPAGE_RESV_OWNER) &&
 			old_folio != pagecache_folio)
-		outside_reserve = 1;
+		cow_from_owner = true;
 
 	folio_get(old_folio);
 
@@ -5849,7 +5856,7 @@ static vm_fault_t hugetlb_wp(struct folio *pagecache_folio,
 	 * be acquired again before returning to the caller, as expected.
 	 */
 	spin_unlock(vmf->ptl);
-	new_folio = alloc_hugetlb_folio(vma, vmf->address, outside_reserve);
+	new_folio = alloc_hugetlb_folio(vma, vmf->address, cow_from_owner);
 
 	if (IS_ERR(new_folio)) {
 		/*
@@ -5859,7 +5866,7 @@ static vm_fault_t hugetlb_wp(struct folio *pagecache_folio,
 		 * reliability, unmap the page from child processes. The child
 		 * may get SIGKILLed if it later faults.
 		 */
-		if (outside_reserve) {
+		if (cow_from_owner) {
 			struct address_space *mapping = vma->vm_file->f_mapping;
 			pgoff_t idx;
 			u32 hash;
@@ -6110,7 +6117,7 @@ static vm_fault_t hugetlb_no_page(struct address_space *mapping,
 				goto out;
 		}
 
-		folio = alloc_hugetlb_folio(vma, vmf->address, 0);
+		folio = alloc_hugetlb_folio(vma, vmf->address, false);
 		if (IS_ERR(folio)) {
 			/*
 			 * Returning error will result in faulting task being
@@ -6578,7 +6585,7 @@ int hugetlb_mfill_atomic_pte(pte_t *dst_pte,
 			goto out;
 		}
 
-		folio = alloc_hugetlb_folio(dst_vma, dst_addr, 0);
+		folio = alloc_hugetlb_folio(dst_vma, dst_addr, false);
 		if (IS_ERR(folio)) {
 			ret = -ENOMEM;
 			goto out;
@@ -6620,7 +6627,7 @@ int hugetlb_mfill_atomic_pte(pte_t *dst_pte,
 			goto out;
 		}
 
-		folio = alloc_hugetlb_folio(dst_vma, dst_addr, 0);
+		folio = alloc_hugetlb_folio(dst_vma, dst_addr, false);
 		if (IS_ERR(folio)) {
 			folio_put(*foliop);
 			ret = -ENOMEM;
-- 
2.47.0



^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH 4/7] mm/hugetlb: Clean up map/global resv accounting when allocate
  2024-12-01 21:22 [PATCH 0/7] mm/hugetlb: Refactor hugetlb allocation resv accounting Peter Xu
                   ` (2 preceding siblings ...)
  2024-12-01 21:22 ` [PATCH 3/7] mm/hugetlb: Rename avoid_reserve to cow_from_owner Peter Xu
@ 2024-12-01 21:22 ` Peter Xu
  2024-12-28  0:06   ` Ackerley Tng
  2024-12-01 21:22 ` [PATCH 5/7] mm/hugetlb: Simplify vma_has_reserves() Peter Xu
                   ` (2 subsequent siblings)
  6 siblings, 1 reply; 15+ messages in thread
From: Peter Xu @ 2024-12-01 21:22 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Rik van Riel, Breno Leitao, Andrew Morton, peterx, Muchun Song,
	Oscar Salvador, Roman Gushchin, Naoya Horiguchi, Ackerley Tng

alloc_hugetlb_folio() isn't a function easy to read, especially on
reservation accountings for either VMA or globally (majorly, spool only).

The 1st complexity lies in the special private CoW path, aka,
cow_from_owner=true case.

The 2nd complexity may be the confusing updates of gbl_chg after it's set
once, which looks like they can change anytime on the fly.

Logically, cow_from_user is only about vma reservation.  We could already
decouple the flag and consolidate it into map charge flag very early.  Then
we don't need to keep checking the CoW special flag every time.

This patch does it by making map_chg a tri-state flag.  Tri-state needed is
unfortunate, and it's because currently vma_needs_reservation() has a side
effect internally, that it must be followed by either a end() or commit().

We keep the same semantic as before on one thing: "if (map_chg)" means we
need a separate per-vma resv count.  It keeps most of the old code like
before untouched with the new enum.

After this patch, we take these steps to decide these variables, hopefully
slightly easier to follow:

  - First, decide map_chg.  This will take cow_from_owner into account,
    once and for all.  It's about whether we could take a resv count from
    the vma, no matter it's shared, private, etc.

  - Then, decide gbl_chg.  The only diff here is spool, comparing to
    map_chg.

Now only update each flag once and for all, instead of keep any of them
flipping which can be very hard to follow.

With cow_from_owner merged into map_chg, we could remove quite a few such
checks all over.  Side benefit of such is that we can get rid of one more
confusing flag, which is deferred_reserve.

Cleanup the comments a bit too.  E.g., MAP_NORESERVE may not need to check
against spool limit, AFAIU, if it's on a shared mapping, and if the page
cache folio has its inode's resv map available (in which case map_chg would
have been set zero, hence the code should be correct, not the comment).

There's one trivial detail that needs attention that this patch touched,
which is this check right after vma_commit_reservation():

  if (map_chg > map_commit)

It changes to:

  if (unlikely(map_chg == MAP_CHG_NEEDED && retval == 0))

It should behave the same like before, because previously the only way to
make "map_chg > map_commit" happen is map_chg=1 && map_commit=0.  That's
exactly the rewritten line.  Meanwhile, either commit() or end() will need
to be skipped if ENFORCE, to keep the old behavior.

Even though it looks a lot changed, but no functional change expected.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 mm/hugetlb.c | 116 +++++++++++++++++++++++++++++++++++----------------
 1 file changed, 80 insertions(+), 36 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index dfd479a857b6..14cfe0bb01e4 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -2956,6 +2956,25 @@ int isolate_or_dissolve_huge_page(struct page *page, struct list_head *list)
 	return ret;
 }
 
+typedef enum {
+	/*
+	 * For either 0/1: we checked the per-vma resv map, and one resv
+	 * count either can be reused (0), or an extra needed (1).
+	 */
+	MAP_CHG_REUSE = 0,
+	MAP_CHG_NEEDED = 1,
+	/*
+	 * Cannot use per-vma resv count can be used, hence a new resv
+	 * count is enforced.
+	 *
+	 * NOTE: This is mostly identical to MAP_CHG_NEEDED, except
+	 * that currently vma_needs_reservation() has an unwanted side
+	 * effect to either use end() or commit() to complete the
+	 * transaction.	 Hence it needs to differenciate from NEEDED.
+	 */
+	MAP_CHG_ENFORCED = 2,
+} map_chg_state;
+
 /*
  * NOTE! "cow_from_owner" represents a very hacky usage only used in CoW
  * faults of hugetlb private mappings on top of a non-page-cache folio (in
@@ -2969,12 +2988,11 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
 	struct hugepage_subpool *spool = subpool_vma(vma);
 	struct hstate *h = hstate_vma(vma);
 	struct folio *folio;
-	long map_chg, map_commit, nr_pages = pages_per_huge_page(h);
-	long gbl_chg;
+	long retval, gbl_chg, nr_pages = pages_per_huge_page(h);
+	map_chg_state map_chg;
 	int memcg_charge_ret, ret, idx;
 	struct hugetlb_cgroup *h_cg = NULL;
 	struct mem_cgroup *memcg;
-	bool deferred_reserve;
 	gfp_t gfp = htlb_alloc_mask(h) | __GFP_RETRY_MAYFAIL;
 
 	memcg = get_mem_cgroup_from_current();
@@ -2985,36 +3003,56 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
 	}
 
 	idx = hstate_index(h);
-	/*
-	 * Examine the region/reserve map to determine if the process
-	 * has a reservation for the page to be allocated.  A return
-	 * code of zero indicates a reservation exists (no change).
-	 */
-	map_chg = gbl_chg = vma_needs_reservation(h, vma, addr);
-	if (map_chg < 0) {
-		if (!memcg_charge_ret)
-			mem_cgroup_cancel_charge(memcg, nr_pages);
-		mem_cgroup_put(memcg);
-		return ERR_PTR(-ENOMEM);
+
+	/* Whether we need a separate per-vma reservation? */
+	if (cow_from_owner) {
+		/*
+		 * Special case!  Since it's a CoW on top of a reserved
+		 * page, the private resv map doesn't count.  So it cannot
+		 * consume the per-vma resv map even if it's reserved.
+		 */
+		map_chg = MAP_CHG_ENFORCED;
+	} else {
+		/*
+		 * Examine the region/reserve map to determine if the process
+		 * has a reservation for the page to be allocated.  A return
+		 * code of zero indicates a reservation exists (no change).
+		 */
+		retval = vma_needs_reservation(h, vma, addr);
+		if (retval < 0) {
+			if (!memcg_charge_ret)
+				mem_cgroup_cancel_charge(memcg, nr_pages);
+			mem_cgroup_put(memcg);
+			return ERR_PTR(-ENOMEM);
+		}
+		map_chg = retval ? MAP_CHG_NEEDED : MAP_CHG_REUSE;
 	}
 
 	/*
+	 * Whether we need a separate global reservation?
+	 *
 	 * Processes that did not create the mapping will have no
 	 * reserves as indicated by the region/reserve map. Check
 	 * that the allocation will not exceed the subpool limit.
-	 * Allocations for MAP_NORESERVE mappings also need to be
-	 * checked against any subpool limit.
+	 * Or if it can get one from the pool reservation directly.
 	 */
-	if (map_chg || cow_from_owner) {
+	if (map_chg) {
 		gbl_chg = hugepage_subpool_get_pages(spool, 1);
 		if (gbl_chg < 0)
 			goto out_end_reservation;
+	} else {
+		/*
+		 * If we have the vma reservation ready, no need for extra
+		 * global reservation.
+		 */
+		gbl_chg = 0;
 	}
 
-	/* If this allocation is not consuming a reservation, charge it now.
+	/*
+	 * If this allocation is not consuming a per-vma reservation,
+	 * charge the hugetlb cgroup now.
 	 */
-	deferred_reserve = map_chg || cow_from_owner;
-	if (deferred_reserve) {
+	if (map_chg) {
 		ret = hugetlb_cgroup_charge_cgroup_rsvd(
 			idx, pages_per_huge_page(h), &h_cg);
 		if (ret)
@@ -3038,7 +3076,7 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
 		if (!folio)
 			goto out_uncharge_cgroup;
 		spin_lock_irq(&hugetlb_lock);
-		if (!cow_from_owner && vma_has_reserves(vma, gbl_chg)) {
+		if (vma_has_reserves(vma, gbl_chg)) {
 			folio_set_hugetlb_restore_reserve(folio);
 			h->resv_huge_pages--;
 		}
@@ -3051,7 +3089,7 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
 	/* If allocation is not consuming a reservation, also store the
 	 * hugetlb_cgroup pointer on the page.
 	 */
-	if (deferred_reserve) {
+	if (map_chg) {
 		hugetlb_cgroup_commit_charge_rsvd(idx, pages_per_huge_page(h),
 						  h_cg, folio);
 	}
@@ -3060,26 +3098,31 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
 
 	hugetlb_set_folio_subpool(folio, spool);
 
-	map_commit = vma_commit_reservation(h, vma, addr);
-	if (unlikely(map_chg > map_commit)) {
+	if (map_chg != MAP_CHG_ENFORCED) {
+		/* commit() is only needed if the map_chg is not enforced */
+		retval = vma_commit_reservation(h, vma, addr);
 		/*
+		 * Check for possible race conditions. When it happens..
 		 * The page was added to the reservation map between
 		 * vma_needs_reservation and vma_commit_reservation.
 		 * This indicates a race with hugetlb_reserve_pages.
 		 * Adjust for the subpool count incremented above AND
-		 * in hugetlb_reserve_pages for the same page.  Also,
+		 * in hugetlb_reserve_pages for the same page.	Also,
 		 * the reservation count added in hugetlb_reserve_pages
 		 * no longer applies.
 		 */
-		long rsv_adjust;
+		if (unlikely(map_chg == MAP_CHG_NEEDED && retval == 0)) {
+			long rsv_adjust;
 
-		rsv_adjust = hugepage_subpool_put_pages(spool, 1);
-		hugetlb_acct_memory(h, -rsv_adjust);
-		if (deferred_reserve) {
-			spin_lock_irq(&hugetlb_lock);
-			hugetlb_cgroup_uncharge_folio_rsvd(hstate_index(h),
-					pages_per_huge_page(h), folio);
-			spin_unlock_irq(&hugetlb_lock);
+			rsv_adjust = hugepage_subpool_put_pages(spool, 1);
+			hugetlb_acct_memory(h, -rsv_adjust);
+			if (map_chg) {
+				spin_lock_irq(&hugetlb_lock);
+				hugetlb_cgroup_uncharge_folio_rsvd(
+				    hstate_index(h), pages_per_huge_page(h),
+				    folio);
+				spin_unlock_irq(&hugetlb_lock);
+			}
 		}
 	}
 
@@ -3093,14 +3136,15 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
 out_uncharge_cgroup:
 	hugetlb_cgroup_uncharge_cgroup(idx, pages_per_huge_page(h), h_cg);
 out_uncharge_cgroup_reservation:
-	if (deferred_reserve)
+	if (map_chg)
 		hugetlb_cgroup_uncharge_cgroup_rsvd(idx, pages_per_huge_page(h),
 						    h_cg);
 out_subpool_put:
-	if (map_chg || cow_from_owner)
+	if (map_chg)
 		hugepage_subpool_put_pages(spool, 1);
 out_end_reservation:
-	vma_end_reservation(h, vma, addr);
+	if (map_chg != MAP_CHG_ENFORCED)
+		vma_end_reservation(h, vma, addr);
 	if (!memcg_charge_ret)
 		mem_cgroup_cancel_charge(memcg, nr_pages);
 	mem_cgroup_put(memcg);
-- 
2.47.0



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 4/7] mm/hugetlb: Clean up map/global resv accounting when allocate
  2024-12-01 21:22 ` [PATCH 4/7] mm/hugetlb: Clean up map/global resv accounting when allocate Peter Xu
@ 2024-12-28  0:06   ` Ackerley Tng
  2025-01-03 16:37     ` Peter Xu
  0 siblings, 1 reply; 15+ messages in thread
From: Ackerley Tng @ 2024-12-28  0:06 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-kernel, linux-mm, riel, leitao, akpm, peterx, muchun.song,
	osalvador, roman.gushchin, nao.horiguchi

Peter Xu <peterx@redhat.com> writes:

> <snip>
>
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  mm/hugetlb.c | 116 +++++++++++++++++++++++++++++++++++----------------
>  1 file changed, 80 insertions(+), 36 deletions(-)
>
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index dfd479a857b6..14cfe0bb01e4 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -2956,6 +2956,25 @@ int isolate_or_dissolve_huge_page(struct page *page, struct list_head *list)
>  	return ret;
>  }
>
> +typedef enum {
> +	/*
> +	 * For either 0/1: we checked the per-vma resv map, and one resv
> +	 * count either can be reused (0), or an extra needed (1).
> +	 */
> +	MAP_CHG_REUSE = 0,
> +	MAP_CHG_NEEDED = 1,
> +	/*
> +	 * Cannot use per-vma resv count can be used, hence a new resv
> +	 * count is enforced.
> +	 *
> +	 * NOTE: This is mostly identical to MAP_CHG_NEEDED, except
> +	 * that currently vma_needs_reservation() has an unwanted side
> +	 * effect to either use end() or commit() to complete the
> +	 * transaction.	 Hence it needs to differenciate from NEEDED.
> +	 */
> +	MAP_CHG_ENFORCED = 2,
> +} map_chg_state;
> +
>  /*
>   * NOTE! "cow_from_owner" represents a very hacky usage only used in CoW
>   * faults of hugetlb private mappings on top of a non-page-cache folio (in
> @@ -2969,12 +2988,11 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
>  	struct hugepage_subpool *spool = subpool_vma(vma);
>  	struct hstate *h = hstate_vma(vma);
>  	struct folio *folio;
> -	long map_chg, map_commit, nr_pages = pages_per_huge_page(h);
> -	long gbl_chg;
> +	long retval, gbl_chg, nr_pages = pages_per_huge_page(h);
> +	map_chg_state map_chg;
>  	int memcg_charge_ret, ret, idx;
>  	struct hugetlb_cgroup *h_cg = NULL;
>  	struct mem_cgroup *memcg;
> -	bool deferred_reserve;
>  	gfp_t gfp = htlb_alloc_mask(h) | __GFP_RETRY_MAYFAIL;
>
>  	memcg = get_mem_cgroup_from_current();
> @@ -2985,36 +3003,56 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
>  	}
>
>  	idx = hstate_index(h);
> -	/*
> -	 * Examine the region/reserve map to determine if the process
> -	 * has a reservation for the page to be allocated.  A return
> -	 * code of zero indicates a reservation exists (no change).
> -	 */
> -	map_chg = gbl_chg = vma_needs_reservation(h, vma, addr);
> -	if (map_chg < 0) {
> -		if (!memcg_charge_ret)
> -			mem_cgroup_cancel_charge(memcg, nr_pages);
> -		mem_cgroup_put(memcg);
> -		return ERR_PTR(-ENOMEM);
> +
> +	/* Whether we need a separate per-vma reservation? */
> +	if (cow_from_owner) {
> +		/*
> +		 * Special case!  Since it's a CoW on top of a reserved
> +		 * page, the private resv map doesn't count.  So it cannot
> +		 * consume the per-vma resv map even if it's reserved.
> +		 */
> +		map_chg = MAP_CHG_ENFORCED;
> +	} else {
> +		/*
> +		 * Examine the region/reserve map to determine if the process
> +		 * has a reservation for the page to be allocated.  A return
> +		 * code of zero indicates a reservation exists (no change).
> +		 */
> +		retval = vma_needs_reservation(h, vma, addr);
> +		if (retval < 0) {
> +			if (!memcg_charge_ret)
> +				mem_cgroup_cancel_charge(memcg, nr_pages);
> +			mem_cgroup_put(memcg);
> +			return ERR_PTR(-ENOMEM);
> +		}
> +		map_chg = retval ? MAP_CHG_NEEDED : MAP_CHG_REUSE;
>  	}
>
>  	/*
> +	 * Whether we need a separate global reservation?
> +	 *
>  	 * Processes that did not create the mapping will have no
>  	 * reserves as indicated by the region/reserve map. Check
>  	 * that the allocation will not exceed the subpool limit.
> -	 * Allocations for MAP_NORESERVE mappings also need to be
> -	 * checked against any subpool limit.
> +	 * Or if it can get one from the pool reservation directly.
>  	 */
> -	if (map_chg || cow_from_owner) {
> +	if (map_chg) {
>  		gbl_chg = hugepage_subpool_get_pages(spool, 1);
>  		if (gbl_chg < 0)
>  			goto out_end_reservation;
> +	} else {
> +		/*
> +		 * If we have the vma reservation ready, no need for extra
> +		 * global reservation.
> +		 */
> +		gbl_chg = 0;
>  	}
>
> -	/* If this allocation is not consuming a reservation, charge it now.
> +	/*
> +	 * If this allocation is not consuming a per-vma reservation,
> +	 * charge the hugetlb cgroup now.
>  	 */
> -	deferred_reserve = map_chg || cow_from_owner;
> -	if (deferred_reserve) {
> +	if (map_chg) {
>  		ret = hugetlb_cgroup_charge_cgroup_rsvd(
>  			idx, pages_per_huge_page(h), &h_cg);

Should hugetlb_cgroup_charge_cgroup_rsvd() be called when map_chg == MAP_CHG_ENFORCED?

>  		if (ret)
> @@ -3038,7 +3076,7 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
>  		if (!folio)
>  			goto out_uncharge_cgroup;
>  		spin_lock_irq(&hugetlb_lock);
> -		if (!cow_from_owner && vma_has_reserves(vma, gbl_chg)) {
> +		if (vma_has_reserves(vma, gbl_chg)) {
>  			folio_set_hugetlb_restore_reserve(folio);
>  			h->resv_huge_pages--;
>  		}
> @@ -3051,7 +3089,7 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
>  	/* If allocation is not consuming a reservation, also store the
>  	 * hugetlb_cgroup pointer on the page.
>  	 */
> -	if (deferred_reserve) {
> +	if (map_chg) {
>  		hugetlb_cgroup_commit_charge_rsvd(idx, pages_per_huge_page(h),
>  						  h_cg, folio);
>  	}

same for this,

> @@ -3060,26 +3098,31 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
>
>  	hugetlb_set_folio_subpool(folio, spool);
>
> -	map_commit = vma_commit_reservation(h, vma, addr);
> -	if (unlikely(map_chg > map_commit)) {
> +	if (map_chg != MAP_CHG_ENFORCED) {
> +		/* commit() is only needed if the map_chg is not enforced */
> +		retval = vma_commit_reservation(h, vma, addr);
>  		/*
> +		 * Check for possible race conditions. When it happens..
>  		 * The page was added to the reservation map between
>  		 * vma_needs_reservation and vma_commit_reservation.
>  		 * This indicates a race with hugetlb_reserve_pages.
>  		 * Adjust for the subpool count incremented above AND
> -		 * in hugetlb_reserve_pages for the same page.  Also,
> +		 * in hugetlb_reserve_pages for the same page.	Also,
>  		 * the reservation count added in hugetlb_reserve_pages
>  		 * no longer applies.
>  		 */
> -		long rsv_adjust;
> +		if (unlikely(map_chg == MAP_CHG_NEEDED && retval == 0)) {
> +			long rsv_adjust;
>
> -		rsv_adjust = hugepage_subpool_put_pages(spool, 1);
> -		hugetlb_acct_memory(h, -rsv_adjust);
> -		if (deferred_reserve) {
> -			spin_lock_irq(&hugetlb_lock);
> -			hugetlb_cgroup_uncharge_folio_rsvd(hstate_index(h),
> -					pages_per_huge_page(h), folio);
> -			spin_unlock_irq(&hugetlb_lock);
> +			rsv_adjust = hugepage_subpool_put_pages(spool, 1);
> +			hugetlb_acct_memory(h, -rsv_adjust);
> +			if (map_chg) {
> +				spin_lock_irq(&hugetlb_lock);
> +				hugetlb_cgroup_uncharge_folio_rsvd(
> +				    hstate_index(h), pages_per_huge_page(h),
> +				    folio);
> +				spin_unlock_irq(&hugetlb_lock);
> +			}
>  		}
>  	}
>
> @@ -3093,14 +3136,15 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
>  out_uncharge_cgroup:
>  	hugetlb_cgroup_uncharge_cgroup(idx, pages_per_huge_page(h), h_cg);
>  out_uncharge_cgroup_reservation:
> -	if (deferred_reserve)
> +	if (map_chg)
>  		hugetlb_cgroup_uncharge_cgroup_rsvd(idx, pages_per_huge_page(h),
>  						    h_cg);

and same for this.

>  out_subpool_put:
> -	if (map_chg || cow_from_owner)
> +	if (map_chg)
>  		hugepage_subpool_put_pages(spool, 1);
>  out_end_reservation:
> -	vma_end_reservation(h, vma, addr);
> +	if (map_chg != MAP_CHG_ENFORCED)
> +		vma_end_reservation(h, vma, addr);
>  	if (!memcg_charge_ret)
>  		mem_cgroup_cancel_charge(memcg, nr_pages);
>  	mem_cgroup_put(memcg);


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 4/7] mm/hugetlb: Clean up map/global resv accounting when allocate
  2024-12-28  0:06   ` Ackerley Tng
@ 2025-01-03 16:37     ` Peter Xu
  2025-01-06 14:48       ` Ackerley Tng
  0 siblings, 1 reply; 15+ messages in thread
From: Peter Xu @ 2025-01-03 16:37 UTC (permalink / raw)
  To: Ackerley Tng
  Cc: linux-kernel, linux-mm, riel, leitao, akpm, muchun.song,
	osalvador, roman.gushchin, nao.horiguchi

On Sat, Dec 28, 2024 at 12:06:34AM +0000, Ackerley Tng wrote:
> >
> > -	/* If this allocation is not consuming a reservation, charge it now.
> > +	/*
> > +	 * If this allocation is not consuming a per-vma reservation,
> > +	 * charge the hugetlb cgroup now.
> >  	 */
> > -	deferred_reserve = map_chg || cow_from_owner;
> > -	if (deferred_reserve) {
> > +	if (map_chg) {
> >  		ret = hugetlb_cgroup_charge_cgroup_rsvd(
> >  			idx, pages_per_huge_page(h), &h_cg);
> 
> Should hugetlb_cgroup_charge_cgroup_rsvd() be called when map_chg == MAP_CHG_ENFORCED?

This looks like a pretty niche use case, though I would say yes.

I don't think I take a lot of consideration here when drafting the patch,
as the change here should have kept the old behavior: map_chg grows into
the tristate so that we can drop deferred_reserve, OTOH nothing should
change from such behavior of cgroup charging.

When it happens, it means the owner process CoWed a private hugetlb folio
which will enforce bypassing the vma reservation.  Here bypassing the vma
check makes sense to me, because the new to-be-cowed folio X will replace
another folio Y, which should have consumed the private vma resv at this
specific index. So there's no way the to-be-cowed folio X can have anything
to do with the vma reservation..

Besides the vma reservation, I don't see why this folio allocation needs to
be any more special. IOW, it should still go through all rest checks and
fail the process properly if the check fails, that should include any form
of cgroups (either hugetlb or memcg), IMHO.

Do you have any specific thought on this path?

Thanks,

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 4/7] mm/hugetlb: Clean up map/global resv accounting when allocate
  2025-01-03 16:37     ` Peter Xu
@ 2025-01-06 14:48       ` Ackerley Tng
  2025-01-06 20:55         ` Peter Xu
  0 siblings, 1 reply; 15+ messages in thread
From: Ackerley Tng @ 2025-01-06 14:48 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-kernel, linux-mm, riel, leitao, akpm, muchun.song,
	osalvador, roman.gushchin, nao.horiguchi

Peter Xu <peterx@redhat.com> writes:

> On Sat, Dec 28, 2024 at 12:06:34AM +0000, Ackerley Tng wrote:
>> >
>> > -	/* If this allocation is not consuming a reservation, charge it now.
>> > +	/*
>> > +	 * If this allocation is not consuming a per-vma reservation,
>> > +	 * charge the hugetlb cgroup now.
>> >  	 */
>> > -	deferred_reserve = map_chg || cow_from_owner;
>> > -	if (deferred_reserve) {
>> > +	if (map_chg) {
>> >  		ret = hugetlb_cgroup_charge_cgroup_rsvd(
>> >  			idx, pages_per_huge_page(h), &h_cg);
>> 
>> Should hugetlb_cgroup_charge_cgroup_rsvd() be called when map_chg == MAP_CHG_ENFORCED?
>
> This looks like a pretty niche use case, though I would say yes.
>
> I don't think I take a lot of consideration here when drafting the patch,
> as the change here should have kept the old behavior: map_chg grows into
> the tristate so that we can drop deferred_reserve, OTOH nothing should
> change from such behavior of cgroup charging.
>
> When it happens, it means the owner process CoWed a private hugetlb folio
> which will enforce bypassing the vma reservation.  Here bypassing the vma
> check makes sense to me, because the new to-be-cowed folio X will replace
> another folio Y, which should have consumed the private vma resv at this
> specific index. So there's no way the to-be-cowed folio X can have anything
> to do with the vma reservation..
>
> Besides the vma reservation, I don't see why this folio allocation needs to
> be any more special. IOW, it should still go through all rest checks and
> fail the process properly if the check fails, that should include any form
> of cgroups (either hugetlb or memcg), IMHO.
>
> Do you have any specific thought on this path?

I re-read the code, and I hope this understanding is right:

When a user sets "rsvd.max_usage_in_bytes" to X, the user is saying that
within this cgroup, the maximum memory that can be reserved in the vma
reservation is X.

Hence even when this CoW is performed, this should count towards the
cgroup's "rsvd.max_usage_in_bytes" and so yes, it should be charged.

I think I misunderstood the context on cgroup charging earlier and hence
I thought it shouldn't be charged, but I agree with you after
re-reading.

>
> Thanks,


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 4/7] mm/hugetlb: Clean up map/global resv accounting when allocate
  2025-01-06 14:48       ` Ackerley Tng
@ 2025-01-06 20:55         ` Peter Xu
  0 siblings, 0 replies; 15+ messages in thread
From: Peter Xu @ 2025-01-06 20:55 UTC (permalink / raw)
  To: Ackerley Tng
  Cc: linux-kernel, linux-mm, riel, leitao, akpm, muchun.song,
	osalvador, roman.gushchin, nao.horiguchi

On Mon, Jan 06, 2025 at 02:48:12PM +0000, Ackerley Tng wrote:
> Peter Xu <peterx@redhat.com> writes:
> 
> > On Sat, Dec 28, 2024 at 12:06:34AM +0000, Ackerley Tng wrote:
> >> >
> >> > -	/* If this allocation is not consuming a reservation, charge it now.
> >> > +	/*
> >> > +	 * If this allocation is not consuming a per-vma reservation,
> >> > +	 * charge the hugetlb cgroup now.
> >> >  	 */
> >> > -	deferred_reserve = map_chg || cow_from_owner;
> >> > -	if (deferred_reserve) {
> >> > +	if (map_chg) {
> >> >  		ret = hugetlb_cgroup_charge_cgroup_rsvd(
> >> >  			idx, pages_per_huge_page(h), &h_cg);
> >> 
> >> Should hugetlb_cgroup_charge_cgroup_rsvd() be called when map_chg == MAP_CHG_ENFORCED?
> >
> > This looks like a pretty niche use case, though I would say yes.
> >
> > I don't think I take a lot of consideration here when drafting the patch,
> > as the change here should have kept the old behavior: map_chg grows into
> > the tristate so that we can drop deferred_reserve, OTOH nothing should
> > change from such behavior of cgroup charging.
> >
> > When it happens, it means the owner process CoWed a private hugetlb folio
> > which will enforce bypassing the vma reservation.  Here bypassing the vma
> > check makes sense to me, because the new to-be-cowed folio X will replace
> > another folio Y, which should have consumed the private vma resv at this
> > specific index. So there's no way the to-be-cowed folio X can have anything
> > to do with the vma reservation..
> >
> > Besides the vma reservation, I don't see why this folio allocation needs to
> > be any more special. IOW, it should still go through all rest checks and
> > fail the process properly if the check fails, that should include any form
> > of cgroups (either hugetlb or memcg), IMHO.
> >
> > Do you have any specific thought on this path?
> 
> I re-read the code, and I hope this understanding is right:
> 
> When a user sets "rsvd.max_usage_in_bytes" to X, the user is saying that
> within this cgroup, the maximum memory that can be reserved in the vma
> reservation is X.

Right, and the allocation may or may not attach to a vma reservation at
all.  In this case it skips the vma reservation however will still need to
be accounted; there should have other similar cases where vma resv doesn't
count, e.g. MAP_NORESERVE.  For those we do accounting on reservations only
until allocation time.

> 
> Hence even when this CoW is performed, this should count towards the
> cgroup's "rsvd.max_usage_in_bytes" and so yes, it should be charged.
> 
> I think I misunderstood the context on cgroup charging earlier and hence
> I thought it shouldn't be charged, but I agree with you after
> re-reading.

Thanks.  I'll hold another 1-2 days then I'll respin.

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH 5/7] mm/hugetlb: Simplify vma_has_reserves()
  2024-12-01 21:22 [PATCH 0/7] mm/hugetlb: Refactor hugetlb allocation resv accounting Peter Xu
                   ` (3 preceding siblings ...)
  2024-12-01 21:22 ` [PATCH 4/7] mm/hugetlb: Clean up map/global resv accounting when allocate Peter Xu
@ 2024-12-01 21:22 ` Peter Xu
  2024-12-01 21:22 ` [PATCH 6/7] mm/hugetlb: Drop vma_has_reserves() Peter Xu
  2024-12-01 21:22 ` [PATCH 7/7] mm/hugetlb: Unify restore reserve accounting for new allocations Peter Xu
  6 siblings, 0 replies; 15+ messages in thread
From: Peter Xu @ 2024-12-01 21:22 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Rik van Riel, Breno Leitao, Andrew Morton, peterx, Muchun Song,
	Oscar Salvador, Roman Gushchin, Naoya Horiguchi, Ackerley Tng

vma_has_reserves() is a helper "trying" to know whether the vma should
consume one reservation when allocating the hugetlb folio.

However it's not clear on why we need such complexity, as such information
is already represented in the "chg" variable.

From alloc_hugetlb_folio() context, "chg" (or in the function's context,
"gbl_chg") is defined as:

  - If gbl_chg=1, the allocation cannot reuse an existing reservation
  - If gbl_chg=0, the allocation should reuse an existing reservation

Firstly, map_chg is defined as following, to cover all cases of hugetlb
reservation scenarios (mostly, via vma_needs_reservation(), but
cow_from_owner is an outlier):

CONDITION                                             HAS RESERVATION?
=========                                             ================
- SHARED: always check against per-inode resv_map
  (ignore NONRESERVE)
  - If resv exists                                ==> YES  [1]
  - If not                                        ==> NO   [2]
- PRIVATE: complicated...
  - Request came from a CoW from owner resv map   ==> NO   [3]
    (when cow_from_owner==true)
  - If does not own a resv_map at all..           ==> NO   [4]
    (examples: VM_NORESERVE, private fork())
  - If owns a resv_map, but resv donsn't exists   ==> NO   [5]
  - If owns a resv_map, and resv exists           ==> YES  [6]

Further on, gbl_chg considered spool setup, so that is a decision based on
all the context.

If we look at vma_has_reserves(), it almost does check that has already
been processed by map_chg accounting (I marked each return value to the
case above):

  static bool vma_has_reserves(struct vm_area_struct *vma, long chg)
  {
          if (vma->vm_flags & VM_NORESERVE) {
                  if (vma->vm_flags & VM_MAYSHARE && chg == 0)
                          return true;              ==> [1]
                  else
                          return false;             ==> [2] or [4]
          }

          if (vma->vm_flags & VM_MAYSHARE) {
                  if (chg)
                          return false;             ==> [2]
                  else
                          return true;              ==> [1]
          }

          if (is_vma_resv_set(vma, HPAGE_RESV_OWNER)) {
                  if (chg)
                          return false;             ==> [5]
                  else
                          return true;              ==> [6]
          }

          return false;                             ==> [4]
  }

It didn't check [3], but [3] case was actually already covered now by the
"chg" / "gbl_chg" / "map_chg" calculations.

In short, vma_has_reserves() doesn't provide anything more than return
"!chg".. so just simplify all the things.

There're a lot of comments describing truncation races, IIUC there should
have no race as long as map_chg is properly done.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 mm/hugetlb.c | 67 ++++++----------------------------------------------
 1 file changed, 7 insertions(+), 60 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 14cfe0bb01e4..b7e16b3c4e67 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1247,66 +1247,13 @@ void clear_vma_resv_huge_pages(struct vm_area_struct *vma)
 }
 
 /* Returns true if the VMA has associated reserve pages */
-static bool vma_has_reserves(struct vm_area_struct *vma, long chg)
+static bool vma_has_reserves(long chg)
 {
-	if (vma->vm_flags & VM_NORESERVE) {
-		/*
-		 * This address is already reserved by other process(chg == 0),
-		 * so, we should decrement reserved count. Without decrementing,
-		 * reserve count remains after releasing inode, because this
-		 * allocated page will go into page cache and is regarded as
-		 * coming from reserved pool in releasing step.  Currently, we
-		 * don't have any other solution to deal with this situation
-		 * properly, so add work-around here.
-		 */
-		if (vma->vm_flags & VM_MAYSHARE && chg == 0)
-			return true;
-		else
-			return false;
-	}
-
-	/* Shared mappings always use reserves */
-	if (vma->vm_flags & VM_MAYSHARE) {
-		/*
-		 * We know VM_NORESERVE is not set.  Therefore, there SHOULD
-		 * be a region map for all pages.  The only situation where
-		 * there is no region map is if a hole was punched via
-		 * fallocate.  In this case, there really are no reserves to
-		 * use.  This situation is indicated if chg != 0.
-		 */
-		if (chg)
-			return false;
-		else
-			return true;
-	}
-
 	/*
-	 * Only the process that called mmap() has reserves for
-	 * private mappings.
+	 * Now "chg" has all the conditions considered for whether we
+	 * should use an existing reservation.
 	 */
-	if (is_vma_resv_set(vma, HPAGE_RESV_OWNER)) {
-		/*
-		 * Like the shared case above, a hole punch or truncate
-		 * could have been performed on the private mapping.
-		 * Examine the value of chg to determine if reserves
-		 * actually exist or were previously consumed.
-		 * Very Subtle - The value of chg comes from a previous
-		 * call to vma_needs_reserves().  The reserve map for
-		 * private mappings has different (opposite) semantics
-		 * than that of shared mappings.  vma_needs_reserves()
-		 * has already taken this difference in semantics into
-		 * account.  Therefore, the meaning of chg is the same
-		 * as in the shared case above.  Code could easily be
-		 * combined, but keeping it separate draws attention to
-		 * subtle differences.
-		 */
-		if (chg)
-			return false;
-		else
-			return true;
-	}
-
-	return false;
+	return chg == 0;
 }
 
 static void enqueue_hugetlb_folio(struct hstate *h, struct folio *folio)
@@ -1407,7 +1354,7 @@ static struct folio *dequeue_hugetlb_folio_vma(struct hstate *h,
 	 * have no page reserves. This check ensures that reservations are
 	 * not "stolen". The child may still get SIGKILLed
 	 */
-	if (!vma_has_reserves(vma, chg) && !available_huge_pages(h))
+	if (!vma_has_reserves(chg) && !available_huge_pages(h))
 		goto err;
 
 	gfp_mask = htlb_alloc_mask(h);
@@ -1425,7 +1372,7 @@ static struct folio *dequeue_hugetlb_folio_vma(struct hstate *h,
 		folio = dequeue_hugetlb_folio_nodemask(h, gfp_mask,
 							nid, nodemask);
 
-	if (folio && vma_has_reserves(vma, chg)) {
+	if (folio && vma_has_reserves(chg)) {
 		folio_set_hugetlb_restore_reserve(folio);
 		h->resv_huge_pages--;
 	}
@@ -3076,7 +3023,7 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
 		if (!folio)
 			goto out_uncharge_cgroup;
 		spin_lock_irq(&hugetlb_lock);
-		if (vma_has_reserves(vma, gbl_chg)) {
+		if (vma_has_reserves(gbl_chg)) {
 			folio_set_hugetlb_restore_reserve(folio);
 			h->resv_huge_pages--;
 		}
-- 
2.47.0



^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH 6/7] mm/hugetlb: Drop vma_has_reserves()
  2024-12-01 21:22 [PATCH 0/7] mm/hugetlb: Refactor hugetlb allocation resv accounting Peter Xu
                   ` (4 preceding siblings ...)
  2024-12-01 21:22 ` [PATCH 5/7] mm/hugetlb: Simplify vma_has_reserves() Peter Xu
@ 2024-12-01 21:22 ` Peter Xu
  2024-12-01 21:22 ` [PATCH 7/7] mm/hugetlb: Unify restore reserve accounting for new allocations Peter Xu
  6 siblings, 0 replies; 15+ messages in thread
From: Peter Xu @ 2024-12-01 21:22 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Rik van Riel, Breno Leitao, Andrew Morton, peterx, Muchun Song,
	Oscar Salvador, Roman Gushchin, Naoya Horiguchi, Ackerley Tng

After the previous cleanup, vma_has_reserves() is mostly an empty helper
except that it says "use reserve count" is inverted meaning from "needs a
global reserve count", which is still true.

To avoid confusions on having two inverted ways to ask the same question,
always use the gbl_chg everywhere, and drop the function.

When at it, rename "chg" to "gbl_chg" in dequeue_hugetlb_folio_vma().  It
might be helpful for readers to see that the "chg" here is the global
reserve count, not the vma resv count.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 mm/hugetlb.c | 23 ++++++-----------------
 1 file changed, 6 insertions(+), 17 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index b7e16b3c4e67..10251ef3289a 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1246,16 +1246,6 @@ void clear_vma_resv_huge_pages(struct vm_area_struct *vma)
 	hugetlb_dup_vma_private(vma);
 }
 
-/* Returns true if the VMA has associated reserve pages */
-static bool vma_has_reserves(long chg)
-{
-	/*
-	 * Now "chg" has all the conditions considered for whether we
-	 * should use an existing reservation.
-	 */
-	return chg == 0;
-}
-
 static void enqueue_hugetlb_folio(struct hstate *h, struct folio *folio)
 {
 	int nid = folio_nid(folio);
@@ -1341,7 +1331,7 @@ static unsigned long available_huge_pages(struct hstate *h)
 
 static struct folio *dequeue_hugetlb_folio_vma(struct hstate *h,
 				struct vm_area_struct *vma,
-				unsigned long address, long chg)
+				unsigned long address, long gbl_chg)
 {
 	struct folio *folio = NULL;
 	struct mempolicy *mpol;
@@ -1350,11 +1340,10 @@ static struct folio *dequeue_hugetlb_folio_vma(struct hstate *h,
 	int nid;
 
 	/*
-	 * A child process with MAP_PRIVATE mappings created by their parent
-	 * have no page reserves. This check ensures that reservations are
-	 * not "stolen". The child may still get SIGKILLed
+	 * gbl_chg==1 means the allocation requires a new page that was not
+	 * reserved before.  Making sure there's at least one free page.
 	 */
-	if (!vma_has_reserves(chg) && !available_huge_pages(h))
+	if (gbl_chg && !available_huge_pages(h))
 		goto err;
 
 	gfp_mask = htlb_alloc_mask(h);
@@ -1372,7 +1361,7 @@ static struct folio *dequeue_hugetlb_folio_vma(struct hstate *h,
 		folio = dequeue_hugetlb_folio_nodemask(h, gfp_mask,
 							nid, nodemask);
 
-	if (folio && vma_has_reserves(chg)) {
+	if (folio && !gbl_chg) {
 		folio_set_hugetlb_restore_reserve(folio);
 		h->resv_huge_pages--;
 	}
@@ -3023,7 +3012,7 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
 		if (!folio)
 			goto out_uncharge_cgroup;
 		spin_lock_irq(&hugetlb_lock);
-		if (vma_has_reserves(gbl_chg)) {
+		if (!gbl_chg) {
 			folio_set_hugetlb_restore_reserve(folio);
 			h->resv_huge_pages--;
 		}
-- 
2.47.0



^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH 7/7] mm/hugetlb: Unify restore reserve accounting for new allocations
  2024-12-01 21:22 [PATCH 0/7] mm/hugetlb: Refactor hugetlb allocation resv accounting Peter Xu
                   ` (5 preceding siblings ...)
  2024-12-01 21:22 ` [PATCH 6/7] mm/hugetlb: Drop vma_has_reserves() Peter Xu
@ 2024-12-01 21:22 ` Peter Xu
  6 siblings, 0 replies; 15+ messages in thread
From: Peter Xu @ 2024-12-01 21:22 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Rik van Riel, Breno Leitao, Andrew Morton, peterx, Muchun Song,
	Oscar Salvador, Roman Gushchin, Naoya Horiguchi, Ackerley Tng

Either hugetlb pages dequeued from hstate, or newly allocated from buddy,
would require restore-reserve accounting to be managed properly.  Merge the
two paths on it.  Add a small comment to make it slightly nicer.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 mm/hugetlb.c | 18 +++++++++---------
 1 file changed, 9 insertions(+), 9 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 10251ef3289a..64e690fe52bf 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1361,11 +1361,6 @@ static struct folio *dequeue_hugetlb_folio_vma(struct hstate *h,
 		folio = dequeue_hugetlb_folio_nodemask(h, gfp_mask,
 							nid, nodemask);
 
-	if (folio && !gbl_chg) {
-		folio_set_hugetlb_restore_reserve(folio);
-		h->resv_huge_pages--;
-	}
-
 	mpol_cond_put(mpol);
 	return folio;
 
@@ -3012,15 +3007,20 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
 		if (!folio)
 			goto out_uncharge_cgroup;
 		spin_lock_irq(&hugetlb_lock);
-		if (!gbl_chg) {
-			folio_set_hugetlb_restore_reserve(folio);
-			h->resv_huge_pages--;
-		}
 		list_add(&folio->lru, &h->hugepage_activelist);
 		folio_ref_unfreeze(folio, 1);
 		/* Fall through */
 	}
 
+	/*
+	 * Either dequeued or buddy-allocated folio needs to add special
+	 * mark to the folio when it consumes a global reservation.
+	 */
+	if (!gbl_chg) {
+		folio_set_hugetlb_restore_reserve(folio);
+		h->resv_huge_pages--;
+	}
+
 	hugetlb_cgroup_commit_charge(idx, pages_per_huge_page(h), h_cg, folio);
 	/* If allocation is not consuming a reservation, also store the
 	 * hugetlb_cgroup pointer on the page.
-- 
2.47.0



^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2025-01-06 20:56 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-12-01 21:22 [PATCH 0/7] mm/hugetlb: Refactor hugetlb allocation resv accounting Peter Xu
2024-12-01 21:22 ` [PATCH 1/7] mm/hugetlb: Fix avoid_reserve to allow taking folio from subpool Peter Xu
2024-12-18 14:33   ` Ackerley Tng
2024-12-27 23:15     ` Ackerley Tng
2025-01-03 16:26       ` Peter Xu
2024-12-01 21:22 ` [PATCH 2/7] mm/hugetlb: Stop using avoid_reserve flag in fork() Peter Xu
2024-12-01 21:22 ` [PATCH 3/7] mm/hugetlb: Rename avoid_reserve to cow_from_owner Peter Xu
2024-12-01 21:22 ` [PATCH 4/7] mm/hugetlb: Clean up map/global resv accounting when allocate Peter Xu
2024-12-28  0:06   ` Ackerley Tng
2025-01-03 16:37     ` Peter Xu
2025-01-06 14:48       ` Ackerley Tng
2025-01-06 20:55         ` Peter Xu
2024-12-01 21:22 ` [PATCH 5/7] mm/hugetlb: Simplify vma_has_reserves() Peter Xu
2024-12-01 21:22 ` [PATCH 6/7] mm/hugetlb: Drop vma_has_reserves() Peter Xu
2024-12-01 21:22 ` [PATCH 7/7] mm/hugetlb: Unify restore reserve accounting for new allocations Peter Xu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox