From: Aristeu Rozanski <aris@ruivo.org>
To: linux-mm@kvack.org
Cc: Andrew Morton <akpm@linux-foundation.org>,
Vishal Moola <vishal.moola@gmail.com>,
Aristeu Rozanski <aris@redhat.com>,
Muchun Song <muchun.song@linux.dev>,
stable@vger.kernel.org
Subject: [PATCH v2] hugetlb: force allocating surplus hugepages on mempolicy allowed nodes
Date: Mon, 1 Jul 2024 17:23:43 -0400 [thread overview]
Message-ID: <20240701212343.GG844599@cathedrallabs.org> (raw)
In-Reply-To: <6683024a.050a0220.45e6c.7312@mx.google.com>
When trying to allocate a hugepage with none reserved ones free, it may
be allowed in case a number of overcommit hugepages was configured (using
/proc/sys/vm/nr_overcommit_hugepages) and that number wasn't reached. This
allows for a behavior of having extra hugepages allocated dynamically, if
there're resources for it. Some sysadmins even prefer not reserving any
hugepages and setting a big number of overcommit hugepages.
But while attempting to allocate overcommit hugepages in a multi node
system (either NUMA or mempolicy/cpuset) said allocations might randomly
fail even when there're resources available for the allocation.
This happens due allowed_mems_nr() only accounting for the number of free hugepages
in the nodes the current process belongs to and the surplus hugepage allocation is
done so it can be allocated in any node. In case one or more of the requested
surplus hugepages are allocated in a different node, the whole allocation will fail
due allowed_mems_nr() returning a lower value.
So allocate surplus hugepages in one of the nodes the current process belongs to.
Easy way to reproduce this issue is to use a 2+ NUMA nodes system:
# echo 0 >/proc/sys/vm/nr_hugepages
# echo 1 >/proc/sys/vm/nr_overcommit_hugepages
# numactl -m0 ./tools/testing/selftests/mm/map_hugetlb 2
Repeating the execution of map_hugetlb test application will eventually fail
when the hugepage ends up allocated in a different node.
v2: - attempt to make the description more clear
- prevent unitialized usage of folio in case current process isn't part of any
nodes with memory
Cc: Vishal Moola <vishal.moola@gmail.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Aristeu Rozanski <aris@redhat.com>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: stable@vger.kernel.org
Signed-off-by: Aristeu Rozanski <aris@ruivo.org>
---
mm/hugetlb.c | 47 ++++++++++++++++++++++++++++-------------------
1 file changed, 28 insertions(+), 19 deletions(-)
--- upstream.orig/mm/hugetlb.c 2024-06-20 13:42:25.699568114 -0400
+++ upstream/mm/hugetlb.c 2024-07-01 16:48:53.693298053 -0400
@@ -2618,6 +2618,23 @@ struct folio *alloc_hugetlb_folio_nodema
return alloc_migrate_hugetlb_folio(h, gfp_mask, preferred_nid, nmask);
}
+static nodemask_t *policy_mbind_nodemask(gfp_t gfp)
+{
+#ifdef CONFIG_NUMA
+ struct mempolicy *mpol = get_task_policy(current);
+
+ /*
+ * Only enforce MPOL_BIND policy which overlaps with cpuset policy
+ * (from policy_nodemask) specifically for hugetlb case
+ */
+ if (mpol->mode == MPOL_BIND &&
+ (apply_policy_zone(mpol, gfp_zone(gfp)) &&
+ cpuset_nodemask_valid_mems_allowed(&mpol->nodes)))
+ return &mpol->nodes;
+#endif
+ return NULL;
+}
+
/*
* Increase the hugetlb pool such that it can accommodate a reservation
* of size 'delta'.
@@ -2631,6 +2648,8 @@ static int gather_surplus_pages(struct h
long i;
long needed, allocated;
bool alloc_ok = true;
+ int node;
+ nodemask_t *mbind_nodemask = policy_mbind_nodemask(htlb_alloc_mask(h));
lockdep_assert_held(&hugetlb_lock);
needed = (h->resv_huge_pages + delta) - h->free_huge_pages;
@@ -2645,8 +2664,15 @@ allocated = 0;
retry:
spin_unlock_irq(&hugetlb_lock);
for (i = 0; i < needed; i++) {
- folio = alloc_surplus_hugetlb_folio(h, htlb_alloc_mask(h),
- NUMA_NO_NODE, NULL);
+ folio = NULL;
+ for_each_node_mask(node, cpuset_current_mems_allowed) {
+ if (!mbind_nodemask || node_isset(node, *mbind_nodemask)) {
+ folio = alloc_surplus_hugetlb_folio(h, htlb_alloc_mask(h),
+ node, NULL);
+ if (folio)
+ break;
+ }
+ }
if (!folio) {
alloc_ok = false;
break;
@@ -4876,23 +4902,6 @@ default_hstate_max_huge_pages = 0;
}
__setup("default_hugepagesz=", default_hugepagesz_setup);
-static nodemask_t *policy_mbind_nodemask(gfp_t gfp)
-{
-#ifdef CONFIG_NUMA
- struct mempolicy *mpol = get_task_policy(current);
-
- /*
- * Only enforce MPOL_BIND policy which overlaps with cpuset policy
- * (from policy_nodemask) specifically for hugetlb case
- */
- if (mpol->mode == MPOL_BIND &&
- (apply_policy_zone(mpol, gfp_zone(gfp)) &&
- cpuset_nodemask_valid_mems_allowed(&mpol->nodes)))
- return &mpol->nodes;
-#endif
- return NULL;
-}
-
static unsigned int allowed_mems_nr(struct hstate *h)
{
int node;
next prev parent reply other threads:[~2024-07-01 21:23 UTC|newest]
Thread overview: 8+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-06-21 19:00 [PATCH] " Aristeu Rozanski
2024-06-22 0:56 ` Andrew Morton
2024-06-25 22:54 ` Andrew Morton
2024-07-01 19:12 ` Aristeu Rozanski
2024-07-01 19:23 ` Vishal Moola
2024-07-01 19:29 ` Aristeu Rozanski
2024-07-01 21:23 ` Aristeu Rozanski [this message]
2024-07-01 19:11 ` Aristeu Rozanski
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20240701212343.GG844599@cathedrallabs.org \
--to=aris@ruivo.org \
--cc=akpm@linux-foundation.org \
--cc=aris@redhat.com \
--cc=linux-mm@kvack.org \
--cc=muchun.song@linux.dev \
--cc=stable@vger.kernel.org \
--cc=vishal.moola@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox