From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 0E56AC02180
	for <linux-mm@archiver.kernel.org>; Mon, 13 Jan 2025 22:58:02 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 66FBD6B0082; Mon, 13 Jan 2025 17:58:01 -0500 (EST)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 61F656B0083; Mon, 13 Jan 2025 17:58:01 -0500 (EST)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 4C0956B0085; Mon, 13 Jan 2025 17:58:01 -0500 (EST)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11])
	by kanga.kvack.org (Postfix) with ESMTP id 284726B0082
	for <linux-mm@kvack.org>; Mon, 13 Jan 2025 17:58:01 -0500 (EST)
Received: from smtpin06.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay09.hostedemail.com (Postfix) with ESMTP id 3D88780140
	for <linux-mm@kvack.org>; Mon, 13 Jan 2025 22:58:00 +0000 (UTC)
X-FDA: 83003943120.06.5EC3C57
Received: from mail-pj1-f74.google.com (mail-pj1-f74.google.com [209.85.216.74])
	by imf16.hostedemail.com (Postfix) with ESMTP id 7D87718000C
	for <linux-mm@kvack.org>; Mon, 13 Jan 2025 22:57:58 +0000 (UTC)
Authentication-Results: imf16.hostedemail.com;
	dkim=pass header.d=google.com header.s=20230601 header.b=kTcDBW7O;
	spf=pass (imf16.hostedemail.com: domain of 3dZqFZwsKCAQegoivpi2xrkksskpi.gsqpmry1-qqozego.svk@flex--ackerleytng.bounces.google.com designates 209.85.216.74 as permitted sender) smtp.mailfrom=3dZqFZwsKCAQegoivpi2xrkksskpi.gsqpmry1-qqozego.svk@flex--ackerleytng.bounces.google.com;
	dmarc=pass (policy=reject) header.from=google.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1736809078;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:dkim-signature;
	bh=NkaMXnFua3SEJIKZOdXTvTVEYcx6MxJwaKUTAI7FyQk=;
	b=4Cw3odBbtGF6B+cBhIyOlf2yVspEqF6/T+bmuki/7fnx5/fVRFHaNgwqjhlWuDyq9i0sOa
	ze/DUUAHE0EBp85lp8rXOwfxB0xTkfKGBXyTni4Yf4lWJTojSyv00a5j0chAJW/N2ldEV0
	/Laa3F0ovwcNaJZ2AXyGVlu75PaK6CQ=
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1736809078; a=rsa-sha256;
	cv=none;
	b=hT1a22NpSAXVMulgf9Pa6T9utcCbKIU3jHUaJvifP6mBU+tQiYySdBW3i6VIX1qJ9jmvCd
	hcv46S5GDvaZR4SbkeTEmNCYFUTJ8Z9HjDbyn1bloC5rHo1NdmVEn5gKLflVbnmdzwgohq
	G6kNfpzQ9A+As18DW9w4uKxX6APWIdw=
ARC-Authentication-Results: i=1;
	imf16.hostedemail.com;
	dkim=pass header.d=google.com header.s=20230601 header.b=kTcDBW7O;
	spf=pass (imf16.hostedemail.com: domain of 3dZqFZwsKCAQegoivpi2xrkksskpi.gsqpmry1-qqozego.svk@flex--ackerleytng.bounces.google.com designates 209.85.216.74 as permitted sender) smtp.mailfrom=3dZqFZwsKCAQegoivpi2xrkksskpi.gsqpmry1-qqozego.svk@flex--ackerleytng.bounces.google.com;
	dmarc=pass (policy=reject) header.from=google.com
Received: by mail-pj1-f74.google.com with SMTP id 98e67ed59e1d1-2ef9b9981f1so12575547a91.3
        for <linux-mm@kvack.org>; Mon, 13 Jan 2025 14:57:58 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=google.com; s=20230601; t=1736809077; x=1737413877; darn=kvack.org;
        h=cc:to:from:subject:message-id:mime-version:in-reply-to:date:from:to
         :cc:subject:date:message-id:reply-to;
        bh=NkaMXnFua3SEJIKZOdXTvTVEYcx6MxJwaKUTAI7FyQk=;
        b=kTcDBW7OG5d6eGnx7QdErhQ7rw2pbgw5/qWnYTX/w5Exnfe4hhWfexOEBPkaEmqRWZ
         0rr+aC3vXwdn3DAJsSIfu8kbjEiqrnIx/1E9OVaSZ4u/hoC9modP0YHMUuZcgYXCIDBQ
         AgkSQEt7ymDKxNpUbuDY8NE5Va2b0TpfpGzlzOkyM33H35GOYeckrvr49JpkSQMLiixP
         mmEtsZ9KYtch+E8epvhNRFmfwdGKbhMQAcjjx0A5kfEobovk7t43J/QhtRO6x2POSALs
         gEs6NspuZpHMuWn/VK8EIqiE1t7eMB35zZO2ZfSxo5LqOlDHxFI1KEQ0LSyAdFWqAD9L
         u3Cg==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1736809077; x=1737413877;
        h=cc:to:from:subject:message-id:mime-version:in-reply-to:date
         :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to;
        bh=NkaMXnFua3SEJIKZOdXTvTVEYcx6MxJwaKUTAI7FyQk=;
        b=CF3sf9wfTnCO5ca3A7RcVLsPErmVJ8gwDqP4dTMZCfSecDU+cnWY3ikdndlgcsrO8T
         4mVmNjyjjopnn3mrgV229UJGCURHNLhvjEcbrROLESDnORdDGbDqnvrUi4pZ2LLO4d0e
         uOCWS72zklqHQ62f76bCZrl+ekGrsi9qa5lBwvngFAzaVEipbS+PKj0rm46z0IiRwlvJ
         vt4IE6JSq7YbHdc7PBq+pfiPobGbekKjLVO8jjIsNc9BF/2pE6+nOzJgYRXRDQS2ZGrk
         Dpr3OD6rpt7PLe/a7nLx2WmR0VHZUkkPO6UpDCPcJWQN7g1gQoDlDHS2MC5DgV+Bnfd9
         PbMA==
X-Gm-Message-State: AOJu0Yw/rS4RwDQvcDItXsZR5o8uLRFe1aY/2xoqDLpUl2RyChS5saVV
	VP/JYcm5rZrO2n3TzWjK7yLHzJqVIIECHTDIokPf1lgOH0op7/uW9B5ItnGgPWho+5jrZGKGzrt
	A3omzr6kHp4msFNtZ4UdtcA==
X-Google-Smtp-Source: AGHT+IHUD8WH4T7lbo6xIypiJlkcZmpxfZIIgUmRw3qH/24FQHbj8xL1HXVj5Hw2ZOCz78IZzY3VQZ01psCfQyzSoQ==
X-Received: from pjyd3.prod.google.com ([2002:a17:90a:dfc3:b0:2ef:6ef8:6567])
 (user=ackerleytng job=prod-delivery.src-stubby-dispatcher) by
 2002:a17:90b:3d09:b0:2ee:cddd:2454 with SMTP id 98e67ed59e1d1-2f548f39a8amr34464912a91.15.1736809077140;
 Mon, 13 Jan 2025 14:57:57 -0800 (PST)
Date: Mon, 13 Jan 2025 22:57:55 +0000
In-Reply-To: <20250107204002.2683356-5-peterx@redhat.com> (message from Peter
 Xu on Tue,  7 Jan 2025 15:39:59 -0500)
Mime-Version: 1.0
Message-ID: <diqzjzayz5ho.fsf@ackerleytng-ctop.c.googlers.com>
Subject: Re: [PATCH v2 4/7] mm/hugetlb: Clean up map/global resv accounting
 when allocate
From: Ackerley Tng <ackerleytng@google.com>
To: Peter Xu <peterx@redhat.com>
Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, leitao@debian.org, 
	riel@surriel.com, muchun.song@linux.dev, nao.horiguchi@gmail.com, 
	roman.gushchin@linux.dev, akpm@linux-foundation.org, peterx@redhat.com, 
	osalvador@suse.de
Content-Type: text/plain; charset="UTF-8"
X-Rspamd-Queue-Id: 7D87718000C
X-Rspam-User: 
X-Rspamd-Server: rspam07
X-Stat-Signature: ntaw74c9tz3azckt33apsnzw13u9eyzq
X-HE-Tag: 1736809078-953575
X-HE-Meta: U2FsdGVkX18q1Cf3NJ7HGNUTeZh84ushV9J0u1B142hOQNT0Sm5rbVZ0xHdDjIrxe2hPMc1Sd0Tcbl9y9fWHLPxCpCT7LaraYjfniEmlwhCfYJKohkcI/Erha2i4UwvYOSwycNNP92t9ru8u2IDhn2w4OfMCP7ApkZ0qtBpRifDFPH7Vmg41spPFqFy+/jw9OyEv0SyVRj7byW8qLiAU2nXOV8nUdwLoQO9IQScEEQ/7Y6oe7Zt+DxNFTbWOtQso3rwvdhieQ0dJbWyCBED70WIcqRowztlTX39gpTgRYTHu8qL+QGhPe3bNnxSSz50CeF0TVNQTxDHdRZZIi1eBiWSvHCtVIS1oRIZqXaVmBeH7vFzMqJoprcL7Ru2U/LoAWkZk6o5AEpWIYBelHDXl4Fl2+KPcXKBPYSh/seVf5D/RYUoZe2vxLdZT6KjV2RmhYEKdxjkQCjtknaCiIoRyxeZu9MZ4+BJpnIu2bMIZnb8K6VPlXoglfnd7zRWe4Xx2WhBKRGsK9w/d8hcJm/W8y3RayFK1y51KSeqrb/IDqYxVGjp8OTQOfqSSLgVkbs09XmNEhKhzwvF1maYhHRHEJUXfBMebhrF1QPFUAf5DMyb+B/MxGR7mJfK54MQF7n79bpa/Gc8WXJFPPDMt+LKnkvpkQ2ss8fyJOBqe1G47G2E4azCB34Fpm+dI/QDG2k0qklLrahZXI29LNxsrW42jX682acUHqtgxfWsuFpPTyJsoInvnFBt2907m4fM/C8+IrRss2WOnpHuuPZ8dJfyc9arksknI67W+3P3e1HqhY6b8YusstVv2oVwlbLncB5/9fzCrU7nOIFNU5ShDAGo0k+5S///rvm+oD5Sa4q1v26nVjCfJBwoRZEVC/oyg2gubD9uKIJVPcHccc59gYGrpS0r5ZBPo7SzIqi4LLds1nD4qD4hZpUnQ3H+NIgl6U8Rpd5daBza9kiTnxNaEjv8
 Ny8yeMg0
 hfx1K57XdJcx7BbiVlGP9v2Ll24byzvaOeJNTRWDHvb3Cz4NufYfGZMveAu8C8B2QfMQJeaqxUjbD1OHKNM+M2yF6lEmupqS82bdL6ErDsnXNs5h5FUfHgsF79LvbNt8xp4ytH3xO9aoAtz0i0lQBv0bXmHMlMOda9f5QZPHV8YzT95WV4lYI9i8a172Icll37f8dn9d5UEI1EYGlopiYkFH4O5W/73BZ+iQMpwqMYMCCc3akvU2+t4HCeJs0RxldT6NJVzH+WezZeFeKQe1P8bo2qzQYSnKEvnKyLxt0H5ZNPkaPd8VS7CwovQ+wrFVOHVwKhFMvf+1so5bCf15KubOjuMMZv8evOFqdhthhHtOj7fi5yLgvjeU34u4k+bJUrNghU00Sx5iik9prjV8iEiYVU11uWtQeYXqK0in/zKxtcJgTPSkmsryxOqJK+vSkonltb8nTrIGNg7nIw68Ma+NE/FcJIrn2XItNuqA4G/gGS+4QvSpRNEC2v8LxgN5d2fstWlpQM2klCOKl2FJkcBWAKh2QIaLTdevlCsWZ4J+v2WEEl8O+OZzlS9DeL6mUc+USXBAUtxUP6Rpp5q4Zu7bTroCu9PWFr1Z7aWRyGvDMGqY=
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000030, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>


Hi Peter,

I have an alternative that is based off your patches.

I like the overall refactoring but was a little uncomfortable with
having to add a custom enum map_chg_state. The custom enum centralizes
the possible states, but the states still need to be interpreted again
throughout the function to take action (e.g. having checks like if
(map_chg != MAP_CHG_ENFORCED) and if (map_chg == MAP_CHG_NEEDED)), and I
feel that shifts the problem later to understanding the interpretation
of the states.

I switched back to something close to avoid_reserve, but improved on
that name with your comments, by calling it bypass_vma_reservations. I
feel that calling it cow_from_owner kind of lets the implementation
detail of the CoW use-case bleed into this function.

"bypass_vma_reservations" is named so that it requests this function,
alloc_hugetlb_folio(), to bypass the resv_map. All parts of
alloc_hugetlb_folio() that update the resv_map are guarded by the
bypass_vma_reservations flag.

This alternative proposes to use the following booleans local to
alloc_hugetlb_folio().

1. vma_reservation_exists

vma_reservation_exists represents whether a reservation exists in the
resv_map and is used. vma_reservation_exists defaults to false, and when
bypass_vma_reservations is not requested, the resv_map will be consulted
to see if vma_reservation_exists.

vma_reservation_exists is also used to store the result of the initial
resv_map check, to correctly fix up a race later on.

2. debit_subpool

If a vma_reservation_exists, then this allocation has been reserved in
both the resv_map and subpool, so skip debiting the subpool.

If alloc_hugetlb_folio() was requested to bypass_vma_reservations, the
subpool should still be charged, so debit_subpool is set.

And if debit_subpool, then proceed to do hugepage_subpool_get_pages()
and set up subpool_reservation_exists.

Later on, debit_subpool is used for the matching cleanup in the error
case.

3. subpool_reservation_exists

subpool_reservation_exists represents whether a reservation exists in
the resv_map and is used, analogous to vma_reservation_exists, and the
subpool is only checked if debit_subpool is set.

If debit_subpool is not set, vma_reservation_exists determines whether a
subpool_reservation_exists.

subpool_reservation_exists is then used to guard decrementing
h->resv_huge_pages, which fixes the bug you found.

4. charge_cgroup_rsvd

This has the same condition as debit_subpool, but has a separate
variable for readability.

Later on, charge_cgroup_rsvd is used to determine whether to commit the
charge, or whether to fix up the charge in error cases.


I also refactored dequeue_hugetlb_folio_vma() to
dequeue_hugetlb_folio_with_mpol() to align with
alloc_buddy_hugetlb_folio_with_mpol() to avoid passing gbl_chg into the
function.

gbl_chg is interpreted in alloc_hugetlb_folio() instead. If
subpool_reservation_exists, then try to get a folio by dequeueing it. If
a subpool reservation does not exist, make sure there are available
pages before dequeueing.

If there was no folio from dequeueing for whatever reason, allocate a
new folio.

This could probably be a separate patch but I'd like to hear your
thoughts before doing integration/cleaning up.

These changes have been tested with your reproducer, and here's the test
output from libhugetlbfs test cases:

********** TEST SUMMARY
*                      2M
*                      32-bit 64-bit
*     Total testcases:    82     85
*             Skipped:     9      9
*                PASS:    72     69
*                FAIL:     0      0
*    Killed by signal:     1      7
*   Bad configuration:     0      0
*       Expected FAIL:     0      0
*     Unexpected PASS:     0      0
*    Test not present:     0      0
* Strange test result:     0      0
**********


Ackerley


---
 mm/hugetlb.c | 186 ++++++++++++++++++++++++++-------------------------
 1 file changed, 94 insertions(+), 92 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 6a0ea28f5bac..2cd588d35984 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1333,9 +1333,9 @@ static unsigned long available_huge_pages(struct hstate *h)
 	return h->free_huge_pages - h->resv_huge_pages;
 }

-static struct folio *dequeue_hugetlb_folio_vma(struct hstate *h,
-				struct vm_area_struct *vma,
-				unsigned long address, long gbl_chg)
+static struct folio *dequeue_hugetlb_folio_with_mpol(struct hstate *h,
+						     struct vm_area_struct *vma,
+						     unsigned long address)
 {
 	struct folio *folio = NULL;
 	struct mempolicy *mpol;
@@ -1343,13 +1343,6 @@ static struct folio *dequeue_hugetlb_folio_vma(struct hstate *h,
 	nodemask_t *nodemask;
 	int nid;

-	/*
-	 * gbl_chg==1 means the allocation requires a new page that was not
-	 * reserved before.  Making sure there's at least one free page.
-	 */
-	if (gbl_chg && !available_huge_pages(h))
-		goto err;
-
 	gfp_mask = htlb_alloc_mask(h);
 	nid = huge_node(vma, address, gfp_mask, &mpol, &nodemask);

@@ -1367,9 +1360,6 @@ static struct folio *dequeue_hugetlb_folio_vma(struct hstate *h,

 	mpol_cond_put(mpol);
 	return folio;
-
-err:
-	return NULL;
 }

 /*
@@ -2943,91 +2933,83 @@ int replace_free_hugepage_folios(unsigned long start_pfn, unsigned long end_pfn)
 	return ret;
 }

-typedef enum {
-	/*
-	 * For either 0/1: we checked the per-vma resv map, and one resv
-	 * count either can be reused (0), or an extra needed (1).
-	 */
-	MAP_CHG_REUSE = 0,
-	MAP_CHG_NEEDED = 1,
-	/*
-	 * Cannot use per-vma resv count can be used, hence a new resv
-	 * count is enforced.
-	 *
-	 * NOTE: This is mostly identical to MAP_CHG_NEEDED, except
-	 * that currently vma_needs_reservation() has an unwanted side
-	 * effect to either use end() or commit() to complete the
-	 * transaction.	 Hence it needs to differenciate from NEEDED.
-	 */
-	MAP_CHG_ENFORCED = 2,
-} map_chg_state;
-
 /*
- * NOTE! "cow_from_owner" represents a very hacky usage only used in CoW
- * faults of hugetlb private mappings on top of a non-page-cache folio (in
- * which case even if there's a private vma resv map it won't cover such
- * allocation).  New call sites should (probably) never set it to true!!
- * When it's set, the allocation will bypass all vma level reservations.
+ * NOTE! "bypass_vma_reservations" represents a very niche usage, when CoW
+ * faults of hugetlb private mappings need to allocate a new page on top of a
+ * non-page-cache folio. In this situation, even if there's a private vma resv
+ * map, the resv map must be bypassed.  New call sites should (probably) never
+ * set bypass_vma_reservations to true!!
  */
 struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
-				    unsigned long addr, bool cow_from_owner)
+				  unsigned long addr, bool bypass_vma_reservations)
 {
 	struct hugepage_subpool *spool = subpool_vma(vma);
 	struct hstate *h = hstate_vma(vma);
 	struct folio *folio;
-	long retval, gbl_chg;
-	map_chg_state map_chg;
 	int ret, idx;
 	struct hugetlb_cgroup *h_cg = NULL;
 	gfp_t gfp = htlb_alloc_mask(h) | __GFP_RETRY_MAYFAIL;
+	bool vma_reservation_exists = false;
+	bool subpool_reservation_exists;
+	bool debit_subpool;
+	bool charge_cgroup_rsvd;

 	idx = hstate_index(h);

-	/* Whether we need a separate per-vma reservation? */
-	if (cow_from_owner) {
-		/*
-		 * Special case!  Since it's a CoW on top of a reserved
-		 * page, the private resv map doesn't count.  So it cannot
-		 * consume the per-vma resv map even if it's reserved.
-		 */
-		map_chg = MAP_CHG_ENFORCED;
-	} else {
+	if (!bypass_vma_reservations) {
 		/*
 		 * Examine the region/reserve map to determine if the process
-		 * has a reservation for the page to be allocated.  A return
-		 * code of zero indicates a reservation exists (no change).
-		 */
-		retval = vma_needs_reservation(h, vma, addr);
-		if (retval < 0)
+		 * has a reservation for the page to be allocated and debit the
+		 * reservation.  If npages_req == 0, a reservation exists and is
+		 * used. If npages_req > 0, a reservation has to be taken either
+		 * from the subpool or global pool.
+ 		 */
+		int npages_req = vma_needs_reservation(h, vma, addr);
+		if (npages_req < 0)
 			return ERR_PTR(-ENOMEM);
-		map_chg = retval ? MAP_CHG_NEEDED : MAP_CHG_REUSE;
+
+		vma_reservation_exists = npages_req == 0;
 	}

 	/*
-	 * Whether we need a separate global reservation?
+	 * If no vma reservation exists, debit the subpool.
 	 *
+	 * Even if we were requested to bypass_vma_reservations, debit the
+	 * subpool - the subpool still has to be charged for this allocation.
+	 */
+	debit_subpool = !vma_reservation_exists || bypass_vma_reservations;
+
+	/*
 	 * Processes that did not create the mapping will have no
 	 * reserves as indicated by the region/reserve map. Check
 	 * that the allocation will not exceed the subpool limit.
 	 * Or if it can get one from the pool reservation directly.
 	 */
-	if (map_chg) {
-		gbl_chg = hugepage_subpool_get_pages(spool, 1);
-		if (gbl_chg < 0)
+	if (debit_subpool) {
+		int npages_req = hugepage_subpool_get_pages(spool, 1);
+		if (npages_req < 0)
 			goto out_end_reservation;
-	} else {
+
 		/*
-		 * If we have the vma reservation ready, no need for extra
-		 * global reservation.
-		 */
-		gbl_chg = 0;
+		 * npages_req == 0 indicates a reservation exists for the
+		 * allocation in the subpool and can be used. npages_req > 0
+		 * indicates that a reservation must be taken from the global
+		 * pool.
+ 		 */
+		subpool_reservation_exists = npages_req == 0;
+	} else {
+		/* A vma reservation implies having a subpool reservation. */
+		subpool_reservation_exists = vma_reservation_exists;
 	}

 	/*
-	 * If this allocation is not consuming a per-vma reservation,
-	 * charge the hugetlb cgroup now.
+	 * If no vma reservation exists, charge the cgroup's reserved quota.
+	 *
+	 * Even if we were requested to bypass_vma_reservations, the cgroup
+	 * still has to be charged for this allocation.
 	 */
-	if (map_chg) {
+	charge_cgroup_rsvd = !vma_reservation_exists || bypass_vma_reservations;
+	if (charge_cgroup_rsvd) {
 		ret = hugetlb_cgroup_charge_cgroup_rsvd(
 			idx, pages_per_huge_page(h), &h_cg);
 		if (ret)
@@ -3039,12 +3021,23 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
 		goto out_uncharge_cgroup_reservation;

 	spin_lock_irq(&hugetlb_lock);
-	/*
-	 * glb_chg is passed to indicate whether or not a page must be taken
-	 * from the global free pool (global change).  gbl_chg == 0 indicates
-	 * a reservation exists for the allocation.
-	 */
-	folio = dequeue_hugetlb_folio_vma(h, vma, addr, gbl_chg);
+
+	if (subpool_reservation_exists) {
+		folio = dequeue_hugetlb_folio_with_mpol(h, vma, addr);
+	} else {
+		/*
+		 * Since no subpool_reservation_exists, the allocation requires
+		 * a new page that was not reserved before.  Only dequeue if
+		 * there are available pages.
+		 */
+		if (available_huge_pages(h)) {
+			folio = dequeue_hugetlb_folio_with_mpol(h, vma, addr);
+		} else {
+			folio = NULL;
+			/* Fallthrough to allocate a new page. */
+		}
+	}
+
 	if (!folio) {
 		spin_unlock_irq(&hugetlb_lock);
 		folio = alloc_buddy_hugetlb_folio_with_mpol(h, vma, addr);
@@ -3057,19 +3050,17 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
 	}

 	/*
-	 * Either dequeued or buddy-allocated folio needs to add special
-	 * mark to the folio when it consumes a global reservation.
+	 * If subpool_reservation_exists (and is used for this allocation),
+	 * decrement resv_huge_pages to indicate that a reservation was used.
 	 */
-	if (!gbl_chg) {
+	if (subpool_reservation_exists) {
 		folio_set_hugetlb_restore_reserve(folio);
 		h->resv_huge_pages--;
 	}

 	hugetlb_cgroup_commit_charge(idx, pages_per_huge_page(h), h_cg, folio);
-	/* If allocation is not consuming a reservation, also store the
-	 * hugetlb_cgroup pointer on the page.
-	 */
-	if (map_chg) {
+
+	if (charge_cgroup_rsvd) {
 		hugetlb_cgroup_commit_charge_rsvd(idx, pages_per_huge_page(h),
 						  h_cg, folio);
 	}
@@ -3078,25 +3069,30 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,

 	hugetlb_set_folio_subpool(folio, spool);

-	if (map_chg != MAP_CHG_ENFORCED) {
-		/* commit() is only needed if the map_chg is not enforced */
-		retval = vma_commit_reservation(h, vma, addr);
+	if (!bypass_vma_reservations) {
+		/*
+		 * As long as vma reservations were not bypassed, we need to
+		 * commit() to clear up any adds_in_progress in resv_map.
+		 */
+		int ret = vma_commit_reservation(h, vma, addr);
 		/*
-		 * Check for possible race conditions. When it happens..
-		 * The page was added to the reservation map between
-		 * vma_needs_reservation and vma_commit_reservation.
-		 * This indicates a race with hugetlb_reserve_pages.
+		 * If there is a discrepancy in reservation status between the
+		 * time of vma_needs_reservation() and vma_commit_reservation(),
+		 * then there the page must have been added to the reservation
+		 * map between vma_needs_reservation() and
+		 * vma_commit_reservation().
+		 *
 		 * Adjust for the subpool count incremented above AND
 		 * in hugetlb_reserve_pages for the same page.	Also,
 		 * the reservation count added in hugetlb_reserve_pages
 		 * no longer applies.
 		 */
-		if (unlikely(map_chg == MAP_CHG_NEEDED && retval == 0)) {
+		if (unlikely(!vma_reservation_exists && ret == 0)) {
 			long rsv_adjust;

 			rsv_adjust = hugepage_subpool_put_pages(spool, 1);
 			hugetlb_acct_memory(h, -rsv_adjust);
-			if (map_chg) {
+			if (charge_cgroup_rsvd) {
 				spin_lock_irq(&hugetlb_lock);
 				hugetlb_cgroup_uncharge_folio_rsvd(
 				    hstate_index(h), pages_per_huge_page(h),
@@ -3124,14 +3120,14 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
 out_uncharge_cgroup:
 	hugetlb_cgroup_uncharge_cgroup(idx, pages_per_huge_page(h), h_cg);
 out_uncharge_cgroup_reservation:
-	if (map_chg)
+	if (charge_cgroup_rsvd)
 		hugetlb_cgroup_uncharge_cgroup_rsvd(idx, pages_per_huge_page(h),
 						    h_cg);
 out_subpool_put:
-	if (map_chg)
+	if (debit_subpool)
 		hugepage_subpool_put_pages(spool, 1);
 out_end_reservation:
-	if (map_chg != MAP_CHG_ENFORCED)
+	if (!bypass_vma_reservations)
 		vma_end_reservation(h, vma, addr);
 	return ERR_PTR(-ENOSPC);
 }
@@ -5900,6 +5896,12 @@ static vm_fault_t hugetlb_wp(struct folio *pagecache_folio,
 	 * be acquired again before returning to the caller, as expected.
 	 */
 	spin_unlock(vmf->ptl);
+
+	/*
+	 * If this is a CoW from the owner of this page, we
+	 * bypass_vma_reservations, since the reservation was already consumed
+	 * when the hugetlb folio was first allocated before the fork happened.
+	 */
 	new_folio = alloc_hugetlb_folio(vma, vmf->address, cow_from_owner);

 	if (IS_ERR(new_folio)) {
-- 
2.47.1.688.g23fc6f90ad-goog