From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 0E56AC02180 for ; Mon, 13 Jan 2025 22:58:02 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 66FBD6B0082; Mon, 13 Jan 2025 17:58:01 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 61F656B0083; Mon, 13 Jan 2025 17:58:01 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 4C0956B0085; Mon, 13 Jan 2025 17:58:01 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id 284726B0082 for ; Mon, 13 Jan 2025 17:58:01 -0500 (EST) Received: from smtpin06.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay09.hostedemail.com (Postfix) with ESMTP id 3D88780140 for ; Mon, 13 Jan 2025 22:58:00 +0000 (UTC) X-FDA: 83003943120.06.5EC3C57 Received: from mail-pj1-f74.google.com (mail-pj1-f74.google.com [209.85.216.74]) by imf16.hostedemail.com (Postfix) with ESMTP id 7D87718000C for ; Mon, 13 Jan 2025 22:57:58 +0000 (UTC) Authentication-Results: imf16.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=kTcDBW7O; spf=pass (imf16.hostedemail.com: domain of 3dZqFZwsKCAQegoivpi2xrkksskpi.gsqpmry1-qqozego.svk@flex--ackerleytng.bounces.google.com designates 209.85.216.74 as permitted sender) smtp.mailfrom=3dZqFZwsKCAQegoivpi2xrkksskpi.gsqpmry1-qqozego.svk@flex--ackerleytng.bounces.google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1736809078; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:dkim-signature; bh=NkaMXnFua3SEJIKZOdXTvTVEYcx6MxJwaKUTAI7FyQk=; b=4Cw3odBbtGF6B+cBhIyOlf2yVspEqF6/T+bmuki/7fnx5/fVRFHaNgwqjhlWuDyq9i0sOa ze/DUUAHE0EBp85lp8rXOwfxB0xTkfKGBXyTni4Yf4lWJTojSyv00a5j0chAJW/N2ldEV0 /Laa3F0ovwcNaJZ2AXyGVlu75PaK6CQ= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1736809078; a=rsa-sha256; cv=none; b=hT1a22NpSAXVMulgf9Pa6T9utcCbKIU3jHUaJvifP6mBU+tQiYySdBW3i6VIX1qJ9jmvCd hcv46S5GDvaZR4SbkeTEmNCYFUTJ8Z9HjDbyn1bloC5rHo1NdmVEn5gKLflVbnmdzwgohq G6kNfpzQ9A+As18DW9w4uKxX6APWIdw= ARC-Authentication-Results: i=1; imf16.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=kTcDBW7O; spf=pass (imf16.hostedemail.com: domain of 3dZqFZwsKCAQegoivpi2xrkksskpi.gsqpmry1-qqozego.svk@flex--ackerleytng.bounces.google.com designates 209.85.216.74 as permitted sender) smtp.mailfrom=3dZqFZwsKCAQegoivpi2xrkksskpi.gsqpmry1-qqozego.svk@flex--ackerleytng.bounces.google.com; dmarc=pass (policy=reject) header.from=google.com Received: by mail-pj1-f74.google.com with SMTP id 98e67ed59e1d1-2ef9b9981f1so12575547a91.3 for ; Mon, 13 Jan 2025 14:57:58 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1736809077; x=1737413877; darn=kvack.org; h=cc:to:from:subject:message-id:mime-version:in-reply-to:date:from:to :cc:subject:date:message-id:reply-to; bh=NkaMXnFua3SEJIKZOdXTvTVEYcx6MxJwaKUTAI7FyQk=; b=kTcDBW7OG5d6eGnx7QdErhQ7rw2pbgw5/qWnYTX/w5Exnfe4hhWfexOEBPkaEmqRWZ 0rr+aC3vXwdn3DAJsSIfu8kbjEiqrnIx/1E9OVaSZ4u/hoC9modP0YHMUuZcgYXCIDBQ AgkSQEt7ymDKxNpUbuDY8NE5Va2b0TpfpGzlzOkyM33H35GOYeckrvr49JpkSQMLiixP mmEtsZ9KYtch+E8epvhNRFmfwdGKbhMQAcjjx0A5kfEobovk7t43J/QhtRO6x2POSALs gEs6NspuZpHMuWn/VK8EIqiE1t7eMB35zZO2ZfSxo5LqOlDHxFI1KEQ0LSyAdFWqAD9L u3Cg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1736809077; x=1737413877; h=cc:to:from:subject:message-id:mime-version:in-reply-to:date :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=NkaMXnFua3SEJIKZOdXTvTVEYcx6MxJwaKUTAI7FyQk=; b=CF3sf9wfTnCO5ca3A7RcVLsPErmVJ8gwDqP4dTMZCfSecDU+cnWY3ikdndlgcsrO8T 4mVmNjyjjopnn3mrgV229UJGCURHNLhvjEcbrROLESDnORdDGbDqnvrUi4pZ2LLO4d0e uOCWS72zklqHQ62f76bCZrl+ekGrsi9qa5lBwvngFAzaVEipbS+PKj0rm46z0IiRwlvJ vt4IE6JSq7YbHdc7PBq+pfiPobGbekKjLVO8jjIsNc9BF/2pE6+nOzJgYRXRDQS2ZGrk Dpr3OD6rpt7PLe/a7nLx2WmR0VHZUkkPO6UpDCPcJWQN7g1gQoDlDHS2MC5DgV+Bnfd9 PbMA== X-Gm-Message-State: AOJu0Yw/rS4RwDQvcDItXsZR5o8uLRFe1aY/2xoqDLpUl2RyChS5saVV VP/JYcm5rZrO2n3TzWjK7yLHzJqVIIECHTDIokPf1lgOH0op7/uW9B5ItnGgPWho+5jrZGKGzrt A3omzr6kHp4msFNtZ4UdtcA== X-Google-Smtp-Source: AGHT+IHUD8WH4T7lbo6xIypiJlkcZmpxfZIIgUmRw3qH/24FQHbj8xL1HXVj5Hw2ZOCz78IZzY3VQZ01psCfQyzSoQ== X-Received: from pjyd3.prod.google.com ([2002:a17:90a:dfc3:b0:2ef:6ef8:6567]) (user=ackerleytng job=prod-delivery.src-stubby-dispatcher) by 2002:a17:90b:3d09:b0:2ee:cddd:2454 with SMTP id 98e67ed59e1d1-2f548f39a8amr34464912a91.15.1736809077140; Mon, 13 Jan 2025 14:57:57 -0800 (PST) Date: Mon, 13 Jan 2025 22:57:55 +0000 In-Reply-To: <20250107204002.2683356-5-peterx@redhat.com> (message from Peter Xu on Tue, 7 Jan 2025 15:39:59 -0500) Mime-Version: 1.0 Message-ID: Subject: Re: [PATCH v2 4/7] mm/hugetlb: Clean up map/global resv accounting when allocate From: Ackerley Tng To: Peter Xu Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, leitao@debian.org, riel@surriel.com, muchun.song@linux.dev, nao.horiguchi@gmail.com, roman.gushchin@linux.dev, akpm@linux-foundation.org, peterx@redhat.com, osalvador@suse.de Content-Type: text/plain; charset="UTF-8" X-Rspamd-Queue-Id: 7D87718000C X-Rspam-User: X-Rspamd-Server: rspam07 X-Stat-Signature: ntaw74c9tz3azckt33apsnzw13u9eyzq X-HE-Tag: 1736809078-953575 X-HE-Meta: U2FsdGVkX18q1Cf3NJ7HGNUTeZh84ushV9J0u1B142hOQNT0Sm5rbVZ0xHdDjIrxe2hPMc1Sd0Tcbl9y9fWHLPxCpCT7LaraYjfniEmlwhCfYJKohkcI/Erha2i4UwvYOSwycNNP92t9ru8u2IDhn2w4OfMCP7ApkZ0qtBpRifDFPH7Vmg41spPFqFy+/jw9OyEv0SyVRj7byW8qLiAU2nXOV8nUdwLoQO9IQScEEQ/7Y6oe7Zt+DxNFTbWOtQso3rwvdhieQ0dJbWyCBED70WIcqRowztlTX39gpTgRYTHu8qL+QGhPe3bNnxSSz50CeF0TVNQTxDHdRZZIi1eBiWSvHCtVIS1oRIZqXaVmBeH7vFzMqJoprcL7Ru2U/LoAWkZk6o5AEpWIYBelHDXl4Fl2+KPcXKBPYSh/seVf5D/RYUoZe2vxLdZT6KjV2RmhYEKdxjkQCjtknaCiIoRyxeZu9MZ4+BJpnIu2bMIZnb8K6VPlXoglfnd7zRWe4Xx2WhBKRGsK9w/d8hcJm/W8y3RayFK1y51KSeqrb/IDqYxVGjp8OTQOfqSSLgVkbs09XmNEhKhzwvF1maYhHRHEJUXfBMebhrF1QPFUAf5DMyb+B/MxGR7mJfK54MQF7n79bpa/Gc8WXJFPPDMt+LKnkvpkQ2ss8fyJOBqe1G47G2E4azCB34Fpm+dI/QDG2k0qklLrahZXI29LNxsrW42jX682acUHqtgxfWsuFpPTyJsoInvnFBt2907m4fM/C8+IrRss2WOnpHuuPZ8dJfyc9arksknI67W+3P3e1HqhY6b8YusstVv2oVwlbLncB5/9fzCrU7nOIFNU5ShDAGo0k+5S///rvm+oD5Sa4q1v26nVjCfJBwoRZEVC/oyg2gubD9uKIJVPcHccc59gYGrpS0r5ZBPo7SzIqi4LLds1nD4qD4hZpUnQ3H+NIgl6U8Rpd5daBza9kiTnxNaEjv8 Ny8yeMg0 hfx1K57XdJcx7BbiVlGP9v2Ll24byzvaOeJNTRWDHvb3Cz4NufYfGZMveAu8C8B2QfMQJeaqxUjbD1OHKNM+M2yF6lEmupqS82bdL6ErDsnXNs5h5FUfHgsF79LvbNt8xp4ytH3xO9aoAtz0i0lQBv0bXmHMlMOda9f5QZPHV8YzT95WV4lYI9i8a172Icll37f8dn9d5UEI1EYGlopiYkFH4O5W/73BZ+iQMpwqMYMCCc3akvU2+t4HCeJs0RxldT6NJVzH+WezZeFeKQe1P8bo2qzQYSnKEvnKyLxt0H5ZNPkaPd8VS7CwovQ+wrFVOHVwKhFMvf+1so5bCf15KubOjuMMZv8evOFqdhthhHtOj7fi5yLgvjeU34u4k+bJUrNghU00Sx5iik9prjV8iEiYVU11uWtQeYXqK0in/zKxtcJgTPSkmsryxOqJK+vSkonltb8nTrIGNg7nIw68Ma+NE/FcJIrn2XItNuqA4G/gGS+4QvSpRNEC2v8LxgN5d2fstWlpQM2klCOKl2FJkcBWAKh2QIaLTdevlCsWZ4J+v2WEEl8O+OZzlS9DeL6mUc+USXBAUtxUP6Rpp5q4Zu7bTroCu9PWFr1Z7aWRyGvDMGqY= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000030, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Hi Peter, I have an alternative that is based off your patches. I like the overall refactoring but was a little uncomfortable with having to add a custom enum map_chg_state. The custom enum centralizes the possible states, but the states still need to be interpreted again throughout the function to take action (e.g. having checks like if (map_chg != MAP_CHG_ENFORCED) and if (map_chg == MAP_CHG_NEEDED)), and I feel that shifts the problem later to understanding the interpretation of the states. I switched back to something close to avoid_reserve, but improved on that name with your comments, by calling it bypass_vma_reservations. I feel that calling it cow_from_owner kind of lets the implementation detail of the CoW use-case bleed into this function. "bypass_vma_reservations" is named so that it requests this function, alloc_hugetlb_folio(), to bypass the resv_map. All parts of alloc_hugetlb_folio() that update the resv_map are guarded by the bypass_vma_reservations flag. This alternative proposes to use the following booleans local to alloc_hugetlb_folio(). 1. vma_reservation_exists vma_reservation_exists represents whether a reservation exists in the resv_map and is used. vma_reservation_exists defaults to false, and when bypass_vma_reservations is not requested, the resv_map will be consulted to see if vma_reservation_exists. vma_reservation_exists is also used to store the result of the initial resv_map check, to correctly fix up a race later on. 2. debit_subpool If a vma_reservation_exists, then this allocation has been reserved in both the resv_map and subpool, so skip debiting the subpool. If alloc_hugetlb_folio() was requested to bypass_vma_reservations, the subpool should still be charged, so debit_subpool is set. And if debit_subpool, then proceed to do hugepage_subpool_get_pages() and set up subpool_reservation_exists. Later on, debit_subpool is used for the matching cleanup in the error case. 3. subpool_reservation_exists subpool_reservation_exists represents whether a reservation exists in the resv_map and is used, analogous to vma_reservation_exists, and the subpool is only checked if debit_subpool is set. If debit_subpool is not set, vma_reservation_exists determines whether a subpool_reservation_exists. subpool_reservation_exists is then used to guard decrementing h->resv_huge_pages, which fixes the bug you found. 4. charge_cgroup_rsvd This has the same condition as debit_subpool, but has a separate variable for readability. Later on, charge_cgroup_rsvd is used to determine whether to commit the charge, or whether to fix up the charge in error cases. I also refactored dequeue_hugetlb_folio_vma() to dequeue_hugetlb_folio_with_mpol() to align with alloc_buddy_hugetlb_folio_with_mpol() to avoid passing gbl_chg into the function. gbl_chg is interpreted in alloc_hugetlb_folio() instead. If subpool_reservation_exists, then try to get a folio by dequeueing it. If a subpool reservation does not exist, make sure there are available pages before dequeueing. If there was no folio from dequeueing for whatever reason, allocate a new folio. This could probably be a separate patch but I'd like to hear your thoughts before doing integration/cleaning up. These changes have been tested with your reproducer, and here's the test output from libhugetlbfs test cases: ********** TEST SUMMARY * 2M * 32-bit 64-bit * Total testcases: 82 85 * Skipped: 9 9 * PASS: 72 69 * FAIL: 0 0 * Killed by signal: 1 7 * Bad configuration: 0 0 * Expected FAIL: 0 0 * Unexpected PASS: 0 0 * Test not present: 0 0 * Strange test result: 0 0 ********** Ackerley --- mm/hugetlb.c | 186 ++++++++++++++++++++++++++------------------------- 1 file changed, 94 insertions(+), 92 deletions(-) diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 6a0ea28f5bac..2cd588d35984 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -1333,9 +1333,9 @@ static unsigned long available_huge_pages(struct hstate *h) return h->free_huge_pages - h->resv_huge_pages; } -static struct folio *dequeue_hugetlb_folio_vma(struct hstate *h, - struct vm_area_struct *vma, - unsigned long address, long gbl_chg) +static struct folio *dequeue_hugetlb_folio_with_mpol(struct hstate *h, + struct vm_area_struct *vma, + unsigned long address) { struct folio *folio = NULL; struct mempolicy *mpol; @@ -1343,13 +1343,6 @@ static struct folio *dequeue_hugetlb_folio_vma(struct hstate *h, nodemask_t *nodemask; int nid; - /* - * gbl_chg==1 means the allocation requires a new page that was not - * reserved before. Making sure there's at least one free page. - */ - if (gbl_chg && !available_huge_pages(h)) - goto err; - gfp_mask = htlb_alloc_mask(h); nid = huge_node(vma, address, gfp_mask, &mpol, &nodemask); @@ -1367,9 +1360,6 @@ static struct folio *dequeue_hugetlb_folio_vma(struct hstate *h, mpol_cond_put(mpol); return folio; - -err: - return NULL; } /* @@ -2943,91 +2933,83 @@ int replace_free_hugepage_folios(unsigned long start_pfn, unsigned long end_pfn) return ret; } -typedef enum { - /* - * For either 0/1: we checked the per-vma resv map, and one resv - * count either can be reused (0), or an extra needed (1). - */ - MAP_CHG_REUSE = 0, - MAP_CHG_NEEDED = 1, - /* - * Cannot use per-vma resv count can be used, hence a new resv - * count is enforced. - * - * NOTE: This is mostly identical to MAP_CHG_NEEDED, except - * that currently vma_needs_reservation() has an unwanted side - * effect to either use end() or commit() to complete the - * transaction. Hence it needs to differenciate from NEEDED. - */ - MAP_CHG_ENFORCED = 2, -} map_chg_state; - /* - * NOTE! "cow_from_owner" represents a very hacky usage only used in CoW - * faults of hugetlb private mappings on top of a non-page-cache folio (in - * which case even if there's a private vma resv map it won't cover such - * allocation). New call sites should (probably) never set it to true!! - * When it's set, the allocation will bypass all vma level reservations. + * NOTE! "bypass_vma_reservations" represents a very niche usage, when CoW + * faults of hugetlb private mappings need to allocate a new page on top of a + * non-page-cache folio. In this situation, even if there's a private vma resv + * map, the resv map must be bypassed. New call sites should (probably) never + * set bypass_vma_reservations to true!! */ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma, - unsigned long addr, bool cow_from_owner) + unsigned long addr, bool bypass_vma_reservations) { struct hugepage_subpool *spool = subpool_vma(vma); struct hstate *h = hstate_vma(vma); struct folio *folio; - long retval, gbl_chg; - map_chg_state map_chg; int ret, idx; struct hugetlb_cgroup *h_cg = NULL; gfp_t gfp = htlb_alloc_mask(h) | __GFP_RETRY_MAYFAIL; + bool vma_reservation_exists = false; + bool subpool_reservation_exists; + bool debit_subpool; + bool charge_cgroup_rsvd; idx = hstate_index(h); - /* Whether we need a separate per-vma reservation? */ - if (cow_from_owner) { - /* - * Special case! Since it's a CoW on top of a reserved - * page, the private resv map doesn't count. So it cannot - * consume the per-vma resv map even if it's reserved. - */ - map_chg = MAP_CHG_ENFORCED; - } else { + if (!bypass_vma_reservations) { /* * Examine the region/reserve map to determine if the process - * has a reservation for the page to be allocated. A return - * code of zero indicates a reservation exists (no change). - */ - retval = vma_needs_reservation(h, vma, addr); - if (retval < 0) + * has a reservation for the page to be allocated and debit the + * reservation. If npages_req == 0, a reservation exists and is + * used. If npages_req > 0, a reservation has to be taken either + * from the subpool or global pool. + */ + int npages_req = vma_needs_reservation(h, vma, addr); + if (npages_req < 0) return ERR_PTR(-ENOMEM); - map_chg = retval ? MAP_CHG_NEEDED : MAP_CHG_REUSE; + + vma_reservation_exists = npages_req == 0; } /* - * Whether we need a separate global reservation? + * If no vma reservation exists, debit the subpool. * + * Even if we were requested to bypass_vma_reservations, debit the + * subpool - the subpool still has to be charged for this allocation. + */ + debit_subpool = !vma_reservation_exists || bypass_vma_reservations; + + /* * Processes that did not create the mapping will have no * reserves as indicated by the region/reserve map. Check * that the allocation will not exceed the subpool limit. * Or if it can get one from the pool reservation directly. */ - if (map_chg) { - gbl_chg = hugepage_subpool_get_pages(spool, 1); - if (gbl_chg < 0) + if (debit_subpool) { + int npages_req = hugepage_subpool_get_pages(spool, 1); + if (npages_req < 0) goto out_end_reservation; - } else { + /* - * If we have the vma reservation ready, no need for extra - * global reservation. - */ - gbl_chg = 0; + * npages_req == 0 indicates a reservation exists for the + * allocation in the subpool and can be used. npages_req > 0 + * indicates that a reservation must be taken from the global + * pool. + */ + subpool_reservation_exists = npages_req == 0; + } else { + /* A vma reservation implies having a subpool reservation. */ + subpool_reservation_exists = vma_reservation_exists; } /* - * If this allocation is not consuming a per-vma reservation, - * charge the hugetlb cgroup now. + * If no vma reservation exists, charge the cgroup's reserved quota. + * + * Even if we were requested to bypass_vma_reservations, the cgroup + * still has to be charged for this allocation. */ - if (map_chg) { + charge_cgroup_rsvd = !vma_reservation_exists || bypass_vma_reservations; + if (charge_cgroup_rsvd) { ret = hugetlb_cgroup_charge_cgroup_rsvd( idx, pages_per_huge_page(h), &h_cg); if (ret) @@ -3039,12 +3021,23 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma, goto out_uncharge_cgroup_reservation; spin_lock_irq(&hugetlb_lock); - /* - * glb_chg is passed to indicate whether or not a page must be taken - * from the global free pool (global change). gbl_chg == 0 indicates - * a reservation exists for the allocation. - */ - folio = dequeue_hugetlb_folio_vma(h, vma, addr, gbl_chg); + + if (subpool_reservation_exists) { + folio = dequeue_hugetlb_folio_with_mpol(h, vma, addr); + } else { + /* + * Since no subpool_reservation_exists, the allocation requires + * a new page that was not reserved before. Only dequeue if + * there are available pages. + */ + if (available_huge_pages(h)) { + folio = dequeue_hugetlb_folio_with_mpol(h, vma, addr); + } else { + folio = NULL; + /* Fallthrough to allocate a new page. */ + } + } + if (!folio) { spin_unlock_irq(&hugetlb_lock); folio = alloc_buddy_hugetlb_folio_with_mpol(h, vma, addr); @@ -3057,19 +3050,17 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma, } /* - * Either dequeued or buddy-allocated folio needs to add special - * mark to the folio when it consumes a global reservation. + * If subpool_reservation_exists (and is used for this allocation), + * decrement resv_huge_pages to indicate that a reservation was used. */ - if (!gbl_chg) { + if (subpool_reservation_exists) { folio_set_hugetlb_restore_reserve(folio); h->resv_huge_pages--; } hugetlb_cgroup_commit_charge(idx, pages_per_huge_page(h), h_cg, folio); - /* If allocation is not consuming a reservation, also store the - * hugetlb_cgroup pointer on the page. - */ - if (map_chg) { + + if (charge_cgroup_rsvd) { hugetlb_cgroup_commit_charge_rsvd(idx, pages_per_huge_page(h), h_cg, folio); } @@ -3078,25 +3069,30 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma, hugetlb_set_folio_subpool(folio, spool); - if (map_chg != MAP_CHG_ENFORCED) { - /* commit() is only needed if the map_chg is not enforced */ - retval = vma_commit_reservation(h, vma, addr); + if (!bypass_vma_reservations) { + /* + * As long as vma reservations were not bypassed, we need to + * commit() to clear up any adds_in_progress in resv_map. + */ + int ret = vma_commit_reservation(h, vma, addr); /* - * Check for possible race conditions. When it happens.. - * The page was added to the reservation map between - * vma_needs_reservation and vma_commit_reservation. - * This indicates a race with hugetlb_reserve_pages. + * If there is a discrepancy in reservation status between the + * time of vma_needs_reservation() and vma_commit_reservation(), + * then there the page must have been added to the reservation + * map between vma_needs_reservation() and + * vma_commit_reservation(). + * * Adjust for the subpool count incremented above AND * in hugetlb_reserve_pages for the same page. Also, * the reservation count added in hugetlb_reserve_pages * no longer applies. */ - if (unlikely(map_chg == MAP_CHG_NEEDED && retval == 0)) { + if (unlikely(!vma_reservation_exists && ret == 0)) { long rsv_adjust; rsv_adjust = hugepage_subpool_put_pages(spool, 1); hugetlb_acct_memory(h, -rsv_adjust); - if (map_chg) { + if (charge_cgroup_rsvd) { spin_lock_irq(&hugetlb_lock); hugetlb_cgroup_uncharge_folio_rsvd( hstate_index(h), pages_per_huge_page(h), @@ -3124,14 +3120,14 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma, out_uncharge_cgroup: hugetlb_cgroup_uncharge_cgroup(idx, pages_per_huge_page(h), h_cg); out_uncharge_cgroup_reservation: - if (map_chg) + if (charge_cgroup_rsvd) hugetlb_cgroup_uncharge_cgroup_rsvd(idx, pages_per_huge_page(h), h_cg); out_subpool_put: - if (map_chg) + if (debit_subpool) hugepage_subpool_put_pages(spool, 1); out_end_reservation: - if (map_chg != MAP_CHG_ENFORCED) + if (!bypass_vma_reservations) vma_end_reservation(h, vma, addr); return ERR_PTR(-ENOSPC); } @@ -5900,6 +5896,12 @@ static vm_fault_t hugetlb_wp(struct folio *pagecache_folio, * be acquired again before returning to the caller, as expected. */ spin_unlock(vmf->ptl); + + /* + * If this is a CoW from the owner of this page, we + * bypass_vma_reservations, since the reservation was already consumed + * when the hugetlb folio was first allocated before the fork happened. + */ new_folio = alloc_hugetlb_folio(vma, vmf->address, cow_from_owner); if (IS_ERR(new_folio)) { -- 2.47.1.688.g23fc6f90ad-goog