From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 9C649C3ABDD for ; Thu, 15 May 2025 19:14:03 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 2DC5B6B0099; Thu, 15 May 2025 15:14:02 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 288DD6B009E; Thu, 15 May 2025 15:14:02 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 12A626B00A0; Thu, 15 May 2025 15:14:02 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id E155C6B0099 for ; Thu, 15 May 2025 15:14:01 -0400 (EDT) Received: from smtpin05.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay05.hostedemail.com (Postfix) with ESMTP id 3F7C858B63 for ; Thu, 15 May 2025 19:14:02 +0000 (UTC) X-FDA: 83446092324.05.AA543B0 Received: from mail-ed1-f45.google.com (mail-ed1-f45.google.com [209.85.208.45]) by imf08.hostedemail.com (Postfix) with ESMTP id 63E8B160002 for ; Thu, 15 May 2025 19:14:00 +0000 (UTC) Authentication-Results: imf08.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b="dJ/ZE5Vr"; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf08.hostedemail.com: domain of andrei.aleohin@gmail.com designates 209.85.208.45 as permitted sender) smtp.mailfrom=andrei.aleohin@gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1747336440; a=rsa-sha256; cv=none; b=4ivfMCPNbLbawYihiipBEuIdn8Nz0mIeIIbvYURod0kPhVixnnUD//o2bNufgsHj0GP/jk 4zCVifjVtRJDew7Rg4AeUnJMCbEiuivAxhOkZMgsnDMiBNrDRQi5gUSE1QBB9JTO7Auopx cpFho9ltqUveWNXPK1++MTPKSbOJheA= ARC-Authentication-Results: i=1; imf08.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b="dJ/ZE5Vr"; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf08.hostedemail.com: domain of andrei.aleohin@gmail.com designates 209.85.208.45 as permitted sender) smtp.mailfrom=andrei.aleohin@gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1747336440; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:references:dkim-signature; bh=QniisoOYt+gRz0DTN7WgtiKznP8Mb33vx7W/LuFkaqM=; b=ilFgODAIPvCClnFFQTwABy7EKojApV0784l6afYngV8ZY4AidBK12E47PvIBOlTQMaCtpE l8daBGnH2iJqQsXvGayudFtp0RY7THynmPDhxthSh36tmQ0wFyW8RW5OgYPb5q1kfI8hae fYuPaIBjB6BadPDhuM0GzlacfZu51Po= Received: by mail-ed1-f45.google.com with SMTP id 4fb4d7f45d1cf-5fcac09313cso2245114a12.2 for ; Thu, 15 May 2025 12:14:00 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1747336439; x=1747941239; darn=kvack.org; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:from:to:cc:subject:date:message-id:reply-to; bh=QniisoOYt+gRz0DTN7WgtiKznP8Mb33vx7W/LuFkaqM=; b=dJ/ZE5VrwDzDhPPpWetcFJ/6ENp6lJocx+9aauN2MJw3XCmZHWy6alnAWfSWgjZaga N13aW+cdCmlek+fLRDt0lhNICPQPdEaOranEDOh6cpKYXnuBQy1N1lURnbTNEbgp1WAE MDcgS38wXbQq40m5wP/x9wH5QRNEeLyjPE5/qp7XxxIhzLN0XvAWoDZ9gWozpTRfmLLF sHd/91KN1skbgaIstDLkU+m0JaEbl1jXtIi3Ukzjam5/v9um4iSayQkTlrBAqC8Odk7N 0WdPQIcoJqiDtYUh4nB0DjQNRWOTqUJVC70DSOJL8eALPhEEC3dVgjfig02+S2a3UH65 UeCg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1747336439; x=1747941239; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=QniisoOYt+gRz0DTN7WgtiKznP8Mb33vx7W/LuFkaqM=; b=u+zEaXHxv9uWDPbiUHKs5os/rotlRDEXewx/dUYtabFVip8GPGpqHEBOpHacxDgsot N80F1Z+haouMwmuZn0xLB2a/K6zMdwzZoNt7gcRgQ0k+p3GnlcKp0thzW30Riks61/Ik 0qTpVaCZwEk+drUpQF3b/wOrNtaJkuLpO4eGwfbEz42dzs/AsAyH4mxMO4k726/wuBb/ 8Pxf1Yhm/DQMCBUDq90JxQ4r6UIVXejSAWfnlmqA3LUL4MP0ciGLCKTKNctIHw3grKar VC7VV48dCHrUAOL84vbVaIxI4zAF9pjQKhCdRYWwyGiTtj8fxU2D9pCH+iofSdSlewln b1EA== X-Forwarded-Encrypted: i=1; AJvYcCXgWxPXqwdYjdvTmpOFRxG8mbPhPZ9UiHGg+UDQpJ4gQAPJjDyCVOYr9PejwpSh+sMUYO7wY4Ub2A==@kvack.org X-Gm-Message-State: AOJu0Yz0ss0uhRlPGwL2Nvkak0v1i5nL2RQL7SugwXr9M9C3wwqECQcx SiEiGk/tiAWeuzYkJiDj0IP8SLXb1Yq3pEA+lXYXe4ON5kDoKSDlz+lO X-Gm-Gg: ASbGncuFus+unUb6rqA3rVe7Vku0xTnxym+j1+IP5Zoil5zfjD8ULZxnJrjWhLkSGS+ h68J8O8NcOv+EE20KV3+1yzUXTkOt2XfkqG8XpasMR7VukXmw82Ww/NKw4xl+mQfZeNa/Remw7h RLwY4MCfGQEO3vJ/CQC73giC7JS4KTPB8/bLgsyoWIarRy7+01diFtS7B0cR++EU8KfTDwF82dG EFRwDVzpWXOaWekTytn+Aq5l2UbWNlsn5GtWjbFc17BGL1Tygt8dXb5V5WUiJtjTR7NqYYbzlS7 t41zU2GRx58944JuiNSGcpP3cW0uajrWrkZpG/0kU8hKK4dQGCU7pkKC/HAgVu8A2GbtBw== X-Google-Smtp-Source: AGHT+IErtPi7iK2VF8TO1exlNUSAmyGYUVh3FM538Ak030AsZwzOpUZ+0x4bwW/ULM2wZTbgtjXSzQ== X-Received: by 2002:a17:907:268d:b0:ad5:e18:2141 with SMTP id a640c23a62f3a-ad52d5d9e43mr80907066b.53.1747336438520; Thu, 15 May 2025 12:13:58 -0700 (PDT) Received: from REDMIBook16.. ([85.198.105.177]) by smtp.gmail.com with ESMTPSA id a640c23a62f3a-ad52d272047sm29192466b.72.2025.05.15.12.13.57 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 15 May 2025 12:13:58 -0700 (PDT) From: Andrey Alekhin To: muchun.song@linux.dev Cc: osalvador@suse.de, linux-mm@kvack.org, Andrey Alekhin Subject: [PATCH] mm: free surplus huge pages properly on NUMA systems Date: Thu, 15 May 2025 22:13:27 +0300 Message-ID: <20250515191327.41089-1-andrei.aleohin@gmail.com> X-Mailer: git-send-email 2.43.0 MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Rspam-User: X-Rspamd-Server: rspam07 X-Rspamd-Queue-Id: 63E8B160002 X-Stat-Signature: 8crax3pn8h1ndqeh1itekchw7q9qzwgw X-HE-Tag: 1747336440-780153 X-HE-Meta: U2FsdGVkX19fGz3X3MqZpO3i9xLDigLWLq/uqgsXu3hT+8VF87lhjSkF5dGNBkDnGdNTmkYUxvyOU6uLfwprkotAPovo/+63KF1cUx9/bw/nw7frkd9DRm6MwsNRb+Vsf4g5tLTVj44GYXHP85+5ybnREnj6s7YcOm1iD4WpQQ01o9Hqra+gAKw67KT3lIYQFDDLtdYHuD0pavNj/GZfJzyFEaR/uhlZu9r9qaRAUo8KvsC+oiRFFTZOLwjZQZza1+IY1Z/eD1adDBnW+Or493NSrrVCe0VHtaiVHWr72nRIXD8FmrO+o7mEcTZKxJtGQOCFUHC78fBZoGPIcSzrsqm54MOnR2f9suUadoKdOPpUO1QNc972I2X62803KKT8uQxOJaAo+wpJjOJdpjc30IAEDFbeh+wskRqwU9Zd8ovQh+2BUfVmDdFZ5mxhWjp3clvK2fL6QaruXmSejbq+tOVMggu4e1vjpKrSzu75Ac2GLk8KvlnATg/6B3RAISGk1xugqIX1Cik9QclZHV9bQ1K+8CypFgSau4w8h/suSL1K3uaihAydyj4G8CHogiz85Z0qCVUl53ZvDVIEJV+qh8jfou1plm26fQMAqurmH6gcMm0auevHfCRelG+V838EUxiKmC/zDgPAlyTQOymMOG4+cwWRTZSey7Ryvfvzhmh2GT32hA3q92/84dBV2glElPqmyv6v3R11XGazD21eeNz7CzKarqz6SsXiDbR29adGGjQ/HFY/YE6ziXyiuq6vgurIQu//QXd5lX5ezoBTkmTzrq70o7n15rO0+mBMZrk3mtlo2FfQhKQVULXnYYreCKxBmc4fWOKqsXIDYb7nrcNszCArhh8oNikQMI001ydZvX6xcDIvn2yFhVOVNoCwCMVQInkUHxcfgwF7wTi0wnWQMxPOVgHs/yjWPFp1gbiyaPEsXaq0ApkUtGhEdzAJ4TfOq//KJmroDXRrkD8 Qmo6xpOY 1blZ4TMwfMDsgu2srluioyQAAMJlMFN5T8y3aEHwKdjWdU6eXOV0qb14ybHXZMuSa+UEEVC08jvbbPUH8izfRR4VPGEsrrLI8B/6sX+g7P9X8MgACNCDp/WwU6jqCMbai3xAofm6u8QYj0cLVDm5QeOt9F60q0q5plfUzTc/m/BS3hLhTP4wHWnRtEyy28x3leTDe+kFVsZQ7zH22PtNixt4pgaySLcPYajjKGTHgWPwdVk8fg+k7tYfKwSKnBlgev5Gj2GBMhgz61Wn2xxvW0Zy4ICONgGkVaeNszdciw9l32QdbkxvBVVxf+t5D9qtbKm+BmZSN6xg8NHAfG+JETFb1gIks1WVRDwh89sSSvTX7CW/qDF+BVLcgQGYBEV64Irsvg6uHkwI46bQXCLUPuZXpSey8wnLJEUIv68NaKxLMMkoRSiucMlZI6wxfC12EuramtvfNMJw2Ekc84CHc8JPHc3dAlcPp55a98lH33yuca07MqUJfXBcomCZUOTV4x0AB+q9XOpWPeD8JxjHCOqA6069oOQrQ6/SHmv++/XebpAL4OFA6+DGLUlIPkSEmMW68 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: == History == Wrong values of huge pages counters were detected on Red Hat 9.0 (linux 5.14) when runing ltp test hugemmap10. Inspection of linux source code showed that the problem was not fixed even in linux 6.14. == Problem == free_huge_folio function does not properly free surplus huge pages on NUMA systems. free_huge_folio checks surplus huge page counter only on current node (where folio is allocated), but gather_surplus_pages function can allocate surplus huge pages on any node. The following sequence is possible on NUMA system: n - overall number of huge pages f - number of free huge pages s - number of surplus huge pages huge page counters: [before] | [after] Process runs on node #1 | node0 node1 1) addr1 = mmap(MAP_SHARED, ...) // 1 huge page is mmaped (cur_nid=1) [n=2 f=2 s=0] [n=1 f=1 s=0] r=0 | [n=2 f=2 s=0] [n=1 f=1 s=0] r=1 2) echo 1 > /proc/sys/vm/nr_hugepages (cur_nid=1) [n=2 f=2 s=0] [n=1 f=1 s=0] r=1 | [n=0 f=0 s=0] [n=1 f=1 s=0] r=1 3) addr2 = mmap(MAP_SHARED, ...) // 1 huge page is mmaped (cur_nid=1) [n=0 f=0 s=0] [n=1 f=1 s=0] r=1 | [n=1 f=1 s=1] [n=1 f=1 s=0] r=2 New surplus huge page is reserved on node0, not on node1. In linux 6.14 it is unlikely but possible and legal. 4) write to second page (touch) [n=1 f=1 s=1] [n=1 f=1 s=0] r=2 | [n=1 f=1 s=1] [n=1 f=0 s=0] r=1 Reserverd page is mapped on node1 5) munmap(addr2) // 1 huge page is unmaped [n=1 f=1 s=1] [n=1 f=0 s=0] r=1 | [n=1 f=1 s=1] [n=1 f=1 s=0] r=1 Huge page is freed, but it is not freed as surplus page. Huge page counters in system are now: [nr_hugepages=2 free_huge_pages=2 surplus_hugepages=1]. But they must be: [nr_hugepages=1 free_huge_pages=1 surplus_hugepages=0]. == Solution == Check huge page counters on all available nodes when page is freed in free_huge_folio. This check guarantees that surplus huge pages are always freed correctly if they present in system. Signed-off-by: Andrey Alekhin diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 6ea1be71aa42..2d38d12f4943 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -1822,6 +1822,23 @@ struct hstate *size_to_hstate(unsigned long size) return NULL; } +static nodemask_t *policy_mbind_nodemask(gfp_t gfp) +{ +#ifdef CONFIG_NUMA + struct mempolicy *mpol = get_task_policy(current); + + /* + * Only enforce MPOL_BIND policy which overlaps with cpuset policy + * (from policy_nodemask) specifically for hugetlb case + */ + if (mpol->mode == MPOL_BIND && + (apply_policy_zone(mpol, gfp_zone(gfp)) && + cpuset_nodemask_valid_mems_allowed(&mpol->nodes))) + return &mpol->nodes; +#endif + return NULL; +} + void free_huge_folio(struct folio *folio) { /* @@ -1833,6 +1850,8 @@ void free_huge_folio(struct folio *folio) struct hugepage_subpool *spool = hugetlb_folio_subpool(folio); bool restore_reserve; unsigned long flags; + int node; + nodemask_t *mbind_nodemask, alloc_nodemask; VM_BUG_ON_FOLIO(folio_ref_count(folio), folio); VM_BUG_ON_FOLIO(folio_mapcount(folio), folio); @@ -1883,6 +1902,25 @@ void free_huge_folio(struct folio *folio) remove_hugetlb_folio(h, folio, true); spin_unlock_irqrestore(&hugetlb_lock, flags); update_and_free_hugetlb_folio(h, folio, true); + } else if (h->surplus_huge_pages) { + mbind_nodemask = policy_mbind_nodemask(htlb_alloc_mask(h)); + if (mbind_nodemask) + nodes_and(alloc_nodemask, *mbind_nodemask, + cpuset_current_mems_allowed); + else + alloc_nodemask = cpuset_current_mems_allowed; + + for_each_node_mask(node, alloc_nodemask) { + if (h->surplus_huge_pages_node[node]) { + h->surplus_huge_pages_node[node]--; + h->surplus_huge_pages--; + break; + } + } + + remove_hugetlb_folio(h, folio, false); + spin_unlock_irqrestore(&hugetlb_lock, flags); + update_and_free_hugetlb_folio(h, folio, true); } else { arch_clear_hugetlb_flags(folio); enqueue_hugetlb_folio(h, folio); @@ -2389,23 +2427,6 @@ struct folio *alloc_hugetlb_folio_nodemask(struct hstate *h, int preferred_nid, return alloc_migrate_hugetlb_folio(h, gfp_mask, preferred_nid, nmask); } -static nodemask_t *policy_mbind_nodemask(gfp_t gfp) -{ -#ifdef CONFIG_NUMA - struct mempolicy *mpol = get_task_policy(current); - - /* - * Only enforce MPOL_BIND policy which overlaps with cpuset policy - * (from policy_nodemask) specifically for hugetlb case - */ - if (mpol->mode == MPOL_BIND && - (apply_policy_zone(mpol, gfp_zone(gfp)) && - cpuset_nodemask_valid_mems_allowed(&mpol->nodes))) - return &mpol->nodes; -#endif - return NULL; -} - /* * Increase the hugetlb pool such that it can accommodate a reservation * of size 'delta'. -- 2.43.0