From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 30002C2D0CD for ; Wed, 21 May 2025 16:29:45 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id A518D6B0092; Wed, 21 May 2025 12:29:44 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id A02C86B0093; Wed, 21 May 2025 12:29:44 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 8F1FA6B0095; Wed, 21 May 2025 12:29:44 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id 6F95E6B0092 for ; Wed, 21 May 2025 12:29:44 -0400 (EDT) Received: from smtpin13.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay08.hostedemail.com (Postfix) with ESMTP id 0D930140B82 for ; Wed, 21 May 2025 16:29:44 +0000 (UTC) X-FDA: 83467451088.13.1A3827D Received: from mail-lf1-f45.google.com (mail-lf1-f45.google.com [209.85.167.45]) by imf25.hostedemail.com (Postfix) with ESMTP id 0C1B5A0002 for ; Wed, 21 May 2025 16:29:41 +0000 (UTC) Authentication-Results: imf25.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=kR5KVQdW; spf=pass (imf25.hostedemail.com: domain of andrei.aleohin@gmail.com designates 209.85.167.45 as permitted sender) smtp.mailfrom=andrei.aleohin@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1747844982; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=zQMKalrIVc1JFp0IGSclkWwPNxdqAsli6UTu3Hlg8rQ=; b=w00EKBeRZUxVG+pyEumGikvXHlbUs2CS3HvPLGuRKj/obuAYyGBgbrG/ViWW8X2uIKchIr /sG9nr03IMAkAhkEiQR+kCtiktNKOEgr9J5TiP07olX0Ahqn/zGN4RId6k1qurcs2qCl5d fAk0dJa5rrSz3KgFQTemxIDNIUFXvuE= ARC-Authentication-Results: i=1; imf25.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=kR5KVQdW; spf=pass (imf25.hostedemail.com: domain of andrei.aleohin@gmail.com designates 209.85.167.45 as permitted sender) smtp.mailfrom=andrei.aleohin@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1747844982; a=rsa-sha256; cv=none; b=7sefm5XdtcyFbp93HKkPEzn8+u9oqhdbY5CL2rU0ZkIh9Hu/nnNM8qqXQQUNv3Na0rDwzQ opKFgBzCPl4iB4ywVGHNcC8ETHrKh1eoaanV84s5WIapFBr3E/K3vTH6L5cvVfWIq2BEql 0sIA0HWLZskzgIJyGvE/Qfobh5epd50= Received: by mail-lf1-f45.google.com with SMTP id 2adb3069b0e04-54d6f93316dso8353603e87.2 for ; Wed, 21 May 2025 09:29:41 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1747844980; x=1748449780; darn=kvack.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=zQMKalrIVc1JFp0IGSclkWwPNxdqAsli6UTu3Hlg8rQ=; b=kR5KVQdWVWRqfQNJJWxbDjctGgTGJJc23zG2fm/8jnhQIio/a7WlZFuXAl1sdBtluI xVPSLRFS7n6Es/HyB4CHrRIe2gm1SGt6zg380eA6O6GCoNWREsIylpUaQ+2f94pQCqk1 xOUX2lrj8OHuWllLo7eJfUdDbxj9E2wx3QjgkDWLGOEdo3YGLxVYEHzw90HJe9zSbDP4 3ovqwEQXEjBEttfILs8nUDTrkzBKsPvXVRLdZTqIfLlcOS3HnUTiqnK96NSvozcenI/K b21gRjqteTGGNmntzIbDkgYIh/uNlIiaavGFSUVvFfd8/L0zynSK/Y85hilskzdSqelN 5EKQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1747844980; x=1748449780; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=zQMKalrIVc1JFp0IGSclkWwPNxdqAsli6UTu3Hlg8rQ=; b=cgl2JdcCQX02iOM14xD1G7QIgiEQtSeHyk35UvQX29vCqYIODFjE+z/gKavndeWMKd KlotNziTzAKFLj/EhUErpOKLUsIP7GlTQAvJhkvSUbCLqaOHVjtF1KBLgX6Ul+nEYqqT Z0tndlqNZL2s643DxqQpjosipIuG5DfYaarPrg7sVVErSVqEgmRuMLa8fBOA5f+Y9vh9 nKebSf/ZHOIolPqfAVFcCHhBqezKSefSBCZtFi7UYRL0P8hMsiZPmSZixoxq3hLvzCBz FoybEBX43JfImDzlXMj/Mod2/Y/ohB0ZqV0gSEMnXoxYx8OfUwJPS+jrY81/Q/eXs78o jOqg== X-Forwarded-Encrypted: i=1; AJvYcCVTrllb/5WeengLFkcXgemFMY5JBtjGxPGlsbxdTbXAiRDbgSf55L/rqWIQl7UaW6z1n8ilz2q2MA==@kvack.org X-Gm-Message-State: AOJu0Yz6rJ+cGg8GqmVJAn1XHppCPqr2b3ms1y6fQdK3AA/jySBpNO1H IbYSioKB6Gp2GZKwHfDgRKqqJSD6Nb5BeZpAXPyFC/E4zG6SGr3exqXr X-Gm-Gg: ASbGnctZTCPDzH1LFvJyJMtVVkBhfeMJykfUpGq5kp4W0QoUMaYKnKSRjBBF4GWXhrr 8euXkdl5S4KOXbvpKsnXYz+ebb6+WDnSbNCMsokHeuNg/an5qYeHxSPdnI1dPzQe2dCSE3mm5ba Zsed+d3AesNT2ilYC4gYlSSQJBB3/uwm7sTKeAA6jTERcViM63xyBLAYtc4gwamqUwbM9YgNmor lGJ7HjAhns4ElYlXGwQYjCergDom7kyZXlNwHXO/h/YFC08UVEjDwm5BAAklh5UsweunmPJwT67 PhFhNPF6GBtHu8Y4LOWRHSbLOcxPxWdxZAKsgJ/SrxCS+m6ESMtfS75ECkwOV4zXn6/poQ== X-Google-Smtp-Source: AGHT+IHPtkMRSRdF+mqJuFT4N5cT66bvaiiD/l+QHg7J6zyopxlrbFNz5qYCuNDSTOZx0SdDyRijgw== X-Received: by 2002:a05:6512:438c:b0:54f:bc9c:b031 with SMTP id 2adb3069b0e04-550e727b2f1mr6636632e87.55.1747844979807; Wed, 21 May 2025 09:29:39 -0700 (PDT) Received: from REDMIBook16.. ([93.157.245.111]) by smtp.gmail.com with ESMTPSA id 2adb3069b0e04-550e702bb0csm2914588e87.166.2025.05.21.09.29.39 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 21 May 2025 09:29:39 -0700 (PDT) From: Andrey Alekhin To: osalvador@suse.de Cc: muchun.song@linux.dev, linux-mm@kvack.org, Andrey Alekhin Subject: Re: Re: [PATCH] mm: free surplus huge pages properly on NUMA systems Date: Wed, 21 May 2025 19:28:38 +0300 Message-ID: <20250521162838.103981-1-andrei.aleohin@gmail.com> X-Mailer: git-send-email 2.43.0 In-Reply-To: References: MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Stat-Signature: y15wgreh1wj5jkau5re8ss8pswf3faot X-Rspam-User: X-Rspamd-Server: rspam03 X-Rspamd-Queue-Id: 0C1B5A0002 X-HE-Tag: 1747844981-425895 X-HE-Meta: U2FsdGVkX1+Wm37qG3qUhTKEes0EzBzumN0aoWmgQOfYEMZszGiOIi6Gi7wCqANLZ+BkuaCnzYsinIwAzdXjk2ulhYyVVu1BRgybt967OxrnBkx+oG3A3h8hQBYG/YgMNwSK75yBQ5QakimTEgFMs7RE9tP+uu0HnQgG3+nGAlzuTjAPlexb1HFyVZaGkYG1IiHRht7tGgG7PzMj5SgEixPXy5lskswf4MWgO6IWcUkP/bi/vUUQmGurH1dnI+dcE1A9RSQo6/ClDRjVRAfFyxuJnhJwnrnUySi3118VHE8kvCkW5qF29WKvbUW0y7fq0bn0p2zy8II354yb1vMe+zN9DJYaeAFMH1SpZfrZfcYULLfHfdPTpLrV1Sn1hvALAopd/d7UmJFrG8bWA519ggGRiU7wKs4Nzin1MWOrd9U3Lpt5nrdDRp32QcS2cwUq9QmjWXgAwZqo6wUBl9lu0bv5lc1hqEmphKnUKMq06Ba0FSvi64dMBds4yEfXgrQrzLbKfE12RrwA/KvGscWlo8ndF8mTB2PaNU/KbQQNgJLSZAl8sGkfKCX9DvJFXVq0yQ7R+/EcO1pdl9kyTLAI9gPAm38RZnZM7Z/zClOeDW+rJTqNfVdmjAwNswNT0cPNb/0ILzTdZ9jA3Tj6qBdZoafvRQ1fXaptRZCsEtrhewixBYuzhSVwypXllKVyzkOOPZ820Mvgk4A7j1cwiLq78CdzvnMbNTPiFfYdnL2WDAhkIz9JZw3raqvUDi10djj5wpC43iZGlUpD2ckr6UZHFnblP1X5274SJJSrkmtjzUvdTTOOk21eTLux3oJdERcurzyOlR1TlDXUTySq4Q9K48KztMOY+n8YnfRoKfeFV3qgI88xS3dkYEUDn0tcZj+2RqtPG2nrI3f0vXMIPHqkV7pfh3LIj9PQDmIO6LajPicViE575h3R3fmbh/5XIqbeRraj/R+i7X1cnHI7b5K VrJLDbkw F7JHD5gDNGy5rDb1Lk0jZrhy8j2dq8tLqrCZlRclvy2+4DmG4Qa7zN7UTAq4V8sTZ5BsxF43AXCbOhcwe4l72PQQvzLYnu5+jqEHYeNd4TFtxtiDR68glB8vNwF5lJqSv3apgR4U+8MpE+yr1VfEnVghgKH4+JUJB9H2CYBRz/6AfWjrfKHP4fAnS/sYRjftgBLpcZ4GqjUcuK++b5dqQgWRXe+PnyjL+koHydMKIWmr04gU/CtfbOw5l6RSlEh0OvLWbjqSTZKBhz+/aw/CsVaqd2w3Uvdy8D8jGs8hj7Gg/SO//XkI4ePadec8vqeY154Ix13TiXAWVJYYfW4l80Bti9yJcrlfnJBZia9yu2ACa2lUKC1BHDkegFDPSjF7SLwZsffF3TEwLh9PTl0oOt6yQ+W/TiK9VrzoegBtkWs1mGYuLbRq3OaHHWcr84KAdvP3p7tUXfoaj54HGV21+wjnfTozdZlQUnY4uuSXIJ37wS1rAySY/aaW2SB70iEbu0JtN X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Tue, 20 May 2025 03:26:18 -0700, Oscar Salvador wrote: > On Thu, May 15, 2025 at 10:13:27PM +0300, Andrey Alekhin wrote: >> The following sequence is possible on NUMA system: >> >> n - overall number of huge pages >> f - number of free huge pages >> s - number of surplus huge pages >> huge page counters: [before] >> | >> [after] >> >> Process runs on node #1 >> | >> node0 node1 >> 1) addr1 = mmap(MAP_SHARED, ...) // 1 huge page is mmaped (cur_nid=1) >> [n=2 f=2 s=0] [n=1 f=1 s=0] r=0 >> | >> [n=2 f=2 s=0] [n=1 f=1 s=0] r=1 >> >> 2) echo 1 > /proc/sys/vm/nr_hugepages (cur_nid=1) >> [n=2 f=2 s=0] [n=1 f=1 s=0] r=1 >> | >> [n=0 f=0 s=0] [n=1 f=1 s=0] r=1 >> 3) addr2 = mmap(MAP_SHARED, ...) // 1 huge page is mmaped (cur_nid=1) >> [n=0 f=0 s=0] [n=1 f=1 s=0] r=1 >> | >> [n=1 f=1 s=1] [n=1 f=1 s=0] r=2 >> New surplus huge page is reserved on node0, not on node1. In linux 6.14 >> it is unlikely but possible and legal. >> >> 4) write to second page (touch) >> [n=1 f=1 s=1] [n=1 f=1 s=0] r=2 >> | >> [n=1 f=1 s=1] [n=1 f=0 s=0] r=1 >> Reserverd page is mapped on node1 >> >> 5) munmap(addr2) // 1 huge page is unmaped >> [n=1 f=1 s=1] [n=1 f=0 s=0] r=1 >> | >> [n=1 f=1 s=1] [n=1 f=1 s=0] r=1 >> Huge page is freed, but it is not freed as surplus page. Huge page >> counters in system are now: [nr_hugepages=2 free_huge_pages=2 >> surplus_hugepages=1]. But they must be: [nr_hugepages=1 free_huge_pages=1 >> surplus_hugepages=0]. > > > But sure once you do the munmap for addr1, stats will be corrected > again, right? Yes, after doing munmap for addr1 stats will be corrected again. But counters after doing munmap for addr2 are wrong. In this example there is only one excessive allocated huge page in system between munmaping addr2 and addr1. But the problem is scalable and this number could be much bigger. More detailed explanation of the sequence leading to the problem is given below. Test hugemmap10 from ltp set does this sequence. 1) mmap of 1 SHARED huge page -> addr1 is returned. 1 free huge page is reserved for vma. 2) The number of huge pages in system is forcefully decreased to 1. Only 1 huge page in the system now and it is reserved for vma. 3) mmap of 1 additional SHARED huge page -> addr2 is returned. Mappings for addr1 and addr2 use similar flags (both shared), and kernel uses the same vma for them. For this vma 1 additional huge page is reserved. Because there is no free huge pages in the system this page is allocated and added to h->hugepage_freelists[nid] list by gather_surplus_pages function. The counters of surplus huge pages are adjusted too. There is a key difference in the process of allocation of huge pages between 5.14 and current kernel. In 5.14 kernel gather_surplus_pages function starts trying to allocate new huge page on the node with number 0. In the current kernel gather_surplus_pages starts allocation on the node of current CPU. This means that in 5.14 kernel huge page is often allocated on another node, whereas in the current kernel huge page is almost always allocated on the node of current CPU. As a result 2 pages are reserved for our vma and 2 cases are possible: - 2 reserved pages in h->hugepage_freelists[1] This case is almost always actual on the current kernel. n0 n1 [n=0 f=0 s=0] [n=2 f=2 s=1] r=2 - 1 reserved page in h->hugepage_freelists[0] and 1 reserved page in h->hugepage_freelists[1] . This case often occurs on 5.14 kernel, but it is unlikely on the current kernel. n0 n1 [n=1 f=1 s=1] [n=1 f=1 s=0] r=2 4) Write data to addr2 -> page fault occurs, huge page is mmaped. In both cases described above reserved huge page is taken from h->hugepage_freelists[1] (current node). 5) munmap is called for addr2. Now huge page needs to be freed in accordance with h->nr_huge_pages and h->surplus_huge_pages counters. In our case h->nr_huge_pages = 2, h->surplus_huge_pages = 1 . After munmap the counters should be: h->nr_huge_pages = 1, h->surplus_huge_pages = 0. - In case of 2 reserved pages in h->hugepage_freelists[1] this is really true, because h->surplus_huge_pages_node[1] = 1 , and surplus huge page is freed in free_huge_folio. This is the reason why the problem is not easily reproducible on the current kernel. But it is not guaranteed, that huge page is always allocated on current node. In case when allocation on current node fails new huge page can be allocated on another node. And this is the second case. - In case of 1 reserved page in h->hugepage_freelists[0] and 1 in h->hugepage_freelists[1] h->surplus_huge_pages_node[1] = 0 and surplus huge page is not freed properly. 1 excessive allocated huge page remains in the system. Huge page counters have wrong values: h->nr_huge_pages = 2, h->surplus_huge_pages = 1. Potential number of excessive allocated huge pages depends on the order of unmapping and amount of free memory on each node. > And I am not convinced about this one. > Apart from the fact that free_huge_folio() can be called from a workqueue, > why would we need to do this dance? In my opinion it does not matter where free_huge_folio is called. We need to obey the restriction on the number of huge pages in the system. All huge pages allocated on top of this number should be considered as surplus and freed when doing munmap. >> void free_huge_folio(struct folio *folio) >> { >> /* >> @@ -1833,6 +1850,8 @@ void free_huge_folio(struct folio *folio) >> struct hugepage_subpool *spool = hugetlb_folio_subpool(folio); >> bool restore_reserve; >> unsigned long flags; >> + int node; >> + nodemask_t *mbind_nodemask, alloc_nodemask; >> >> VM_BUG_ON_FOLIO(folio_ref_count(folio), folio); >> VM_BUG_ON_FOLIO(folio_mapcount(folio), folio); >> @@ -1883,6 +1902,25 @@ void free_huge_folio(struct folio *folio) >> remove_hugetlb_folio(h, folio, true); >> spin_unlock_irqrestore(&hugetlb_lock, flags); >> update_and_free_hugetlb_folio(h, folio, true); >> + } else if (h->surplus_huge_pages) { >> + mbind_nodemask = policy_mbind_nodemask(htlb_alloc_mask(h)); >> + if (mbind_nodemask) >> + nodes_and(alloc_nodemask, *mbind_nodemask, >> + cpuset_current_mems_allowed); >> + else >> + alloc_nodemask = cpuset_current_mems_allowed; >> + >> + for_each_node_mask(node, alloc_nodemask) { >> + if (h->surplus_huge_pages_node[node]) { >> + h->surplus_huge_pages_node[node]--; >> + h->surplus_huge_pages--; >> + break; >> + } >> + } > > What if the node is not in the policy anymore? What happens to the its > counters? There is a problem here in my patch. I have followed mempolicy for current task when calculating the mask of nodes similarly to the allocation process in gather_surplus_pages. But changing of mempolicy and therefore mask of nodes is possible between allocation of huge page and freeing it. At least it is possible through system call set_mempolicy. Such change does not make numa nodes unavailable. I suppose that all available nodes should be scanned: void free_huge_folio(struct folio *folio) { /* @@ -1833,6 +1850,8 @@ void free_huge_folio(struct folio *folio) struct hugepage_subpool *spool = hugetlb_folio_subpool(folio); bool restore_reserve; unsigned long flags; + int node; VM_BUG_ON_FOLIO(folio_ref_count(folio), folio); VM_BUG_ON_FOLIO(folio_mapcount(folio), folio); @@ -1883,6 +1902,25 @@ void free_huge_folio(struct folio *folio) remove_hugetlb_folio(h, folio, true); spin_unlock_irqrestore(&hugetlb_lock, flags); update_and_free_hugetlb_folio(h, folio, true); + } else if (h->surplus_huge_pages) { + for_each_node_mask(node, cpuset_current_mems_allowed) { + if (h->surplus_huge_pages_node[node]) { + h->surplus_huge_pages_node[node]--; + h->surplus_huge_pages--; + break; + } + } > I have to think about this some more, but I am not really convinced we > need this. Ok, for this purpose I have described the problem in details above. -- Andrey Alekhin