From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 286C1C3ABDD for ; Tue, 20 May 2025 10:26:22 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id B87D26B008A; Tue, 20 May 2025 06:26:21 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id B11B46B0093; Tue, 20 May 2025 06:26:21 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id A004B6B0095; Tue, 20 May 2025 06:26:21 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id 7E6786B008A for ; Tue, 20 May 2025 06:26:21 -0400 (EDT) Received: from smtpin06.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay02.hostedemail.com (Postfix) with ESMTP id 21AC5120B07 for ; Tue, 20 May 2025 10:26:21 +0000 (UTC) X-FDA: 83462906562.06.6CE440B Received: from smtp-out1.suse.de (smtp-out1.suse.de [195.135.223.130]) by imf16.hostedemail.com (Postfix) with ESMTP id E4A39180006 for ; Tue, 20 May 2025 10:26:18 +0000 (UTC) Authentication-Results: imf16.hostedemail.com; dkim=pass header.d=suse.de header.s=susede2_rsa header.b=dieSzyYT; dkim=pass header.d=suse.de header.s=susede2_ed25519 header.b=O749hb7r; dkim=pass header.d=suse.de header.s=susede2_rsa header.b=dieSzyYT; dkim=pass header.d=suse.de header.s=susede2_ed25519 header.b=O749hb7r; spf=pass (imf16.hostedemail.com: domain of osalvador@suse.de designates 195.135.223.130 as permitted sender) smtp.mailfrom=osalvador@suse.de; dmarc=pass (policy=none) header.from=suse.de ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1747736779; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=4LC13q19be7bnT8T35Nk2OfuxllvYD5Usq3XR0iwGcw=; b=2wfquvrRaLH21iOhK9U2o/r6b22D0J01FhaNrYpG+PXSmKhhQqlxfRXaW+WkBwAVb9m66R coAvPVK1WdlDVsGYh2CZt/dZfwfvABpp4Akp0LLm+cD8mxJRMu8XCNcjvYHp0rh90yoTok w1bGDtKHCxvfQYj394D7gvjWi9O3q6c= ARC-Authentication-Results: i=1; imf16.hostedemail.com; dkim=pass header.d=suse.de header.s=susede2_rsa header.b=dieSzyYT; dkim=pass header.d=suse.de header.s=susede2_ed25519 header.b=O749hb7r; dkim=pass header.d=suse.de header.s=susede2_rsa header.b=dieSzyYT; dkim=pass header.d=suse.de header.s=susede2_ed25519 header.b=O749hb7r; spf=pass (imf16.hostedemail.com: domain of osalvador@suse.de designates 195.135.223.130 as permitted sender) smtp.mailfrom=osalvador@suse.de; dmarc=pass (policy=none) header.from=suse.de ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1747736779; a=rsa-sha256; cv=none; b=pbmF3i62CPCoojhCpDu6BuCD9+ok+tpSJ/3pSFSY8XMdr9cXc/JKBR3uC+9CgEIMi+lR06 cBoYT4nPRNcVdABa4V00gIA4IBtlyXTvCXIBkr7ovtUkv/5ldKWoWCmm4kiQ0IDzXhqX7T DKHzpi/QqyS68YgA2pUShxTddjXFbbQ= Received: from imap1.dmz-prg2.suse.org (unknown [10.150.64.97]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (No client certificate requested) by smtp-out1.suse.de (Postfix) with ESMTPS id 5392F22A3E; Tue, 20 May 2025 10:26:17 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.de; s=susede2_rsa; t=1747736777; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=4LC13q19be7bnT8T35Nk2OfuxllvYD5Usq3XR0iwGcw=; b=dieSzyYTACRtKxH2WCqzvXLlX/EjXxqlhDRt2IdMnjM6/xNpQvGn0BaIFv+0X2NN6aOJt7 IIluAMV7fSFvTaYfXV6JwGH3nEkJivrnZHCTQvhQqit/I4daQvdPMmNagIspHcaChQONIE N2TR2VV9+nxLOfcHfgGULjeP6z60LWY= DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=suse.de; s=susede2_ed25519; t=1747736777; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=4LC13q19be7bnT8T35Nk2OfuxllvYD5Usq3XR0iwGcw=; b=O749hb7r0sw4Z+YBYjFJ3o59rq1FI191/9fBXagKOXct57UJ7qgA7FooUeo2pbRE5VQd7O hFD8Qytydvzj2pBg== DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.de; s=susede2_rsa; t=1747736777; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=4LC13q19be7bnT8T35Nk2OfuxllvYD5Usq3XR0iwGcw=; b=dieSzyYTACRtKxH2WCqzvXLlX/EjXxqlhDRt2IdMnjM6/xNpQvGn0BaIFv+0X2NN6aOJt7 IIluAMV7fSFvTaYfXV6JwGH3nEkJivrnZHCTQvhQqit/I4daQvdPMmNagIspHcaChQONIE N2TR2VV9+nxLOfcHfgGULjeP6z60LWY= DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=suse.de; s=susede2_ed25519; t=1747736777; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=4LC13q19be7bnT8T35Nk2OfuxllvYD5Usq3XR0iwGcw=; b=O749hb7r0sw4Z+YBYjFJ3o59rq1FI191/9fBXagKOXct57UJ7qgA7FooUeo2pbRE5VQd7O hFD8Qytydvzj2pBg== Received: from imap1.dmz-prg2.suse.org (localhost [127.0.0.1]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (No client certificate requested) by imap1.dmz-prg2.suse.org (Postfix) with ESMTPS id 40CC413888; Tue, 20 May 2025 10:26:17 +0000 (UTC) Received: from dovecot-director2.suse.de ([2a07:de40:b281:106:10:150:64:167]) by imap1.dmz-prg2.suse.org with ESMTPSA id ZIHzDslYLGjCaQAAD6G6ig (envelope-from ); Tue, 20 May 2025 10:26:17 +0000 Date: Tue, 20 May 2025 12:26:00 +0200 From: Oscar Salvador To: Andrey Alekhin Cc: muchun.song@linux.dev, linux-mm@kvack.org Subject: Re: [PATCH] mm: free surplus huge pages properly on NUMA systems Message-ID: References: <20250515191327.41089-1-andrei.aleohin@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20250515191327.41089-1-andrei.aleohin@gmail.com> X-Stat-Signature: pmspkz98k4o459bnenu4auiqim5efyr9 X-Rspam-User: X-Rspamd-Server: rspam03 X-Rspamd-Queue-Id: E4A39180006 X-HE-Tag: 1747736778-471762 X-HE-Meta: U2FsdGVkX1+oLev3ex8OanWQvT/hXsKeuKXVPdBLXTh+0B2ID++RcJ6X0/SNwOE0Oa+nzJU8XMUOPxz9DkFp82KDv2OMjo+0mxfviEYidWhnBY6NrxYyXQR36go7fv/efGdkW28gl6Gm1+tN7ycHILx6QYUkRr7EeRXJlwec978yiuw7GsiFW6Kb+b9gN4EUyPeagiJ4jIZC4PQ+9RoMLyrq0fNeJXfHgCkDYTXbFaHELqfIINJKsg0hzRSAHS2CXVA292SSH5LHNcmj8cZTleFvmpzgCFBlsf6hLvReVvhm6aUO5DOThka2B6zzgUU783H/xxb26lFkVTkI1oItjVcmJFUPQtXCp+trDNBlZHMHASW/vnh29HOOFomjfkLkh6he8MKQUSG6KoWAou3IPU1BMev5CkexjeYPhJWQDVFCjaL2d1Ksvna18/GFkh+l+ScJan7jrZBxK2Tbpk9IkXxqYJhmB2DL0rshyA9CjpDUK4S/yW7eKtumOZd2OrtR/Gl3J8UuELMVbxIkFRrPni46ViENfK8lYK90JeX1SGCXS6cy7ijW+5Sm1VlSdYL6hlYtOBz/ytdjRVhdvzsshdOIowUkgsceJXsVT2nqRYB+MRInVzAJ5xdjT4NHUO/5J/m073oeJnXMif0WWbePed+qnyOSYbjxpz37mfF/3GyZAg7YttLyI13pVgafkhvfD49JaG3XJCZYeJhroE7KT+B7QY052ioljEHS8dFqv5MK/61IlIOh2bM8uWTOaV4/YeFDAfP5FYnN5fKl8vwxWRi5A52NR6454tO16+OZBYU6c/xlQixTNDhRS+wjlI10TihT9cLX1nidP8uDuAgS9whQSHHwI3HcfGRB4sF0fS9IimEK+mgyU9wZccQAzkdbHQ+Jr0ZyZIclu30aliUrG66uNWeJpRbfj9kMd9dJJRIJEdLrgPxbGJNqDhc/HvgVH8pq9wxPg5X2A5QKHrz fSfxEvqQ Xo3a/N0K6Twk9XbLkMsjqFebwZkF0zCB9FYbWsOyKGQamJlzdNZx9xfNNZBS6KefoDBc9nwG87VHGPa1Paqamgw8pJ+xWb4Mi+KCdpJY8FonExy566L3D68Vy/NTmtQjYd3ylOB/vsdBL9K10FzgKG7IECo25w44S9sL1mZA7UIlZ/S0eWJrxZSXbOcmsXHEaqfVD6uT+9vIUkyKkO0OKUrL0cRSDOeLiZBzOpDxYyiLm4D60YJmCtjlRwf6Y0kf7bSw7ZYhRFQwgEmdtbtCagU77iJT9NsXwL+9tRoQZeRBgZbg= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Thu, May 15, 2025 at 10:13:27PM +0300, Andrey Alekhin wrote: > The following sequence is possible on NUMA system: > > n - overall number of huge pages > f - number of free huge pages > s - number of surplus huge pages > huge page counters: [before] > | > [after] > > Process runs on node #1 > | > node0 node1 > 1) addr1 = mmap(MAP_SHARED, ...) // 1 huge page is mmaped (cur_nid=1) > [n=2 f=2 s=0] [n=1 f=1 s=0] r=0 > | > [n=2 f=2 s=0] [n=1 f=1 s=0] r=1 > > 2) echo 1 > /proc/sys/vm/nr_hugepages (cur_nid=1) > [n=2 f=2 s=0] [n=1 f=1 s=0] r=1 > | > [n=0 f=0 s=0] [n=1 f=1 s=0] r=1 > 3) addr2 = mmap(MAP_SHARED, ...) // 1 huge page is mmaped (cur_nid=1) > [n=0 f=0 s=0] [n=1 f=1 s=0] r=1 > | > [n=1 f=1 s=1] [n=1 f=1 s=0] r=2 > New surplus huge page is reserved on node0, not on node1. In linux 6.14 > it is unlikely but possible and legal. > > 4) write to second page (touch) > [n=1 f=1 s=1] [n=1 f=1 s=0] r=2 > | > [n=1 f=1 s=1] [n=1 f=0 s=0] r=1 > Reserverd page is mapped on node1 > > 5) munmap(addr2) // 1 huge page is unmaped > [n=1 f=1 s=1] [n=1 f=0 s=0] r=1 > | > [n=1 f=1 s=1] [n=1 f=1 s=0] r=1 > Huge page is freed, but it is not freed as surplus page. Huge page > counters in system are now: [nr_hugepages=2 free_huge_pages=2 > surplus_hugepages=1]. But they must be: [nr_hugepages=1 free_huge_pages=1 > surplus_hugepages=0]. But sure once you do the munmap for addr1, stats will be corrected again, right? > void free_huge_folio(struct folio *folio) > { > /* > @@ -1833,6 +1850,8 @@ void free_huge_folio(struct folio *folio) > struct hugepage_subpool *spool = hugetlb_folio_subpool(folio); > bool restore_reserve; > unsigned long flags; > + int node; > + nodemask_t *mbind_nodemask, alloc_nodemask; > > VM_BUG_ON_FOLIO(folio_ref_count(folio), folio); > VM_BUG_ON_FOLIO(folio_mapcount(folio), folio); > @@ -1883,6 +1902,25 @@ void free_huge_folio(struct folio *folio) > remove_hugetlb_folio(h, folio, true); > spin_unlock_irqrestore(&hugetlb_lock, flags); > update_and_free_hugetlb_folio(h, folio, true); > + } else if (h->surplus_huge_pages) { > + mbind_nodemask = policy_mbind_nodemask(htlb_alloc_mask(h)); > + if (mbind_nodemask) > + nodes_and(alloc_nodemask, *mbind_nodemask, > + cpuset_current_mems_allowed); > + else > + alloc_nodemask = cpuset_current_mems_allowed; > + > + for_each_node_mask(node, alloc_nodemask) { > + if (h->surplus_huge_pages_node[node]) { > + h->surplus_huge_pages_node[node]--; > + h->surplus_huge_pages--; > + break; > + } > + } And I am not convinced about this one. Apart from the fact that free_huge_folio() can be called from a workqueue, why would we need to do this dance? What if the node is not in the policy anymore? What happens to the its counters? I have to think about this some more, but I am not really convinced we need this. -- Oscar Salvador SUSE Labs