From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 88FE5C369AB for ; Wed, 16 Apr 2025 01:08:19 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 6755A6B0201; Tue, 15 Apr 2025 21:08:18 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 622DF6B0202; Tue, 15 Apr 2025 21:08:18 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 4EB7F28000E; Tue, 15 Apr 2025 21:08:18 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id 316E86B0201 for ; Tue, 15 Apr 2025 21:08:18 -0400 (EDT) Received: from smtpin21.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay01.hostedemail.com (Postfix) with ESMTP id 100901CF0E1 for ; Wed, 16 Apr 2025 01:08:18 +0000 (UTC) X-FDA: 83338121076.21.62AA5C6 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by imf24.hostedemail.com (Postfix) with ESMTP id 9555B180006 for ; Wed, 16 Apr 2025 01:08:15 +0000 (UTC) Authentication-Results: imf24.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=iFrxP+gQ; dmarc=pass (policy=quarantine) header.from=redhat.com; spf=pass (imf24.hostedemail.com: domain of luizcap@redhat.com designates 170.10.133.124 as permitted sender) smtp.mailfrom=luizcap@redhat.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1744765695; a=rsa-sha256; cv=none; b=nmF7EZ//8ZMYsbCMSKFydQBSt3vgW0N2yHtcMHSJSzPM8ddTlbRjWdpMNPsfqeNO9gYh9P HarPdt5+3y7kJlQG63NbhqxUT39+fpys3Sf5x9jo3l7swdRV/nj+qn0gFrrOwjzil5+8Ro BhGQ8e7ne+j2WT/HwK8xYAepxlldVs4= ARC-Authentication-Results: i=1; imf24.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=iFrxP+gQ; dmarc=pass (policy=quarantine) header.from=redhat.com; spf=pass (imf24.hostedemail.com: domain of luizcap@redhat.com designates 170.10.133.124 as permitted sender) smtp.mailfrom=luizcap@redhat.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1744765695; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=vPbZ2GG/vRgMP7SCHzm6/03bJU3vDBJ32vD/KHg7e6A=; b=sxqgSqSheP2ihl2JUvsM4RJrmNiioyDTYyIwoqf7Q3STUJWmMI/JVlF9TF5L+MKTiHaUAS a8yRCN71zXjQd0GFdJXcP0m/uEpkR/r/wAo/7kjL2sZ2AVaKekioIbVDvzsWuLqtoPDDse 5+qwzmV1akDXk0rnEC9379mB7f6pY54= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1744765694; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=vPbZ2GG/vRgMP7SCHzm6/03bJU3vDBJ32vD/KHg7e6A=; b=iFrxP+gQ4EgG6rTgwJzxbLF3Td4pAznyiBCi9L7XrcvLXVYAfeFECX6Hsc3K/pNE7Sh6OD wy0Lf4FEm8sqA7kkGoWHr/Yw77UZ6lPyhXc0TeOVI3bB25ZTx7gvij+NRu41uq4+/nmvjS 46Q1HtG9a8KO5IpHuwOedsqXR9d7FK4= Received: from mail-pl1-f198.google.com (mail-pl1-f198.google.com [209.85.214.198]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-320-hbu-6kmUMaaYmuUkBmk37Q-1; Tue, 15 Apr 2025 21:08:13 -0400 X-MC-Unique: hbu-6kmUMaaYmuUkBmk37Q-1 X-Mimecast-MFC-AGG-ID: hbu-6kmUMaaYmuUkBmk37Q_1744765692 Received: by mail-pl1-f198.google.com with SMTP id d9443c01a7336-22410053005so99401565ad.1 for ; Tue, 15 Apr 2025 18:08:13 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1744765692; x=1745370492; h=content-transfer-encoding:in-reply-to:from:content-language :references:cc:to:subject:user-agent:mime-version:date:message-id :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=vPbZ2GG/vRgMP7SCHzm6/03bJU3vDBJ32vD/KHg7e6A=; b=Wz3y7DdVU3AVF2J4Jl0yRlrlVLe5cNi7kw/hDxMffHv8RTI5r3xqIp9g4Td4qpJEDY EC+DVoMK+9w88k1jlDn2Jv+439QfqhxJSWMaFpgmfO1XTF1I7DIAmxIH5aospyRPZQiJ BKQ3qJnMCMreZYBfpKzCKMyJ+uUqPlhMZanTQUTQtxM+l40xzAlnFdLLA5ID/Er5Jk3c KrFaAU8L4+pcsXnJFLR/yEqL7gFdpRGy/qlSx7y4P1cHHcO0rKUvJtgy49WwmjkFwUPQ qP7VUglp7kdiTrvGy2457Jpz0ln6sHmrEOV8SXlnignwlhiJrtiSyOX9N1kjs5+gwRJU NQtA== X-Forwarded-Encrypted: i=1; AJvYcCW7Yx+FAI8fhJ1sYyjgNArDMi2hJaJMJgKPcySEdEocLJiiNKC8K6GGuy/w22Hlnf9z4CcHaNePHg==@kvack.org X-Gm-Message-State: AOJu0Yz+weIhtY4Dsmh4RKuY3OBlu15bMnLVdD4dM+62qZl54PWN5R8v 25LsLitQ3Av75aQBhjbRIT5spYcexPNjpFVbOOWsMC0mutDFQOwcHUCb17cie76Khp3hAb+qDeK po3veR2P2wC+9CzRaB1hk8T0N1MqXB1opF2G6sl2kRQr6znvE X-Gm-Gg: ASbGncvgWI9S+z8TpzW0QSdF+zD8fS3tqSuNVHj8vzOlaJnQuGGqc4NDe5NwI5QEmeA AQO2ZOj/QTF0bl0SdHJR3We4IT82M7ATTl4EKhHrnYvSciUxJFjGBuLGOg0pknF0LCrOkTz+hON wrjADY2Iva9/DUE/76cxDUxbg6zcMnpUuatE54xmJvvIYGTmKKRzW4q2ED8keN9AfNrLa/N5SEw hCvRod6KVZidedU+Vyh/Io9ExmCX2MuIkyalO1p4Lh3cDedWw/6Gx+CMRk3Oa8MnHfiLjYDfmmH hfXzt9P/Ki8lug== X-Received: by 2002:a17:903:1b70:b0:224:179a:3b8f with SMTP id d9443c01a7336-22c319f58cfmr19234975ad.23.1744765692143; Tue, 15 Apr 2025 18:08:12 -0700 (PDT) X-Google-Smtp-Source: AGHT+IG6A6MIYEwJ44OIIkaN3TE0XQfF682cWnw/bYw0PdyiI1mveB1wu7Yjy3O3iFrZQTiFjazUPQ== X-Received: by 2002:a17:903:1b70:b0:224:179a:3b8f with SMTP id d9443c01a7336-22c319f58cfmr19234595ad.23.1744765691736; Tue, 15 Apr 2025 18:08:11 -0700 (PDT) Received: from [192.168.2.110] ([70.53.200.211]) by smtp.gmail.com with ESMTPSA id d9443c01a7336-22c33ef1102sm1916865ad.47.2025.04.15.18.08.07 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Tue, 15 Apr 2025 18:08:11 -0700 (PDT) Message-ID: Date: Tue, 15 Apr 2025 21:07:53 -0400 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH] mm/hugetlb: use separate nodemask for bootmem allocations To: Frank van der Linden , akpm@linux-foundation.org, muchun.song@linux.dev, linux-mm@kvack.org, linux-kernel@vger.kernel.org Cc: david@redhat.com, osalvador@suse.de References: <20250402205613.3086864-1-fvdl@google.com> From: Luiz Capitulino In-Reply-To: <20250402205613.3086864-1-fvdl@google.com> X-Mimecast-Spam-Score: 0 X-Mimecast-MFC-PROC-ID: 23rv0Lhhp_MqEs2y904t4s1oTcKprLlQ-YUorBnMjGI_1744765692 X-Mimecast-Originator: redhat.com Content-Language: en-US, en-CA Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit X-Rspamd-Server: rspam03 X-Rspamd-Queue-Id: 9555B180006 X-Stat-Signature: 764khq7ijn8976i38urdwzfwy5yzmpaa X-Rspam-User: X-HE-Tag: 1744765695-465641 X-HE-Meta: U2FsdGVkX1/cm8ZmqIOAZmcB5Wd7MUx12hQYSJnNr63bEIPD+V50txD50dddR3oeckCFc9r2mBQA4/pYCQkYH8a2eXJyPP1AlTpvG2UBk7IJWokS1hMhSG6fJ/JAIFdO3vIfOct4w8KgcqkvfuwGncgSg56SjTK6L09Pg+iH9JRP8wcpHj7fVTP18h88OADZ9L+VOa5TxO/w1aZkcMB/W6/EJrhntVdyhcpdGwDJ2LGYQQRSj520SVmU1OytN/bKAg9ciXontn58E8AvrxZpd/t0wTM2dalmFC7LuhNXZK8l/EG5s/wO3w8D5nBF5BWXYEv9H+7Z6YDlb5FPPRARp4wNK1M0ddUKOgmq90WwXwABiaSxYtvbuZUSzoZZu0yBpFuTvEpn5BZdTP2CnW2uCOEN2QVCuH3dqHpCRQR+vdF2Y0iSIq8w2rjfcSFmXymfhd/98QoCm35x1jWOWQJJfHNtKBsxxbnHiMxlXlddHkp56S/Wx9rH38QL8U0sBo2b2JdzbCKgArNZjMF4MRORGJX6kqn9Iq0Vp+PUXUJ1TZP+c7OibjhcMYDXTDr3YAepK6A9qkoW7skK2zoLXv5yiMWr9+LbygcyjJm2ssnRwmMry9UnyQri5rfL0zs2wkZKWNo9x/FbDDNQ9mpv18qC2z2d0tQoR00S9hHqD26BWeLEBqETQj4kFxPTYuLfa0iaEscwSqafyDs5crri+MJPkelumJG51pwJPM/5U9+tliVqQ4dJC3xQlG1h5MxK79JevU1hCpshw6EKY+664mZJx0FHvuywbp/5Dw3g65FI0rpzQBKpBRrf1txnOfATRYc4AFjQlmi9I1T9hc8p3oIJ5waJ2Ab7iCF2NIjNpisn1TVVztNgnyTGC02Jqxkmj8+AXPnbzTbOOaR+MI+fL/cBlj3wJsJLhPU8NGsgRqU91/Ow0FsFuTDGsys744+ThAbTeaTcwJTOu/isS0++4NX R3jbQMNv pkquxDnMRT80gbifDLD7vvYaXrJYQllSCh5UCuf81HAm7pz/alsvHM1ES3i30IfYE9qzXwLRZ9bpWyqPVI55DIZvgHW6KOy77PEtBEfMJbW7taj7hqKlCuDzB4WdC/fJKI7Kq+BFgDDstKDTFgJGLReuSHuNFMSjpHhNXpohNaX6lRW/3RfUm45qFjWXDPDLk9HKPvjCFKJf9r5AKBCG7WoCHn98BIkwtKxpkFv0my3MGNtI9VVQ5ylnLWd6oEraa3ZNBRjmMIxFpfpZJhukPHC4Ax884CKvcc/HrXsgdFzxAGTXMRM+2SCKK4HYcOVRZ+mYknCtavlkqnNYg67sStzr3dd+t7hxxwur8hWSSOD1CVDu3uGBhWw+PWyTi4k85htwKaTa/89Op4l5+7wNFavVrHSCYROmugydbOT1g6i3mXNCCNI9mbY0BBG0+DCsXqLUO X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On 2025-04-02 16:56, Frank van der Linden wrote: > Hugetlb boot allocation has used online nodes for allocation since > commit de55996d7188 ("mm/hugetlb: use online nodes for bootmem > allocation"). This was needed to be able to do the allocations > earlier in boot, before N_MEMORY was set. Honest question: I imagine there's a reason why we can't move x86's hugetlb_cma_reserve() and hugetlb_bootmem_alloc() calls in setup_arch() to after x86_init.paging.pagetable_init() (which seems to be where we call zone_sizes_init())? This way we could go back to using N_MEMORY and avoid this dance. I'm not familiar with vmemmap if that's the reason... - Luiz > > This might lead to a different distribution of gigantic hugepages > across NUMA nodes if there are memoryless nodes in the system. > > What happens is that the memoryless nodes are tried, but then > the memblock allocation fails and falls back, which usually means > that the node that has the highest physical address available > will be used (top-down allocation). While this will end up > getting the same number of hugetlb pages, they might not be > be distributed the same way. The fallback for each memoryless > node might not end up coming from the same node as the > successful round-robin allocation from N_MEMORY nodes. > > While administrators that rely on having a specific number of > hugepages per node should use the hugepages=N:X syntax, it's > better not to change the old behavior for the plain hugepages=N > case. > > To do this, construct a nodemask for hugetlb bootmem purposes > only, containing nodes that have memory. Then use that > for round-robin bootmem allocations. > > This saves some cycles, and the added advantage here is that > hugetlb_cma can use it too, avoiding the older issue of > pointless attempts to create a CMA area for memoryless nodes > (which will also cause the per-node CMA area size to be too > small). > > Fixes: de55996d7188 ("mm/hugetlb: use online nodes for bootmem allocation") > Signed-off-by: Frank van der Linden > --- > include/linux/hugetlb.h | 3 +++ > mm/hugetlb.c | 30 ++++++++++++++++++++++++++++-- > mm/hugetlb_cma.c | 11 +++++++---- > 3 files changed, 38 insertions(+), 6 deletions(-) > > diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h > index 8f3ac832ee7f..fc9166f7f679 100644 > --- a/include/linux/hugetlb.h > +++ b/include/linux/hugetlb.h > @@ -14,6 +14,7 @@ > #include > #include > #include > +#include > > struct ctl_table; > struct user_struct; > @@ -176,6 +177,8 @@ extern struct list_head huge_boot_pages[MAX_NUMNODES]; > > void hugetlb_bootmem_alloc(void); > bool hugetlb_bootmem_allocated(void); > +extern nodemask_t hugetlb_bootmem_nodes; > +void hugetlb_bootmem_set_nodes(void); > > /* arch callbacks */ > > diff --git a/mm/hugetlb.c b/mm/hugetlb.c > index 6fccfe6d046c..e69f6f31e082 100644 > --- a/mm/hugetlb.c > +++ b/mm/hugetlb.c > @@ -58,6 +58,7 @@ int hugetlb_max_hstate __read_mostly; > unsigned int default_hstate_idx; > struct hstate hstates[HUGE_MAX_HSTATE]; > > +__initdata nodemask_t hugetlb_bootmem_nodes; > __initdata struct list_head huge_boot_pages[MAX_NUMNODES]; > static unsigned long hstate_boot_nrinvalid[HUGE_MAX_HSTATE] __initdata; > > @@ -3237,7 +3238,8 @@ int __alloc_bootmem_huge_page(struct hstate *h, int nid) > } > > /* allocate from next node when distributing huge pages */ > - for_each_node_mask_to_alloc(&h->next_nid_to_alloc, nr_nodes, node, &node_states[N_ONLINE]) { > + for_each_node_mask_to_alloc(&h->next_nid_to_alloc, nr_nodes, node, > + &hugetlb_bootmem_nodes) { > m = alloc_bootmem(h, node, false); > if (!m) > return 0; > @@ -3701,6 +3703,15 @@ static void __init hugetlb_init_hstates(void) > struct hstate *h, *h2; > > for_each_hstate(h) { > + /* > + * Always reset to first_memory_node here, even if > + * next_nid_to_alloc was set before - we can't > + * reference hugetlb_bootmem_nodes after init, and > + * first_memory_node is right for all further allocations. > + */ > + h->next_nid_to_alloc = first_memory_node; > + h->next_nid_to_free = first_memory_node; > + > /* oversize hugepages were init'ed in early boot */ > if (!hstate_is_gigantic(h)) > hugetlb_hstate_alloc_pages(h); > @@ -4990,6 +5001,20 @@ static int __init default_hugepagesz_setup(char *s) > } > hugetlb_early_param("default_hugepagesz", default_hugepagesz_setup); > > +void __init hugetlb_bootmem_set_nodes(void) > +{ > + int i, nid; > + unsigned long start_pfn, end_pfn; > + > + if (!nodes_empty(hugetlb_bootmem_nodes)) > + return; > + > + for_each_mem_pfn_range(i, MAX_NUMNODES, &start_pfn, &end_pfn, &nid) { > + if (end_pfn > start_pfn) > + node_set(nid, hugetlb_bootmem_nodes); > + } > +} > + > static bool __hugetlb_bootmem_allocated __initdata; > > bool __init hugetlb_bootmem_allocated(void) > @@ -5005,6 +5030,8 @@ void __init hugetlb_bootmem_alloc(void) > if (__hugetlb_bootmem_allocated) > return; > > + hugetlb_bootmem_set_nodes(); > + > for (i = 0; i < MAX_NUMNODES; i++) > INIT_LIST_HEAD(&huge_boot_pages[i]); > > @@ -5012,7 +5039,6 @@ void __init hugetlb_bootmem_alloc(void) > > for_each_hstate(h) { > h->next_nid_to_alloc = first_online_node; > - h->next_nid_to_free = first_online_node; > > if (hstate_is_gigantic(h)) > hugetlb_hstate_alloc_pages(h); > diff --git a/mm/hugetlb_cma.c b/mm/hugetlb_cma.c > index e0f2d5c3a84c..f58ef4969e7a 100644 > --- a/mm/hugetlb_cma.c > +++ b/mm/hugetlb_cma.c > @@ -66,7 +66,7 @@ hugetlb_cma_alloc_bootmem(struct hstate *h, int *nid, bool node_exact) > if (node_exact) > return NULL; > > - for_each_online_node(node) { > + for_each_node_mask(node, hugetlb_bootmem_nodes) { > cma = hugetlb_cma[node]; > if (!cma || node == *nid) > continue; > @@ -153,11 +153,13 @@ void __init hugetlb_cma_reserve(int order) > if (!hugetlb_cma_size) > return; > > + hugetlb_bootmem_set_nodes(); > + > for (nid = 0; nid < MAX_NUMNODES; nid++) { > if (hugetlb_cma_size_in_node[nid] == 0) > continue; > > - if (!node_online(nid)) { > + if (!node_isset(nid, hugetlb_bootmem_nodes)) { > pr_warn("hugetlb_cma: invalid node %d specified\n", nid); > hugetlb_cma_size -= hugetlb_cma_size_in_node[nid]; > hugetlb_cma_size_in_node[nid] = 0; > @@ -190,13 +192,14 @@ void __init hugetlb_cma_reserve(int order) > * If 3 GB area is requested on a machine with 4 numa nodes, > * let's allocate 1 GB on first three nodes and ignore the last one. > */ > - per_node = DIV_ROUND_UP(hugetlb_cma_size, nr_online_nodes); > + per_node = DIV_ROUND_UP(hugetlb_cma_size, > + nodes_weight(hugetlb_bootmem_nodes)); > pr_info("hugetlb_cma: reserve %lu MiB, up to %lu MiB per node\n", > hugetlb_cma_size / SZ_1M, per_node / SZ_1M); > } > > reserved = 0; > - for_each_online_node(nid) { > + for_each_node_mask(nid, hugetlb_bootmem_nodes) { > int res; > char name[CMA_MAX_NAME]; >