From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id F0FE6C433F5 for ; Thu, 14 Oct 2021 02:38:50 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 8A45461152 for ; Thu, 14 Oct 2021 02:38:50 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.4.1 mail.kernel.org 8A45461152 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=linux.alibaba.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=kvack.org Received: by kanga.kvack.org (Postfix) id 389576B0085; Wed, 13 Oct 2021 22:38:50 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 339DC6B0087; Wed, 13 Oct 2021 22:38:50 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 20140900002; Wed, 13 Oct 2021 22:38:50 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0236.hostedemail.com [216.40.44.236]) by kanga.kvack.org (Postfix) with ESMTP id 10DBB6B0085 for ; Wed, 13 Oct 2021 22:38:50 -0400 (EDT) Received: from smtpin13.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay02.hostedemail.com (Postfix) with ESMTP id B18C631E71 for ; Thu, 14 Oct 2021 02:38:49 +0000 (UTC) X-FDA: 78693485178.13.54902C1 Received: from out4436.biz.mail.alibaba.com (out4436.biz.mail.alibaba.com [47.88.44.36]) by imf11.hostedemail.com (Postfix) with ESMTP id AF37EF0000AC for ; Thu, 14 Oct 2021 02:38:47 +0000 (UTC) X-Alimail-AntiSpam:AC=PASS;BC=-1|-1;BR=01201311R551e4;CH=green;DM=||false|;DS=||;FP=0|-1|-1|-1|0|-1|-1|-1;HT=e01e04400;MF=baolin.wang@linux.alibaba.com;NM=1;PH=DS;RN=9;SR=0;TI=SMTPD_---0UrkMr59_1634179123; Received: from 30.21.164.76(mailfrom:baolin.wang@linux.alibaba.com fp:SMTPD_---0UrkMr59_1634179123) by smtp.aliyun-inc.com(127.0.0.1); Thu, 14 Oct 2021 10:38:44 +0800 Subject: Re: [PATCH] hugetlb: Support node specified when using cma for gigantic hugepages To: Mike Kravetz , akpm@linux-foundation.org Cc: mhocko@kernel.org, guro@fb.com, corbet@lwn.net, yaozhenguo1@gmail.com, linux-mm@kvack.org, linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org References: <1633843448-966-1-git-send-email-baolin.wang@linux.alibaba.com> <6bd3789f-4dee-a184-d415-4ad77f0f98b7@oracle.com> <326ece39-a6f5-26ce-827b-68272525e947@linux.alibaba.com> <3858dbb6-3353-749e-6867-a5d6046e2f1a@oracle.com> From: Baolin Wang Message-ID: <67a6a5bc-f83f-15bb-f728-f32497c7cf9f@linux.alibaba.com> Date: Thu, 14 Oct 2021 10:39:25 +0800 User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0) Gecko/20100101 Thunderbird/78.14.0 MIME-Version: 1.0 In-Reply-To: <3858dbb6-3353-749e-6867-a5d6046e2f1a@oracle.com> Content-Type: text/plain; charset=utf-8; format=flowed X-Rspamd-Server: rspam03 X-Rspamd-Queue-Id: AF37EF0000AC X-Stat-Signature: qdhed873fi14sepjfeizxa4t1b94fjg7 Authentication-Results: imf11.hostedemail.com; dkim=none; dmarc=pass (policy=none) header.from=alibaba.com; spf=pass (imf11.hostedemail.com: domain of baolin.wang@linux.alibaba.com designates 47.88.44.36 as permitted sender) smtp.mailfrom=baolin.wang@linux.alibaba.com X-HE-Tag: 1634179127-999270 Content-Transfer-Encoding: quoted-printable X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: =E5=9C=A8 2021/10/14 10:30, Mike Kravetz =E5=86=99=E9=81=93: > On 10/13/21 7:23 PM, Baolin Wang wrote: >> >> >> On 2021/10/14 6:06, Mike Kravetz wrote: >>> On 10/9/21 10:24 PM, Baolin Wang wrote: >>>> Now the size of CMA area for gigantic hugepages runtime allocation i= s >>>> balanced for all online nodes, but we also want to specify the size = of >>>> CMA per-node, or only one node in some cases, which are similar with >>>> commit 86acc55c3d32 ("hugetlbfs: extend the definition of hugepages >>>> parameter to support node allocation")[1]. >>>> >>>> Thus this patch adds node format for 'hugetlb_cma' parameter to supp= ort >>>> specifying the size of CMA per-node. An example is as follows: >>>> >>>> hugetlb_cma=3D0:5G,2:5G >>>> >>>> which means allocating 5G size of CMA area on node 0 and node 2 >>>> respectively. >>>> >>>> [1] >>>> https://lkml.kernel.org/r/20211005054729.86457-1-yaozhenguo1@gmail.c= om >>>> >>>> Signed-off-by: Baolin Wang >>>> --- >>>> Documentation/admin-guide/kernel-parameters.txt | 6 +- >>>> mm/hugetlb.c | 79 ++++++++++++= +++++++++---- >>>> 2 files changed, 73 insertions(+), 12 deletions(-) >>>> >>>> diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Docum= entation/admin-guide/kernel-parameters.txt >>>> index 3ad8e9d0..a147faa5 100644 >>>> --- a/Documentation/admin-guide/kernel-parameters.txt >>>> +++ b/Documentation/admin-guide/kernel-parameters.txt >>>> @@ -1587,8 +1587,10 @@ >>>> registers. Default set by CONFIG_HPET_MMAP_DEFAULT. >>>> hugetlb_cma=3D [HW,CMA] The size of a CMA area used for = allocation >>>> - of gigantic hugepages. >>>> - Format: nn[KMGTPE] >>>> + of gigantic hugepages. Or using node format, the size >>>> + of a CMA area per node can be specified. >>>> + Format: nn[KMGTPE] or (node format) >>>> + :nn[KMGTPE][,:nn[KMGTPE]] >>>> Reserve a CMA area of given size and allocate gigan= tic >>>> hugepages using the CMA allocator. If enabled, the >>>> diff --git a/mm/hugetlb.c b/mm/hugetlb.c >>>> index 6d2f4c2..8b4e409 100644 >>>> --- a/mm/hugetlb.c >>>> +++ b/mm/hugetlb.c >>>> @@ -50,6 +50,7 @@ >>>> #ifdef CONFIG_CMA >>>> static struct cma *hugetlb_cma[MAX_NUMNODES]; >>>> +static unsigned long hugetlb_cma_size_in_node[MAX_NUMNODES] __initd= ata; >>>> static bool hugetlb_cma_page(struct page *page, unsigned int orde= r) >>>> { >>>> return cma_pages_valid(hugetlb_cma[page_to_nid(page)], page, >>>> @@ -62,6 +63,7 @@ static bool hugetlb_cma_page(struct page *page, un= signed int order) >>>> } >>>> #endif >>>> static unsigned long hugetlb_cma_size __initdata; >>>> +static nodemask_t hugetlb_cma_nodes_allowed =3D NODE_MASK_NONE; >>>> /* >>>> * Minimum page order among possible hugepage sizes, set to a pro= per value >>>> @@ -3497,9 +3499,15 @@ static ssize_t __nr_hugepages_store_common(bo= ol obey_mempolicy, >>>> if (nid =3D=3D NUMA_NO_NODE) { >>>> /* >>>> + * If we've specified the size of CMA area per node, >>>> + * should use it firstly. >>>> + */ >>>> + if (hstate_is_gigantic(h) && !nodes_empty(hugetlb_cma_nodes= _allowed)) >>>> + n_mask =3D &hugetlb_cma_nodes_allowed; >>>> + /* >>> >>> IIUC, this changes the behavior for 'balanced' gigantic huge page poo= l >>> allocations if per-node hugetlb_cma is specified. It will now only >>> attempt to allocate gigantic pages on nodes where CMA was reserved. >>> Even if we run out of space on the node, it will not go to other node= s >>> as before. Is that correct? >> >> Right. >> >>> >>> I do not believe we want this change in behavior. IMO, if the user i= s >>> doing node specific CMA reservations, then the user should use the no= de >>> specific sysfs file for pool allocations on that node. >> >> Sounds more reasonable, will move 'hugetlb_cma_nodes_allowed' to the n= ode specific allocation. >> >>>> * global hstate attribute >>>> */ >>>> - if (!(obey_mempolicy && >>>> + else if (!(obey_mempolicy && >>>> init_nodemask_of_mempolicy(&nodes_allowed))) >>>> n_mask =3D &node_states[N_MEMORY]; >>>> else >>>> @@ -6745,7 +6753,38 @@ void hugetlb_unshare_all_pmds(struct vm_area_= struct *vma) >>>> static int __init cmdline_parse_hugetlb_cma(char *p) >>>> { >>>> - hugetlb_cma_size =3D memparse(p, &p); >>>> + int nid, count =3D 0; >>>> + unsigned long tmp; >>>> + char *s =3D p; >>>> + >>>> + while (*s) { >>>> + if (sscanf(s, "%lu%n", &tmp, &count) !=3D 1) >>>> + break; >>>> + >>>> + if (s[count] =3D=3D ':') { >>>> + nid =3D tmp; >>>> + if (nid < 0 || nid >=3D MAX_NUMNODES) >>>> + break; >>> >>> nid can only be compared to MAX_NUMNODES because this an early param >>> before numa is setup and we do not know exactly how many nodes there >>> are. Is this correct? >> >> Yes. >> >>> >>> Suppose one specifies an invaid node. For example, on my 2 node syst= em >>> I use the option 'hugetlb_cma=3D2:2G'. This is not flagged as an err= or >>> during processing and 1G CMA is reserved on node 0 and 1G is reserved >>> on node 1. Is that by design, or just chance? >> >> Actually we won't allocate any CMA area in this case, since in hugetlb= _cma_reserve(), we will only iterate the online nodes to try to allocate = CMA area, and node 2 is not in the range of online nodes in this case. >> >=20 > But, since it can not do node specifric allocations it falls through to > the all nodes case? Here is what I see: >=20 > # numactl -H > available: 2 nodes (0-1) > node 0 cpus: 0 1 > node 0 size: 8053 MB > node 0 free: 6543 MB > node 1 cpus: 2 3 > node 1 size: 8150 MB > node 1 free: 4851 MB > node distances: > node 0 1 > 0: 10 20 > 1: 20 10 >=20 > # reboot >=20 > # dmesg | grep -i huge > [ 0.000000] Command line: BOOT_IMAGE=3D/vmlinuz-5.15.0-rc4-mm1+ root= =3D/dev/mapper/fedora_new--host-root ro rd.lvm.lv=3Dfedora_new-host/root = rd.lvm.lv=3Dfedora_new-host/swap console=3Dtty0 console=3DttyS0,115200 au= dit=3D0 transparent_hugepage=3Dalways hugetlb_free_vmemmap=3Don hugetlb_c= ma=3D2:2G > [ 0.008345] hugetlb_cma: reserve 2048 MiB, up to 1024 MiB per node > [ 0.008349] hugetlb_cma: reserved 1024 MiB on node 0 > [ 0.008352] hugetlb_cma: reserved 1024 MiB on node 1 > [ 0.053682] Kernel command line: BOOT_IMAGE=3D/vmlinuz-5.15.0-rc4-mm= 1+ root=3D/dev/mapper/fedora_new--host-root ro rd.lvm.lv=3Dfedora_new-hos= t/root rd.lvm.lv=3Dfedora_new-host/swap console=3Dtty0 console=3DttyS0,11= 5200 audit=3D0 transparent_hugepage=3Dalways hugetlb_free_vmemmap=3Don hu= getlb_cma=3D2:2G > [ 0.401648] HugeTLB: can free 4094 vmemmap pages for hugepages-10485= 76kB > [ 0.413681] HugeTLB: can free 6 vmemmap pages for hugepages-2048kB > [ 0.414590] HugeTLB registered 1.00 GiB page size, pre-allocated 0 p= ages > [ 0.415653] HugeTLB registered 2.00 MiB page size, pre-allocated 0 p= ages Oh, I see. I only validate the online nodes to check if we've specified=20 the size of CMA, so in this case, it will fall back to the original=20 balanced policy. As you said, I should catch the invalid nodes in=20 hugetlb_cma_reserve() to avoid this isse. Thanks for pointing out the iss= ue. + for_each_node_state(nid, N_ONLINE) { + if (hugetlb_cma_size_in_node[nid] > 0) { + node_specific_cma_alloc =3D true; + break; + } + }