From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id C68CFC0015E for ; Tue, 1 Aug 2023 12:23:13 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 4D833940009; Tue, 1 Aug 2023 08:23:13 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 4864B8E0002; Tue, 1 Aug 2023 08:23:13 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 375E8940009; Tue, 1 Aug 2023 08:23:13 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id 288C38E0002 for ; Tue, 1 Aug 2023 08:23:13 -0400 (EDT) Received: from smtpin16.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay02.hostedemail.com (Postfix) with ESMTP id E1F3D12053C for ; Tue, 1 Aug 2023 12:23:12 +0000 (UTC) X-FDA: 81075450624.16.3AA2871 Received: from mail-pg1-f176.google.com (mail-pg1-f176.google.com [209.85.215.176]) by imf17.hostedemail.com (Postfix) with ESMTP id 8657E4001F for ; Tue, 1 Aug 2023 12:23:09 +0000 (UTC) Authentication-Results: imf17.hostedemail.com; dkim=pass header.d=smartx-com.20221208.gappssmtp.com header.s=20221208 header.b=uKShJ1EW; spf=none (imf17.hostedemail.com: domain of xueshi.hu@smartx.com has no SPF policy when checking 209.85.215.176) smtp.mailfrom=xueshi.hu@smartx.com; dmarc=none ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1690892590; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=ipMPs+lNgQY1jDbEgZjOrla/2qQ9jKfbR9xnri89BWM=; b=q6dxfwRVnTcYhNpIt54aHicLaoAEEgqF9FeDwTvnxDmbqAuAks7wxXMgAVm4czy2Wtm8Xc IWOA8mhTbwgBHd0+ByTCLI35O2SDB4/QRDZ2CfWbOpT2n3kyVTZbr8xmipPnvlj+J6YJox AYzMQcMszDn+jJQ5s9C5Q543j26U9lE= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1690892590; a=rsa-sha256; cv=none; b=eZ4FpOnrmM3dBK1OunMhTNebY3Gmooa+wH5I3emUh8cvwd+Ais8AhDKU/ImsXg9gMppUwM 7gvTcyqcltb4ee3LSNmSMtsBvFv0TJRY7fNB8n+siIteZuBo1/0H+935HVFjbOUwAf79DM W7NDElJRZHkY8t82u3lbF6f6wesXiIY= ARC-Authentication-Results: i=1; imf17.hostedemail.com; dkim=pass header.d=smartx-com.20221208.gappssmtp.com header.s=20221208 header.b=uKShJ1EW; spf=none (imf17.hostedemail.com: domain of xueshi.hu@smartx.com has no SPF policy when checking 209.85.215.176) smtp.mailfrom=xueshi.hu@smartx.com; dmarc=none Received: by mail-pg1-f176.google.com with SMTP id 41be03b00d2f7-563e6524aaeso3199876a12.0 for ; Tue, 01 Aug 2023 05:23:08 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=smartx-com.20221208.gappssmtp.com; s=20221208; t=1690892588; x=1691497388; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=ipMPs+lNgQY1jDbEgZjOrla/2qQ9jKfbR9xnri89BWM=; b=uKShJ1EW9jBZtCIoVUf5ZmScHInUwFG5/NWqI9qXXlpzJdgHuvRv3LGcGizKOogL3D sX2FDjTnFxuwMeekLbig2iOf48Qa/cUVVn2d7zxKKgCE8XyJNtOZYd8OfQtIh3rJihxr uDTHaCLFnl+eTW35IDqxPsMFySeuO0njBKuTdwRMD08kA4BbZpe0YMSZCWAKyYIC2Xmr RDepc4+5C8Ct4i40MmT/1kUD8iggJZw7s/+vd3phqCaG+fzERNTXRagowrMi1Tw2w1k6 3+UvfeET0UCq0aymvfvr4gVK1SzExbtmCxpc+PLMzFaSKUexM2X5kyC1jPEqINjyjFGh 2T9w== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1690892588; x=1691497388; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=ipMPs+lNgQY1jDbEgZjOrla/2qQ9jKfbR9xnri89BWM=; b=Hqtr39n7QR4AtPB6wGF0cuFPDturX2Gxd7OTNXYetbbnlpxFCJzk/Kj8xWacK4RNAy Xr+oyhEPSdll93YLr8X4LAJwQ9csmUc4xhEGGW/xX+jxhI1UeD9W9DrYwG3xMcrPmvtD yQCuP81vutHOV4duiRU/yhSOaoDgZwMIonEabMXjXRu3Ugg1L+81fb3B0RGwppzLPlC8 yHwE5zWwNV1+DpHGimOIXd0BqkU755Gq1X0s6HtS+xSCPwE5EUN9QkasIzGrYOL6de8J lKvZ7bnxpu9InrgYQgOw6+Pi3ht5Ev2f2fYivOZ1m7YJM4abNQ7j9jUQ3VxQ0VH9Lg2h KJxg== X-Gm-Message-State: ABy/qLbM1r0lIh1eGSiVFl0PYDx+S1Eg9VK1t++1U9sdb6wpJSN4Mvk/ TMeAb4CDFByFPO3dp0BuSX3XI7TfPz9LE0bDCz3ekg== X-Google-Smtp-Source: APBJJlEPd727LpRTEECLO+h8KqWPVRp3udrOrGJIP3sMS0nbORqnGsaf8nSAzizFfmSY0cm4ZS/oVcOi0IDq9KZhbss= X-Received: by 2002:a17:90b:380b:b0:260:9cad:c56d with SMTP id mq11-20020a17090b380b00b002609cadc56dmr10497560pjb.7.1690892587231; Tue, 01 Aug 2023 05:23:07 -0700 (PDT) MIME-Version: 1.0 References: <20230730125156.207301-1-xueshi.hu@smartx.com> <20230730125156.207301-2-xueshi.hu@smartx.com> <20230731221725.GA3351@monkey> In-Reply-To: <20230731221725.GA3351@monkey> From: Xueshi Hu Date: Tue, 1 Aug 2023 20:22:56 +0800 Message-ID: Subject: Re: [PATCH 1/3] mm/hugetlb: fix the inconsistency of /proc/sys/vm/nr_huge_pages To: Mike Kravetz Cc: muchun.song@linux.dev, akpm@linux-foundation.org, linux-mm@kvack.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Queue-Id: 8657E4001F X-Rspam-User: X-Rspamd-Server: rspam11 X-Stat-Signature: wdwc5i5pq3wi9cxt6nyy3du5jexxku6e X-HE-Tag: 1690892589-356604 X-HE-Meta: U2FsdGVkX1/xGxdKqqMU1Dr1g2uMaPI5wj9tqXREUnBClc8owPNNBOJ2nYzfm7JaIad6MFHiYFNaeICgLrnbEXTJEJoC7zWN1Zg3KSMVcX4IsrB6B2sNF6BL0eZKp4i8fV9qV1xzEINyeonyQUQj5Xn68CzADWqORgcBAAx5RwzV8YLz8jqWb9/daqyfLGVcFYxzKo+fGh00uyekK7Q6r4WpotacqJR0hzDZYSBHfWKfSQN7YhYoyaN6mM5TelWuxEufZ4WlM/QJXEvdl6ECcjRza/U1L3XY4igTzKfW1Wky/nhzZcJFqDRHp0rv4lctBlrNjecO5FGQRhZe9+cJj+NvVUyJabG/iI9g+CASba8BDf1dW6AGqosxVbkEs7S2o8lp/7iGhzeYpaYXV2/saQCzE2czXyz2gXhiAOMoKjnNddhjUR4Vv2ZxlYt9QrT+goygpoSS59Cj+yyc3WTIhUDu0eF2NZvgD+qTlANy5qmHFcit1WY0ZZA9gruny1Za8AShj+dO7DginERtnngi5BuKlCHIyqhXtOOj1+6u7yDwos2T+oCut2Nb4feJkQ6q7WTyNGK+LkU1Nj5q8LRnNMoEAnWKIAqdyNOhf9kPTYxKmaT+amjskuBaoGEAzlZmsnHIQWSN83gkLBkvvhNWmfYA5nBTU+lC7CpwWhK+9rMgLNKc5TNlzsI+SAuLEsrWyYChhDNl+AZi5KRUeKvTRhmqJdSem0ynH9ZdcFmwB6XRBWyc/9IhF/UnGbBanEXRW1TymyiZ79QETm0O6IpzyfLSjPaqoIImRYWrXCYx0YpOLRzvZKZBNYgQKlaGpE8ETozucyrZJ+CP8a58M7dqYpqQIbTQ6flM+4ZBqrBMpAktIoQXz/LRX9cwzX6rkp5+zfLYJ2dQHDN7iG7wfb1SKnRw54rZCXTlKITVeV58yECy7wY08Xi884NXaprJP4x3hM3Lv8jK78MjTQAh2c0 jrtwT//M 5FSzHiYJdktQ0e/DBsTedrAPp6kR2I+Os9m+cRmF1aI3qeRYNxk8Yd5Sk1xJFQf/KyZw7xBNVHwlz80eEAYrVYkwyKR7nocgJWFGDwnXvveGNJEK/QTYPhyL5SSb0bhBR+hrxDDoAP8B7D1xHiQ7SYWHAG8gagqT5Z5taNEgVBzO5hnpgPQQ2BJgwVkBp7KHX/DqGQBysKSLgkI+QbqAWvXQgMQnvcVkT8z7uJwDoLMYBZYcAFYnvyv9jNu/ApRq7jad8veo3Tfdc3tlz+YSA74AaBQzUjM29UwTe/Sy74bBpefNT17GLME8PevhPfUu+3Hnnxi+H36kdt77TKjROgJuUJrXFFh6JA3Uqfe8tpxpj9M6pa8JpyluZntC+9cBhzofO X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Tue, Aug 1, 2023 at 6:17=E2=80=AFAM Mike Kravetz wrote: > > On 07/30/23 20:51, Xueshi Hu wrote: > > When writing to /proc/sys/vm/nr_huge_pages, it indicates global number = of > > huge pages of the default hstate. But when reading from it, it indicate= s > > the current number of "persistent" huge pages in the kernel's huge page > > pool. > > > > There are currently four interfaces used to export the number of huge > > pages: > > - /proc/meminfo > > - /proc/sys/vm/*hugepages* > > - /sys/devices/system/node/node0/hugepages/hugepages-2048kB/* > > - /sys/kernel/mm/hugepages/hugepages-2048kB/* > > > > But only the /proc/sys/vm/nr_huge_pages provides the 'persistent' > > semantics when reading from it. This inconsistency is very subtle and c= an > > be easily misunderstood. > > Thanks for looking into this. > > The hugetlb documentation (./admin-guide/mm/hugetlbpage.rst) mentions > the term 'persistent hugetlb pages', but never provides a definition. > > We can get the definition from the code as: > #define persistent_huge_pages(h) (h->nr_huge_pages - h->surplus_huge_page= s) > > Further, the documentation says: > "The ``/proc/meminfo`` file provides information about the total number o= f > persistent hugetlb pages in the kernel's huge page pool." > > "``/proc/sys/vm/nr_hugepages`` indicates the current number of "persisten= t" > huge pages in the kernel's huge page pool." > > "The administrator may shrink the pool of persistent huge pages for > the default huge page size by setting the ``nr_hugepages`` sysctl to a > smaller value." > > So, the documentation implies that these interfaces should display the > number of persistent hugetlb pages. As you have discovered, all but the > sysctl interface (and /proc/sys/vm/nr_hugepages) displays the total > number of hugetlb pages rather than the number of persistent hugetlb > pages. > > If we wanted to match the documentation, it seems we should change all > the "show" interfaces to display persistent huge pages. However, I am a > bit concerned about how this may impact end users. > > There are two types if inconsistencies in these interfaces. > 1) As this patch points out, not all "show" interfaces provide the same > information. sysctl (/proc/sys/vm/nr_hugepages) displays the number > of persistent hugetlb pages, while the others display the total number > of hugetlb pages. > 2) The show/read interfaces generally provide the total number of > hugetlb pages, and the update/write interfaces update the number of > persistent hugetlb pages. > > Both of these situations can lead to user confusion. My 'guess' is that > this has not been a widespread issue as most hugetlb users do not > configure overcommit/surplus hugetlb pages and thus total number of > hugetlb pages is the same as number of persistent hugetlb pages. > > Right now, I would suggest making all these interfaces display/take the > number of persistent hugetlb pages for consistency. This also matches > the documentation. > > Thoughts? I am concerned that modifying it this way may result in an weaker control over hugetlb pages. Administrator will no longer be able to increase surplus pages through the nr_hugepages interface. Since surplus pages depend on the state of programs in the entire system, adjusting nr_hugepages may lead to an unexpected number of hugetlbs allocated which may leads to oom. About the definition of /proc/sys/vm/nr_huge_pages and meaning of "persistent", the documentation is kind of ambiguous. The documentation says: "The ``/proc/meminfo`` file provides information about the total number of persistent hugetlb pages in the kernel's huge page pool." "Caveat: Shrinking the persistent huge page pool via ``nr_hugepages`` such that it becomes less than the number of huge pages in use will convert the balance of the in-use huge pages to surplus huge pages." "The ``/proc`` interfaces discussed above have been retained for backwards compatibility." The ambiguities are: 1. HugePages_Total in /proc/meminfo is actually the total number of hugetlb pages. 2. If nr_hugepages means persistent hugetlb pages, converting in-use huge pages to surplus huge pages is impossible. 3. As you know, backward compatibility is not retained. Given that the document needs to be modified anyway, why not make the interface more user-friendly? Thanks, Hu > -- > Mike Kravetz > > > > > Signed-off-by: Xueshi Hu > > --- > > mm/hugetlb.c | 2 +- > > 1 file changed, 1 insertion(+), 1 deletion(-) > > > > diff --git a/mm/hugetlb.c b/mm/hugetlb.c > > index e327a5a7602c..76af189053f0 100644 > > --- a/mm/hugetlb.c > > +++ b/mm/hugetlb.c > > @@ -4606,7 +4606,7 @@ static int hugetlb_sysctl_handler_common(bool obe= y_mempolicy, > > void *buffer, size_t *length, loff_t *ppos) > > { > > struct hstate *h =3D &default_hstate; > > - unsigned long tmp =3D h->max_huge_pages; > > + unsigned long tmp =3D h->nr_huge_pages; > > int ret; > > > > if (!hugepages_supported()) > > -- > > 2.40.1 > > > >