From: Xueshi Hu <xueshi.hu@smartx.com>
To: Mike Kravetz <mike.kravetz@oracle.com>
Cc: akpm@linux-foundation.org, linux-mm@kvack.org, muchun.song@linux.dev
Subject: Re: [PATCH 1/3] mm/hugetlb: fix the inconsistency of /proc/sys/vm/nr_huge_pages
Date: Wed, 2 Aug 2023 15:31:07 +0800 [thread overview]
Message-ID: <hyuifebwdkyhrn2s2rjwlsuh7ficbbqhrs3qigis3gal4if2pz@m6flznizw4cc> (raw)
In-Reply-To: <20230801184942.GA6544@monkey>
On Tue, Aug 01, 2023 at 11:49:42AM -0700, Mike Kravetz wrote:
> On 08/01/23 20:22, Xueshi Hu wrote:
> > On Tue, Aug 1, 2023 at 6:17 AM Mike Kravetz <mike.kravetz@oracle.com> wrote:
> > >
> > > On 07/30/23 20:51, Xueshi Hu wrote:
> > > > When writing to /proc/sys/vm/nr_huge_pages, it indicates global number of
> > > > huge pages of the default hstate. But when reading from it, it indicates
> > > > the current number of "persistent" huge pages in the kernel's huge page
> > > > pool.
> > > >
> > > > There are currently four interfaces used to export the number of huge
> > > > pages:
> > > > - /proc/meminfo
> > > > - /proc/sys/vm/*hugepages*
> > > > - /sys/devices/system/node/node0/hugepages/hugepages-2048kB/*
> > > > - /sys/kernel/mm/hugepages/hugepages-2048kB/*
> > > >
> > > > But only the /proc/sys/vm/nr_huge_pages provides the 'persistent'
> > > > semantics when reading from it. This inconsistency is very subtle and can
> > > > be easily misunderstood.
> > >
> > > Thanks for looking into this.
> > >
> > > The hugetlb documentation (./admin-guide/mm/hugetlbpage.rst) mentions
> > > the term 'persistent hugetlb pages', but never provides a definition.
> > >
> > > We can get the definition from the code as:
> > > #define persistent_huge_pages(h) (h->nr_huge_pages - h->surplus_huge_pages)
> > >
> > > Further, the documentation says:
> > > "The ``/proc/meminfo`` file provides information about the total number of
> > > persistent hugetlb pages in the kernel's huge page pool."
> > >
> > > "``/proc/sys/vm/nr_hugepages`` indicates the current number of "persistent"
> > > huge pages in the kernel's huge page pool."
> > >
> > > "The administrator may shrink the pool of persistent huge pages for
> > > the default huge page size by setting the ``nr_hugepages`` sysctl to a
> > > smaller value."
> > >
> > > So, the documentation implies that these interfaces should display the
> > > number of persistent hugetlb pages. As you have discovered, all but the
> > > sysctl interface (and /proc/sys/vm/nr_hugepages) displays the total
> > > number of hugetlb pages rather than the number of persistent hugetlb
> > > pages.
> > >
> > > If we wanted to match the documentation, it seems we should change all
> > > the "show" interfaces to display persistent huge pages. However, I am a
> > > bit concerned about how this may impact end users.
> > >
> > > There are two types if inconsistencies in these interfaces.
> > > 1) As this patch points out, not all "show" interfaces provide the same
> > > information. sysctl (/proc/sys/vm/nr_hugepages) displays the number
> > > of persistent hugetlb pages, while the others display the total number
> > > of hugetlb pages.
> > > 2) The show/read interfaces generally provide the total number of
> > > hugetlb pages, and the update/write interfaces update the number of
> > > persistent hugetlb pages.
> > >
> > > Both of these situations can lead to user confusion. My 'guess' is that
> > > this has not been a widespread issue as most hugetlb users do not
> > > configure overcommit/surplus hugetlb pages and thus total number of
> > > hugetlb pages is the same as number of persistent hugetlb pages.
> > >
> > > Right now, I would suggest making all these interfaces display/take the
> > > number of persistent hugetlb pages for consistency. This also matches
> > > the documentation.
> > >
> > > Thoughts?
> > I am concerned that modifying it this way may result in an weaker control
> > over hugetlb pages. Administrator will no longer be able to increase
> > surplus pages through the nr_hugepages interface.
> >
> > Since surplus pages depend on the state of programs in the entire
> > system, adjusting nr_hugepages may lead to an unexpected number of
> > hugetlbs allocated which may leads to oom.
>
> Sorry, I am not sure I understand your concerns.
I'm wrong, just ignore what I've said.
>
> Currently, the interfaces to set/update the number of hugetlb pages use
> the supplied count as the number of requested persistent pages. I am
> not suggesting any changes there (except the bug in node specific code
> you discovered). Rather, I am suggesting that we update the interfaces
> which show the number of hugepages (nr_hugepages) to display the number
> of persistent pages to be consistent with the set/update interfaces.
I agree with you.
>
> > About the definition of /proc/sys/vm/nr_huge_pages and meaning of
> > "persistent", the documentation is kind of ambiguous.
> >
> > The documentation says:
> >
> > "The ``/proc/meminfo`` file provides information about the total number of
> > persistent hugetlb pages in the kernel's huge page pool."
> >
> > "Caveat: Shrinking the persistent huge page pool via ``nr_hugepages``
> > such that it becomes less than the number of huge pages in use will
> > convert the balance of the in-use huge pages to surplus huge pages."
> >
> > "The ``/proc`` interfaces discussed above have been retained for backwards
> > compatibility."
> >
> > The ambiguities are:
> > 1. HugePages_Total in /proc/meminfo is actually the total number of
> > hugetlb pages.
>
> Correct. Although the documentation states it is the number of
> persistent hugetlb pages. meminfo also contains the number of surplus
> huge pages. So, it it possible that one could see
>
> HugePages_Total: 0
> HugePages_Surp: 100
It's easy to fix.
>
> Ideally, one would want to know the value for overcommit hugepages as
> well.
It will be straightforward to achieve this.
>
> The sysfs directories /sys/kernel/mm/hugepages/hugepages-*/ contain both
> the surplus and overcommit counts.
>
> node specific sysfs directories only contain surplus counts.
Node specific sysfs directories don't contain resv_hugepages too.
After resolving this issue, I will attempt to assess the feasibility
about how to implement node-specific reservations and overcommitment.
>
> > 2. If nr_hugepages means persistent hugetlb pages, converting in-use huge
> > pages to surplus huge pages is impossible.
>
> I am not sure I understand. When writing to nr_hugepages today, it does
> mean persistent hugetlb pages. Are you suggesting we change it to mean
> total hugetlb pages when writing/updating? I do not think that is the
> case, as none of your proposed changes do this.
Still, I'm wrong.
>
> > 3. As you know, backward compatibility is not retained.
> >
> > Given that the document needs to be modified anyway, why not make the
> > interface more user-friendly?
>
> In any case, I agree the document should be updated to match the code.
> It should also define persistent hugetlb pages.
Yes, I'll add it in the v2 patch.
>
> Thank you,
> --
> Mike Kravetz
next prev parent reply other threads:[~2023-08-02 7:31 UTC|newest]
Thread overview: 10+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-07-30 12:51 [PATCH 0/3] mm/hugetlb: fix /sys and /proc fs dealing with persistent hugepages Xueshi Hu
2023-07-30 12:51 ` [PATCH 1/3] mm/hugetlb: fix the inconsistency of /proc/sys/vm/nr_huge_pages Xueshi Hu
2023-07-31 22:17 ` Mike Kravetz
2023-08-01 12:22 ` Xueshi Hu
2023-08-01 18:49 ` Mike Kravetz
2023-08-02 7:31 ` Xueshi Hu [this message]
2023-08-02 18:20 ` Mike Kravetz
2023-07-30 12:51 ` [PATCH 2/3] mm/hugeltb: clean up hstate::max_huge_pages Xueshi Hu
2023-07-30 12:51 ` [PATCH 3/3] mm/hugeltb: fix nodes huge page allocation when there are surplus pages Xueshi Hu
2023-07-31 22:56 ` Mike Kravetz
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=hyuifebwdkyhrn2s2rjwlsuh7ficbbqhrs3qigis3gal4if2pz@m6flznizw4cc \
--to=xueshi.hu@smartx.com \
--cc=akpm@linux-foundation.org \
--cc=linux-mm@kvack.org \
--cc=mike.kravetz@oracle.com \
--cc=muchun.song@linux.dev \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox