linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Xueshi Hu <xueshi.hu@smartx.com>
To: Mike Kravetz <mike.kravetz@oracle.com>
Cc: akpm@linux-foundation.org, linux-mm@kvack.org, muchun.song@linux.dev
Subject: Re: [PATCH 1/3] mm/hugetlb: fix the inconsistency of /proc/sys/vm/nr_huge_pages
Date: Wed, 2 Aug 2023 15:31:07 +0800	[thread overview]
Message-ID: <hyuifebwdkyhrn2s2rjwlsuh7ficbbqhrs3qigis3gal4if2pz@m6flznizw4cc> (raw)
In-Reply-To: <20230801184942.GA6544@monkey>

On Tue, Aug 01, 2023 at 11:49:42AM -0700, Mike Kravetz wrote:
> On 08/01/23 20:22, Xueshi Hu wrote:
> > On Tue, Aug 1, 2023 at 6:17 AM Mike Kravetz <mike.kravetz@oracle.com> wrote:
> > >
> > > On 07/30/23 20:51, Xueshi Hu wrote:
> > > > When writing to /proc/sys/vm/nr_huge_pages, it indicates global number of
> > > > huge pages of the default hstate. But when reading from it, it indicates
> > > > the current number of "persistent" huge pages in the kernel's huge page
> > > > pool.
> > > >
> > > > There are currently four interfaces used to export the number of huge
> > > > pages:
> > > > - /proc/meminfo
> > > > - /proc/sys/vm/*hugepages*
> > > > - /sys/devices/system/node/node0/hugepages/hugepages-2048kB/*
> > > > - /sys/kernel/mm/hugepages/hugepages-2048kB/*
> > > >
> > > > But only the /proc/sys/vm/nr_huge_pages provides the 'persistent'
> > > > semantics when reading from it. This inconsistency is very subtle and can
> > > > be easily misunderstood.
> > >
> > > Thanks for looking into this.
> > >
> > > The hugetlb documentation (./admin-guide/mm/hugetlbpage.rst) mentions
> > > the term 'persistent hugetlb pages', but never provides a definition.
> > >
> > > We can get the definition from the code as:
> > > #define persistent_huge_pages(h) (h->nr_huge_pages - h->surplus_huge_pages)
> > >
> > > Further, the documentation says:
> > > "The ``/proc/meminfo`` file provides information about the total number of
> > >  persistent hugetlb pages in the kernel's huge page pool."
> > >
> > > "``/proc/sys/vm/nr_hugepages`` indicates the current number of "persistent"
> > >  huge pages in the kernel's huge page pool."
> > >
> > > "The administrator may shrink the pool of persistent huge pages for
> > >  the default huge page size by setting the ``nr_hugepages`` sysctl to a
> > >  smaller value."
> > >
> > > So, the documentation implies that these interfaces should display the
> > > number of persistent hugetlb pages.  As you have discovered, all but the
> > > sysctl interface (and /proc/sys/vm/nr_hugepages) displays the total
> > > number of hugetlb pages rather than the number of persistent hugetlb
> > > pages.
> > >
> > > If we wanted to match the documentation, it seems we should change all
> > > the "show" interfaces to display persistent huge pages.  However, I am a
> > > bit concerned about how this may impact end users.
> > >
> > > There are two types if inconsistencies in these interfaces.
> > > 1) As this patch points out, not all "show" interfaces provide the same
> > >    information.  sysctl (/proc/sys/vm/nr_hugepages) displays the number
> > >    of persistent hugetlb pages, while the others display the total number
> > >    of hugetlb pages.
> > > 2) The show/read interfaces generally provide the total number of
> > >    hugetlb pages, and the update/write interfaces update the number of
> > >    persistent hugetlb pages.
> > >
> > > Both of these situations can lead to user confusion.  My 'guess' is that
> > > this has not been a widespread issue as most hugetlb users do not
> > > configure overcommit/surplus hugetlb pages and thus total number of
> > > hugetlb pages is the same as number of persistent hugetlb pages.
> > >
> > > Right now, I would suggest making all these interfaces display/take the
> > > number of persistent hugetlb pages for consistency.  This also matches
> > > the documentation.
> > >
> > > Thoughts?
> > I am concerned that modifying it this way may result in an weaker control
> > over hugetlb pages. Administrator will no longer be able to increase
> > surplus pages through the nr_hugepages interface.
> > 
> > Since surplus pages depend on the state of programs in the entire
> > system, adjusting nr_hugepages may lead to an unexpected number of
> > hugetlbs allocated which may leads to oom.
> 
> Sorry, I am not sure I understand your concerns.
I'm wrong, just ignore what I've said.
> 
> Currently, the interfaces to set/update the number of hugetlb pages use
> the supplied count as the number of requested persistent pages.  I am
> not suggesting any changes there (except the bug in node specific code
> you discovered).  Rather, I am suggesting that we update the interfaces
> which show the number of hugepages (nr_hugepages) to display the number
> of persistent pages to be consistent with the set/update interfaces.
I agree with you.
> 
> > About the definition of /proc/sys/vm/nr_huge_pages and meaning of
> > "persistent", the documentation is kind of ambiguous.
> > 
> > The documentation says:
> > 
> > "The ``/proc/meminfo`` file provides information about the total number of
> > persistent hugetlb pages in the kernel's huge page pool."
> > 
> > "Caveat: Shrinking the persistent huge page pool via ``nr_hugepages``
> > such that it becomes less than the number of huge pages in use will
> > convert the balance of the in-use huge pages to surplus huge pages."
> > 
> > "The ``/proc`` interfaces discussed above have been retained for backwards
> > compatibility."
> > 
> > The ambiguities are:
> > 1. HugePages_Total in /proc/meminfo is actually the total number of
> > hugetlb pages.
> 
> Correct.  Although the documentation states it is the number of
> persistent hugetlb pages.  meminfo also contains the number of surplus
> huge pages.  So, it it possible that one could see
> 
> HugePages_Total: 0
> HugePages_Surp:  100
It's easy to fix.
> 
> Ideally, one would want to know the value for overcommit hugepages as
> well.
It will be straightforward to achieve this.
> 
> The sysfs directories /sys/kernel/mm/hugepages/hugepages-*/ contain both
> the surplus and overcommit counts.
> 
> node specific sysfs directories only contain surplus counts.
Node specific sysfs directories don't contain resv_hugepages too.
After resolving this issue, I will attempt to assess the feasibility
about how to implement node-specific reservations and overcommitment.
> 
> > 2. If nr_hugepages means persistent hugetlb pages, converting in-use huge
> > pages to surplus huge pages is impossible.
> 
> I am not sure I understand.  When writing to nr_hugepages today, it does
> mean persistent hugetlb pages.  Are you suggesting we change it to mean
> total hugetlb pages when writing/updating?  I do not think that is the
> case, as none of your proposed changes do this.
Still, I'm wrong.
> 
> > 3. As you know, backward compatibility is not retained.
> > 
> > Given that the document needs to be modified anyway, why not make the
> > interface more user-friendly?
> 
> In any case, I agree the document should be updated to match the code.
> It should also define persistent hugetlb pages.
Yes, I'll add it in the v2 patch.
> 
> Thank you,
> -- 
> Mike Kravetz


  reply	other threads:[~2023-08-02  7:31 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-07-30 12:51 [PATCH 0/3] mm/hugetlb: fix /sys and /proc fs dealing with persistent hugepages Xueshi Hu
2023-07-30 12:51 ` [PATCH 1/3] mm/hugetlb: fix the inconsistency of /proc/sys/vm/nr_huge_pages Xueshi Hu
2023-07-31 22:17   ` Mike Kravetz
2023-08-01 12:22     ` Xueshi Hu
2023-08-01 18:49       ` Mike Kravetz
2023-08-02  7:31         ` Xueshi Hu [this message]
2023-08-02 18:20           ` Mike Kravetz
2023-07-30 12:51 ` [PATCH 2/3] mm/hugeltb: clean up hstate::max_huge_pages Xueshi Hu
2023-07-30 12:51 ` [PATCH 3/3] mm/hugeltb: fix nodes huge page allocation when there are surplus pages Xueshi Hu
2023-07-31 22:56   ` Mike Kravetz

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=hyuifebwdkyhrn2s2rjwlsuh7ficbbqhrs3qigis3gal4if2pz@m6flznizw4cc \
    --to=xueshi.hu@smartx.com \
    --cc=akpm@linux-foundation.org \
    --cc=linux-mm@kvack.org \
    --cc=mike.kravetz@oracle.com \
    --cc=muchun.song@linux.dev \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox