linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Nico Pache <npache@redhat.com>
To: "Lorenzo Stoakes (Oracle)" <ljs@kernel.org>
Cc: linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org,
	 linux-mm@kvack.org, linux-trace-kernel@vger.kernel.org,
	aarcange@redhat.com,  akpm@linux-foundation.org,
	anshuman.khandual@arm.com, apopple@nvidia.com,
	 baohua@kernel.org, baolin.wang@linux.alibaba.com,
	byungchul@sk.com,  catalin.marinas@arm.com, cl@gentwo.org,
	corbet@lwn.net,  dave.hansen@linux.intel.com, david@kernel.org,
	dev.jain@arm.com,  gourry@gourry.net, hannes@cmpxchg.org,
	hughd@google.com, jack@suse.cz,  jackmanb@google.com,
	jannh@google.com, jglisse@google.com,  joshua.hahnjy@gmail.com,
	kas@kernel.org, lance.yang@linux.dev,  Liam.Howlett@oracle.com,
	lorenzo.stoakes@oracle.com,  mathieu.desnoyers@efficios.com,
	matthew.brost@intel.com, mhiramat@kernel.org,  mhocko@suse.com,
	peterx@redhat.com, pfalcato@suse.de, rakie.kim@sk.com,
	 raquini@redhat.com, rdunlap@infradead.org,
	richard.weiyang@gmail.com,  rientjes@google.com,
	rostedt@goodmis.org, rppt@kernel.org,  ryan.roberts@arm.com,
	shivankg@amd.com, sunnanyong@huawei.com,  surenb@google.com,
	thomas.hellstrom@linux.intel.com, tiwai@suse.de,
	 usamaarif642@gmail.com, vbabka@suse.cz, vishal.moola@gmail.com,
	 wangkefeng.wang@huawei.com, will@kernel.org,
	willy@infradead.org,  yang@os.amperecomputing.com,
	ying.huang@linux.alibaba.com, ziy@nvidia.com,
	 zokeefe@google.com
Subject: Re: [PATCH mm-unstable v15 07/13] mm/khugepaged: add per-order mTHP collapse failure statistics
Date: Sun, 12 Apr 2026 20:48:29 -0600	[thread overview]
Message-ID: <CAA1CXcCS9gWySN1oQzEYpALfURxBwt58us9tkAbNPnHOKmLd5g@mail.gmail.com> (raw)
In-Reply-To: <c832d503-8b8c-487a-b61a-df74a3057308@lucifer.local>

On Tue, Mar 17, 2026 at 11:05 AM Lorenzo Stoakes (Oracle)
<ljs@kernel.org> wrote:
>
> On Wed, Feb 25, 2026 at 08:25:04PM -0700, Nico Pache wrote:
> > Add three new mTHP statistics to track collapse failures for different
> > orders when encountering swap PTEs, excessive none PTEs, and shared PTEs:
> >
> > - collapse_exceed_swap_pte: Increment when mTHP collapse fails due to swap
> >       PTEs
> >
> > - collapse_exceed_none_pte: Counts when mTHP collapse fails due to
> >       exceeding the none PTE threshold for the given order
> >
> > - collapse_exceed_shared_pte: Counts when mTHP collapse fails due to shared
> >       PTEs
> >
> > These statistics complement the existing THP_SCAN_EXCEED_* events by
> > providing per-order granularity for mTHP collapse attempts. The stats are
> > exposed via sysfs under
> > `/sys/kernel/mm/transparent_hugepage/hugepages-*/stats/` for each
> > supported hugepage size.
> >
> > As we currently dont support collapsing mTHPs that contain a swap or
> > shared entry, those statistics keep track of how often we are
> > encountering failed mTHP collapses due to these restrictions.
> >
> > Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> > Signed-off-by: Nico Pache <npache@redhat.com>
> > ---
> >  Documentation/admin-guide/mm/transhuge.rst | 24 ++++++++++++++++++++++
> >  include/linux/huge_mm.h                    |  3 +++
> >  mm/huge_memory.c                           |  7 +++++++
> >  mm/khugepaged.c                            | 16 ++++++++++++---
> >  4 files changed, 47 insertions(+), 3 deletions(-)
> >
> > diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst
> > index c51932e6275d..eebb1f6bbc6c 100644
> > --- a/Documentation/admin-guide/mm/transhuge.rst
> > +++ b/Documentation/admin-guide/mm/transhuge.rst
> > @@ -714,6 +714,30 @@ nr_anon_partially_mapped
> >         an anonymous THP as "partially mapped" and count it here, even though it
> >         is not actually partially mapped anymore.
> >
> > +collapse_exceed_none_pte
> > +       The number of collapse attempts that failed due to exceeding the
> > +       max_ptes_none threshold. For mTHP collapse, Currently only max_ptes_none
> > +       values of 0 and (HPAGE_PMD_NR - 1) are supported. Any other value will
> > +       emit a warning and no mTHP collapse will be attempted. khugepaged will
>
> It's weird to document this here but not elsewhere in the document? I mean I
> made this comment on the documentation patch also.

I can add some more documentation but TBH I don't really know where or
what else to put. I checked a few of these other per-mTHP stats, and
none are referenced elsewhere. if anything these 3 additions are the
best documented ones.

>
> Not sure if I missed you adding it to another bit of the docs? :)
>
> > +       try to collapse to the largest enabled (m)THP size; if it fails, it will
> > +       try the next lower enabled mTHP size. This counter records the number of
> > +       times a collapse attempt was skipped for exceeding the max_ptes_none
> > +       threshold, and khugepaged will move on to the next available mTHP size.
> > +
> > +collapse_exceed_swap_pte
> > +       The number of anonymous mTHP PTE ranges which were unable to collapse due
> > +       to containing at least one swap PTE. Currently khugepaged does not
> > +       support collapsing mTHP regions that contain a swap PTE. This counter can
> > +       be used to monitor the number of khugepaged mTHP collapses that failed
> > +       due to the presence of a swap PTE.
> > +
> > +collapse_exceed_shared_pte
> > +       The number of anonymous mTHP PTE ranges which were unable to collapse due
> > +       to containing at least one shared PTE. Currently khugepaged does not
> > +       support collapsing mTHP PTE ranges that contain a shared PTE. This
> > +       counter can be used to monitor the number of khugepaged mTHP collapses
> > +       that failed due to the presence of a shared PTE.
>
> All of these talk about 'ranges' that could be of any size. Are these useful
> metrics? Counting a bunch of failures and not knowing if they are 256 KB
> failures or 16 KB failures or whatever is maybe not so useful information?

These are per-mTHP size statistics. If you look at the surrounding
examples and docs this all makes more sense.

>
> Also, from the code, aren't you treating PMD events the same as mTHP ones from
> the point of view of these counters? Maybe worth documenting that?

IIUC, yes but that is true of all these

```
In /sys/kernel/mm/transparent_hugepage/hugepages-<size>kB/stats, There are
also individual counters for each huge page size, which can be utilized to
monitor the system's effectiveness in providing huge pages for usage. Each
counter has its own corresponding file.
```

>
> > +
> >  As the system ages, allocating huge pages may be expensive as the
> >  system uses memory compaction to copy data around memory to free a
> >  huge page for use. There are some counters in ``/proc/vmstat`` to help
> > diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> > index 9941fc6d7bd8..e8777bb2347d 100644
> > --- a/include/linux/huge_mm.h
> > +++ b/include/linux/huge_mm.h
> > @@ -144,6 +144,9 @@ enum mthp_stat_item {
> >       MTHP_STAT_SPLIT_DEFERRED,
> >       MTHP_STAT_NR_ANON,
> >       MTHP_STAT_NR_ANON_PARTIALLY_MAPPED,
> > +     MTHP_STAT_COLLAPSE_EXCEED_SWAP,
> > +     MTHP_STAT_COLLAPSE_EXCEED_NONE,
> > +     MTHP_STAT_COLLAPSE_EXCEED_SHARED,
> >       __MTHP_STAT_COUNT
> >  };
> >
> > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > index 228f35e962b9..1049a207a257 100644
> > --- a/mm/huge_memory.c
> > +++ b/mm/huge_memory.c
> > @@ -642,6 +642,10 @@ DEFINE_MTHP_STAT_ATTR(split_failed, MTHP_STAT_SPLIT_FAILED);
> >  DEFINE_MTHP_STAT_ATTR(split_deferred, MTHP_STAT_SPLIT_DEFERRED);
> >  DEFINE_MTHP_STAT_ATTR(nr_anon, MTHP_STAT_NR_ANON);
> >  DEFINE_MTHP_STAT_ATTR(nr_anon_partially_mapped, MTHP_STAT_NR_ANON_PARTIALLY_MAPPED);
> > +DEFINE_MTHP_STAT_ATTR(collapse_exceed_swap_pte, MTHP_STAT_COLLAPSE_EXCEED_SWAP);
> > +DEFINE_MTHP_STAT_ATTR(collapse_exceed_none_pte, MTHP_STAT_COLLAPSE_EXCEED_NONE);
> > +DEFINE_MTHP_STAT_ATTR(collapse_exceed_shared_pte, MTHP_STAT_COLLAPSE_EXCEED_SHARED);
>
> Is there a reason there's such a difference between the names and the actual
> enum names?

Good point I didnt think about that. I can update those as long as
they don't conflict with something else (I forget why i named them
like this).

>
> > +
> >
> >  static struct attribute *anon_stats_attrs[] = {
> >       &anon_fault_alloc_attr.attr,
> > @@ -658,6 +662,9 @@ static struct attribute *anon_stats_attrs[] = {
> >       &split_deferred_attr.attr,
> >       &nr_anon_attr.attr,
> >       &nr_anon_partially_mapped_attr.attr,
> > +     &collapse_exceed_swap_pte_attr.attr,
> > +     &collapse_exceed_none_pte_attr.attr,
> > +     &collapse_exceed_shared_pte_attr.attr,
> >       NULL,
> >  };
> >
> > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > index c739f26dd61e..a6cf90e09e4a 100644
> > --- a/mm/khugepaged.c
> > +++ b/mm/khugepaged.c
> > @@ -595,7 +595,9 @@ static enum scan_result __collapse_huge_page_isolate(struct vm_area_struct *vma,
> >                               continue;
> >                       } else {
> >                               result = SCAN_EXCEED_NONE_PTE;
> > -                             count_vm_event(THP_SCAN_EXCEED_NONE_PTE);
> > +                             if (is_pmd_order(order))
> > +                                     count_vm_event(THP_SCAN_EXCEED_NONE_PTE);
> > +                             count_mthp_stat(order, MTHP_STAT_COLLAPSE_EXCEED_NONE);
>
> It's a bit gross to have separate stats for both thp and mthp but maybe
> unavoidable from a legacy stand point.

I agree but that's how it currently is. Perhaps we can add this to the
TODO list for THP work.

>
> Why are we dropping the _PTE suffix?

I follow the convention that the other mTHP stats follow for example
(MTHP_STAT_SPLIT_DEFERRED)

>
> >                               goto out;
> >                       }
> >               }
> > @@ -631,10 +633,17 @@ static enum scan_result __collapse_huge_page_isolate(struct vm_area_struct *vma,
> >                        * shared may cause a future higher order collapse on a
> >                        * rescan of the same range.
> >                        */
> > -                     if (!is_pmd_order(order) || (cc->is_khugepaged &&
> > -                         shared > khugepaged_max_ptes_shared)) {
>
> OK losing track here :) as the series sadly doesn't currently apply so can't
> browser file as is.
>
> In the code I'm looking at, there's also a ++shared here that I guess another
> patch removed?
>
> Is this in the folio_maybe_mapped_shared() branch?

yes the counting is now done at the top of that branch.

>
> > +                     if (!is_pmd_order(order)) {
> > +                             result = SCAN_EXCEED_SHARED_PTE;
> > +                             count_mthp_stat(order, MTHP_STAT_COLLAPSE_EXCEED_SHARED);
> > +                             goto out;
> > +                     }
> > +
> > +                     if (cc->is_khugepaged &&
> > +                         shared > khugepaged_max_ptes_shared) {
> >                               result = SCAN_EXCEED_SHARED_PTE;
> >                               count_vm_event(THP_SCAN_EXCEED_SHARED_PTE);
> > +                             count_mthp_stat(order, MTHP_STAT_COLLAPSE_EXCEED_SHARED);
> >                               goto out;
>
> Anyway I'm a bit lost on this logic until a respin but this looks like a LOT of
> code duplication. I see David alluded to a refactoring so maybe what he suggests
> will help (not had a chance to check what it is specifically :P)

Yep :) should look cleaner in the next one. Although it's quite a bit
of refactoring. I'll be praying that i got it right on the first go,
and I put all the other pieces in the desired spot.

>
> >                       }
> >               }
> > @@ -1081,6 +1090,7 @@ static enum scan_result __collapse_huge_page_swapin(struct mm_struct *mm,
> >                * range.
> >                */
> >               if (!is_pmd_order(order)) {
> > +                     count_mthp_stat(order, MTHP_STAT_COLLAPSE_EXCEED_SWAP);
>
> Hmm I thought we were incrementing mthp stats for pmd sized also?

Yes we are supposed to. I've already refactored and it looks fine
there... perhaps i missed this one in this version!

Cheers,

-- Nico

>
> >                       pte_unmap(pte);
> >                       mmap_read_unlock(mm);
> >                       result = SCAN_EXCEED_SWAP_PTE;
> > --
> > 2.53.0
> >
>
> Cheers, Lorenzo
>



  reply	other threads:[~2026-04-13  2:48 UTC|newest]

Thread overview: 51+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-02-26  3:17 [PATCH mm-unstable v15 00/13] khugepaged: mTHP support Nico Pache
2026-02-26  3:22 ` [PATCH mm-unstable v15 01/13] mm/khugepaged: generalize hugepage_vma_revalidate for " Nico Pache
2026-03-12 20:00   ` David Hildenbrand (Arm)
2026-02-26  3:23 ` [PATCH mm-unstable v15 02/13] mm/khugepaged: generalize alloc_charge_folio() Nico Pache
2026-03-12 20:05   ` David Hildenbrand (Arm)
2026-02-26  3:23 ` [PATCH mm-unstable v15 03/13] mm/khugepaged: generalize __collapse_huge_page_* for mTHP support Nico Pache
2026-03-12 20:32   ` David Hildenbrand (Arm)
2026-03-12 20:36     ` David Hildenbrand (Arm)
2026-03-12 20:56       ` David Hildenbrand (Arm)
2026-04-08 19:48         ` Nico Pache
2026-04-09  8:14           ` David Hildenbrand (Arm)
2026-04-09 16:17             ` Nico Pache
2026-04-09 18:35               ` David Hildenbrand (Arm)
2026-02-26  3:24 ` [PATCH mm-unstable v15 04/13] mm/khugepaged: introduce collapse_max_ptes_none helper function Nico Pache
2026-02-26  3:24 ` [PATCH mm-unstable v15 05/13] mm/khugepaged: generalize collapse_huge_page for mTHP collapse Nico Pache
2026-03-17 16:51   ` Lorenzo Stoakes (Oracle)
2026-03-17 17:16     ` Randy Dunlap
2026-02-26  3:24 ` [PATCH mm-unstable v15 06/13] mm/khugepaged: skip collapsing mTHP to smaller orders Nico Pache
2026-03-12 21:00   ` David Hildenbrand (Arm)
2026-04-13  1:38     ` Nico Pache
2026-02-26  3:25 ` [PATCH mm-unstable v15 07/13] mm/khugepaged: add per-order mTHP collapse failure statistics Nico Pache
2026-03-12 21:03   ` David Hildenbrand (Arm)
2026-03-17 17:05   ` Lorenzo Stoakes (Oracle)
2026-04-13  2:48     ` Nico Pache [this message]
2026-02-26  3:25 ` [PATCH mm-unstable v15 08/13] mm/khugepaged: improve tracepoints for mTHP orders Nico Pache
2026-03-12 21:05   ` David Hildenbrand (Arm)
2026-02-26  3:25 ` [PATCH mm-unstable v15 09/13] mm/khugepaged: introduce collapse_allowable_orders helper function Nico Pache
2026-03-12 21:09   ` David Hildenbrand (Arm)
2026-03-17 17:08   ` Lorenzo Stoakes (Oracle)
2026-02-26  3:26 ` [PATCH mm-unstable v15 10/13] mm/khugepaged: Introduce mTHP collapse support Nico Pache
2026-03-12 21:16   ` David Hildenbrand (Arm)
2026-03-17 21:36   ` Lorenzo Stoakes (Oracle)
2026-02-26  3:26 ` [PATCH mm-unstable v15 11/13] mm/khugepaged: avoid unnecessary mTHP collapse attempts Nico Pache
2026-02-26 16:26   ` Usama Arif
2026-02-26 20:47     ` Nico Pache
2026-03-12 21:19   ` David Hildenbrand (Arm)
2026-03-17 10:35   ` Lorenzo Stoakes (Oracle)
2026-03-18 18:59     ` Nico Pache
2026-03-18 19:48       ` David Hildenbrand (Arm)
2026-03-19 15:59         ` Lorenzo Stoakes (Oracle)
2026-02-26  3:26 ` [PATCH mm-unstable v15 12/13] mm/khugepaged: run khugepaged for all orders Nico Pache
2026-02-26 15:53   ` Usama Arif
2026-03-12 21:22   ` David Hildenbrand (Arm)
2026-03-17 10:58   ` Lorenzo Stoakes (Oracle)
2026-03-18 19:02     ` Nico Pache
2026-03-17 11:36   ` Lance Yang
2026-03-18 19:07     ` Nico Pache
2026-02-26  3:27 ` [PATCH mm-unstable v15 13/13] Documentation: mm: update the admin guide for mTHP collapse Nico Pache
2026-03-17 11:02   ` Lorenzo Stoakes (Oracle)
2026-03-18 19:08     ` Nico Pache
2026-03-18 19:49       ` David Hildenbrand (Arm)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAA1CXcCS9gWySN1oQzEYpALfURxBwt58us9tkAbNPnHOKmLd5g@mail.gmail.com \
    --to=npache@redhat.com \
    --cc=Liam.Howlett@oracle.com \
    --cc=aarcange@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=anshuman.khandual@arm.com \
    --cc=apopple@nvidia.com \
    --cc=baohua@kernel.org \
    --cc=baolin.wang@linux.alibaba.com \
    --cc=byungchul@sk.com \
    --cc=catalin.marinas@arm.com \
    --cc=cl@gentwo.org \
    --cc=corbet@lwn.net \
    --cc=dave.hansen@linux.intel.com \
    --cc=david@kernel.org \
    --cc=dev.jain@arm.com \
    --cc=gourry@gourry.net \
    --cc=hannes@cmpxchg.org \
    --cc=hughd@google.com \
    --cc=jack@suse.cz \
    --cc=jackmanb@google.com \
    --cc=jannh@google.com \
    --cc=jglisse@google.com \
    --cc=joshua.hahnjy@gmail.com \
    --cc=kas@kernel.org \
    --cc=lance.yang@linux.dev \
    --cc=linux-doc@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linux-trace-kernel@vger.kernel.org \
    --cc=ljs@kernel.org \
    --cc=lorenzo.stoakes@oracle.com \
    --cc=mathieu.desnoyers@efficios.com \
    --cc=matthew.brost@intel.com \
    --cc=mhiramat@kernel.org \
    --cc=mhocko@suse.com \
    --cc=peterx@redhat.com \
    --cc=pfalcato@suse.de \
    --cc=rakie.kim@sk.com \
    --cc=raquini@redhat.com \
    --cc=rdunlap@infradead.org \
    --cc=richard.weiyang@gmail.com \
    --cc=rientjes@google.com \
    --cc=rostedt@goodmis.org \
    --cc=rppt@kernel.org \
    --cc=ryan.roberts@arm.com \
    --cc=shivankg@amd.com \
    --cc=sunnanyong@huawei.com \
    --cc=surenb@google.com \
    --cc=thomas.hellstrom@linux.intel.com \
    --cc=tiwai@suse.de \
    --cc=usamaarif642@gmail.com \
    --cc=vbabka@suse.cz \
    --cc=vishal.moola@gmail.com \
    --cc=wangkefeng.wang@huawei.com \
    --cc=will@kernel.org \
    --cc=willy@infradead.org \
    --cc=yang@os.amperecomputing.com \
    --cc=ying.huang@linux.alibaba.com \
    --cc=ziy@nvidia.com \
    --cc=zokeefe@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox