Re: [PATCH v12 mm-new 06/15] khugepaged: introduce collapse_max_ptes_none helper function

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Nico Pache <npache@redhat.com>
To: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>,
	David Hildenbrand <david@redhat.com>,
	 linux-kernel@vger.kernel.org,
	linux-trace-kernel@vger.kernel.org,  linux-mm@kvack.org,
	linux-doc@vger.kernel.org, ziy@nvidia.com,
	 Liam.Howlett@oracle.com, ryan.roberts@arm.com, dev.jain@arm.com,
	 corbet@lwn.net, rostedt@goodmis.org, mhiramat@kernel.org,
	 mathieu.desnoyers@efficios.com, akpm@linux-foundation.org,
	baohua@kernel.org,  willy@infradead.org, peterx@redhat.com,
	wangkefeng.wang@huawei.com,  usamaarif642@gmail.com,
	sunnanyong@huawei.com, vishal.moola@gmail.com,
	 thomas.hellstrom@linux.intel.com, yang@os.amperecomputing.com,
	kas@kernel.org,  aarcange@redhat.com, raquini@redhat.com,
	anshuman.khandual@arm.com,  catalin.marinas@arm.com,
	tiwai@suse.de, will@kernel.org,  dave.hansen@linux.intel.com,
	jack@suse.cz, cl@gentwo.org, jglisse@google.com,
	 surenb@google.com, zokeefe@google.com, hannes@cmpxchg.org,
	 rientjes@google.com, mhocko@suse.com, rdunlap@infradead.org,
	hughd@google.com,  richard.weiyang@gmail.com,
	lance.yang@linux.dev, vbabka@suse.cz,  rppt@kernel.org,
	jannh@google.com, pfalcato@suse.de
Subject: Re: [PATCH v12 mm-new 06/15] khugepaged: introduce collapse_max_ptes_none helper function
Date: Tue, 28 Oct 2025 20:49:02 -0600	[thread overview]
Message-ID: <CAA1CXcDZoJEFi2_aDbOa19=zHZstV-v+i2z3bDxB8bin=6m3Dw@mail.gmail.com> (raw)
In-Reply-To: <b1f8c5e3-0849-4c04-9ee3-5a0183d3af9b@linux.alibaba.com>

On Tue, Oct 28, 2025 at 8:10 PM Baolin Wang
<baolin.wang@linux.alibaba.com> wrote:
>
>
>
> On 2025/10/29 02:59, Lorenzo Stoakes wrote:
> > On Tue, Oct 28, 2025 at 07:08:38PM +0100, David Hildenbrand wrote:
> >>
> >>>>> Hey Lorenzo,
> >>>>>
> >>>>>> I mean not to beat a dead horse re: v11 commentary, but I thought we were going
> >>>>>> to implement David's idea re: the new 'eagerness' tunable, and again we're now just
> >>>>>> implementing the capping at HPAGE_PMD_NR/2 - 1 thing again?
> >>>>>
> >>>>> I spoke to David and he said to continue forward with this series; the
> >>>>> "eagerness" tunable will take some time, and may require further
> >>>>> considerations/discussion.
> >>>>
> >>>> Right, after talking to Johannes it got clearer that what we envisioned with
> >>>
> >>> I'm not sure that you meant to say go ahead with the series as-is with this
> >>> silent capping?
> >>
> >> No, "go ahead" as in "let's find some way forward that works for all and is
> >> not too crazy".
> >
> > Right we clearly needed to discuss that further at the time but that's moot now,
> > we're figuring it out now :)
> >
> >>
> >> [...]
> >>
> >>>> "eagerness" would not be like swappiness, and we will really have to be
> >>>> careful here. I don't know yet when I will have time to look into that.
> >>>
> >>> I guess I missed this part of the converastion, what do you mean?
> >>
> >> Johannes raised issues with that on the list and afterwards we had an
> >> offline discussion about some of the details and why something unpredictable
> >> is not good.
> >
> > Could we get these details on-list so we can discuss them? This doesn't have to
> > be urgent, but I would like to have a say in this or at least be part of the
> > converastion please.
> >
> >>
> >>>
> >>> The whole concept is that we have a paramaeter whose value is _abstracted_ and
> >>> which we control what it means.
> >>>
> >>> I'm not sure exactly why that would now be problematic? The fundamental concept
> >>> seems sound no? Last I remember of the conversation this was the case.
> >>
> >> The basic idea was to do something abstracted as swappiness. Turns out
> >> "swappiness" is really something predictable, not something we can randomly
> >> change how it behaves under the hood.
> >>
> >> So we'd have to find something similar for "eagerness", and that's where it
> >> stops being easy.
> >
> > I think we shouldn't be too stuck on
> >
> >>
> >>>
> >>>>
> >>>> If we want to avoid the implicit capping, I think there are the following
> >>>> possible approaches
> >>>>
> >>>> (1) Tolerate creep for now, maybe warning if the user configures it.
> >>>
> >>> I mean this seems a viable option if there is pressure to land this series
> >>> before we have a viable uAPI for configuring this.
> >>>
> >>> A part of me thinks we shouldn't rush series in for that reason though and
> >>> should require that we have a proper control here.
> >>>
> >>> But I guess this approach is the least-worst as it leaves us with the most
> >>> options moving forwards.
> >>
> >> Yes. There is also the alternative of respecting only 0 / 511 for mTHP
> >> collapse for now as discussed in the other thread.
> >
> > Yes I guess let's carry that on over there.
> >
> > I mean this is why I said it's better to try to keep things in one thread :) but
> > anyway, we've forked and can't be helped now.
> >
> > To be clear that was a criticism of - email development - not you.
> >
> > It's _extremely easy_ to have this happen because one thread naturally leads to
> > a broader discussion of a given topic, whereas another has questions from
> > somebody else about the same topic, to which people reply and then... you have a
> > fork and it can't be helped.
> >
> > I guess I'm saying it'd be good if we could say 'ok let's move this to X'.
> >
> > But that's also broken in its own way, you can't stop people from replying in
> > the other thread still and yeah. It's a limitation of this model :)
> >
> >>
> >>>
> >>>> (2) Avoid creep by counting zero-filled pages towards none_or_zero.
> >>>
> >>> Would this really make all that much difference?
> >>
> >> It solves the creep problem I think, but it's a bit nasty IMHO.
> >
> > Ah because you'd end up wtih a bunch of zeroed pages from the prior mTHP
> > collapses, interesting...
> >
> > Scanning for that does seem a bit nasty though yes...
> >
> >>
> >>>
> >>>> (3) Have separate toggles for each THP size. Doesn't quite solve the
> >>>>       problem, only shifts it.
> >>>
> >>> Yeah I did wonder about this as an alternative solution. But of course it then
> >>> makes it vague what the parent values means in respect of the individual levels,
> >>> unless we have an 'inherit' mode there too (possible).
> >>>
> >>> It's going to be confusing though as max_ptes_none sits at the root khugepaged/
> >>> level and I don't think any other parameter from khugepaged/ is exposed at
> >>> individual page size levels.
> >>>
> >>> And of course doing this means we
> >>>
> >>>>
> >>>> Anything else?
> >>>
> >>> Err... I mean I'm not sure if you missed it but I suggested an approach in the
> >>> sub-thread - exposing mthp_max_ptes_none as a _READ-ONLY_ field at:
> >>>
> >>> /sys/kernel/mm/transparent_hugepage/khugepaged/max_mthp_ptes_none
> >>>
> >>> Then we allow the capping, but simply document that we specify what the capped
> >>> value will be here for mTHP.
> >>
> >> I did not have time to read the details on that so far.
> >
> > OK. It is a bit nasty, yes. The idea is to find something that allows the
> > capping to work.
> >
> >>
> >> It would be one solution forward. I dislike it because I think the whole
> >> capping is an intermediate thing that can be (and likely must be, when
> >> considering mTHP underused shrinking I think) solved in the future
> >> differently. That's why I would prefer adding this only if there is no
> >> other, simpler, way forward.
> >
> > Yes I agree that if we could avoid it it'd be great.
> >
> > Really I proposed this solution on the basis that we were somehow ok with the
> > capping.
> >
> > If we can avoid that'd be ideal as it reduces complexity and 'unexpected'
> > behaviour.
> >
> > We'll clarify on the other thread, but the 511/0 was compelling to me before as
> > a simplification, and if we can have a straightforward model of how mTHP
> > collapse across none/zero page PTEs behaves this is ideal.
> >
> > The only question is w.r.t. warnings etc. but we can handle details there.
> >
> >>
> >>>
> >>> That struck me as the simplest way of getting this series landed without
> >>> necessarily violating any future eagerness which:
> >>>
> >>> a. Must still support khugepaged/max_ptes_none - we aren't getting away from
> >>>      this, it's uAPI.
> >>>
> >>> b. Surely must want to do different things for mTHP in eagerness, so if we're
> >>>      exposing some PTE value in max_ptes_none doing so in
> >>>      khugepaged/mthp_max_ptes_none wouldn't be problematic (note again - it's
> >>>      readonly so unlike max_ptes_none we don't have to worry about the other
> >>>      direction).
> >>>
> >>> HOWEVER, eagerness might want want to change this behaviour per-mTHP size, in
> >>> which case perhaps mthp_max_ptes_none would be problematic in that it is some
> >>> kind of average.
> >>>
> >>> Then again we could always revert to putting this parameter as in (3) in that
> >>> case, ugly but kinda viable.
> >>>
> >>>>
> >>>> IIUC, creep is less of a problem when we have the underused shrinker
> >>>> enabled: whatever we over-allocated can (unless longterm-pinned etc) get
> >>>> reclaimed again.
> >>>>
> >>>> So maybe having underused-shrinker support for mTHP as well would be a
> >>>> solution to tackle (1) later?
> >>>
> >>> How viable is this in the short term?
> >>
> >> I once started looking into it, but it will require quite some work, because
> >> the lists will essentially include each and every (m)THP in the system ...
> >> so i think we will need some redesign.
> >
> > Ack.
> >
> > This aligns with non-0/511 settings being non-functional for mTHP atm anyway.
> >
> >>
> >>>
> >>> Another possible solution:
> >>>
> >>> If mthp_max_ptes_none is not workable, we could have a toggle at, e.g.:
> >>>
> >>> /sys/kernel/mm/transparent_hugepage/khugepaged/mthp_cap_collapse_none
> >>>
> >>> As a simple boolean. If switched on then we document that it caps mTHP as
> >>> per Nico's suggestion.
> >>>
> >>> That way we avoid the 'silent' issue I have with all this and it's an
> >>> explicit setting.
> >>
> >> Right, but it's another toggle I wish we wouldn't need. We could of course
> >> also make it some compile-time option, but not sure if that's really any
> >> better.
> >>
> >> I'd hope we find an easy way forward that doesn't require new toggles, at
> >> least for now ...
> >
> > Right, well I agree if we can make this 0/511 thing work, let's do that.
> >
> > Toggle are just 'least worst' workarounds on assumption of the need for capping.
>
> I finally finished reading through the discussions across multiple
> threads:), and it looks like we've reached a preliminary consensus (make
> 0/511 work). Great and thanks!
>
> IIUC, the strategy is, configuring it to 511 means always enabling mTHP
> collapse, configuring it to 0 means collapsing mTHP only if all PTEs are
> non-none/zero, and for other values, we issue a warning and prohibit
> mTHP collapse (avoid Lorenzo's concern about silently changing
> max_ptes_none). Then the implementation for collapse_max_ptes_none()
> should be as follows:
>
> static int collapse_max_ptes_none(unsigned int order, bool full_scan)
> {
>          /* ignore max_ptes_none limits */
>          if (full_scan)
>                  return HPAGE_PMD_NR - 1;
>
>          if (order == HPAGE_PMD_ORDER)
>                  return khugepaged_max_ptes_none;
>
>          /*
>           * To prevent creeping towards larger order collapses for mTHP
> collapse,
>           * we restrict khugepaged_max_ptes_none to only 511 or 0,
> simplifying the
>           * logic. This means:
>           * max_ptes_none == 511 -> collapse mTHP always
>           * max_ptes_none == 0 -> collapse mTHP only if we all PTEs are
> non-none/zero
>           */
>          if (!khugepaged_max_ptes_none || khugepaged_max_ptes_none ==
> HPAGE_PMD_NR - 1)
>                  return khugepaged_max_ptes_none >> (HPAGE_PMD_ORDER -
> order);
>
>          pr_warn_once("mTHP collapse only supports
> khugepaged_max_ptes_none configured as 0 or %d\n", HPAGE_PMD_NR - 1);
>          return -EINVAL;
> }
>
> So what do you think?

Yes i'm glad we finally came to some consensus, despite it being a
less than ideal solution.

Hopefully the eagerness patchset re-introduces all the lost
functionality in the future.

Cheers
-- Nico

>

next prev parent reply	other threads:[~2025-10-29  2:49 UTC|newest]

Thread overview: 91+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-10-22 18:37 [PATCH v12 mm-new 00/15] khugepaged: mTHP support Nico Pache
2025-10-22 18:37 ` [PATCH v12 mm-new 01/15] khugepaged: rename hpage_collapse_* to collapse_* Nico Pache
2025-11-08  1:42   ` Wei Yang
2025-10-22 18:37 ` [PATCH v12 mm-new 02/15] introduce collapse_single_pmd to unify khugepaged and madvise_collapse Nico Pache
2025-10-27  9:00   ` Lance Yang
2025-10-27 15:44   ` Lorenzo Stoakes
2025-11-08  1:44   ` Wei Yang
2025-10-22 18:37 ` [PATCH v12 mm-new 03/15] khugepaged: generalize hugepage_vma_revalidate for mTHP support Nico Pache
2025-10-27  9:02   ` Lance Yang
2025-11-08  1:54   ` Wei Yang
2025-10-22 18:37 ` [PATCH v12 mm-new 04/15] khugepaged: generalize alloc_charge_folio() Nico Pache
2025-10-27  9:05   ` Lance Yang
2025-11-08  2:34   ` Wei Yang
2025-10-22 18:37 ` [PATCH v12 mm-new 05/15] khugepaged: generalize __collapse_huge_page_* for mTHP support Nico Pache
2025-10-27  9:17   ` Lance Yang
2025-10-27 16:00   ` Lorenzo Stoakes
2025-11-10 13:20     ` Nico Pache
2025-11-08  3:01   ` Wei Yang
2025-10-22 18:37 ` [PATCH v12 mm-new 06/15] khugepaged: introduce collapse_max_ptes_none helper function Nico Pache
2025-10-27 17:53   ` Lorenzo Stoakes
2025-10-28 10:09     ` Baolin Wang
2025-10-28 13:57       ` Nico Pache
2025-10-28 17:07       ` Lorenzo Stoakes
2025-10-28 17:56         ` David Hildenbrand
2025-10-28 18:09           ` Lorenzo Stoakes
2025-10-28 18:17             ` David Hildenbrand
2025-10-28 18:41               ` Lorenzo Stoakes
2025-10-29 15:04                 ` David Hildenbrand
2025-10-29 18:41                   ` Lorenzo Stoakes
2025-10-29 21:10                     ` Nico Pache
2025-10-30 18:03                       ` Lorenzo Stoakes
2025-10-29 20:45                   ` Nico Pache
2025-10-28 13:36     ` Nico Pache
2025-10-28 14:15       ` David Hildenbrand
2025-10-28 17:29         ` Lorenzo Stoakes
2025-10-28 17:36           ` Lorenzo Stoakes
2025-10-28 18:08           ` David Hildenbrand
2025-10-28 18:59             ` Lorenzo Stoakes
2025-10-28 19:08               ` Lorenzo Stoakes
2025-10-29  2:09               ` Baolin Wang
2025-10-29  2:49                 ` Nico Pache [this message]
2025-10-29 18:55                 ` Lorenzo Stoakes
2025-10-29 21:14                   ` Nico Pache
2025-10-30  1:15                     ` Baolin Wang
2025-10-29  2:47               ` Nico Pache
2025-10-29 18:58                 ` Lorenzo Stoakes
2025-10-29 21:23                   ` Nico Pache
2025-10-30 10:15                     ` Lorenzo Stoakes
2025-10-31 11:12               ` David Hildenbrand
2025-10-28 16:57       ` Lorenzo Stoakes
2025-10-28 17:49         ` David Hildenbrand
2025-10-28 17:59           ` Lorenzo Stoakes
2025-10-22 18:37 ` [PATCH v12 mm-new 07/15] khugepaged: generalize collapse_huge_page for mTHP collapse Nico Pache
2025-10-27  3:25   ` Baolin Wang
2025-11-06 18:14   ` Lorenzo Stoakes
2025-11-07  3:09     ` Dev Jain
2025-11-07  9:18       ` Lorenzo Stoakes
2025-11-07 19:33     ` Nico Pache
2025-10-22 18:37 ` [PATCH v12 mm-new 08/15] khugepaged: skip collapsing mTHP to smaller orders Nico Pache
2025-10-22 18:37 ` [PATCH v12 mm-new 09/15] khugepaged: add per-order mTHP collapse failure statistics Nico Pache
2025-11-06 18:45   ` Lorenzo Stoakes
2025-11-07 17:14     ` Nico Pache
2025-10-22 18:37 ` [PATCH v12 mm-new 10/15] khugepaged: improve tracepoints for mTHP orders Nico Pache
2025-10-22 18:37 ` [PATCH v12 mm-new 11/15] khugepaged: introduce collapse_allowable_orders helper function Nico Pache
2025-11-06 18:49   ` Lorenzo Stoakes
2025-11-07 18:01     ` Nico Pache
2025-10-22 18:37 ` [PATCH v12 mm-new 12/15] khugepaged: Introduce mTHP collapse support Nico Pache
2025-10-27  6:28   ` Baolin Wang
2025-11-09  2:08   ` Wei Yang
2025-11-11 21:56     ` Nico Pache
2025-11-19 11:53   ` Lorenzo Stoakes
2025-11-19 12:08     ` Lorenzo Stoakes
2025-11-20 22:32     ` Nico Pache
2025-10-22 18:37 ` [PATCH v12 mm-new 13/15] khugepaged: avoid unnecessary mTHP collapse attempts Nico Pache
2025-11-09  2:40   ` Wei Yang
2025-11-17 18:16     ` Nico Pache
2025-11-18  2:00       ` Wei Yang
2025-11-19 12:05   ` Lorenzo Stoakes
2025-11-26 23:16     ` Nico Pache
2025-11-26 23:29     ` Nico Pache
2025-10-22 18:37 ` [PATCH v12 mm-new 14/15] khugepaged: run khugepaged for all orders Nico Pache
2025-11-19 12:13   ` Lorenzo Stoakes
2025-11-20  6:37     ` Baolin Wang
2025-10-22 18:37 ` [PATCH v12 mm-new 15/15] Documentation: mm: update the admin guide for mTHP collapse Nico Pache
2025-10-22 19:52   ` Christoph Lameter (Ampere)
2025-10-22 20:22     ` David Hildenbrand
2025-10-23  8:00       ` Lorenzo Stoakes
2025-10-23  8:44         ` Pedro Falcato
2025-10-24 13:54           ` Zach O'Keefe
2025-10-23 23:41       ` Christoph Lameter (Ampere)
2025-10-22 20:13 ` [PATCH v12 mm-new 00/15] khugepaged: mTHP support Andrew Morton

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAA1CXcDZoJEFi2_aDbOa19=zHZstV-v+i2z3bDxB8bin=6m3Dw@mail.gmail.com' \
    --to=npache@redhat.com \
    --cc=Liam.Howlett@oracle.com \
    --cc=aarcange@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=anshuman.khandual@arm.com \
    --cc=baohua@kernel.org \
    --cc=baolin.wang@linux.alibaba.com \
    --cc=catalin.marinas@arm.com \
    --cc=cl@gentwo.org \
    --cc=corbet@lwn.net \
    --cc=dave.hansen@linux.intel.com \
    --cc=david@redhat.com \
    --cc=dev.jain@arm.com \
    --cc=hannes@cmpxchg.org \
    --cc=hughd@google.com \
    --cc=jack@suse.cz \
    --cc=jannh@google.com \
    --cc=jglisse@google.com \
    --cc=kas@kernel.org \
    --cc=lance.yang@linux.dev \
    --cc=linux-doc@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linux-trace-kernel@vger.kernel.org \
    --cc=lorenzo.stoakes@oracle.com \
    --cc=mathieu.desnoyers@efficios.com \
    --cc=mhiramat@kernel.org \
    --cc=mhocko@suse.com \
    --cc=peterx@redhat.com \
    --cc=pfalcato@suse.de \
    --cc=raquini@redhat.com \
    --cc=rdunlap@infradead.org \
    --cc=richard.weiyang@gmail.com \
    --cc=rientjes@google.com \
    --cc=rostedt@goodmis.org \
    --cc=rppt@kernel.org \
    --cc=ryan.roberts@arm.com \
    --cc=sunnanyong@huawei.com \
    --cc=surenb@google.com \
    --cc=thomas.hellstrom@linux.intel.com \
    --cc=tiwai@suse.de \
    --cc=usamaarif642@gmail.com \
    --cc=vbabka@suse.cz \
    --cc=vishal.moola@gmail.com \
    --cc=wangkefeng.wang@huawei.com \
    --cc=will@kernel.org \
    --cc=willy@infradead.org \
    --cc=yang@os.amperecomputing.com \
    --cc=ziy@nvidia.com \
    --cc=zokeefe@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox