linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
To: Zi Yan <ziy@nvidia.com>
Cc: Usama Arif <usamaarif642@gmail.com>,
	david@redhat.com, Andrew Morton <akpm@linux-foundation.org>,
	linux-mm@kvack.org, hannes@cmpxchg.org, shakeel.butt@linux.dev,
	riel@surriel.com, baolin.wang@linux.alibaba.com,
	Liam.Howlett@oracle.com, npache@redhat.com, ryan.roberts@arm.com,
	dev.jain@arm.com, hughd@google.com, linux-kernel@vger.kernel.org,
	linux-doc@vger.kernel.org, kernel-team@meta.com,
	Juan Yescas <jyescas@google.com>,
	Breno Leitao <leitao@debian.org>
Subject: Re: [RFC] mm: khugepaged: use largest enabled hugepage order for min_free_kbytes
Date: Mon, 9 Jun 2025 20:40:36 +0100	[thread overview]
Message-ID: <f980e652-8e2a-41da-af9b-39fdd439fefc@lucifer.local> (raw)
In-Reply-To: <AA2C4D68-B1DC-48A6-A807-56516067B9C7@nvidia.com>

On Mon, Jun 09, 2025 at 11:20:04AM -0400, Zi Yan wrote:
> On 9 Jun 2025, at 10:50, Lorenzo Stoakes wrote:
>
> > On Mon, Jun 09, 2025 at 10:37:26AM -0400, Zi Yan wrote:
> >> On 9 Jun 2025, at 10:16, Lorenzo Stoakes wrote:
> >>
> >>> On Mon, Jun 09, 2025 at 03:11:27PM +0100, Usama Arif wrote:
> >
> > [snip]
> >
> >>>> So I guess the question is what should be the next step? The following has been discussed:
> >>>>
> >>>> - Changing pageblock_order at runtime: This seems unreasonable after Zi's explanation above
> >>>>   and might have unintended consequences if done at runtime, so a no go?
> >>>> - Decouple only watermark calculation and defrag granularity from pageblock order (also from Zi).
> >>>>   The decoupling can be done separately. Watermark calculation can be decoupled using the
> >>>>   approach taken in this RFC. Although max order used by pagecache needs to be addressed.
> >>>>
> >>>
> >>> I need to catch up with the thread (workload crazy atm), but why isn't it
> >>> feasible to simply statically adjust the pageblock size?
> >>>
> >>> The whole point of 'defragmentation' is to _heuristically_ make it less
> >>> likely there'll be fragmentation when requesting page blocks.
> >>>
> >>> And the watermark code is explicitly about providing reserves at a
> >>> _pageblock granularity_.
> >>>
> >>> Why would we want to 'defragment' to 512MB physically contiguous chunks
> >>> that we rarely use?
> >>>
> >>> Since it's all heuristic, it seems reasonable to me to cap it at a sensible
> >>> level no?
> >>
> >> What is a sensible level? 2MB is a good starting point. If we cap pageblock
> >> at 2MB, everyone should be happy at the moment. But if one user wants to
> >> allocate 4MB mTHP, they will most likely fail miserably, because pageblock
> >> is 2MB, kernel is OK to have a 2MB MIGRATE_MOVABLE pageblock next to a 2MB
> >> MGIRATE_UNMOVABLE one, making defragmenting 4MB an impossible job.
> >>
> >> Defragmentation has two components: 1) pageblock, which has migratetypes
> >> to prevent mixing movable and unmovable pages, as a single unmovable page
> >> blocks large free pages from being created; 2) memory compaction granularity,
> >> which is the actual work to move pages around and form a large free pages.
> >> Currently, kernel assumes pageblock size = defragmentation granularity,
> >> but in reality, as long as pageblock size >= defragmentation granularity,
> >> memory compaction would still work, but not the other way around. So we
> >> need to choose pageblock size carefully to not break memory compaction.
> >
> > OK I get it - the issue is that compaction itself operations at a pageblock
> > granularity, and once you get so fragmented that compaction is critical to
> > defragmentation, you are stuck if the pageblock is not big enough.
>
> Right.
>
> >
> > Thing is, 512MB pageblock size for compaction seems insanely inefficient in
> > itself, and if we're complaining about issues with unavailable reserved
> > memory due to crazy PMD size, surely one will encounter the compaction
> > process simply failing to succeed/taking forever/causing issues with
> > reclaim/higher order folio allocation.
>
> Yep. Initially, we probably never thought PMD THP would be as large as
> 512MB.

Of course, such is the 'organic' nature of kernel development :)

>
> >
> > I mean, I don't really know the compaction code _at all_ (ran out of time
> > to cover in book ;), but is it all or-nothing? Does it grab a pageblock or
> > gives up?
>
> compaction works on one pageblock at a time, trying to migrate in-use pages
> within the pageblock away to create a free page for THP allocation.
> It assumes PMD THP size is equal to pageblock size. It will keep working
> until a PMD THP size free page is created. This is a very high level
> description, omitting a lot of details like how to avoid excessive compaction
> work, how to reduce compaction latency.

Yeah this matches my assumptions.

>
> >
> > Because it strikes me that a crazy pageblock size would cause really
> > serious system issues on that basis alone if that's the case.
> >
> > And again this leads me back to thinking it should just be the page block
> > size _as a whole_ that should be adjusted.
> >
> > Keep in mind a user can literally reduce the page block size already via
> > CONFIG_PAGE_BLOCK_MAX_ORDER.
> >
> > To me it seems that we should cap it at the highest _reasonable_ mTHP size
> > you can get on a 64KB (i.e. maximum right? RIGHT? :P) base page size
> > system.
> >
> > That way, people _can still get_ super huge PMD sized huge folios up to the
> > point of fragmentation.
> >
> > If we do reduce things this way we should give a config option to allow
> > users who truly want collosal PMD sizes with associated
> > watermarks/compaction to be able to still have it.
> >
> > CONFIG_PAGE_BLOCK_HARD_LIMIT_MB or something?
>
> I agree with capping pageblock size at a highest reasonable mTHP size.
> In case there is some user relying on this huge PMD THP, making
> pageblock a boot time variable might be a little better, since
> they do not need to recompile the kernel for their need, assuming
> distros will pick something like 2MB as the default pageblock size.

Right, this seems sensible, as long as we set a _default_ that limits to
whatever it would be, 2MB or such.

I don't think it's unreasonable to make that change since this 512 MB thing
is so entirely unexpected and unusual.

I think Usama said it would be a pain it working this way if it had to be
explicitly set as a boot time variable without defaulting like this.

>
> >
> > I also question this de-coupling in general (I may be missing somethig
> > however!) - the watermark code _very explicitly_ refers to providing
> > _pageblocks_ in order to ensure _defragmentation_ right?
>
> Yes. Since without enough free memory (bigger than a PMD THP),
> memory compaction will just do useless work.

Yeah right, so this is a key thing and why we need to rework the current
state of the patch.

>
> >
> > We would need to absolutely justify why it's suddenly ok to not provide
> > page blocks here.
> >
> > This is very very delicate code we have to be SO careful about.
> >
> > This is why I am being cautious here :)
>
> Understood. In theory, we can associate watermarks with THP allowed orders
> the other way around too, meaning if user lowers vm.min_free_kbytes,
> all THP/mTHP sizes bigger than the watermark threshold are disabled
> automatically. This could fix the memory compaction issues, but
> that might also drive user crazy as they cannot use the THP sizes
> they want.

Yeah that's interesting but I think that's just far too subtle and people will
have no idea what's going on.

I really think a hard cap, expressed in KB/MB, on pageblock size is the way to
go (but overrideable for people crazy enough to truly want 512 MB pages - and
who cannot then complain about watermarks).

>
> Often, user just ask for an impossible combination: they
> want to use all free memory, because they paid for it, and they
> want THPs, because they want max performance. When PMD THP is
> small like 2MB, the “unusable” free memory is not that noticeable,
> but when PMD THP is as large as 512MB, user just cannot unsee it. :)

Well, users asking for crazy things then being surprised when they get them
is nothing new :P

>
>
> Best Regards,
> Yan, Zi

Thanks for your input!

Cheers, Lorenzo


  reply	other threads:[~2025-06-09 19:40 UTC|newest]

Thread overview: 32+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-06-06 14:37 Usama Arif
2025-06-06 15:01 ` Usama Arif
2025-06-06 15:18 ` Zi Yan
2025-06-06 15:38   ` Usama Arif
2025-06-06 16:10     ` Zi Yan
2025-06-07  8:35       ` Lorenzo Stoakes
2025-06-08  0:04         ` Zi Yan
2025-06-09 11:13       ` Usama Arif
2025-06-09 13:19         ` Zi Yan
2025-06-09 14:11           ` Usama Arif
2025-06-09 14:16             ` Lorenzo Stoakes
2025-06-09 14:37               ` Zi Yan
2025-06-09 14:50                 ` Lorenzo Stoakes
2025-06-09 15:20                   ` Zi Yan
2025-06-09 19:40                     ` Lorenzo Stoakes [this message]
2025-06-09 19:49                       ` Zi Yan
2025-06-09 20:03                         ` Usama Arif
2025-06-09 20:24                           ` Zi Yan
2025-06-10 10:41                             ` Usama Arif
2025-06-10 14:03                         ` Lorenzo Stoakes
2025-06-10 14:20                           ` Zi Yan
2025-06-10 15:16                             ` Usama Arif
2025-06-09 15:32             ` Zi Yan
2025-06-06 17:37 ` David Hildenbrand
2025-06-09 11:34   ` Usama Arif
2025-06-09 13:28     ` Zi Yan
2025-06-07  8:18 ` Lorenzo Stoakes
2025-06-07  8:44   ` Lorenzo Stoakes
2025-06-09 12:07   ` Usama Arif
2025-06-09 12:12     ` Usama Arif
2025-06-09 14:58       ` Lorenzo Stoakes
2025-06-09 14:57     ` Lorenzo Stoakes

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=f980e652-8e2a-41da-af9b-39fdd439fefc@lucifer.local \
    --to=lorenzo.stoakes@oracle.com \
    --cc=Liam.Howlett@oracle.com \
    --cc=akpm@linux-foundation.org \
    --cc=baolin.wang@linux.alibaba.com \
    --cc=david@redhat.com \
    --cc=dev.jain@arm.com \
    --cc=hannes@cmpxchg.org \
    --cc=hughd@google.com \
    --cc=jyescas@google.com \
    --cc=kernel-team@meta.com \
    --cc=leitao@debian.org \
    --cc=linux-doc@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=npache@redhat.com \
    --cc=riel@surriel.com \
    --cc=ryan.roberts@arm.com \
    --cc=shakeel.butt@linux.dev \
    --cc=usamaarif642@gmail.com \
    --cc=ziy@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox