From: Ryan Roberts <ryan.roberts@arm.com>
To: John Hubbard <jhubbard@nvidia.com>,
Matthew Wilcox <willy@infradead.org>,
Yang Shi <shy828301@gmail.com>,
"Yin, Fengwei" <fengwei.yin@intel.com>,
Yu Zhao <yuzhao@google.com>, Zi Yan <ziy@nvidia.com>,
David Hildenbrand <david@redhat.com>,
David Rientjes <rientjes@google.com>,
Andrew Morton <akpm@linux-foundation.org>,
Vlastimil Babka <vbabka@suse.cz>,
"Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Linux-MM <linux-mm@kvack.org>
Subject: Re: ANON_LARGE_FOLIOS meeting follow-up & refined proposal
Date: Mon, 25 Sep 2023 09:51:52 +0100 [thread overview]
Message-ID: <92937776-1e16-47e5-bef9-4c1a04bc98c0@arm.com> (raw)
In-Reply-To: <7301771f-d654-4e5a-a197-3a3d8750440c@nvidia.com>
On 23/09/2023 01:33, John Hubbard wrote:
> On 9/22/23 08:48, Ryan Roberts wrote:
> ...
>> I never had any feedback on the below; I'm not sure if that means everyone is
>> happy or that nobody read it??
>
> One can never really know: zero or more people read it, and of those, no
> one hated it enough to send out a quick NAK. So that's a *possible*,
> lukewarm endorsement of sorts. Success! :)
You really know how to fill a guy with confidence! ;-)
>
> ...
>
>> BUT I've had yet another idea on the controls front, which would enable exposing
>> this to user space as an extension to transparent_hugepage, while continuing to
>> support THP as is and also be able to control THP and ALF (anon large folio)
>
> The new ALF / ANON_LARGE_FOLIO naming looks good to me. The grep aspect
> is a nice touch.
Well if we go the route of the newest proposal, then I guess the naming is less
important, because it all attaches to transparent_hugepage.
>
> ...
>
>> Add 2 controls to sysfs:
>>
>> /sys/kernel/mm/transparent_hugepage/anon_orders
>> - bitfield where set bits are orders that will be tried during allocation
>> - defaults to 1<<PMD_ORDER, which gives current THP behaviour with no ALF
>> - For now, 1<<PMD_ORDER is highest settable bit, but easy to expand in future
>> - To enable ALF, set the appropriate lower bits
>> - To disable THP, clear 1<<PMD_ORDER
>> - (In future we could add an "auto" option too)
>>
>> /sys/kernel/mm/transparent_hugepage/anon_always_mask
>> - orders in (anon_orders & anon_always_mask) are not subject to madvise
>> - so when enabled=madvise, still try (anon_orders & anon_always_mask) orders
>> as if enabled=always
>> - defaults to 0 (all subject to madvise)
>>
>
> I *think* I like this a lot,
On the weight of this lukewarm endorsement, I'm going to code it up and aim to
post something for dicussion end of this week. ;-)
> although I have some clarifying question
> below. It seems to address the key things that have been complicating
> the discussions: the API is now looking more flexible, and yet still
> easy to understand and reason about. Nice.
>
> A couple of questions about how this works:
>
>>
>> The defaults for those controls give you "legacy THP". But you can modify the
>> controls to generate policies like this:
>>
>>
>
> For these tables, a small key or legend would help. I've forgotten already
> what "S" means, and am also vague about exactly what "THP>ALF>S" behavior
> means, too.
THP:
transparent hugepage allocation; specifically PMD sized/aligned/mapped.
ALF:
anonymous large folio allocation; specifically some order between
[PMD_ORDER-1, 1]. Always PTE-mapped.
S:
single page allocation; order-0, always PTE-mapped.
I've found these discrete logical buckets useful for thinking about the problem,
although the implementation doesn't always treat them completely separately (S
is just a final fallback order in ALF's list of orders to try) and the new
proposal exposes both THP and ALF through a unified THP interface.
The '>' indicates 'fallback'. Fallback happens for a few different reasons; VMA
is too small to contain the proposed folio order, or some PTEs that would be
covered by the new folio are already populated, etc. ALF usually isn't just a
single order either - it has a list of orders that it will try.
Possibly all a bit confusing, but this is the nomenclature I've been using in
the context of all the discusions so far and wanted to try to keep things
comparable.
>
>> THP only - existing behaviour (default):
>> ----------------------------------------
>>
>> anon_orders = 1<<PMD_ORDER
>> anon_always_mask = 0
>>
>> thp prctl: | dis | ena | ena | ena
>
> All I see in the prctl(2) man page is PR_SET_THP_DISABLE, I don't
> see any _ENABLE. What does the above refer to?
dis: PR_SET_THP_DISABLE with arg2=1 (thp disabled via prctl)
ena: PR_SET_THP_DISABLE with arg2=0 (thp not disabled via prctl)
I was trying to illustrate that ALF is now also affected by this prctl. With the
previous proposal it was independent of THP and therefore independent of this
prctl. Of course it would still be _possible_ to ignore this control for the ALF
orders, but I think that risks being very confusing for users.
>
>
>> thp sysfs: | X | never | madvise | always
>> ----------------------|-----------|-----------|-----------|-------------
>> no hint | S | S | S | THP>S
>> MADV_HUGEPAGE | S | S | THP>S | THP>S
>> MADV_NOHUGEPAGE | S | S | S | S
>>
>>
> ...
>>
>> It does have the disadvantage that ALF is tied to MADV_HUGEPAGE, whereas the
>
> Right, that is a little awkward. But maybe less so now, with this new proposal,
> which leaves THP a little closer to ALF.
Indeed, this approach makes it clearer/easier for users to understand, because
conceptually we are just introducing a wider set of folio sizes that THP can use
and all the existing THP controls continue to mean what they always meant.
The only risk I see is if there are workloads that want to use both (PMD) THP
and ALF, but in different VMAs, and they absolutely do not want the possibillity
of ALF in the (PMD) THP area if THP fails, and instead always fallback to Single
allocations for that VMA. But that sounds very niche to me. And would be better
solved by the additional (future) introduction of a set of allowed orders that
can be attached to a specific VMA.
There are a couple of other wrinkles that I didn't highlight in my first mail:
- khugepaged will continue to work only on PMD-sized THP. It will ignore the new
ALF orders. This was always the plan, but if exposing the ALF functionality
through THP interface to user space, does that make it confusing? I don't
think its a big issue personally. And we can always enhance khugepaged to work
on <PMD_ORDER folios later if we find a compelling reason.
- We will want to name new counters following THP naming, not large folio. I
propose that the existing AnonHugePages type counters will count ALL THP (i.e.
PMD order and ALF orders), and additionally add 2 new counters for PMD-mapped
and PTE-mapped, which should sum to the value in the original counter.
Hopefully that makes things clear while retaining back compat.
>
>
> thanks,
next prev parent reply other threads:[~2023-09-25 8:51 UTC|newest]
Thread overview: 16+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-09-14 8:16 Ryan Roberts
2023-09-22 15:48 ` Ryan Roberts
2023-09-23 0:33 ` John Hubbard
2023-09-25 8:51 ` Ryan Roberts [this message]
2023-09-26 18:31 ` David Hildenbrand
2023-09-27 7:23 ` Ryan Roberts
2023-09-27 15:32 ` David Hildenbrand
2023-09-27 19:04 ` Ryan Roberts
2023-10-02 12:58 ` David Hildenbrand
2023-10-05 7:37 ` Ryan Roberts
[not found] ` <c60321ef-8596-8fa0-7367-f43e69e1d894@redhat.com>
2023-10-05 9:46 ` Ryan Roberts
2023-10-06 11:53 ` David Hildenbrand
2023-09-26 18:34 ` David Hildenbrand
2023-09-26 8:13 ` Kirill A. Shutemov
2023-09-26 18:29 ` David Hildenbrand
2023-09-26 18:26 ` David Hildenbrand
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=92937776-1e16-47e5-bef9-4c1a04bc98c0@arm.com \
--to=ryan.roberts@arm.com \
--cc=akpm@linux-foundation.org \
--cc=david@redhat.com \
--cc=fengwei.yin@intel.com \
--cc=jhubbard@nvidia.com \
--cc=kirill.shutemov@linux.intel.com \
--cc=linux-mm@kvack.org \
--cc=rientjes@google.com \
--cc=shy828301@gmail.com \
--cc=vbabka@suse.cz \
--cc=willy@infradead.org \
--cc=yuzhao@google.com \
--cc=ziy@nvidia.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox