ANON_LARGE_FOLIOS meeting follow-up & refined proposal

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* ANON_LARGE_FOLIOS meeting follow-up & refined proposal
@ 2023-09-14  8:16 Ryan Roberts
  2023-09-22 15:48 ` Ryan Roberts
  2023-09-26 18:26 ` David Hildenbrand
  0 siblings, 2 replies; 16+ messages in thread
From: Ryan Roberts @ 2023-09-14  8:16 UTC (permalink / raw)
  To: Matthew Wilcox, Yang Shi, Yin, Fengwei, Yu Zhao, Zi Yan,
	David Hildenbrand, David Rientjes, Andrew Morton,
	Vlastimil Babka, John Hubbard, Kirill A. Shutemov
  Cc: Linux-MM

Hi All,

Thanks for participating in the discussion yesterday - it finally feels like we
are converging on the MVP feature set. Below are my notes from the call and a
modified proposal for controls and stats. It would be great if we can continue
to review and refine over email. I'm also planning to post an implementation
within the next couple of weeks, which I hope will also accelerate convergence.

Roadmap
-------

Stage 1: (MVP) Propose to add minimal runtime controls and stats (as outlined
below). There were no disagreements on the call about this feature set being
either too little or too big for the initial submission.

Stage 2: Focus on decreasing memory wastage. Plan A will attempt to do this
automatically within the kernel (I highlighted some ideas in the slide pack
which we didn't get time to cover). Plan B is to add more fine-grained controls
to to fine tune things at memcg/process/vma level (TBD). I'm not covering this
stage in this email.

Naming
------

We may add large folio support to shmem in future, which may need some separate
controls (TBD). As a result, consensus was to have  generic name "large folio",
which is specialized for anon memory. Then in future it could also be
specialized for shmem.

I'm going to reflect this in the kernel naming by changing LARGE_ANON_FOLIO to
ANON_LARGE_FOLIO, that way it makes "LARGE_FOLIO" grepable.

I'm also reflecting this in the sysfs controls. I'll create a directory
'/sys/kernel/mm/large_folio' as the root. Within that there are 2 main options:

- Put shared controls directly in this directory. Add a sub-directory 'anon' for
  anon-specific controls (and in future 'shmem'...)
- Put all controls in the root directory and prefix the filename for
  anon-specific controls with 'anon' (e.g. anon_enabled).

Given I don't think there will be many anon-specific controls (1 for now), and
THP already uses the latter scheme, I'm proposing to go with the latter.

Controls
--------

Modified proposal, after discussion yesterday:

- boot_param: anon_large_folio
    - =always|never|madvise
    - sets boot-up default for large_folio/anon_enabled
- sysfs: /sys/kernel/mm/large_folio/anon_enabled
    - =always|never|madvise
- sysfs: /sys/kernel/mm/large_folio/defrag
    - =always|defer|defer+madvise|madvise|never
    - Anticipate would be shared between anon and shmem if shmem added
        - this is already true for THP
    - Kirill suggested to drop and hardcode to "never" (GFP_TRANSHUGE_LIGHT)
	- Yu previously commented GFP_TRANSHUGE_LIGHT isn't always ideal [1]
	- So current series is hooking THP's defrag setting
	- Given we want to separate THP and LAF, I'm proposing to keep it
- debugfs: /sys/kernel/debug/mm/large_folio/anon_orders
    - Comma-separated, descending list of orders to try
    - Default: arch_wants_pte_order(),PAGE_ALLOC_COSTLY_ORDER
    - 0 always implicitly appended to end
    - Max allowed is PMD_ORDER-1
    - intended for developers to experiment
    - debugfs means we can change/remove it or promote it to sysfs later
- MADV_NOHUGEPAGE is honored; LAF disabled for these VMAs
    - Required for correctness of existing use cases (live migration post copy)
- New MADV_LARGEFOLIO madvise opcode
    - Like MADV_HUGEPAGE but for large folio

Optional:

DavidR suggested adding ability to set a VMA-specific LAF order, using
process_madvise():
    - Optionally accept LAF order through flags param of 
      process_madvise(MADV_LARGEFOLIO)
    - When no LAF order passed, (or called with madvise()) use global LAF order

Personally, I would prefer to avoid vma-specific laf order for an initial
submission and instead defer the addition until clear need is identified.
Thoughts?

Stats
-----

meminfo:AnonHugePages, smaps:AnonHugePages and memory.stat:anon_thp will
continue to account THP only.

I plan to add meminfo:AnonLargeFolio, smaps:AnonLargeFolio and
memory.stat:anon_large_folio to account LAFs.

Do I need to add counters to vmstat also? (e.g. large_folio_fault_alloc,
large_folio_fault_fallback, etc) - would need to think about which counters and
what they mean if so.

Thanks,
Ryan

[1] https://lore.kernel.org/linux-mm/CAOUHufYWtsAU4PvKpVhzJUeQb9cd+BifY9KzgceBXHp2F2dDRg@mail.gmail.com/

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: ANON_LARGE_FOLIOS meeting follow-up & refined proposal
  2023-09-14  8:16 ANON_LARGE_FOLIOS meeting follow-up & refined proposal Ryan Roberts
@ 2023-09-22 15:48 ` Ryan Roberts
  2023-09-23  0:33   ` John Hubbard
                     ` (2 more replies)
  2023-09-26 18:26 ` David Hildenbrand
  1 sibling, 3 replies; 16+ messages in thread
From: Ryan Roberts @ 2023-09-22 15:48 UTC (permalink / raw)
  To: Matthew Wilcox, Yang Shi, Yin, Fengwei, Yu Zhao, Zi Yan,
	David Hildenbrand, David Rientjes, Andrew Morton,
	Vlastimil Babka, John Hubbard, Kirill A. Shutemov
  Cc: Linux-MM

On 14/09/2023 09:16, Ryan Roberts wrote:
> Hi All,
> 
> Thanks for participating in the discussion yesterday - it finally feels like we
> are converging on the MVP feature set. Below are my notes from the call and a
> modified proposal for controls and stats. It would be great if we can continue
> to review and refine over email. I'm also planning to post an implementation
> within the next couple of weeks, which I hope will also accelerate convergence.

I never had any feedback on the below; I'm not sure if that means everyone is
happy or that nobody read it??

I've got an implementation for all of the below ready to go, with a few tweeks
to the details (the main change is that anon_orders is now a bitfield where each
set bit represents an order in the set, rather than the originally proposed
comma-separated list of orders).

BUT I've had yet another idea on the controls front, which would enable exposing
this to user space as an extension to transparent_hugepage, while continuing to
support THP as is and also be able to control THP and ALF (anon large folio)
usage independently. On reflection, I think it is cleaner to do it this way for
a couple of reasons:

  - We don't have to introduce a whole new feature (ALF) to the user. Most of
    the concepts and controls overlap a lot with THP anyway, so if we can make
    it look like an extension, I think it would be easier to communicate.

  - The approach I have in mind would make it easy to extend to orders greater
    than PMD_ORDER in future if that's a direction we want to eventually go.
    Because >PMD_ORDER implies multiple PMD entries, it would half belong to THP
    and half belong to ALF in the current proposal, which is nasty.

I'll lay out the new proposal now, but I suspect this will ultimately warrant
another mm alignment meeting...


Add 2 controls to sysfs:

/sys/kernel/mm/transparent_hugepage/anon_orders
  - bitfield where set bits are orders that will be tried during allocation
  - defaults to 1<<PMD_ORDER, which gives current THP behaviour with no ALF
  - For now, 1<<PMD_ORDER is highest settable bit, but easy to expand in future
  - To enable ALF, set the appropriate lower bits
  - To disable THP, clear 1<<PMD_ORDER
  - (In future we could add an "auto" option too)

/sys/kernel/mm/transparent_hugepage/anon_always_mask
  - orders in (anon_orders & anon_always_mask) are not subject to madvise
  - so when enabled=madvise, still try (anon_orders & anon_always_mask) orders
    as if enabled=always
  - defaults to 0 (all subject to madvise)


The defaults for those controls give you "legacy THP". But you can modify the
controls to generate policies like this:


THP only - existing behaviour (default):
----------------------------------------

anon_orders = 1<<PMD_ORDER
anon_always_mask = 0

thp prctl:            | dis       | ena       | ena       | ena
thp sysfs:            | X         | never     | madvise   | always
----------------------|-----------|-----------|-----------|-------------
no hint               | S         | S         | S         | THP>S
MADV_HUGEPAGE         | S         | S         | THP>S     | THP>S
MADV_NOHUGEPAGE       | S         | S         | S         | S


ALF only:
---------

anon_orders = 1<<3 (order-3 - example)
anon_always_mask = 0

thp prctl:            | dis       | ena       | ena       | ena
thp sysfs:            | X         | never     | madvise   | always
----------------------|-----------|-----------|-----------|-------------
no hint               | S         | S         | S         | ALF>S
MADV_HUGEPAGE         | S         | S         | ALF>S     | ALF>S
MADV_NOHUGEPAGE       | S         | S         | S         | S


THP and ALF:
------------

anon_orders = 1<<PMD_ORDER | 1<<3
anon_always_mask = 0 (default)

thp prctl:            | dis       | ena       | ena       | ena
thp sysfs:            | X         | never     | madvise   | always
----------------------|-----------|-----------|-----------|-------------
no hint               | S         | S         | S         | THP>ALF>S
MADV_HUGEPAGE         | S         | S         | THP>ALF>S | THP>ALF>S
MADV_NOHUGEPAGE       | S         | S         | S         | S


THP and ALF, with THP=always, ALF=advise:
-----------------------------------------

anon_orders = 1<<PMD_ORDER | 1<<3
anon_always_mask = 1<<PMD_ORDER

thp prctl:            | dis       | ena       | ena       | ena
thp sysfs:            | X         | never     | madvise   | always
----------------------|-----------|-----------|-----------|-------------
no hint               | S         | S         | THP>S     | THP>ALF>S
MADV_HUGEPAGE         | S         | S         | THP>ALF>S | THP>ALF>S
MADV_NOHUGEPAGE       | S         | S         | S         | S


THP and ALF, with THP=madvise, ALF=always:
------------------------------------------

anon_orders = 1<<PMD_ORDER | 1<<3
anon_always_mask = 1<<3

thp prctl:            | dis       | ena       | ena       | ena
thp sysfs:            | X         | never     | madvise   | always
----------------------|-----------|-----------|-----------|-------------
no hint               | S         | S         | ALF>S     | THP>ALF>S
MADV_HUGEPAGE         | S         | S         | THP>ALF>S | THP>ALF>S
MADV_NOHUGEPAGE       | S         | S         | S         | S


It does have the disadvantage that ALF is tied to MADV_HUGEPAGE, whereas the
below approach introduces a new, independent MADV_LARGEFOLIO. But personally I
don't see that as a major issue. And we could solve it in future by extending
MADV_HUGEPAGE to add a vma-specific set of orders, via the process_madvise flags.

Thoughts?

I'll hold off posting the implementation of the below for now, while we decide
if its better to head in this direction.

Thanks,
Ryan



> 
> 
> Roadmap
> -------
> 
> Stage 1: (MVP) Propose to add minimal runtime controls and stats (as outlined
> below). There were no disagreements on the call about this feature set being
> either too little or too big for the initial submission.
> 
> Stage 2: Focus on decreasing memory wastage. Plan A will attempt to do this
> automatically within the kernel (I highlighted some ideas in the slide pack
> which we didn't get time to cover). Plan B is to add more fine-grained controls
> to to fine tune things at memcg/process/vma level (TBD). I'm not covering this
> stage in this email.
> 
> 
> Naming
> ------
> 
> We may add large folio support to shmem in future, which may need some separate
> controls (TBD). As a result, consensus was to have  generic name "large folio",
> which is specialized for anon memory. Then in future it could also be
> specialized for shmem.
> 
> I'm going to reflect this in the kernel naming by changing LARGE_ANON_FOLIO to
> ANON_LARGE_FOLIO, that way it makes "LARGE_FOLIO" grepable.
> 
> I'm also reflecting this in the sysfs controls. I'll create a directory
> '/sys/kernel/mm/large_folio' as the root. Within that there are 2 main options:
> 
> - Put shared controls directly in this directory. Add a sub-directory 'anon' for
>   anon-specific controls (and in future 'shmem'...)
> - Put all controls in the root directory and prefix the filename for
>   anon-specific controls with 'anon' (e.g. anon_enabled).
> 
> Given I don't think there will be many anon-specific controls (1 for now), and
> THP already uses the latter scheme, I'm proposing to go with the latter.
> 
> 
> Controls
> --------
> 
> Modified proposal, after discussion yesterday:
> 
> - boot_param: anon_large_folio
>     - =always|never|madvise
>     - sets boot-up default for large_folio/anon_enabled
> - sysfs: /sys/kernel/mm/large_folio/anon_enabled
>     - =always|never|madvise
> - sysfs: /sys/kernel/mm/large_folio/defrag
>     - =always|defer|defer+madvise|madvise|never
>     - Anticipate would be shared between anon and shmem if shmem added
>         - this is already true for THP
>     - Kirill suggested to drop and hardcode to "never" (GFP_TRANSHUGE_LIGHT)
> 	- Yu previously commented GFP_TRANSHUGE_LIGHT isn't always ideal [1]
> 	- So current series is hooking THP's defrag setting
> 	- Given we want to separate THP and LAF, I'm proposing to keep it
> - debugfs: /sys/kernel/debug/mm/large_folio/anon_orders
>     - Comma-separated, descending list of orders to try
>     - Default: arch_wants_pte_order(),PAGE_ALLOC_COSTLY_ORDER
>     - 0 always implicitly appended to end
>     - Max allowed is PMD_ORDER-1
>     - intended for developers to experiment
>     - debugfs means we can change/remove it or promote it to sysfs later
> - MADV_NOHUGEPAGE is honored; LAF disabled for these VMAs
>     - Required for correctness of existing use cases (live migration post copy)
> - New MADV_LARGEFOLIO madvise opcode
>     - Like MADV_HUGEPAGE but for large folio
> 
> Optional:
> 
> DavidR suggested adding ability to set a VMA-specific LAF order, using
> process_madvise():
>     - Optionally accept LAF order through flags param of 
>       process_madvise(MADV_LARGEFOLIO)
>     - When no LAF order passed, (or called with madvise()) use global LAF order
> 
> Personally, I would prefer to avoid vma-specific laf order for an initial
> submission and instead defer the addition until clear need is identified.
> Thoughts?
> 
> 
> Stats
> -----
> 
> meminfo:AnonHugePages, smaps:AnonHugePages and memory.stat:anon_thp will
> continue to account THP only.
> 
> I plan to add meminfo:AnonLargeFolio, smaps:AnonLargeFolio and
> memory.stat:anon_large_folio to account LAFs.
> 
> Do I need to add counters to vmstat also? (e.g. large_folio_fault_alloc,
> large_folio_fault_fallback, etc) - would need to think about which counters and
> what they mean if so.
> 
> 
> Thanks,
> Ryan
> 
> 
> [1] https://lore.kernel.org/linux-mm/CAOUHufYWtsAU4PvKpVhzJUeQb9cd+BifY9KzgceBXHp2F2dDRg@mail.gmail.com/



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: ANON_LARGE_FOLIOS meeting follow-up & refined proposal
  2023-09-22 15:48 ` Ryan Roberts
@ 2023-09-23  0:33   ` John Hubbard
  2023-09-25  8:51     ` Ryan Roberts
  2023-09-26  8:13   ` Kirill A. Shutemov
  2023-09-26 18:29   ` David Hildenbrand
  2 siblings, 1 reply; 16+ messages in thread
From: John Hubbard @ 2023-09-23  0:33 UTC (permalink / raw)
  To: Ryan Roberts, Matthew Wilcox, Yang Shi, Yin, Fengwei, Yu Zhao,
	Zi Yan, David Hildenbrand, David Rientjes, Andrew Morton,
	Vlastimil Babka, Kirill A. Shutemov
  Cc: Linux-MM

On 9/22/23 08:48, Ryan Roberts wrote:
...
> I never had any feedback on the below; I'm not sure if that means everyone is
> happy or that nobody read it??

One can never really know: zero or more people read it, and of those, no
one hated it enough to send out a quick NAK. So that's a *possible*,
lukewarm endorsement of sorts. Success! :)

...

> BUT I've had yet another idea on the controls front, which would enable exposing
> this to user space as an extension to transparent_hugepage, while continuing to
> support THP as is and also be able to control THP and ALF (anon large folio)

The new ALF / ANON_LARGE_FOLIO naming looks good to me. The grep aspect
is a nice touch.

...

> Add 2 controls to sysfs:
> 
> /sys/kernel/mm/transparent_hugepage/anon_orders
>    - bitfield where set bits are orders that will be tried during allocation
>    - defaults to 1<<PMD_ORDER, which gives current THP behaviour with no ALF
>    - For now, 1<<PMD_ORDER is highest settable bit, but easy to expand in future
>    - To enable ALF, set the appropriate lower bits
>    - To disable THP, clear 1<<PMD_ORDER
>    - (In future we could add an "auto" option too)
> 
> /sys/kernel/mm/transparent_hugepage/anon_always_mask
>    - orders in (anon_orders & anon_always_mask) are not subject to madvise
>    - so when enabled=madvise, still try (anon_orders & anon_always_mask) orders
>      as if enabled=always
>    - defaults to 0 (all subject to madvise)
> 

I *think* I like this a lot, although I have some clarifying question
below. It seems to address the key things that have been complicating
the discussions: the API is now looking more flexible, and yet still
easy to understand and reason about. Nice.

A couple of questions about how this works:

> 
> The defaults for those controls give you "legacy THP". But you can modify the
> controls to generate policies like this:
> 
> 

For these tables, a small key or legend would help. I've forgotten already
what "S" means, and am also vague about exactly what "THP>ALF>S" behavior
means, too.

> THP only - existing behaviour (default):
> ----------------------------------------
> 
> anon_orders = 1<<PMD_ORDER
> anon_always_mask = 0
> 
> thp prctl:            | dis       | ena       | ena       | ena

All I see in the prctl(2) man page is PR_SET_THP_DISABLE, I don't
see any _ENABLE. What does the above refer to?


> thp sysfs:            | X         | never     | madvise   | always
> ----------------------|-----------|-----------|-----------|-------------
> no hint               | S         | S         | S         | THP>S
> MADV_HUGEPAGE         | S         | S         | THP>S     | THP>S
> MADV_NOHUGEPAGE       | S         | S         | S         | S
> 
> 
...
> 
> It does have the disadvantage that ALF is tied to MADV_HUGEPAGE, whereas the

Right, that is a little awkward. But maybe less so now, with this new proposal,
which leaves THP a little closer to ALF.


thanks,
-- 
John Hubbard
NVIDIA



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: ANON_LARGE_FOLIOS meeting follow-up & refined proposal
  2023-09-23  0:33   ` John Hubbard
@ 2023-09-25  8:51     ` Ryan Roberts
  2023-09-26 18:31       ` David Hildenbrand
  2023-09-26 18:34       ` David Hildenbrand
  0 siblings, 2 replies; 16+ messages in thread
From: Ryan Roberts @ 2023-09-25  8:51 UTC (permalink / raw)
  To: John Hubbard, Matthew Wilcox, Yang Shi, Yin, Fengwei, Yu Zhao,
	Zi Yan, David Hildenbrand, David Rientjes, Andrew Morton,
	Vlastimil Babka, Kirill A. Shutemov
  Cc: Linux-MM

On 23/09/2023 01:33, John Hubbard wrote:
> On 9/22/23 08:48, Ryan Roberts wrote:
> ...
>> I never had any feedback on the below; I'm not sure if that means everyone is
>> happy or that nobody read it??
> 
> One can never really know: zero or more people read it, and of those, no
> one hated it enough to send out a quick NAK. So that's a *possible*,
> lukewarm endorsement of sorts. Success! :)

You really know how to fill a guy with confidence! ;-)

> 
> ...
> 
>> BUT I've had yet another idea on the controls front, which would enable exposing
>> this to user space as an extension to transparent_hugepage, while continuing to
>> support THP as is and also be able to control THP and ALF (anon large folio)
> 
> The new ALF / ANON_LARGE_FOLIO naming looks good to me. The grep aspect
> is a nice touch.

Well if we go the route of the newest proposal, then I guess the naming is less
important, because it all attaches to transparent_hugepage.

> 
> ...
> 
>> Add 2 controls to sysfs:
>>
>> /sys/kernel/mm/transparent_hugepage/anon_orders
>>    - bitfield where set bits are orders that will be tried during allocation
>>    - defaults to 1<<PMD_ORDER, which gives current THP behaviour with no ALF
>>    - For now, 1<<PMD_ORDER is highest settable bit, but easy to expand in future
>>    - To enable ALF, set the appropriate lower bits
>>    - To disable THP, clear 1<<PMD_ORDER
>>    - (In future we could add an "auto" option too)
>>
>> /sys/kernel/mm/transparent_hugepage/anon_always_mask
>>    - orders in (anon_orders & anon_always_mask) are not subject to madvise
>>    - so when enabled=madvise, still try (anon_orders & anon_always_mask) orders
>>      as if enabled=always
>>    - defaults to 0 (all subject to madvise)
>>
> 
> I *think* I like this a lot, 

On the weight of this lukewarm endorsement, I'm going to code it up and aim to
post something for dicussion end of this week. ;-)

> although I have some clarifying question
> below. It seems to address the key things that have been complicating
> the discussions: the API is now looking more flexible, and yet still
> easy to understand and reason about. Nice.
> 
> A couple of questions about how this works:
> 
>>
>> The defaults for those controls give you "legacy THP". But you can modify the
>> controls to generate policies like this:
>>
>>
> 
> For these tables, a small key or legend would help. I've forgotten already
> what "S" means, and am also vague about exactly what "THP>ALF>S" behavior
> means, too.

THP:
    transparent hugepage allocation; specifically PMD sized/aligned/mapped.

ALF:
    anonymous large folio allocation; specifically some order between
    [PMD_ORDER-1, 1]. Always PTE-mapped.
S:
    single page allocation; order-0, always PTE-mapped.

I've found these discrete logical buckets useful for thinking about the problem,
although the implementation doesn't always treat them completely separately (S
is just a final fallback order in ALF's list of orders to try) and the new
proposal exposes both THP and ALF through a unified THP interface.

The '>' indicates 'fallback'. Fallback happens for a few different reasons; VMA
is too small to contain the proposed folio order, or some PTEs that would be
covered by the new folio are already populated, etc. ALF usually isn't just a
single order either - it has a list of orders that it will try.

Possibly all a bit confusing, but this is the nomenclature I've been using in
the context of all the discusions so far and wanted to try to keep things
comparable.


> 
>> THP only - existing behaviour (default):
>> ----------------------------------------
>>
>> anon_orders = 1<<PMD_ORDER
>> anon_always_mask = 0
>>
>> thp prctl:            | dis       | ena       | ena       | ena
> 
> All I see in the prctl(2) man page is PR_SET_THP_DISABLE, I don't
> see any _ENABLE. What does the above refer to?

dis: PR_SET_THP_DISABLE with arg2=1 (thp disabled via prctl)
ena: PR_SET_THP_DISABLE with arg2=0 (thp not disabled via prctl)

I was trying to illustrate that ALF is now also affected by this prctl. With the
previous proposal it was independent of THP and therefore independent of this
prctl. Of course it would still be _possible_ to ignore this control for the ALF
orders, but I think that risks being very confusing for users.

> 
> 
>> thp sysfs:            | X         | never     | madvise   | always
>> ----------------------|-----------|-----------|-----------|-------------
>> no hint               | S         | S         | S         | THP>S
>> MADV_HUGEPAGE         | S         | S         | THP>S     | THP>S
>> MADV_NOHUGEPAGE       | S         | S         | S         | S
>>
>>
> ...
>>
>> It does have the disadvantage that ALF is tied to MADV_HUGEPAGE, whereas the
> 
> Right, that is a little awkward. But maybe less so now, with this new proposal,
> which leaves THP a little closer to ALF.

Indeed, this approach makes it clearer/easier for users to understand, because
conceptually we are just introducing a wider set of folio sizes that THP can use
and all the existing THP controls continue to mean what they always meant.

The only risk I see is if there are workloads that want to use both (PMD) THP
and ALF, but in different VMAs, and they absolutely do not want the possibillity
of ALF in the (PMD) THP area if THP fails, and instead always fallback to Single
allocations for that VMA. But that sounds very niche to me. And would be better
solved by the additional (future) introduction of a set of allowed orders that
can be attached to a specific VMA.


There are a couple of other wrinkles that I didn't highlight in my first mail:

- khugepaged will continue to work only on PMD-sized THP. It will ignore the new
  ALF orders. This was always the plan, but if exposing the ALF functionality
  through THP interface to user space, does that make it confusing? I don't
  think its a big issue personally. And we can always enhance khugepaged to work
  on <PMD_ORDER folios later if we find a compelling reason.

- We will want to name new counters following THP naming, not large folio. I
  propose that the existing AnonHugePages type counters will count ALL THP (i.e.
  PMD order and ALF orders), and additionally add 2 new counters for PMD-mapped
  and PTE-mapped, which should sum to the value in the original counter.
  Hopefully that makes things clear while retaining back compat.


> 
> 
> thanks,



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: ANON_LARGE_FOLIOS meeting follow-up & refined proposal
  2023-09-22 15:48 ` Ryan Roberts
  2023-09-23  0:33   ` John Hubbard
@ 2023-09-26  8:13   ` Kirill A. Shutemov
  2023-09-26 18:29   ` David Hildenbrand
  2 siblings, 0 replies; 16+ messages in thread
From: Kirill A. Shutemov @ 2023-09-26  8:13 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Matthew Wilcox, Yang Shi, Yin, Fengwei, Yu Zhao, Zi Yan,
	David Hildenbrand, David Rientjes, Andrew Morton,
	Vlastimil Babka, John Hubbard, Linux-MM

On Fri, Sep 22, 2023 at 04:48:44PM +0100, Ryan Roberts wrote:
> Thoughts?

I do like like this approach more. I would like ALF to be subset of THP,
not a separate feature. We already have PTE-mapped THPs, so it feels like
a natural extension of the concept.

-- 
  Kiryl Shutsemau / Kirill A. Shutemov


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: ANON_LARGE_FOLIOS meeting follow-up & refined proposal
  2023-09-14  8:16 ANON_LARGE_FOLIOS meeting follow-up & refined proposal Ryan Roberts
  2023-09-22 15:48 ` Ryan Roberts
@ 2023-09-26 18:26 ` David Hildenbrand
  1 sibling, 0 replies; 16+ messages in thread
From: David Hildenbrand @ 2023-09-26 18:26 UTC (permalink / raw)
  To: Ryan Roberts, Matthew Wilcox, Yang Shi, Yin, Fengwei, Yu Zhao,
	Zi Yan, David Rientjes, Andrew Morton, Vlastimil Babka,
	John Hubbard, Kirill A. Shutemov
  Cc: Linux-MM

[sorry, busy with 100 different things]

> 
> DavidR suggested adding ability to set a VMA-specific LAF order, using
> process_madvise():
>      - Optionally accept LAF order through flags param of
>        process_madvise(MADV_LARGEFOLIO)
>      - When no LAF order passed, (or called with madvise()) use global LAF order
> 
> Personally, I would prefer to avoid vma-specific laf order for an initial
> submission and instead defer the addition until clear need is identified.
> Thoughts?

Yes, let's start with the simple cases.

> 
> 
> Stats
> -----
> 
> meminfo:AnonHugePages, smaps:AnonHugePages and memory.stat:anon_thp will
> continue to account THP only.

Agreed. One concern that was brought up that existing users associate 
"huge" with "pmd-size". So any thp toggles/stats should control/express 
exactly that for now.

> 
> I plan to add meminfo:AnonLargeFolio, smaps:AnonLargeFolio and
> memory.stat:anon_large_folio to account LAFs.
> 
> Do I need to add counters to vmstat also? (e.g. large_folio_fault_alloc,
> large_folio_fault_fallback, etc) - would need to think about which counters and
> what they mean if so.

Can probably be added later.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: ANON_LARGE_FOLIOS meeting follow-up & refined proposal
  2023-09-22 15:48 ` Ryan Roberts
  2023-09-23  0:33   ` John Hubbard
  2023-09-26  8:13   ` Kirill A. Shutemov
@ 2023-09-26 18:29   ` David Hildenbrand
  2 siblings, 0 replies; 16+ messages in thread
From: David Hildenbrand @ 2023-09-26 18:29 UTC (permalink / raw)
  To: Ryan Roberts, Matthew Wilcox, Yang Shi, Yin, Fengwei, Yu Zhao,
	Zi Yan, David Rientjes, Andrew Morton, Vlastimil Babka,
	John Hubbard, Kirill A. Shutemov
  Cc: Linux-MM

On 22.09.23 17:48, Ryan Roberts wrote:
> On 14/09/2023 09:16, Ryan Roberts wrote:
>> Hi All,
>>
>> Thanks for participating in the discussion yesterday - it finally feels like we
>> are converging on the MVP feature set. Below are my notes from the call and a
>> modified proposal for controls and stats. It would be great if we can continue
>> to review and refine over email. I'm also planning to post an implementation
>> within the next couple of weeks, which I hope will also accelerate convergence.
> 
> I never had any feedback on the below; I'm not sure if that means everyone is
> happy or that nobody read it??
> 
> I've got an implementation for all of the below ready to go, with a few tweeks
> to the details (the main change is that anon_orders is now a bitfield where each
> set bit represents an order in the set, rather than the originally proposed
> comma-separated list of orders).
> 
> BUT I've had yet another idea on the controls front, which would enable exposing
> this to user space as an extension to transparent_hugepage, while continuing to
> support THP as is and also be able to control THP and ALF (anon large folio)
> usage independently. On reflection, I think it is cleaner to do it this way for
> a couple of reasons:
> 
>    - We don't have to introduce a whole new feature (ALF) to the user. Most of
>      the concepts and controls overlap a lot with THP anyway, so if we can make
>      it look like an extension, I think it would be easier to communicate.
> 
>    - The approach I have in mind would make it easy to extend to orders greater
>      than PMD_ORDER in future if that's a direction we want to eventually go.
>      Because >PMD_ORDER implies multiple PMD entries, it would half belong to THP
>      and half belong to ALF in the current proposal, which is nasty.
> 
> I'll lay out the new proposal now, but I suspect this will ultimately warrant
> another mm alignment meeting...
> 
> 
> Add 2 controls to sysfs:
> 
> /sys/kernel/mm/transparent_hugepage/anon_orders
>    - bitfield where set bits are orders that will be tried during allocation
>    - defaults to 1<<PMD_ORDER, which gives current THP behaviour with no ALF
>    - For now, 1<<PMD_ORDER is highest settable bit, but easy to expand in future
>    - To enable ALF, set the appropriate lower bits
>    - To disable THP, clear 1<<PMD_ORDER
>    - (In future we could add an "auto" option too)
> 

I recall discussing that in the past -- where we also thought about 
adding exactly that to debugfs -- but the consensus was that we should 
keep THP alone.

I agree at this point that we better keep THP alone.

THP might be a subset of ALF (at some point, hopefully), but not the 
other way around IMHO.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: ANON_LARGE_FOLIOS meeting follow-up & refined proposal
  2023-09-25  8:51     ` Ryan Roberts
@ 2023-09-26 18:31       ` David Hildenbrand
  2023-09-27  7:23         ` Ryan Roberts
  2023-09-26 18:34       ` David Hildenbrand
  1 sibling, 1 reply; 16+ messages in thread
From: David Hildenbrand @ 2023-09-26 18:31 UTC (permalink / raw)
  To: Ryan Roberts, John Hubbard, Matthew Wilcox, Yang Shi, Yin,
	Fengwei, Yu Zhao, Zi Yan, David Rientjes, Andrew Morton,
	Vlastimil Babka, Kirill A. Shutemov
  Cc: Linux-MM

On 25.09.23 10:51, Ryan Roberts wrote:
> On 23/09/2023 01:33, John Hubbard wrote:
>> On 9/22/23 08:48, Ryan Roberts wrote:
>> ...
>>> I never had any feedback on the below; I'm not sure if that means everyone is
>>> happy or that nobody read it??
>>
>> One can never really know: zero or more people read it, and of those, no
>> one hated it enough to send out a quick NAK. So that's a *possible*,
>> lukewarm endorsement of sorts. Success! :)
> 
> You really know how to fill a guy with confidence! ;-)
> 
>>
>> ...
>>
>>> BUT I've had yet another idea on the controls front, which would enable exposing
>>> this to user space as an extension to transparent_hugepage, while continuing to
>>> support THP as is and also be able to control THP and ALF (anon large folio)
>>
>> The new ALF / ANON_LARGE_FOLIO naming looks good to me. The grep aspect
>> is a nice touch.
> 
> Well if we go the route of the newest proposal, then I guess the naming is less
> important, because it all attaches to transparent_hugepage.

I agree that ALF is better. But having something under "THP", that is 
not THP and not accounted as THP ... I don't quite like it (although, 
before we discussed that approach in the past, I did like it).

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: ANON_LARGE_FOLIOS meeting follow-up & refined proposal
  2023-09-25  8:51     ` Ryan Roberts
  2023-09-26 18:31       ` David Hildenbrand
@ 2023-09-26 18:34       ` David Hildenbrand
  1 sibling, 0 replies; 16+ messages in thread
From: David Hildenbrand @ 2023-09-26 18:34 UTC (permalink / raw)
  To: Ryan Roberts, John Hubbard, Matthew Wilcox, Yang Shi, Yin,
	Fengwei, Yu Zhao, Zi Yan, David Rientjes, Andrew Morton,
	Vlastimil Babka, Kirill A. Shutemov
  Cc: Linux-MM

On 25.09.23 10:51, Ryan Roberts wrote:
> On 23/09/2023 01:33, John Hubbard wrote:
>> On 9/22/23 08:48, Ryan Roberts wrote:
>> ...
>>> I never had any feedback on the below; I'm not sure if that means everyone is
>>> happy or that nobody read it??
>>
>> One can never really know: zero or more people read it, and of those, no
>> one hated it enough to send out a quick NAK. So that's a *possible*,
>> lukewarm endorsement of sorts. Success! :)
> 
> You really know how to fill a guy with confidence! ;-)
> 
>>
>> ...
>>
>>> BUT I've had yet another idea on the controls front, which would enable exposing
>>> this to user space as an extension to transparent_hugepage, while continuing to
>>> support THP as is and also be able to control THP and ALF (anon large folio)
>>
>> The new ALF / ANON_LARGE_FOLIO naming looks good to me. The grep aspect
>> is a nice touch.
> 
> Well if we go the route of the newest proposal, then I guess the naming is less
> important, because it all attaches to transparent_hugepage.
> 
>>
>> ...
>>
>>> Add 2 controls to sysfs:
>>>
>>> /sys/kernel/mm/transparent_hugepage/anon_orders
>>>     - bitfield where set bits are orders that will be tried during allocation
>>>     - defaults to 1<<PMD_ORDER, which gives current THP behaviour with no ALF
>>>     - For now, 1<<PMD_ORDER is highest settable bit, but easy to expand in future
>>>     - To enable ALF, set the appropriate lower bits
>>>     - To disable THP, clear 1<<PMD_ORDER
>>>     - (In future we could add an "auto" option too)
>>>
>>> /sys/kernel/mm/transparent_hugepage/anon_always_mask
>>>     - orders in (anon_orders & anon_always_mask) are not subject to madvise
>>>     - so when enabled=madvise, still try (anon_orders & anon_always_mask) orders
>>>       as if enabled=always
>>>     - defaults to 0 (all subject to madvise)
>>>
>>
>> I *think* I like this a lot,
> 
> On the weight of this lukewarm endorsement, I'm going to code it up and aim to
> post something for dicussion end of this week. ;-)
> 
>> although I have some clarifying question
>> below. It seems to address the key things that have been complicating
>> the discussions: the API is now looking more flexible, and yet still
>> easy to understand and reason about. Nice.
>>
>> A couple of questions about how this works:
>>
>>>
>>> The defaults for those controls give you "legacy THP". But you can modify the
>>> controls to generate policies like this:
>>>
>>>
>>
>> For these tables, a small key or legend would help. I've forgotten already
>> what "S" means, and am also vague about exactly what "THP>ALF>S" behavior
>> means, too.
> 
> THP:
>      transparent hugepage allocation; specifically PMD sized/aligned/mapped.
> 
> ALF:
>      anonymous large folio allocation; specifically some order between
>      [PMD_ORDER-1, 1]. Always PTE-mapped.

^ and that's exactly not where we wanted to draw the line to be 
future-proof. Ideally we'd create something that is future proof, such 
that it could be extended to any folio sizes in the future.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: ANON_LARGE_FOLIOS meeting follow-up & refined proposal
  2023-09-26 18:31       ` David Hildenbrand
@ 2023-09-27  7:23         ` Ryan Roberts
  2023-09-27 15:32           ` David Hildenbrand
  0 siblings, 1 reply; 16+ messages in thread
From: Ryan Roberts @ 2023-09-27  7:23 UTC (permalink / raw)
  To: David Hildenbrand, John Hubbard, Matthew Wilcox, Yang Shi, Yin,
	Fengwei, Yu Zhao, Zi Yan, David Rientjes, Andrew Morton,
	Vlastimil Babka, Kirill A. Shutemov
  Cc: Linux-MM

On 26/09/2023 19:31, David Hildenbrand wrote:
> On 25.09.23 10:51, Ryan Roberts wrote:
>> On 23/09/2023 01:33, John Hubbard wrote:
>>> On 9/22/23 08:48, Ryan Roberts wrote:
>>> ...
>>>> I never had any feedback on the below; I'm not sure if that means everyone is
>>>> happy or that nobody read it??
>>>
>>> One can never really know: zero or more people read it, and of those, no
>>> one hated it enough to send out a quick NAK. So that's a *possible*,
>>> lukewarm endorsement of sorts. Success! :)
>>
>> You really know how to fill a guy with confidence! ;-)
>>
>>>
>>> ...
>>>
>>>> BUT I've had yet another idea on the controls front, which would enable
>>>> exposing
>>>> this to user space as an extension to transparent_hugepage, while continuing to
>>>> support THP as is and also be able to control THP and ALF (anon large folio)
>>>
>>> The new ALF / ANON_LARGE_FOLIO naming looks good to me. The grep aspect
>>> is a nice touch.
>>
>> Well if we go the route of the newest proposal, then I guess the naming is less
>> important, because it all attaches to transparent_hugepage.
> 
> I agree that ALF is better. But having something under "THP", that is not THP
> and not accounted as THP ... I don't quite like it (although, before we
> discussed that approach in the past, I did like it).

I know we discussed and concluded against putting it under THP in the past, but
I think that decision was driven by not not having any proposal that would allow
us to put it under THP without breaking the expectations of existing (PMD-sized)
THP users, or not being able to control use of the lower orders and PMD-order.
Personally I think my latest proposal is a way to solve that problem, and in
that case, I personally think exposing it as an extension to THP is neater:

 - all existing THP controls work as they did before
 - new anon_orders and anon_always_mask files allow opt-in to
   smaller-than-PMD-orders
 - All exisitng counters remain unchanged, and continue to count PMD-mapped THP
   only:
      - /proc/meminfo:AnonHugePages
      - /sys/devices/system/node/nodeX/meminfo:AnonHugePages
      - /proc/vmstat:nr_anon_transparent_hugepages
      - /proc/<pid>/smaps[_roolup]:AnonHugePages
      - memory.stat(v1):rss_huge
      - memory.stat(v2):anon_thp
 - New counters introduced to count PTE-mapped THP/large folios:
      - /proc/meminfo:AnonHugePteMap
      - /sys/devices/system/node/nodeX/meminfo:AnonHugePteMap
      - /proc/vmstat:nr_anon_thp_pte
      - /proc/<pid>/smaps[_roolup]:AnonHugePteMap
      - memory.stat(v1):anon_thp_pte
      - memory.stat(v2):anon_thp_pte
 - It's a lot less code (I have an implementation for both approaches)

Admittedly, I haven't spent too much time thinking about the other thp counters
in vmstat yet (e.g. thp_fault_alloc, thp_fault_fallback, etc). Proposal is that
for now, they would continue to be PMD-order only. But I think you could
probably hook those upto the PTE-mapped ones as well, instead of duplicating all
the counters.

As Kiril mentioned, PTE-mapped THP is already a thing, so this approach just
formalises it.

I also think the "huge" means PMD-size argument is a bit weak, given that THP
supports PUD-size today for file mappings, and in the context of hugetlb, huge
can mean contpte, pmd, contpmd, pud, etc.

I'll have the patch set ready to post by Friday. How about I post it, then we
can continue the conversation in the context of the actual code? If the
concensus is that this is not the way to do it, then I'll post the large_folio
version instead?

Thanks,
Ryan

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: ANON_LARGE_FOLIOS meeting follow-up & refined proposal
  2023-09-27  7:23         ` Ryan Roberts
@ 2023-09-27 15:32           ` David Hildenbrand
  2023-09-27 19:04             ` Ryan Roberts
  0 siblings, 1 reply; 16+ messages in thread
From: David Hildenbrand @ 2023-09-27 15:32 UTC (permalink / raw)
  To: Ryan Roberts, John Hubbard, Matthew Wilcox, Yang Shi, Yin,
	Fengwei, Yu Zhao, Zi Yan, David Rientjes, Andrew Morton,
	Vlastimil Babka, Kirill A. Shutemov, Hugh Dickins
  Cc: Linux-MM

On 27.09.23 09:23, Ryan Roberts wrote:
> On 26/09/2023 19:31, David Hildenbrand wrote:
>> On 25.09.23 10:51, Ryan Roberts wrote:
>>> On 23/09/2023 01:33, John Hubbard wrote:
>>>> On 9/22/23 08:48, Ryan Roberts wrote:
>>>> ...
>>>>> I never had any feedback on the below; I'm not sure if that means everyone is
>>>>> happy or that nobody read it??
>>>>
>>>> One can never really know: zero or more people read it, and of those, no
>>>> one hated it enough to send out a quick NAK. So that's a *possible*,
>>>> lukewarm endorsement of sorts. Success! :)
>>>
>>> You really know how to fill a guy with confidence! ;-)
>>>
>>>>
>>>> ...
>>>>
>>>>> BUT I've had yet another idea on the controls front, which would enable
>>>>> exposing
>>>>> this to user space as an extension to transparent_hugepage, while continuing to
>>>>> support THP as is and also be able to control THP and ALF (anon large folio)
>>>>
>>>> The new ALF / ANON_LARGE_FOLIO naming looks good to me. The grep aspect
>>>> is a nice touch.
>>>
>>> Well if we go the route of the newest proposal, then I guess the naming is less
>>> important, because it all attaches to transparent_hugepage.
>>
>> I agree that ALF is better. But having something under "THP", that is not THP
>> and not accounted as THP ... I don't quite like it (although, before we
>> discussed that approach in the past, I did like it).
> 
> I know we discussed and concluded against putting it under THP in the past, but
> I think that decision was driven by not not having any proposal that would allow
> us to put it under THP without breaking the expectations of existing (PMD-sized)
> THP users, or not being able to control use of the lower orders and PMD-order.

Not only that. It was also because we didn't want to confuse users/devs 
that assume that THP == PMD-sized.

I'll CC Hugh, I recall he had an opinion on that (I recall some comments 
about cleanly separating both features towards the user).

> Personally I think my latest proposal is a way to solve that problem, and in
> that case, I personally think exposing it as an extension to THP is neater:
> 
>   - all existing THP controls work as they did before
>   - new anon_orders and anon_always_mask files allow opt-in to
>     smaller-than-PMD-orders

As "enable" controls anon only (that's correct, right?), maybe these 
should also simply be called "orders" and "always_mask". shmem could get 
their own set, like "shmem_enable".

>   - All exisitng counters remain unchanged, and continue to count PMD-mapped THP
>     only:
>        - /proc/meminfo:AnonHugePages
>        - /sys/devices/system/node/nodeX/meminfo:AnonHugePages
>        - /proc/vmstat:nr_anon_transparent_hugepages
>        - /proc/<pid>/smaps[_roolup]:AnonHugePages
>        - memory.stat(v1):rss_huge
>        - memory.stat(v2):anon_thp
>   - New counters introduced to count PTE-mapped THP/large folios:
>        - /proc/meminfo:AnonHugePteMap
>        - /sys/devices/system/node/nodeX/meminfo:AnonHugePteMap
>        - /proc/vmstat:nr_anon_thp_pte
>        - /proc/<pid>/smaps[_roolup]:AnonHugePteMap
>        - memory.stat(v1):anon_thp_pte
>        - memory.stat(v2):anon_thp_pte
>   - It's a lot less code (I have an implementation for both approaches)
> 
> Admittedly, I haven't spent too much time thinking about the other thp counters
> in vmstat yet (e.g. thp_fault_alloc, thp_fault_fallback, etc). Proposal is that
> for now, they would continue to be PMD-order only. But I think you could
> probably hook those upto the PTE-mapped ones as well, instead of duplicating all
> the counters.
> 
> As Kiril mentioned, PTE-mapped THP is already a thing, so this approach just
> formalises it.

Not quite. PTE-mapped THP were just a side-effect of the transparency 
handling. We never allocated and populated PTE-mapped PMD-sized THP on 
allocation. So I don't immediately see the connection between both for 
this case.

Would you account a PTE-mapped (PMD-sized) THP as anon_thp or 
anon_thp_pte? What if it's mapped via PTEs and PMDs? I don't see how 
that formalises that case for the existing PMD-szed THP.

> 
> I also think the "huge" means PMD-size argument is a bit weak, given that THP
> supports PUD-size today for file mappings, and in the context of hugetlb, huge
> can mean contpte, pmd, contpmd, pud, etc.

I made similar statements in the past but was convinced otherwise :)

> 
> I'll have the patch set ready to post by Friday. How about I post it, then we
> can continue the conversation in the context of the actual code? If the
> concensus is that this is not the way to do it, then I'll post the large_folio
> version instead?

No strong opinion from my side, I considered a "fresh start" without the 
THP implication/thermonology after all the previous discussions cleaner 
[which I think was one of the outcomes of the previous discussions].

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: ANON_LARGE_FOLIOS meeting follow-up & refined proposal
  2023-09-27 15:32           ` David Hildenbrand
@ 2023-09-27 19:04             ` Ryan Roberts
  2023-10-02 12:58               ` David Hildenbrand
  0 siblings, 1 reply; 16+ messages in thread
From: Ryan Roberts @ 2023-09-27 19:04 UTC (permalink / raw)
  To: David Hildenbrand, John Hubbard, Matthew Wilcox, Yang Shi, Yin,
	Fengwei, Yu Zhao, Zi Yan, David Rientjes, Andrew Morton,
	Vlastimil Babka, Kirill A. Shutemov, Hugh Dickins
  Cc: Linux-MM

On 27/09/2023 16:32, David Hildenbrand wrote:
> On 27.09.23 09:23, Ryan Roberts wrote:
>> On 26/09/2023 19:31, David Hildenbrand wrote:
>>> On 25.09.23 10:51, Ryan Roberts wrote:
>>>> On 23/09/2023 01:33, John Hubbard wrote:
>>>>> On 9/22/23 08:48, Ryan Roberts wrote:
>>>>> ...
>>>>>> I never had any feedback on the below; I'm not sure if that means everyone is
>>>>>> happy or that nobody read it??
>>>>>
>>>>> One can never really know: zero or more people read it, and of those, no
>>>>> one hated it enough to send out a quick NAK. So that's a *possible*,
>>>>> lukewarm endorsement of sorts. Success! :)
>>>>
>>>> You really know how to fill a guy with confidence! ;-)
>>>>
>>>>>
>>>>> ...
>>>>>
>>>>>> BUT I've had yet another idea on the controls front, which would enable
>>>>>> exposing
>>>>>> this to user space as an extension to transparent_hugepage, while
>>>>>> continuing to
>>>>>> support THP as is and also be able to control THP and ALF (anon large folio)
>>>>>
>>>>> The new ALF / ANON_LARGE_FOLIO naming looks good to me. The grep aspect
>>>>> is a nice touch.
>>>>
>>>> Well if we go the route of the newest proposal, then I guess the naming is less
>>>> important, because it all attaches to transparent_hugepage.
>>>
>>> I agree that ALF is better. But having something under "THP", that is not THP
>>> and not accounted as THP ... I don't quite like it (although, before we
>>> discussed that approach in the past, I did like it).
>>
>> I know we discussed and concluded against putting it under THP in the past, but
>> I think that decision was driven by not not having any proposal that would allow
>> us to put it under THP without breaking the expectations of existing (PMD-sized)
>> THP users, or not being able to control use of the lower orders and PMD-order.
> 
> Not only that. It was also because we didn't want to confuse users/devs that
> assume that THP == PMD-sized.
> 
> I'll CC Hugh, I recall he had an opinion on that (I recall some comments about
> cleanly separating both features towards the user).

Were those comments made during the first meeting? I don't recall them, but will
go back and watch the video.

> 
>> Personally I think my latest proposal is a way to solve that problem, and in
>> that case, I personally think exposing it as an extension to THP is neater:
>>
>>   - all existing THP controls work as they did before
>>   - new anon_orders and anon_always_mask files allow opt-in to
>>     smaller-than-PMD-orders
> 
> As "enable" controls anon only (that's correct, right?), maybe these should also
> simply be called "orders" and "always_mask". shmem could get their own set, like
> "shmem_enable".

Yes, could do it that way. I thought that since "shmem_" was used when shmem was
introduced, it would be clearer to prefix the new anon controls too. Happy to
remove though.

> 
>>   - All exisitng counters remain unchanged, and continue to count PMD-mapped THP
>>     only:
>>        - /proc/meminfo:AnonHugePages
>>        - /sys/devices/system/node/nodeX/meminfo:AnonHugePages
>>        - /proc/vmstat:nr_anon_transparent_hugepages
>>        - /proc/<pid>/smaps[_roolup]:AnonHugePages
>>        - memory.stat(v1):rss_huge
>>        - memory.stat(v2):anon_thp
>>   - New counters introduced to count PTE-mapped THP/large folios:
>>        - /proc/meminfo:AnonHugePteMap
>>        - /sys/devices/system/node/nodeX/meminfo:AnonHugePteMap
>>        - /proc/vmstat:nr_anon_thp_pte
>>        - /proc/<pid>/smaps[_roolup]:AnonHugePteMap
>>        - memory.stat(v1):anon_thp_pte
>>        - memory.stat(v2):anon_thp_pte
>>   - It's a lot less code (I have an implementation for both approaches)
>>
>> Admittedly, I haven't spent too much time thinking about the other thp counters
>> in vmstat yet (e.g. thp_fault_alloc, thp_fault_fallback, etc). Proposal is that
>> for now, they would continue to be PMD-order only. But I think you could
>> probably hook those upto the PTE-mapped ones as well, instead of duplicating all
>> the counters.
>>
>> As Kiril mentioned, PTE-mapped THP is already a thing, so this approach just
>> formalises it.
> 
> Not quite. PTE-mapped THP were just a side-effect of the transparency handling.
> We never allocated and populated PTE-mapped PMD-sized THP on allocation. So I
> don't immediately see the connection between both for this case.

I'm just making the point that when they become PTE-mapped, we don't stop
calling them THP. I accept that its not exactly the same though.

> 
> Would you account a PTE-mapped (PMD-sized) THP as anon_thp or anon_thp_pte? What
> if it's mapped via PTEs and PMDs? I don't see how that formalises that case for
> the existing PMD-szed THP.

I account PTE-mapped THP as anon_thp_pte. And if the same folio is mapped both
ways, I account it in both counters. I'm not claiming that anon_thp +
anon_thp_pte = amount of allocated thp in total. anon_thp_pte is intended to
help debug; you can use it to see what percentage of pte-mapped memory is THP.

> 
>>
>> I also think the "huge" means PMD-size argument is a bit weak, given that THP
>> supports PUD-size today for file mappings, and in the context of hugetlb, huge
>> can mean contpte, pmd, contpmd, pud, etc.
> 
> I made similar statements in the past but was convinced otherwise :)
> 
>>
>> I'll have the patch set ready to post by Friday. How about I post it, then we
>> can continue the conversation in the context of the actual code? If the
>> concensus is that this is not the way to do it, then I'll post the large_folio
>> version instead?
> 
> No strong opinion from my side, I considered a "fresh start" without the THP
> implication/thermonology after all the previous discussions cleaner [which I
> think was one of the outcomes of the previous discussions].
> 

My concern is that the "fresh start" is not as simple as it appears. I've come
to the conclusion that if we have a new interface, then it should really be a
strict superset of THP to make it extensible in future. But that opens questions
about how you configure PMD-sized allocations when both interfaces disagree. For
"enabled" its fairly straightforward; you can do a logical OR. But its less
clear how to handle disagreement over defrag. And then you have huge_zero_page
and khugepaged etc, which might just stay with THP. But eventually we will
probably want to do async collapse for smaller order folios too, and at that
point you have to duplicate all those controls... So I concluded that actually
it is cleaner to just bolt on a small-order extension to THP. I've updated all
the docs, and that was pretty simple to do, which usually suggests that the
extension is purely additive and shouldn't be confusing.

Take a look at the patches, then make a judgement ;-)



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: ANON_LARGE_FOLIOS meeting follow-up & refined proposal
  2023-09-27 19:04             ` Ryan Roberts
@ 2023-10-02 12:58               ` David Hildenbrand
  2023-10-05  7:37                 ` Ryan Roberts
  0 siblings, 1 reply; 16+ messages in thread
From: David Hildenbrand @ 2023-10-02 12:58 UTC (permalink / raw)
  To: Ryan Roberts, John Hubbard, Matthew Wilcox, Yang Shi, Yin,
	Fengwei, Yu Zhao, Zi Yan, David Rientjes, Andrew Morton,
	Vlastimil Babka, Kirill A. Shutemov, Hugh Dickins
  Cc: Linux-MM

>> Not only that. It was also because we didn't want to confuse users/devs that
>> assume that THP == PMD-sized.
>>
>> I'll CC Hugh, I recall he had an opinion on that (I recall some comments about
>> cleanly separating both features towards the user).
> 
> Were those comments made during the first meeting? I don't recall them, but will
> go back and watch the video.

Unless I am daydreaming, Willy made such comments during a THP cabal 
meeting, and Hugh somewhere on the list in other context. Maybe they 
changed their mind or I am making things up (after 4 days of fever 
dreams I don't know what's real anymore :D ).

The biggest concern was that huge really implies PMD (and maybe later 
PUD) -- and could end up confusing users, stats etc. Maybe it can be 
handled, I'll have to take a look at your proposal.

I was advocating just calling these things THP right from the start (and 
was using the arguments you are using in this mail :) ), but understood 
the concerns.

Apparently, freebsd wants to call these things "Medium-sized superpages" 
[1]. Of course, superpages are just huge pages, but everybody has to 
invent a new term for the same thing [I did not check who had it first 
;) ]. Of course, under Windows static (like hugetlb) huge pages are 
called large-pages.

[1] https://www.freebsd.org/status/report-2022-04-2022-06/superpages/

> 
>>
>>> Personally I think my latest proposal is a way to solve that problem, and in
>>> that case, I personally think exposing it as an extension to THP is neater:
>>>
>>>    - all existing THP controls work as they did before
>>>    - new anon_orders and anon_always_mask files allow opt-in to
>>>      smaller-than-PMD-orders
>>
>> As "enable" controls anon only (that's correct, right?), maybe these should also
>> simply be called "orders" and "always_mask". shmem could get their own set, like
>> "shmem_enable".
> 
> Yes, could do it that way. I thought that since "shmem_" was used when shmem was
> introduced, it would be clearer to prefix the new anon controls too. Happy to
> remove though.
> 
>>
>>>    - All exisitng counters remain unchanged, and continue to count PMD-mapped THP
>>>      only:
>>>         - /proc/meminfo:AnonHugePages
>>>         - /sys/devices/system/node/nodeX/meminfo:AnonHugePages
>>>         - /proc/vmstat:nr_anon_transparent_hugepages
>>>         - /proc/<pid>/smaps[_roolup]:AnonHugePages
>>>         - memory.stat(v1):rss_huge
>>>         - memory.stat(v2):anon_thp
>>>    - New counters introduced to count PTE-mapped THP/large folios:
>>>         - /proc/meminfo:AnonHugePteMap
>>>         - /sys/devices/system/node/nodeX/meminfo:AnonHugePteMap
>>>         - /proc/vmstat:nr_anon_thp_pte
>>>         - /proc/<pid>/smaps[_roolup]:AnonHugePteMap
>>>         - memory.stat(v1):anon_thp_pte
>>>         - memory.stat(v2):anon_thp_pte
>>>    - It's a lot less code (I have an implementation for both approaches)
>>>
>>> Admittedly, I haven't spent too much time thinking about the other thp counters
>>> in vmstat yet (e.g. thp_fault_alloc, thp_fault_fallback, etc). Proposal is that
>>> for now, they would continue to be PMD-order only. But I think you could
>>> probably hook those upto the PTE-mapped ones as well, instead of duplicating all
>>> the counters.
>>>
>>> As Kiril mentioned, PTE-mapped THP is already a thing, so this approach just
>>> formalises it.
>>
>> Not quite. PTE-mapped THP were just a side-effect of the transparency handling.
>> We never allocated and populated PTE-mapped PMD-sized THP on allocation. So I
>> don't immediately see the connection between both for this case.
> 
> I'm just making the point that when they become PTE-mapped, we don't stop
> calling them THP. I accept that its not exactly the same though.

... PTE-mappin them does not change their size ;) But I get what you mean.

> 
>>
>> Would you account a PTE-mapped (PMD-sized) THP as anon_thp or anon_thp_pte? What
>> if it's mapped via PTEs and PMDs? I don't see how that formalises that case for
>> the existing PMD-szed THP.
> 
> I account PTE-mapped THP as anon_thp_pte. And if the same folio is mapped both
> ways, I account it in both counters. I'm not claiming that anon_thp +
> anon_thp_pte = amount of allocated thp in total. anon_thp_pte is intended to
> help debug; you can use it to see what percentage of pte-mapped memory is THP.
> 
>>
>>>
>>> I also think the "huge" means PMD-size argument is a bit weak, given that THP
>>> supports PUD-size today for file mappings, and in the context of hugetlb, huge
>>> can mean contpte, pmd, contpmd, pud, etc.
>>
>> I made similar statements in the past but was convinced otherwise :)
>>
>>>
>>> I'll have the patch set ready to post by Friday. How about I post it, then we
>>> can continue the conversation in the context of the actual code? If the
>>> concensus is that this is not the way to do it, then I'll post the large_folio
>>> version instead?
>>
>> No strong opinion from my side, I considered a "fresh start" without the THP
>> implication/thermonology after all the previous discussions cleaner [which I
>> think was one of the outcomes of the previous discussions].
>>
> 
> My concern is that the "fresh start" is not as simple as it appears. I've come
> to the conclusion that if we have a new interface, then it should really be a
> strict superset of THP to make it extensible in future. But that opens questions

^ +1

> about how you configure PMD-sized allocations when both interfaces disagree. For
> "enabled" its fairly straightforward; you can do a logical OR. But its less
> clear how to handle disagreement over defrag. And then you have huge_zero_page
> and khugepaged etc, which might just stay with THP. But eventually we will

Probably we want everything that THP had (khugepaged, zeropage, ...) 
also on some (selected?) smaller orders.

> probably want to do async collapse for smaller order folios too, and at that
> point you have to duplicate all those controls... So I concluded that actually
> it is cleaner to just bolt on a small-order extension to THP. I've updated all
> the docs, and that was pretty simple to do, which usually suggests that the
> extension is purely additive and shouldn't be confusing.

Fine with me. I don't quite like bitmaps exposed to user space, though. 
Just having a user-readable list or a "directory" with various options 
as files might be cleaner ...

> 
> Take a look at the patches, then make a judgement ;-)
> 

... but we'll discuss it there :)

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: ANON_LARGE_FOLIOS meeting follow-up & refined proposal
  2023-10-02 12:58               ` David Hildenbrand
@ 2023-10-05  7:37                 ` Ryan Roberts
       [not found]                   ` <c60321ef-8596-8fa0-7367-f43e69e1d894@redhat.com>
  0 siblings, 1 reply; 16+ messages in thread
From: Ryan Roberts @ 2023-10-05  7:37 UTC (permalink / raw)
  To: David Hildenbrand, John Hubbard, Matthew Wilcox, Yang Shi, Yin,
	Fengwei, Yu Zhao, Zi Yan, David Rientjes, Andrew Morton,
	Vlastimil Babka, Kirill A. Shutemov, Hugh Dickins
  Cc: Linux-MM

On 02/10/2023 13:58, David Hildenbrand wrote:
>> [...]
>>
>> My concern is that the "fresh start" is not as simple as it appears. I've come
>> to the conclusion that if we have a new interface, then it should really be a
>> strict superset of THP to make it extensible in future. But that opens questions
> 
> ^ +1
> 
>> about how you configure PMD-sized allocations when both interfaces disagree. For
>> "enabled" its fairly straightforward; you can do a logical OR. But its less
>> clear how to handle disagreement over defrag. And then you have huge_zero_page
>> and khugepaged etc, which might just stay with THP. But eventually we will
> 
> Probably we want everything that THP had (khugepaged, zeropage, ...) also on
> some (selected?) smaller orders.
> 
>> probably want to do async collapse for smaller order folios too, and at that
>> point you have to duplicate all those controls... So I concluded that actually
>> it is cleaner to just bolt on a small-order extension to THP. I've updated all
>> the docs, and that was pretty simple to do, which usually suggests that the
>> extension is purely additive and shouldn't be confusing.
> 
> Fine with me. I don't quite like bitmaps exposed to user space, though. Just
> having a user-readable list or a "directory" with various options as files might
> be cleaner ...
> 
>>
>> Take a look at the patches, then make a judgement ;-)
>>
> 
> ... but we'll discuss it there :)
> 

David, FYI, the patches are posted at [1] (you're cc'ed) and have been in
mm-unstable for nearly a week - so I guess they will go to mm-stable soon by
default. So if you want to object to any of it, now's the time ;-).

[1] https://lore.kernel.org/linux-mm/20230929114421.3761121-1-ryan.roberts@arm.com/


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: ANON_LARGE_FOLIOS meeting follow-up & refined proposal
       [not found]                   ` <c60321ef-8596-8fa0-7367-f43e69e1d894@redhat.com>
@ 2023-10-05  9:46                     ` Ryan Roberts
  2023-10-06 11:53                       ` David Hildenbrand
  0 siblings, 1 reply; 16+ messages in thread
From: Ryan Roberts @ 2023-10-05  9:46 UTC (permalink / raw)
  To: David Hildenbrand, John Hubbard, Matthew Wilcox, Yang Shi, Yin,
	Fengwei, Yu Zhao, Zi Yan, David Rientjes, Andrew Morton,
	Vlastimil Babka, Kirill A. Shutemov, Hugh Dickins
  Cc: Linux-MM

On 05/10/2023 09:15, David Hildenbrand wrote:
> On 05.10.23 09:37, Ryan Roberts wrote:
>> On 02/10/2023 13:58, David Hildenbrand wrote:
>>>> [...]
>>>>
>>>> My concern is that the "fresh start" is not as simple as it appears. I've come
>>>> to the conclusion that if we have a new interface, then it should really be a
>>>> strict superset of THP to make it extensible in future. But that opens
>>>> questions
>>>
>>> ^ +1
>>>
>>>> about how you configure PMD-sized allocations when both interfaces disagree.
>>>> For
>>>> "enabled" its fairly straightforward; you can do a logical OR. But its less
>>>> clear how to handle disagreement over defrag. And then you have huge_zero_page
>>>> and khugepaged etc, which might just stay with THP. But eventually we will
>>>
>>> Probably we want everything that THP had (khugepaged, zeropage, ...) also on
>>> some (selected?) smaller orders.
>>>
>>>> probably want to do async collapse for smaller order folios too, and at that
>>>> point you have to duplicate all those controls... So I concluded that actually
>>>> it is cleaner to just bolt on a small-order extension to THP. I've updated all
>>>> the docs, and that was pretty simple to do, which usually suggests that the
>>>> extension is purely additive and shouldn't be confusing.
>>>
>>> Fine with me. I don't quite like bitmaps exposed to user space, though. Just
>>> having a user-readable list or a "directory" with various options as files might
>>> be cleaner ...
>>>
>>>>
>>>> Take a look at the patches, then make a judgement ;-)
>>>>
>>>
>>> ... but we'll discuss it there :)
>>>
>>
>> David, FYI, the patches are posted at [1] (you're cc'ed) and have been in
>> mm-unstable for nearly a week - so I guess they will go to mm-stable soon by
>> default. So if you want to object to any of it, now's the time ;-).
> 
> I just did :P
> 
> Note that I'm distracted by a tiny human being. I should be back at work tomorrow.

Ahh - congratulations!

> 
> Hopefully other people that participated in the discussions can review and ack
> in the meantime.

That would certainly be nice (hint to everyone else on the thread ;-)

> 
> IMHO there really is no need to rush at this point.

I have a couple of selfish reasons; I was hoping to get it into v6.7 since I was
thinking that would be the next LTS, but I've just done the maths again, and it
looks like it will be v6.6, so I guess I've missed it anyway. The other is that
I would like to move focus to other changes that build on this, and that's
difficult while this is still not merged.

> 



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: ANON_LARGE_FOLIOS meeting follow-up & refined proposal
  2023-10-05  9:46                     ` Ryan Roberts
@ 2023-10-06 11:53                       ` David Hildenbrand
  0 siblings, 0 replies; 16+ messages in thread
From: David Hildenbrand @ 2023-10-06 11:53 UTC (permalink / raw)
  To: Ryan Roberts, John Hubbard, Matthew Wilcox, Yang Shi, Yin,
	Fengwei, Yu Zhao, Zi Yan, David Rientjes, Andrew Morton,
	Vlastimil Babka, Kirill A. Shutemov, Hugh Dickins
  Cc: Linux-MM

On 05.10.23 11:46, Ryan Roberts wrote:
> On 05/10/2023 09:15, David Hildenbrand wrote:
>> On 05.10.23 09:37, Ryan Roberts wrote:
>>> On 02/10/2023 13:58, David Hildenbrand wrote:
>>>>> [...]
>>>>>
>>>>> My concern is that the "fresh start" is not as simple as it appears. I've come
>>>>> to the conclusion that if we have a new interface, then it should really be a
>>>>> strict superset of THP to make it extensible in future. But that opens
>>>>> questions
>>>>
>>>> ^ +1
>>>>
>>>>> about how you configure PMD-sized allocations when both interfaces disagree.
>>>>> For
>>>>> "enabled" its fairly straightforward; you can do a logical OR. But its less
>>>>> clear how to handle disagreement over defrag. And then you have huge_zero_page
>>>>> and khugepaged etc, which might just stay with THP. But eventually we will
>>>>
>>>> Probably we want everything that THP had (khugepaged, zeropage, ...) also on
>>>> some (selected?) smaller orders.
>>>>
>>>>> probably want to do async collapse for smaller order folios too, and at that
>>>>> point you have to duplicate all those controls... So I concluded that actually
>>>>> it is cleaner to just bolt on a small-order extension to THP. I've updated all
>>>>> the docs, and that was pretty simple to do, which usually suggests that the
>>>>> extension is purely additive and shouldn't be confusing.
>>>>
>>>> Fine with me. I don't quite like bitmaps exposed to user space, though. Just
>>>> having a user-readable list or a "directory" with various options as files might
>>>> be cleaner ...
>>>>
>>>>>
>>>>> Take a look at the patches, then make a judgement ;-)
>>>>>
>>>>
>>>> ... but we'll discuss it there :)
>>>>
>>>
>>> David, FYI, the patches are posted at [1] (you're cc'ed) and have been in
>>> mm-unstable for nearly a week - so I guess they will go to mm-stable soon by
>>> default. So if you want to object to any of it, now's the time ;-).
>>
>> I just did :P
>>
>> Note that I'm distracted by a tiny human being. I should be back at work tomorrow.
> 
> Ahh - congratulations!

Thanks man!

> 
>>
>> Hopefully other people that participated in the discussions can review and ack
>> in the meantime.
> 
> That would certainly be nice (hint to everyone else on the thread ;-)

Indeed, I'll do my best to provide feedback soon, but I shouldn't be the 
only one doing so :)

[my backlog is crazy large after some sick days, public holidays and 
tiny human beings]

> 
>>
>> IMHO there really is no need to rush at this point.
> 
> I have a couple of selfish reasons; I was hoping to get it into v6.7 since I was
> thinking that would be the next LTS, but I've just done the maths again, and it
> looks like it will be v6.6, so I guess I've missed it anyway. The other is that

Yes, that ship has sailed; also, it's not that much of value if merged 
but cannot be reasonably enabled yet due to other TODOs (IOW, no 
distribution would enable it).

> I would like to move focus to other changes that build on this, and that's
> difficult while this is still not merged.

IMHO, this is one of the big important features comparable to ordinary 
THP back then. We better take our time to get it right (well, rather as 
little wrong as possible :) ).

You can start sending out other work that depends on this, even if not 
merged yet. ... there tends to be a review bottleneck, so don't expect 
fast feedback; but it might be reasonable to let people know what's 
coming up next.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2023-10-06 11:53 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-09-14  8:16 ANON_LARGE_FOLIOS meeting follow-up & refined proposal Ryan Roberts
2023-09-22 15:48 ` Ryan Roberts
2023-09-23  0:33   ` John Hubbard
2023-09-25  8:51     ` Ryan Roberts
2023-09-26 18:31       ` David Hildenbrand
2023-09-27  7:23         ` Ryan Roberts
2023-09-27 15:32           ` David Hildenbrand
2023-09-27 19:04             ` Ryan Roberts
2023-10-02 12:58               ` David Hildenbrand
2023-10-05  7:37                 ` Ryan Roberts
     [not found]                   ` <c60321ef-8596-8fa0-7367-f43e69e1d894@redhat.com>
2023-10-05  9:46                     ` Ryan Roberts
2023-10-06 11:53                       ` David Hildenbrand
2023-09-26 18:34       ` David Hildenbrand
2023-09-26  8:13   ` Kirill A. Shutemov
2023-09-26 18:29   ` David Hildenbrand
2023-09-26 18:26 ` David Hildenbrand

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox