The future of PageAnonExclusive

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* The future of PageAnonExclusive
@ 2024-12-11 11:55 David Hildenbrand
  2024-12-11 11:56 ` David Hildenbrand
  0 siblings, 1 reply; 10+ messages in thread
From: David Hildenbrand @ 2024-12-11 11:55 UTC (permalink / raw)
  To: linux-mm; +Cc: willy, Ryan Roberts

Hi,

PageAnonExclusive (PAE) is working very reliable at this point. But 
especially in the context of THPs (large folios) we'd like to do better:

(1) For PTE-mapped THP, we have to maintain it per page. We'd like to
     avoid per-page flags as good as possible (e.g., waste in "struct
     page",  touching many cachelines).

(2) We currently have to use atomics to set/clear the flag, even when
     working on tail pages. While there would be ways to mitigate this
     when modifying the flags of multiple tail pages (bitlock protecting
     all tail page flag updates), I'd much rather avoid messing with
     page tail flags at all.


In general, the PAE bit can be considered an extended PTE bit that we 
currently store in the "struct page" that is mapped by the PTE. Ideally, 
we'd just store that information in the PTE, or alongside the PTE:

A writable PTE implies PAE. A write-protected PTE needs additional 
information whether it is PAE (-> whether we can just remap it writable, 
FOLL_FORCE to it, PIN it ...).

We are out of PTE bits, especially when having to implement it across 
*all* architectures. That's one of the reasons we went with PAE back 
then. As a nice side-effect it allowed for sanity checks when unpinning 
folios (-> PAE must still be set, which is impossible when the 
information stored in the PTE).


There are 3 main approaches I've been looking into:

(A) Make it a per-folio flag. I've spent endless hours trying to get it
     conceptually right, but it's just a big pain: as soon as we clear
     the flag, we have to make sure that all PTEs are write-protected,
     that the folio is not pinned, and that concurrent GUP cannot work.
     So far the page table lock protected the PAE bit, but with a per-
     folio flag that is not guaranteed for THPs.

     fork() with things like VM_DONTCOPY, VM_DONTFORK, early-abort, page
     migration/swapout that can happen any time during fork etc. make
     this really though to get right with THPs. My head hurts any time I
     think about it.

     While I think fork() itself can be handled, the concurrent page
     migration / swapout is where it gets extremely tricky.

     This can be done somehow I'm sure, but the devil is in the corner
     cases when having multiple PTEs mapping a large folio. We'd still
     need atomics to set/clear the single folio flag, because of the
     nature of concurrent folio flag updates.

(B) Allocate additional metadata (PAE bitmap) for page tables, protected
     by the PTL. That is, when we want to *clear* PAE (fork(), KSM), we'd
     lazily allocate the bitmap and store it for our page table.

     On x86: 512 PTEs -> 512bits -> 64byte

     fork() gets a bit more expensive, because we'd have to allocate this
     bitmap for the parent and the child page table we are working on, so
     we can mark the pages as "!PAE" in both page tables.

     This could work, I have not prototyped it. We'd have to support it
     on the PTE/PMD/PUD-table level.

     One tricky thing is having multiple pagetables per page, but I
     assume it can be handled (we should have a single PT lock for all of
     them IIRC, and only need to address the bitmap at the right offset).

     Another challenge is how to link to this metadata from ptdesc on all
     archs.. So far, __page_mapping is unused, and could maybe be used to
     link to such metadata -- at least page tables can be identified
     reliably using the page type.

(C) Encode it in the PTE.

     pte_write() -> PAE

     !pte_write() && pte_dirty() -> PAE

     !pte_write && !pte_dirty() -> !PAE

     That implies, that when wrprotecting a PTE, we'd have to move the
     dirty bit to the folio. When wr-unprotecting it, we could mark the
     PTE dirty if the folio is dirty.

     I suspect that most anon folios are dirty most of the time either
     way, and the common case of having them just writable in the PTE
     wouldn't change.

     The main idea is that nobody (including HW) should ever be marking a
     readonly PTE dirty (so the theory behind it). We have to take good
     care whenever we modify/query the dirty bit or modify the writable
     bit.

     There is quite some code to audit/sanitize. Further, we'd have to
     decouple softdirty PTE handling from dirty PTE handling (pte_mkdirty
     sets the pte softdirty), and adjust arm64 cont-pte and similar PTE
     batching code to respect the per-PTE dirty bit when
     the PTE is write-protected.

     This would be the most elegant solution, but requires a bit of care
     + sanity checks.


Any thoughts or other ideas?

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: The future of PageAnonExclusive
  2024-12-11 11:55 The future of PageAnonExclusive David Hildenbrand
@ 2024-12-11 11:56 ` David Hildenbrand
  2024-12-11 13:42   ` Kirill A. Shutemov
  2024-12-11 14:25   ` Ryan Roberts
  0 siblings, 2 replies; 10+ messages in thread
From: David Hildenbrand @ 2024-12-11 11:56 UTC (permalink / raw)
  To: linux-mm; +Cc: Ryan Roberts, Matthew Wilcox

Now CCing the correct Willy :)

On 11.12.24 12:55, David Hildenbrand wrote:
> Hi,
> 
> PageAnonExclusive (PAE) is working very reliable at this point. But
> especially in the context of THPs (large folios) we'd like to do better:
> 
> (1) For PTE-mapped THP, we have to maintain it per page. We'd like to
>       avoid per-page flags as good as possible (e.g., waste in "struct
>       page",  touching many cachelines).
> 
> (2) We currently have to use atomics to set/clear the flag, even when
>       working on tail pages. While there would be ways to mitigate this
>       when modifying the flags of multiple tail pages (bitlock protecting
>       all tail page flag updates), I'd much rather avoid messing with
>       page tail flags at all.
> 
> 
> In general, the PAE bit can be considered an extended PTE bit that we
> currently store in the "struct page" that is mapped by the PTE. Ideally,
> we'd just store that information in the PTE, or alongside the PTE:
> 
> A writable PTE implies PAE. A write-protected PTE needs additional
> information whether it is PAE (-> whether we can just remap it writable,
> FOLL_FORCE to it, PIN it ...).
> 
> We are out of PTE bits, especially when having to implement it across
> *all* architectures. That's one of the reasons we went with PAE back
> then. As a nice side-effect it allowed for sanity checks when unpinning
> folios (-> PAE must still be set, which is impossible when the
> information stored in the PTE).
> 
> 
> There are 3 main approaches I've been looking into:
> 
> (A) Make it a per-folio flag. I've spent endless hours trying to get it
>       conceptually right, but it's just a big pain: as soon as we clear
>       the flag, we have to make sure that all PTEs are write-protected,
>       that the folio is not pinned, and that concurrent GUP cannot work.
>       So far the page table lock protected the PAE bit, but with a per-
>       folio flag that is not guaranteed for THPs.
> 
>       fork() with things like VM_DONTCOPY, VM_DONTFORK, early-abort, page
>       migration/swapout that can happen any time during fork etc. make
>       this really though to get right with THPs. My head hurts any time I
>       think about it.
> 
>       While I think fork() itself can be handled, the concurrent page
>       migration / swapout is where it gets extremely tricky.
> 
>       This can be done somehow I'm sure, but the devil is in the corner
>       cases when having multiple PTEs mapping a large folio. We'd still
>       need atomics to set/clear the single folio flag, because of the
>       nature of concurrent folio flag updates.
> 
> (B) Allocate additional metadata (PAE bitmap) for page tables, protected
>       by the PTL. That is, when we want to *clear* PAE (fork(), KSM), we'd
>       lazily allocate the bitmap and store it for our page table.
> 
>       On x86: 512 PTEs -> 512bits -> 64byte
> 
>       fork() gets a bit more expensive, because we'd have to allocate this
>       bitmap for the parent and the child page table we are working on, so
>       we can mark the pages as "!PAE" in both page tables.
> 
>       This could work, I have not prototyped it. We'd have to support it
>       on the PTE/PMD/PUD-table level.
> 
>       One tricky thing is having multiple pagetables per page, but I
>       assume it can be handled (we should have a single PT lock for all of
>       them IIRC, and only need to address the bitmap at the right offset).
> 
>       Another challenge is how to link to this metadata from ptdesc on all
>       archs.. So far, __page_mapping is unused, and could maybe be used to
>       link to such metadata -- at least page tables can be identified
>       reliably using the page type.
> 
> (C) Encode it in the PTE.
> 
>       pte_write() -> PAE
> 
>       !pte_write() && pte_dirty() -> PAE
> 
>       !pte_write && !pte_dirty() -> !PAE
> 
>       That implies, that when wrprotecting a PTE, we'd have to move the
>       dirty bit to the folio. When wr-unprotecting it, we could mark the
>       PTE dirty if the folio is dirty.
> 
>       I suspect that most anon folios are dirty most of the time either
>       way, and the common case of having them just writable in the PTE
>       wouldn't change.
> 
>       The main idea is that nobody (including HW) should ever be marking a
>       readonly PTE dirty (so the theory behind it). We have to take good
>       care whenever we modify/query the dirty bit or modify the writable
>       bit.
> 
>       There is quite some code to audit/sanitize. Further, we'd have to
>       decouple softdirty PTE handling from dirty PTE handling (pte_mkdirty
>       sets the pte softdirty), and adjust arm64 cont-pte and similar PTE
>       batching code to respect the per-PTE dirty bit when
>       the PTE is write-protected.
> 
>       This would be the most elegant solution, but requires a bit of care
>       + sanity checks.
> 
> 
> Any thoughts or other ideas?
> 


-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: The future of PageAnonExclusive
  2024-12-11 11:56 ` David Hildenbrand
@ 2024-12-11 13:42   ` Kirill A. Shutemov
  2024-12-11 13:48     ` David Hildenbrand
  2024-12-11 14:25   ` Ryan Roberts
  1 sibling, 1 reply; 10+ messages in thread
From: Kirill A. Shutemov @ 2024-12-11 13:42 UTC (permalink / raw)
  To: David Hildenbrand; +Cc: linux-mm, Ryan Roberts, Matthew Wilcox

On Wed, Dec 11, 2024 at 12:56:11PM +0100, David Hildenbrand wrote:
> > (C) Encode it in the PTE.
> > 
> >       pte_write() -> PAE
> > 
> >       !pte_write() && pte_dirty() -> PAE
> > 
> >       !pte_write && !pte_dirty() -> !PAE

You are late to the party. On x86, !pte_write() && pte_dirty() is shadow
stack.

-- 
  Kiryl Shutsemau / Kirill A. Shutemov


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: The future of PageAnonExclusive
  2024-12-11 13:42   ` Kirill A. Shutemov
@ 2024-12-11 13:48     ` David Hildenbrand
  0 siblings, 0 replies; 10+ messages in thread
From: David Hildenbrand @ 2024-12-11 13:48 UTC (permalink / raw)
  To: Kirill A. Shutemov; +Cc: linux-mm, Ryan Roberts, Matthew Wilcox

On 11.12.24 14:42, Kirill A. Shutemov wrote:
> On Wed, Dec 11, 2024 at 12:56:11PM +0100, David Hildenbrand wrote:
>>> (C) Encode it in the PTE.
>>>
>>>        pte_write() -> PAE
>>>
>>>        !pte_write() && pte_dirty() -> PAE
>>>
>>>        !pte_write && !pte_dirty() -> !PAE
> 
> You are late to the party. On x86, !pte_write() && pte_dirty() is shadow
> stack.
> 

Hah, no, it works! :)

On x86 we use this fancy savedirty bit to handle that internally, such 
that pte_write/pte_dirty keep working as expected on shadow stacks.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: The future of PageAnonExclusive
  2024-12-11 11:56 ` David Hildenbrand
  2024-12-11 13:42   ` Kirill A. Shutemov
@ 2024-12-11 14:25   ` Ryan Roberts
  2024-12-11 14:49     ` David Hildenbrand
  1 sibling, 1 reply; 10+ messages in thread
From: Ryan Roberts @ 2024-12-11 14:25 UTC (permalink / raw)
  To: David Hildenbrand, linux-mm; +Cc: Matthew Wilcox

On 11/12/2024 11:56, David Hildenbrand wrote:
> Now CCing the correct Willy :)
> 
> On 11.12.24 12:55, David Hildenbrand wrote:
>> Hi,
>>
>> PageAnonExclusive (PAE) is working very reliable at this point. But
>> especially in the context of THPs (large folios) we'd like to do better:
>>
>> (1) For PTE-mapped THP, we have to maintain it per page. We'd like to
>>       avoid per-page flags as good as possible (e.g., waste in "struct
>>       page",  touching many cachelines).

Presumably also important for the Glorious Future where struct page is just a
pointer and struct folio (et al) is allocated dynamically?

>>
>> (2) We currently have to use atomics to set/clear the flag, even when
>>       working on tail pages. While there would be ways to mitigate this
>>       when modifying the flags of multiple tail pages (bitlock protecting
>>       all tail page flag updates), I'd much rather avoid messing with
>>       page tail flags at all.
>>
>>
>> In general, the PAE bit can be considered an extended PTE bit that we
>> currently store in the "struct page" that is mapped by the PTE. Ideally,
>> we'd just store that information in the PTE, or alongside the PTE:
>>
>> A writable PTE implies PAE. A write-protected PTE needs additional
>> information whether it is PAE (-> whether we can just remap it writable,
>> FOLL_FORCE to it, PIN it ...).
>>
>> We are out of PTE bits, especially when having to implement it across
>> *all* architectures. That's one of the reasons we went with PAE back
>> then. As a nice side-effect it allowed for sanity checks when unpinning
>> folios (-> PAE must still be set, which is impossible when the
>> information stored in the PTE).
>>
>>
>> There are 3 main approaches I've been looking into:
>>
>> (A) Make it a per-folio flag. I've spent endless hours trying to get it
>>       conceptually right, but it's just a big pain: as soon as we clear
>>       the flag, we have to make sure that all PTEs are write-protected,
>>       that the folio is not pinned, and that concurrent GUP cannot work.
>>       So far the page table lock protected the PAE bit, but with a per-
>>       folio flag that is not guaranteed for THPs.
>>
>>       fork() with things like VM_DONTCOPY, VM_DONTFORK, early-abort, page
>>       migration/swapout that can happen any time during fork etc. make
>>       this really though to get right with THPs. My head hurts any time I
>>       think about it.
>>
>>       While I think fork() itself can be handled, the concurrent page
>>       migration / swapout is where it gets extremely tricky.
>>
>>       This can be done somehow I'm sure, but the devil is in the corner
>>       cases when having multiple PTEs mapping a large folio. We'd still
>>       need atomics to set/clear the single folio flag, because of the
>>       nature of concurrent folio flag updates.
>>
>> (B) Allocate additional metadata (PAE bitmap) for page tables, protected
>>       by the PTL. That is, when we want to *clear* PAE (fork(), KSM), we'd
>>       lazily allocate the bitmap and store it for our page table.
>>
>>       On x86: 512 PTEs -> 512bits -> 64byte
>>
>>       fork() gets a bit more expensive, because we'd have to allocate this
>>       bitmap for the parent and the child page table we are working on, so
>>       we can mark the pages as "!PAE" in both page tables.
>>
>>       This could work, I have not prototyped it. We'd have to support it
>>       on the PTE/PMD/PUD-table level.
>>
>>       One tricky thing is having multiple pagetables per page, but I
>>       assume it can be handled (we should have a single PT lock for all of
>>       them IIRC, and only need to address the bitmap at the right offset).
>>
>>       Another challenge is how to link to this metadata from ptdesc on all
>>       archs.. So far, __page_mapping is unused, and could maybe be used to
>>       link to such metadata -- at least page tables can be identified
>>       reliably using the page type.

FWIW, I did a prototype of this sort of thing a while back to try to create some
extra (general purpose) PTE bits for arm64. There is already a union of various
arch-specific things in ptdesc, none of which were used by arm64 so I just added
an arm64 field to that. That doesn't help you though.

As I recall it all got horrible because I couldn't read the extra bits
atomically with the rest of the PTE and I couldn't convince myself that it was
always safe for lockless walkers. (I think we had a fairly long thread talking
about it). Anyway, I suspect that this is not a problem for your case because
you'll be operating at a higher level where you can always gaurantee the PTL is
held?

arm64 uses slab allocator for its top level tables when that level is not an
entire page. So there is no ptdesc to attach to in that case. My prototype
swerved that by disallowing block mappings at the top level.

>>
>> (C) Encode it in the PTE.
>>
>>       pte_write() -> PAE
>>
>>       !pte_write() && pte_dirty() -> PAE
>>
>>       !pte_write && !pte_dirty() -> !PAE
>>
>>       That implies, that when wrprotecting a PTE, we'd have to move the
>>       dirty bit to the folio. When wr-unprotecting it, we could mark the
>>       PTE dirty if the folio is dirty.
>>
>>       I suspect that most anon folios are dirty most of the time either
>>       way, and the common case of having them just writable in the PTE
>>       wouldn't change.
>>
>>       The main idea is that nobody (including HW) should ever be marking a
>>       readonly PTE dirty (so the theory behind it). We have to take good
>>       care whenever we modify/query the dirty bit or modify the writable
>>       bit.
>>
>>       There is quite some code to audit/sanitize. Further, we'd have to
>>       decouple softdirty PTE handling from dirty PTE handling (pte_mkdirty
>>       sets the pte softdirty), and adjust arm64 cont-pte and similar PTE
>>       batching code to respect the per-PTE dirty bit when
>>       the PTE is write-protected.
>>
>>       This would be the most elegant solution, but requires a bit of care
>>       + sanity checks.

This sounds like it could all get quite fragile to me. Lots of potential to get
accidentally broken over time...

>>
>>
>> Any thoughts or other ideas?

What happened to the idea in your "every mapping counts" paper? Doesn't that
provide this info?

Thanks,
Ryan

>>
> 
> 



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: The future of PageAnonExclusive
  2024-12-11 14:25   ` Ryan Roberts
@ 2024-12-11 14:49     ` David Hildenbrand
  2024-12-11 15:45       ` Matthew Wilcox
  0 siblings, 1 reply; 10+ messages in thread
From: David Hildenbrand @ 2024-12-11 14:49 UTC (permalink / raw)
  To: Ryan Roberts, linux-mm; +Cc: Matthew Wilcox

On 11.12.24 15:25, Ryan Roberts wrote:
> On 11/12/2024 11:56, David Hildenbrand wrote:
>> Now CCing the correct Willy :)
>>
>> On 11.12.24 12:55, David Hildenbrand wrote:
>>> Hi,
>>>
>>> PageAnonExclusive (PAE) is working very reliable at this point. But
>>> especially in the context of THPs (large folios) we'd like to do better:
>>>
>>> (1) For PTE-mapped THP, we have to maintain it per page. We'd like to
>>>        avoid per-page flags as good as possible (e.g., waste in "struct
>>>        page",  touching many cachelines).
> 
> Presumably also important for the Glorious Future where struct page is just a
> pointer and struct folio (et al) is allocated dynamically?

I think Willy mentioned that there might be ways to encode it in the 
8-byte for the "tail" pages.

> 
>>>
>>> (2) We currently have to use atomics to set/clear the flag, even when
>>>        working on tail pages. While there would be ways to mitigate this
>>>        when modifying the flags of multiple tail pages (bitlock protecting
>>>        all tail page flag updates), I'd much rather avoid messing with
>>>        page tail flags at all.
>>>
>>>
>>> In general, the PAE bit can be considered an extended PTE bit that we
>>> currently store in the "struct page" that is mapped by the PTE. Ideally,
>>> we'd just store that information in the PTE, or alongside the PTE:
>>>
>>> A writable PTE implies PAE. A write-protected PTE needs additional
>>> information whether it is PAE (-> whether we can just remap it writable,
>>> FOLL_FORCE to it, PIN it ...).
>>>
>>> We are out of PTE bits, especially when having to implement it across
>>> *all* architectures. That's one of the reasons we went with PAE back
>>> then. As a nice side-effect it allowed for sanity checks when unpinning
>>> folios (-> PAE must still be set, which is impossible when the
>>> information stored in the PTE).
>>>
>>>
>>> There are 3 main approaches I've been looking into:
>>>
>>> (A) Make it a per-folio flag. I've spent endless hours trying to get it
>>>        conceptually right, but it's just a big pain: as soon as we clear
>>>        the flag, we have to make sure that all PTEs are write-protected,
>>>        that the folio is not pinned, and that concurrent GUP cannot work.
>>>        So far the page table lock protected the PAE bit, but with a per-
>>>        folio flag that is not guaranteed for THPs.
>>>
>>>        fork() with things like VM_DONTCOPY, VM_DONTFORK, early-abort, page
>>>        migration/swapout that can happen any time during fork etc. make
>>>        this really though to get right with THPs. My head hurts any time I
>>>        think about it.
>>>
>>>        While I think fork() itself can be handled, the concurrent page
>>>        migration / swapout is where it gets extremely tricky.
>>>
>>>        This can be done somehow I'm sure, but the devil is in the corner
>>>        cases when having multiple PTEs mapping a large folio. We'd still
>>>        need atomics to set/clear the single folio flag, because of the
>>>        nature of concurrent folio flag updates.
>>>
>>> (B) Allocate additional metadata (PAE bitmap) for page tables, protected
>>>        by the PTL. That is, when we want to *clear* PAE (fork(), KSM), we'd
>>>        lazily allocate the bitmap and store it for our page table.
>>>
>>>        On x86: 512 PTEs -> 512bits -> 64byte
>>>
>>>        fork() gets a bit more expensive, because we'd have to allocate this
>>>        bitmap for the parent and the child page table we are working on, so
>>>        we can mark the pages as "!PAE" in both page tables.
>>>
>>>        This could work, I have not prototyped it. We'd have to support it
>>>        on the PTE/PMD/PUD-table level.
>>>
>>>        One tricky thing is having multiple pagetables per page, but I
>>>        assume it can be handled (we should have a single PT lock for all of
>>>        them IIRC, and only need to address the bitmap at the right offset).
>>>
>>>        Another challenge is how to link to this metadata from ptdesc on all
>>>        archs.. So far, __page_mapping is unused, and could maybe be used to
>>>        link to such metadata -- at least page tables can be identified
>>>        reliably using the page type.
> 
> FWIW, I did a prototype of this sort of thing a while back to try to create some
> extra (general purpose) PTE bits for arm64. There is already a union of various
> arch-specific things in ptdesc, none of which were used by arm64 so I just added
> an arm64 field to that. That doesn't help you though.

Right.

> 
> As I recall it all got horrible because I couldn't read the extra bits
> atomically with the rest of the PTE and I couldn't convince myself that it was
> always safe for lockless walkers. (I think we had a fairly long thread talking
> about it). Anyway, I suspect that this is not a problem for your case because
> you'll be operating at a higher level where you can always gaurantee the PTL is
> held?

Well, there is GUP-fast, without any locking :(

Lazily allocating it would probably work: if there is no bitmap pointer, 
everything is exclusive. If there is a bitmap pointer, it cannot vanish. 
But the RCU freeing of page tables etc ... don't make this significantly 
easy to implement.

> 
> arm64 uses slab allocator for its top level tables when that level is not an
> entire page. So there is no ptdesc to attach to in that case. My prototype
> swerved that by disallowing block mappings at the top level.

After writing it here, I realized that lazy allocation will be a bit of 
a problem for remapping a PMD-mapped THP using PTEs. The page table we 
dispose would already have that bitmap allocated, otherwise we might not 
have the bitmap when remapping a PMD-mapped THP that is shared ... and 
we are not guaranteed to be able to allocate memory at that point.

> 
>>>
>>> (C) Encode it in the PTE.
>>>
>>>        pte_write() -> PAE
>>>
>>>        !pte_write() && pte_dirty() -> PAE
>>>
>>>        !pte_write && !pte_dirty() -> !PAE
>>>
>>>        That implies, that when wrprotecting a PTE, we'd have to move the
>>>        dirty bit to the folio. When wr-unprotecting it, we could mark the
>>>        PTE dirty if the folio is dirty.
>>>
>>>        I suspect that most anon folios are dirty most of the time either
>>>        way, and the common case of having them just writable in the PTE
>>>        wouldn't change.
>>>
>>>        The main idea is that nobody (including HW) should ever be marking a
>>>        readonly PTE dirty (so the theory behind it). We have to take good
>>>        care whenever we modify/query the dirty bit or modify the writable
>>>        bit.
>>>
>>>        There is quite some code to audit/sanitize. Further, we'd have to
>>>        decouple softdirty PTE handling from dirty PTE handling (pte_mkdirty
>>>        sets the pte softdirty), and adjust arm64 cont-pte and similar PTE
>>>        batching code to respect the per-PTE dirty bit when
>>>        the PTE is write-protected.
>>>
>>>        This would be the most elegant solution, but requires a bit of care
>>>        + sanity checks.
> 
> This sounds like it could all get quite fragile to me. Lots of potential to get
> accidentally broken over time...

It could be fairly well sanity checked I think.

 From all the things, it's the clearest regarding locking, memory 
allocation ... and the rules can be extremely easily documented.

Whereby (A) is just a nightmare, and I get the feeling that (B) is as well.

Yeah, ideally we'd have a spare PTE bit, but that is pretty much out of 
the picture .. :(

> 
>>>
>>>
>>> Any thoughts or other ideas?
> 
> What happened to the idea in your "every mapping counts" paper?

I'm still working on getting something simpler upstream (I sent a v1 for 
MM owner tracking, which I am reworking as we speak to be a bit simpler 
and maybe also work for 32bit ... somehow so we can enable it 
unconditionally) first. It's all tricky ...

> Doesn't that
> provide this info?

Unfortunately not. "mapped exclusively" vs. "Anon Exclusive" are two 
things.

"Anon exclusive" implies "mapped exclusively", but not the other way around.

We could detect "this is mapped by one process" (the old broken 
page_mapcount()==1 check) but have vmsplice/O_DIRECT/swapcache 
references from another process; for example, the famous vmsplice issue.

And because we cannot decide whether the references are from *this* 
process or from another one, we can only get it wrong.

To reset PAE, one can use "mappped exclusively + all references from 
pinnings", but it cannot replace PAE.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: The future of PageAnonExclusive
  2024-12-11 14:49     ` David Hildenbrand
@ 2024-12-11 15:45       ` Matthew Wilcox
  2024-12-11 15:50         ` David Hildenbrand
  0 siblings, 1 reply; 10+ messages in thread
From: Matthew Wilcox @ 2024-12-11 15:45 UTC (permalink / raw)
  To: David Hildenbrand; +Cc: Ryan Roberts, linux-mm

On Wed, Dec 11, 2024 at 03:49:12PM +0100, David Hildenbrand wrote:
> On 11.12.24 15:25, Ryan Roberts wrote:
> > On 11/12/2024 11:56, David Hildenbrand wrote:
> > > Now CCing the correct Willy :)
> > > 
> > > On 11.12.24 12:55, David Hildenbrand wrote:
> > > > Hi,
> > > > 
> > > > PageAnonExclusive (PAE) is working very reliable at this point. But
> > > > especially in the context of THPs (large folios) we'd like to do better:
> > > > 
> > > > (1) For PTE-mapped THP, we have to maintain it per page. We'd like to
> > > >        avoid per-page flags as good as possible (e.g., waste in "struct
> > > >        page",  touching many cachelines).
> > 
> > Presumably also important for the Glorious Future where struct page is just a
> > pointer and struct folio (et al) is allocated dynamically?
> 
> I think Willy mentioned that there might be ways to encode it in the 8-byte
> for the "tail" pages.

Yes.  For anon memory, the page->memdesc has a 4-bit 'type' and the
remaining 60 bits is a pointer to a struct folio (allocated from a slab
with 16 byte alignment).  The current list of types [1] has file folios
as type 2 and anon folios as type 3.  We could allocate a type to be
'anon exclusive', thus essentially giving us an anon-exclusive bit.

[1] https://kernelnewbies.org/MatthewWilcox/Memdescs

Don't get too excited about "we're almost out of types".  The "managed"
type has subtypes.  We could also collapse "file" and "anon" into a
single type and distinguish between them with a bit in the folio.

Anyway, yes, we can do one per-page flag.  Two per-page flags starts to
get dicey.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: The future of PageAnonExclusive
  2024-12-11 15:45       ` Matthew Wilcox
@ 2024-12-11 15:50         ` David Hildenbrand
  2024-12-11 16:11           ` Matthew Wilcox
  0 siblings, 1 reply; 10+ messages in thread
From: David Hildenbrand @ 2024-12-11 15:50 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: Ryan Roberts, linux-mm

On 11.12.24 16:45, Matthew Wilcox wrote:
> On Wed, Dec 11, 2024 at 03:49:12PM +0100, David Hildenbrand wrote:
>> On 11.12.24 15:25, Ryan Roberts wrote:
>>> On 11/12/2024 11:56, David Hildenbrand wrote:
>>>> Now CCing the correct Willy :)
>>>>
>>>> On 11.12.24 12:55, David Hildenbrand wrote:
>>>>> Hi,
>>>>>
>>>>> PageAnonExclusive (PAE) is working very reliable at this point. But
>>>>> especially in the context of THPs (large folios) we'd like to do better:
>>>>>
>>>>> (1) For PTE-mapped THP, we have to maintain it per page. We'd like to
>>>>>         avoid per-page flags as good as possible (e.g., waste in "struct
>>>>>         page",  touching many cachelines).
>>>
>>> Presumably also important for the Glorious Future where struct page is just a
>>> pointer and struct folio (et al) is allocated dynamically?
>>
>> I think Willy mentioned that there might be ways to encode it in the 8-byte
>> for the "tail" pages.
> 
> Yes.  For anon memory, the page->memdesc has a 4-bit 'type' and the
> remaining 60 bits is a pointer to a struct folio (allocated from a slab
> with 16 byte alignment).  The current list of types [1] has file folios
> as type 2 and anon folios as type 3.  We could allocate a type to be
> 'anon exclusive', thus essentially giving us an anon-exclusive bit.

Right.

> 
> [1] https://kernelnewbies.org/MatthewWilcox/Memdescs
> 
> Don't get too excited about "we're almost out of types".  The "managed"
> type has subtypes.  We could also collapse "file" and "anon" into a
> single type and distinguish between them with a bit in the folio.
> 
> Anyway, yes, we can do one per-page flag.  Two per-page flags starts to
> get dicey.

hwpoison? :/

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: The future of PageAnonExclusive
  2024-12-11 15:50         ` David Hildenbrand
@ 2024-12-11 16:11           ` Matthew Wilcox
  2024-12-11 16:15             ` David Hildenbrand
  0 siblings, 1 reply; 10+ messages in thread
From: Matthew Wilcox @ 2024-12-11 16:11 UTC (permalink / raw)
  To: David Hildenbrand; +Cc: Ryan Roberts, linux-mm

On Wed, Dec 11, 2024 at 04:50:12PM +0100, David Hildenbrand wrote:
> On 11.12.24 16:45, Matthew Wilcox wrote:
> > [1] https://kernelnewbies.org/MatthewWilcox/Memdescs
> > 
> > Don't get too excited about "we're almost out of types".  The "managed"
> > type has subtypes.  We could also collapse "file" and "anon" into a
> > single type and distinguish between them with a bit in the folio.
> > 
> > Anyway, yes, we can do one per-page flag.  Two per-page flags starts to
> > get dicey.
> 
> hwpoison? :/

type 9!  My thinking is that the hwpoison type contains an orig_memdesc
field, as well as whatever else is needed to describe the hwpoison that
was discovered (start, length, what else?)


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: The future of PageAnonExclusive
  2024-12-11 16:11           ` Matthew Wilcox
@ 2024-12-11 16:15             ` David Hildenbrand
  0 siblings, 0 replies; 10+ messages in thread
From: David Hildenbrand @ 2024-12-11 16:15 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: Ryan Roberts, linux-mm

On 11.12.24 17:11, Matthew Wilcox wrote:
> On Wed, Dec 11, 2024 at 04:50:12PM +0100, David Hildenbrand wrote:
>> On 11.12.24 16:45, Matthew Wilcox wrote:
>>> [1] https://kernelnewbies.org/MatthewWilcox/Memdescs
>>>
>>> Don't get too excited about "we're almost out of types".  The "managed"
>>> type has subtypes.  We could also collapse "file" and "anon" into a
>>> single type and distinguish between them with a bit in the folio.
>>>
>>> Anyway, yes, we can do one per-page flag.  Two per-page flags starts to
>>> get dicey.
>>
>> hwpoison? :/
> 
> type 9!  My thinking is that the hwpoison type contains an orig_memdesc
> field, as well as whatever else is needed to describe the hwpoison that
> was discovered (start, length, what else?)

Ah, I see. So you'd have to allocate that type during a MCE, and have 
all file/anon detection code respect that indirection as well.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2024-12-11 16:15 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-12-11 11:55 The future of PageAnonExclusive David Hildenbrand
2024-12-11 11:56 ` David Hildenbrand
2024-12-11 13:42   ` Kirill A. Shutemov
2024-12-11 13:48     ` David Hildenbrand
2024-12-11 14:25   ` Ryan Roberts
2024-12-11 14:49     ` David Hildenbrand
2024-12-11 15:45       ` Matthew Wilcox
2024-12-11 15:50         ` David Hildenbrand
2024-12-11 16:11           ` Matthew Wilcox
2024-12-11 16:15             ` David Hildenbrand

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox