Re: The future of PageAnonExclusive

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: David Hildenbrand <david@redhat.com>
To: "linux-mm@kvack.org" <linux-mm@kvack.org>
Cc: Ryan Roberts <ryan.roberts@arm.com>,
	Matthew Wilcox <willy@infradead.org>
Subject: Re: The future of PageAnonExclusive
Date: Wed, 11 Dec 2024 12:56:11 +0100	[thread overview]
Message-ID: <3a9e7c3b-7b69-4f16-80ea-b1a0d4dba853@redhat.com> (raw)
In-Reply-To: <9c2a17af-4df8-42f0-93c8-83133b6104fd@redhat.com>

Now CCing the correct Willy :)

On 11.12.24 12:55, David Hildenbrand wrote:
> Hi,
> 
> PageAnonExclusive (PAE) is working very reliable at this point. But
> especially in the context of THPs (large folios) we'd like to do better:
> 
> (1) For PTE-mapped THP, we have to maintain it per page. We'd like to
>       avoid per-page flags as good as possible (e.g., waste in "struct
>       page",  touching many cachelines).
> 
> (2) We currently have to use atomics to set/clear the flag, even when
>       working on tail pages. While there would be ways to mitigate this
>       when modifying the flags of multiple tail pages (bitlock protecting
>       all tail page flag updates), I'd much rather avoid messing with
>       page tail flags at all.
> 
> 
> In general, the PAE bit can be considered an extended PTE bit that we
> currently store in the "struct page" that is mapped by the PTE. Ideally,
> we'd just store that information in the PTE, or alongside the PTE:
> 
> A writable PTE implies PAE. A write-protected PTE needs additional
> information whether it is PAE (-> whether we can just remap it writable,
> FOLL_FORCE to it, PIN it ...).
> 
> We are out of PTE bits, especially when having to implement it across
> *all* architectures. That's one of the reasons we went with PAE back
> then. As a nice side-effect it allowed for sanity checks when unpinning
> folios (-> PAE must still be set, which is impossible when the
> information stored in the PTE).
> 
> 
> There are 3 main approaches I've been looking into:
> 
> (A) Make it a per-folio flag. I've spent endless hours trying to get it
>       conceptually right, but it's just a big pain: as soon as we clear
>       the flag, we have to make sure that all PTEs are write-protected,
>       that the folio is not pinned, and that concurrent GUP cannot work.
>       So far the page table lock protected the PAE bit, but with a per-
>       folio flag that is not guaranteed for THPs.
> 
>       fork() with things like VM_DONTCOPY, VM_DONTFORK, early-abort, page
>       migration/swapout that can happen any time during fork etc. make
>       this really though to get right with THPs. My head hurts any time I
>       think about it.
> 
>       While I think fork() itself can be handled, the concurrent page
>       migration / swapout is where it gets extremely tricky.
> 
>       This can be done somehow I'm sure, but the devil is in the corner
>       cases when having multiple PTEs mapping a large folio. We'd still
>       need atomics to set/clear the single folio flag, because of the
>       nature of concurrent folio flag updates.
> 
> (B) Allocate additional metadata (PAE bitmap) for page tables, protected
>       by the PTL. That is, when we want to *clear* PAE (fork(), KSM), we'd
>       lazily allocate the bitmap and store it for our page table.
> 
>       On x86: 512 PTEs -> 512bits -> 64byte
> 
>       fork() gets a bit more expensive, because we'd have to allocate this
>       bitmap for the parent and the child page table we are working on, so
>       we can mark the pages as "!PAE" in both page tables.
> 
>       This could work, I have not prototyped it. We'd have to support it
>       on the PTE/PMD/PUD-table level.
> 
>       One tricky thing is having multiple pagetables per page, but I
>       assume it can be handled (we should have a single PT lock for all of
>       them IIRC, and only need to address the bitmap at the right offset).
> 
>       Another challenge is how to link to this metadata from ptdesc on all
>       archs.. So far, __page_mapping is unused, and could maybe be used to
>       link to such metadata -- at least page tables can be identified
>       reliably using the page type.
> 
> (C) Encode it in the PTE.
> 
>       pte_write() -> PAE
> 
>       !pte_write() && pte_dirty() -> PAE
> 
>       !pte_write && !pte_dirty() -> !PAE
> 
>       That implies, that when wrprotecting a PTE, we'd have to move the
>       dirty bit to the folio. When wr-unprotecting it, we could mark the
>       PTE dirty if the folio is dirty.
> 
>       I suspect that most anon folios are dirty most of the time either
>       way, and the common case of having them just writable in the PTE
>       wouldn't change.
> 
>       The main idea is that nobody (including HW) should ever be marking a
>       readonly PTE dirty (so the theory behind it). We have to take good
>       care whenever we modify/query the dirty bit or modify the writable
>       bit.
> 
>       There is quite some code to audit/sanitize. Further, we'd have to
>       decouple softdirty PTE handling from dirty PTE handling (pte_mkdirty
>       sets the pte softdirty), and adjust arm64 cont-pte and similar PTE
>       batching code to respect the per-PTE dirty bit when
>       the PTE is write-protected.
> 
>       This would be the most elegant solution, but requires a bit of care
>       + sanity checks.
> 
> 
> Any thoughts or other ideas?
> 


-- 
Cheers,

David / dhildenb

next prev parent reply	other threads:[~2024-12-11 11:56 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-12-11 11:55 David Hildenbrand
2024-12-11 11:56 ` David Hildenbrand [this message]
2024-12-11 13:42   ` Kirill A. Shutemov
2024-12-11 13:48     ` David Hildenbrand
2024-12-11 14:25   ` Ryan Roberts
2024-12-11 14:49     ` David Hildenbrand
2024-12-11 15:45       ` Matthew Wilcox
2024-12-11 15:50         ` David Hildenbrand
2024-12-11 16:11           ` Matthew Wilcox
2024-12-11 16:15             ` David Hildenbrand

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=3a9e7c3b-7b69-4f16-80ea-b1a0d4dba853@redhat.com \
    --to=david@redhat.com \
    --cc=linux-mm@kvack.org \
    --cc=ryan.roberts@arm.com \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox