From: David Hildenbrand <david@redhat.com>
To: "linux-mm@kvack.org" <linux-mm@kvack.org>
Cc: willy@linux.intel.com, Ryan Roberts <ryan.roberts@arm.com>
Subject: The future of PageAnonExclusive
Date: Wed, 11 Dec 2024 12:55:22 +0100 [thread overview]
Message-ID: <9c2a17af-4df8-42f0-93c8-83133b6104fd@redhat.com> (raw)
Hi,
PageAnonExclusive (PAE) is working very reliable at this point. But
especially in the context of THPs (large folios) we'd like to do better:
(1) For PTE-mapped THP, we have to maintain it per page. We'd like to
avoid per-page flags as good as possible (e.g., waste in "struct
page", touching many cachelines).
(2) We currently have to use atomics to set/clear the flag, even when
working on tail pages. While there would be ways to mitigate this
when modifying the flags of multiple tail pages (bitlock protecting
all tail page flag updates), I'd much rather avoid messing with
page tail flags at all.
In general, the PAE bit can be considered an extended PTE bit that we
currently store in the "struct page" that is mapped by the PTE. Ideally,
we'd just store that information in the PTE, or alongside the PTE:
A writable PTE implies PAE. A write-protected PTE needs additional
information whether it is PAE (-> whether we can just remap it writable,
FOLL_FORCE to it, PIN it ...).
We are out of PTE bits, especially when having to implement it across
*all* architectures. That's one of the reasons we went with PAE back
then. As a nice side-effect it allowed for sanity checks when unpinning
folios (-> PAE must still be set, which is impossible when the
information stored in the PTE).
There are 3 main approaches I've been looking into:
(A) Make it a per-folio flag. I've spent endless hours trying to get it
conceptually right, but it's just a big pain: as soon as we clear
the flag, we have to make sure that all PTEs are write-protected,
that the folio is not pinned, and that concurrent GUP cannot work.
So far the page table lock protected the PAE bit, but with a per-
folio flag that is not guaranteed for THPs.
fork() with things like VM_DONTCOPY, VM_DONTFORK, early-abort, page
migration/swapout that can happen any time during fork etc. make
this really though to get right with THPs. My head hurts any time I
think about it.
While I think fork() itself can be handled, the concurrent page
migration / swapout is where it gets extremely tricky.
This can be done somehow I'm sure, but the devil is in the corner
cases when having multiple PTEs mapping a large folio. We'd still
need atomics to set/clear the single folio flag, because of the
nature of concurrent folio flag updates.
(B) Allocate additional metadata (PAE bitmap) for page tables, protected
by the PTL. That is, when we want to *clear* PAE (fork(), KSM), we'd
lazily allocate the bitmap and store it for our page table.
On x86: 512 PTEs -> 512bits -> 64byte
fork() gets a bit more expensive, because we'd have to allocate this
bitmap for the parent and the child page table we are working on, so
we can mark the pages as "!PAE" in both page tables.
This could work, I have not prototyped it. We'd have to support it
on the PTE/PMD/PUD-table level.
One tricky thing is having multiple pagetables per page, but I
assume it can be handled (we should have a single PT lock for all of
them IIRC, and only need to address the bitmap at the right offset).
Another challenge is how to link to this metadata from ptdesc on all
archs.. So far, __page_mapping is unused, and could maybe be used to
link to such metadata -- at least page tables can be identified
reliably using the page type.
(C) Encode it in the PTE.
pte_write() -> PAE
!pte_write() && pte_dirty() -> PAE
!pte_write && !pte_dirty() -> !PAE
That implies, that when wrprotecting a PTE, we'd have to move the
dirty bit to the folio. When wr-unprotecting it, we could mark the
PTE dirty if the folio is dirty.
I suspect that most anon folios are dirty most of the time either
way, and the common case of having them just writable in the PTE
wouldn't change.
The main idea is that nobody (including HW) should ever be marking a
readonly PTE dirty (so the theory behind it). We have to take good
care whenever we modify/query the dirty bit or modify the writable
bit.
There is quite some code to audit/sanitize. Further, we'd have to
decouple softdirty PTE handling from dirty PTE handling (pte_mkdirty
sets the pte softdirty), and adjust arm64 cont-pte and similar PTE
batching code to respect the per-PTE dirty bit when
the PTE is write-protected.
This would be the most elegant solution, but requires a bit of care
+ sanity checks.
Any thoughts or other ideas?
--
Cheers,
David / dhildenb
next reply other threads:[~2024-12-11 11:55 UTC|newest]
Thread overview: 10+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-12-11 11:55 David Hildenbrand [this message]
2024-12-11 11:56 ` David Hildenbrand
2024-12-11 13:42 ` Kirill A. Shutemov
2024-12-11 13:48 ` David Hildenbrand
2024-12-11 14:25 ` Ryan Roberts
2024-12-11 14:49 ` David Hildenbrand
2024-12-11 15:45 ` Matthew Wilcox
2024-12-11 15:50 ` David Hildenbrand
2024-12-11 16:11 ` Matthew Wilcox
2024-12-11 16:15 ` David Hildenbrand
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=9c2a17af-4df8-42f0-93c8-83133b6104fd@redhat.com \
--to=david@redhat.com \
--cc=linux-mm@kvack.org \
--cc=ryan.roberts@arm.com \
--cc=willy@linux.intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox