From: Vlastimil Babka <vbabka@suse.cz>
To: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>,
Andrew Morton <akpm@linux-foundation.org>
Cc: Suren Baghdasaryan <surenb@google.com>,
"Liam R . Howlett" <Liam.Howlett@oracle.com>,
Matthew Wilcox <willy@infradead.org>,
"Paul E . McKenney" <paulmck@kernel.org>,
Jann Horn <jannh@google.com>,
David Hildenbrand <david@redhat.com>,
linux-mm@kvack.org, linux-kernel@vger.kernel.org,
Muchun Song <muchun.song@linux.dev>,
Richard Henderson <richard.henderson@linaro.org>,
Ivan Kokshaysky <ink@jurassic.park.msu.ru>,
Matt Turner <mattst88@gmail.com>,
Thomas Bogendoerfer <tsbogend@alpha.franken.de>,
"James E . J . Bottomley" <James.Bottomley@HansenPartnership.com>,
Helge Deller <deller@gmx.de>, Chris Zankel <chris@zankel.net>,
Max Filippov <jcmvbkbc@gmail.com>, Arnd Bergmann <arnd@arndb.de>,
linux-alpha@vger.kernel.org, linux-mips@vger.kernel.org,
linux-parisc@vger.kernel.org, linux-arch@vger.kernel.org,
Shuah Khan <shuah@kernel.org>,
Christian Brauner <brauner@kernel.org>,
linux-kselftest@vger.kernel.org,
Sidhartha Kumar <sidhartha.kumar@oracle.com>,
Jeff Xu <jeffxu@chromium.org>,
Christoph Hellwig <hch@infradead.org>,
Linux API <linux-api@vger.kernel.org>
Subject: Re: [PATCH 0/4] implement lightweight guard pages
Date: Fri, 18 Oct 2024 18:10:37 +0200 [thread overview]
Message-ID: <e4985328-dbfa-4c47-9cf9-12aa89ba9798@suse.cz> (raw)
In-Reply-To: <cover.1729196871.git.lorenzo.stoakes@oracle.com>
+CC linux-api (also should on future revisions)
On 10/17/24 22:42, Lorenzo Stoakes wrote:
> Userland library functions such as allocators and threading implementations
> often require regions of memory to act as 'guard pages' - mappings which,
> when accessed, result in a fatal signal being sent to the accessing
> process.
>
> The current means by which these are implemented is via a PROT_NONE mmap()
> mapping, which provides the required semantics however incur an overhead of
> a VMA for each such region.
>
> With a great many processes and threads, this can rapidly add up and incur
> a significant memory penalty. It also has the added problem of preventing
> merges that might otherwise be permitted.
>
> This series takes a different approach - an idea suggested by Vlasimil
> Babka (and before him David Hildenbrand and Jann Horn - perhaps more - the
> provenance becomes a little tricky to ascertain after this - please forgive
> any omissions!) - rather than locating the guard pages at the VMA layer,
> instead placing them in page tables mapping the required ranges.
>
> Early testing of the prototype version of this code suggests a 5 times
> speed up in memory mapping invocations (in conjunction with use of
> process_madvise()) and a 13% reduction in VMAs on an entirely idle android
> system and unoptimised code.
>
> We expect with optimisation and a loaded system with a larger number of
> guard pages this could significantly increase, but in any case these
> numbers are encouraging.
>
> This way, rather than having separate VMAs specifying which parts of a
> range are guard pages, instead we have a VMA spanning the entire range of
> memory a user is permitted to access and including ranges which are to be
> 'guarded'.
>
> After mapping this, a user can specify which parts of the range should
> result in a fatal signal when accessed.
>
> By restricting the ability to specify guard pages to memory mapped by
> existing VMAs, we can rely on the mappings being torn down when the
> mappings are ultimately unmapped and everything works simply as if the
> memory were not faulted in, from the point of view of the containing VMAs.
>
> This mechanism in effect poisons memory ranges similar to hardware memory
> poisoning, only it is an entirely software-controlled form of poisoning.
>
> Any poisoned region of memory is also able to 'unpoisoned', that is, to
> have its poison markers removed.
>
> The mechanism is implemented via madvise() behaviour - MADV_GUARD_POISON
> which simply poisons ranges - and MADV_GUARD_UNPOISON - which clears this
> poisoning.
>
> Poisoning can be performed across multiple VMAs and any existing mappings
> will be cleared, that is zapped, before installing the poisoned page table
> mappings.
>
> There is no concept of 'nested' poisoning, multiple attempts to poison a
> range will, after the first poisoning, have no effect.
>
> Importantly, unpoisoning of poisoned ranges has no effect on non-poisoned
> memory, so a user can safely unpoison a range of memory and clear only
> poison page table mappings leaving the rest intact.
>
> The actual mechanism by which the page table entries are specified makes
> use of existing logic - PTE markers, which are used for the userfaultfd
> UFFDIO_POISON mechanism.
>
> Unfortunately PTE_MARKER_POISONED is not suited for the guard page
> mechanism as it results in VM_FAULT_HWPOISON semantics in the fault
> handler, so we add our own specific PTE_MARKER_GUARD and adapt existing
> logic to handle it.
>
> We also extend the generic page walk mechanism to allow for installation of
> PTEs (carefully restricted to memory management logic only to prevent
> unwanted abuse).
>
> We ensure that zapping performed by, for instance, MADV_DONTNEED, does not
> remove guard poison markers, nor does forking (except when VM_WIPEONFORK is
> specified for a VMA which implies a total removal of memory
> characteristics).
>
> It's important to note that the guard page implementation is emphatically
> NOT a security feature, so a user can remove the poisoning if they wish. We
> simply implement it in such a way as to provide the least surprising
> behaviour.
>
> An extensive set of self-tests are provided which ensure behaviour is as
> expected and additionally self-documents expected behaviour of poisoned
> ranges.
>
> Suggested-by: Vlastimil Babka <vbabka@suze.cz>
Please fix the domain typo (also in patch 3 :)
Thanks for implementing this,
Vlastimil
> Suggested-by: Jann Horn <jannh@google.com>
> Suggested-by: David Hildenbrand <david@redhat.com>
>
> v1
> * Un-RFC'd as appears no major objections to approach but rather debate on
> implementation.
> * Fixed issue with arches which need mmu_context.h and
> tlbfush.h. header imports in pagewalker logic to be able to use
> update_mmu_cache() as reported by the kernel test bot.
> * Added comments in page walker logic to clarify who can use
> ops->install_pte and why as well as adding a check_ops_valid() helper
> function, as suggested by Christoph.
> * Pass false in full parameter in pte_clear_not_present_full() as suggested
> by Jann.
> * Stopped erroneously requiring a write lock for the poison operation as
> suggested by Jann and Suren.
> * Moved anon_vma_prepare() to the start of madvise_guard_poison() to be
> consistent with how this is used elsewhere in the kernel as suggested by
> Jann.
> * Avoid returning -EAGAIN if we are raced on page faults, just keep looping
> and duck out if a fatal signal is pending or a conditional reschedule is
> needed, as suggested by Jann.
> * Avoid needlessly splitting huge PUDs and PMDs by specifying
> ACTION_CONTINUE, as suggested by Jann.
>
> RFC
> https://lore.kernel.org/all/cover.1727440966.git.lorenzo.stoakes@oracle.com/
>
> Lorenzo Stoakes (4):
> mm: pagewalk: add the ability to install PTEs
> mm: add PTE_MARKER_GUARD PTE marker
> mm: madvise: implement lightweight guard page mechanism
> selftests/mm: add self tests for guard page feature
>
> arch/alpha/include/uapi/asm/mman.h | 3 +
> arch/mips/include/uapi/asm/mman.h | 3 +
> arch/parisc/include/uapi/asm/mman.h | 3 +
> arch/xtensa/include/uapi/asm/mman.h | 3 +
> include/linux/mm_inline.h | 2 +-
> include/linux/pagewalk.h | 18 +-
> include/linux/swapops.h | 26 +-
> include/uapi/asm-generic/mman-common.h | 3 +
> mm/hugetlb.c | 3 +
> mm/internal.h | 6 +
> mm/madvise.c | 168 ++++
> mm/memory.c | 18 +-
> mm/mprotect.c | 3 +-
> mm/mseal.c | 1 +
> mm/pagewalk.c | 200 ++--
> tools/testing/selftests/mm/.gitignore | 1 +
> tools/testing/selftests/mm/Makefile | 1 +
> tools/testing/selftests/mm/guard-pages.c | 1168 ++++++++++++++++++++++
> 18 files changed, 1564 insertions(+), 66 deletions(-)
> create mode 100644 tools/testing/selftests/mm/guard-pages.c
>
> --
> 2.46.2
next prev parent reply other threads:[~2024-10-18 16:10 UTC|newest]
Thread overview: 16+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-10-17 20:42 Lorenzo Stoakes
2024-10-17 20:42 ` [PATCH 1/4] mm: pagewalk: add the ability to install PTEs Lorenzo Stoakes
2024-10-17 20:42 ` [PATCH 2/4] mm: add PTE_MARKER_GUARD PTE marker Lorenzo Stoakes
2024-10-17 20:42 ` [PATCH 3/4] mm: madvise: implement lightweight guard page mechanism Lorenzo Stoakes
2024-10-17 20:42 ` [PATCH 4/4] selftests/mm: add self tests for guard page feature Lorenzo Stoakes
2024-10-17 21:24 ` Shuah Khan
2024-10-18 7:12 ` Lorenzo Stoakes
2024-10-18 15:32 ` Shuah Khan
2024-10-18 16:07 ` Lorenzo Stoakes
2024-10-18 16:22 ` Lorenzo Stoakes
2024-10-18 16:24 ` Shuah Khan
2024-10-18 16:25 ` Shuah Khan
2024-10-18 16:41 ` Lorenzo Stoakes
2024-10-18 16:10 ` Vlastimil Babka [this message]
2024-10-18 16:17 ` [PATCH 0/4] implement lightweight guard pages Lorenzo Stoakes
2024-10-18 21:30 ` Lorenzo Stoakes
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=e4985328-dbfa-4c47-9cf9-12aa89ba9798@suse.cz \
--to=vbabka@suse.cz \
--cc=James.Bottomley@HansenPartnership.com \
--cc=Liam.Howlett@oracle.com \
--cc=akpm@linux-foundation.org \
--cc=arnd@arndb.de \
--cc=brauner@kernel.org \
--cc=chris@zankel.net \
--cc=david@redhat.com \
--cc=deller@gmx.de \
--cc=hch@infradead.org \
--cc=ink@jurassic.park.msu.ru \
--cc=jannh@google.com \
--cc=jcmvbkbc@gmail.com \
--cc=jeffxu@chromium.org \
--cc=linux-alpha@vger.kernel.org \
--cc=linux-api@vger.kernel.org \
--cc=linux-arch@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-kselftest@vger.kernel.org \
--cc=linux-mips@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=linux-parisc@vger.kernel.org \
--cc=lorenzo.stoakes@oracle.com \
--cc=mattst88@gmail.com \
--cc=muchun.song@linux.dev \
--cc=paulmck@kernel.org \
--cc=richard.henderson@linaro.org \
--cc=shuah@kernel.org \
--cc=sidhartha.kumar@oracle.com \
--cc=surenb@google.com \
--cc=tsbogend@alpha.franken.de \
--cc=willy@infradead.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox