linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: David Hildenbrand <david@redhat.com>
To: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>,
	Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: James.Bottomley@hansenpartnership.com, Liam.Howlett@oracle.com,
	akpm@linux-foundation.org, arnd@kernel.org, brauner@kernel.org,
	chris@zankel.net, deller@gmx.de, hch@infradead.org,
	jannh@google.com, jcmvbkbc@gmail.com, jeffxu@chromium.org,
	jhubbard@nvidia.com, linux-api@vger.kernel.org,
	linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	mattst88@gmail.com, muchun.song@linux.dev, paulmck@kernel.org,
	richard.henderson@linaro.org, shuah@kernel.org,
	sidhartha.kumar@oracle.com, surenb@google.com,
	tsbogend@alpha.franken.de, vbabka@suse.cz, willy@infradead.org,
	criu@lists.linux.dev, Andrei Vagin <avagin@gmail.com>,
	Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
Subject: Re: [PATCH v4 0/5] implement lightweight guard pages
Date: Wed, 19 Mar 2025 15:52:56 +0100	[thread overview]
Message-ID: <278393de-2729-4ed0-822c-87f33c7ce27e@redhat.com> (raw)
In-Reply-To: <zihwmp67m2lpuxbfktmztvjdyap7suzd75dowlw4eamu6bhjf3@6euydiqowc7h>

On 19.03.25 15:50, Alexander Mikhalitsyn wrote:
> On Mon, Oct 28, 2024 at 02:13:26PM +0000, Lorenzo Stoakes wrote:
>> Userland library functions such as allocators and threading implementations
>> often require regions of memory to act as 'guard pages' - mappings which,
>> when accessed, result in a fatal signal being sent to the accessing
>> process.
>>
>> The current means by which these are implemented is via a PROT_NONE mmap()
>> mapping, which provides the required semantics however incur an overhead of
>> a VMA for each such region.
>>
>> With a great many processes and threads, this can rapidly add up and incur
>> a significant memory penalty. It also has the added problem of preventing
>> merges that might otherwise be permitted.
>>
>> This series takes a different approach - an idea suggested by Vlasimil
>> Babka (and before him David Hildenbrand and Jann Horn - perhaps more - the
>> provenance becomes a little tricky to ascertain after this - please forgive
>> any omissions!)  - rather than locating the guard pages at the VMA layer,
>> instead placing them in page tables mapping the required ranges.
>>
>> Early testing of the prototype version of this code suggests a 5 times
>> speed up in memory mapping invocations (in conjunction with use of
>> process_madvise()) and a 13% reduction in VMAs on an entirely idle android
>> system and unoptimised code.
>>
>> We expect with optimisation and a loaded system with a larger number of
>> guard pages this could significantly increase, but in any case these
>> numbers are encouraging.
>>
>> This way, rather than having separate VMAs specifying which parts of a
>> range are guard pages, instead we have a VMA spanning the entire range of
>> memory a user is permitted to access and including ranges which are to be
>> 'guarded'.
>>
>> After mapping this, a user can specify which parts of the range should
>> result in a fatal signal when accessed.
>>
>> By restricting the ability to specify guard pages to memory mapped by
>> existing VMAs, we can rely on the mappings being torn down when the
>> mappings are ultimately unmapped and everything works simply as if the
>> memory were not faulted in, from the point of view of the containing VMAs.
>>
>> This mechanism in effect poisons memory ranges similar to hardware memory
>> poisoning, only it is an entirely software-controlled form of poisoning.
>>
>> The mechanism is implemented via madvise() behaviour - MADV_GUARD_INSTALL
>> which installs page table-level guard page markers - and
>> MADV_GUARD_REMOVE - which clears them.
>>
>> Guard markers can be installed across multiple VMAs and any existing
>> mappings will be cleared, that is zapped, before installing the guard page
>> markers in the page tables.
>>
>> There is no concept of 'nested' guard markers, multiple attempts to install
>> guard markers in a range will, after the first attempt, have no effect.
>>
>> Importantly, removing guard markers over a range that contains both guard
>> markers and ordinary backed memory has no effect on anything but the guard
>> markers (including leaving huge pages un-split), so a user can safely
>> remove guard markers over a range of memory leaving the rest intact.
>>
>> The actual mechanism by which the page table entries are specified makes
>> use of existing logic - PTE markers, which are used for the userfaultfd
>> UFFDIO_POISON mechanism.
>>
>> Unfortunately PTE_MARKER_POISONED is not suited for the guard page
>> mechanism as it results in VM_FAULT_HWPOISON semantics in the fault
>> handler, so we add our own specific PTE_MARKER_GUARD and adapt existing
>> logic to handle it.
>>
>> We also extend the generic page walk mechanism to allow for installation of
>> PTEs (carefully restricted to memory management logic only to prevent
>> unwanted abuse).
>>
>> We ensure that zapping performed by MADV_DONTNEED and MADV_FREE do not
>> remove guard markers, nor does forking (except when VM_WIPEONFORK is
>> specified for a VMA which implies a total removal of memory
>> characteristics).
>>
>> It's important to note that the guard page implementation is emphatically
>> NOT a security feature, so a user can remove the markers if they wish. We
>> simply implement it in such a way as to provide the least surprising
>> behaviour.
>>
>> An extensive set of self-tests are provided which ensure behaviour is as
>> expected and additionally self-documents expected behaviour of guard
>> ranges.
> 
> Dear Lorenzo,
> Dear colleagues,
> 
> sorry about raising an old thread.
> 
> It looks like this feature is now used in glibc [1]. And we noticed failures in CRIU [2]
> CI on Fedora Rawhide userspace. Now a question is how we can properly detect such
> "guarded" pages from user space. As I can see from MADV_GUARD_INSTALL implementation,
> it does not modify VMA flags anyhow, but only page tables. It means that /proc/<pid>/maps
> and /proc/<pid>/smaps interfaces are useless in this case. (Please, correct me if I'm missing
> anything here.)
> 
> I wonder if you have any ideas / suggestions regarding Checkpoint/Restore here. We (CRIU devs) are happy
> to develop some patches to bring some uAPI to expose MADV_GUARDs, but before going into this we decided
> to raise this question in LKML.


See [1] and [2]

[1] 
https://lkml.kernel.org/r/cover.1740139449.git.lorenzo.stoakes@oracle.com
[2] https://lwn.net/Articles/1011366/


-- 
Cheers,

David / dhildenb



  reply	other threads:[~2025-03-19 14:53 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-10-28 14:13 Lorenzo Stoakes
2024-10-28 14:13 ` [PATCH v4 1/5] mm: pagewalk: add the ability to install PTEs Lorenzo Stoakes
2024-10-28 14:13 ` [PATCH v4 2/5] mm: add PTE_MARKER_GUARD PTE marker Lorenzo Stoakes
2024-10-28 14:13 ` [PATCH v4 3/5] mm: madvise: implement lightweight guard page mechanism Lorenzo Stoakes
2024-10-29 10:32   ` Vlastimil Babka
2024-10-28 14:13 ` [PATCH v4 4/5] tools: testing: update tools UAPI header for mman-common.h Lorenzo Stoakes
2024-10-28 14:13 ` [PATCH v4 5/5] selftests/mm: add self tests for guard page feature Lorenzo Stoakes
2024-10-28 18:24 ` [PATCH v4 0/5] implement lightweight guard pages SeongJae Park
2024-10-28 22:22   ` Lorenzo Stoakes
2025-03-19 14:50 ` Alexander Mikhalitsyn
2025-03-19 14:52   ` David Hildenbrand [this message]
2025-03-19 15:02     ` Lorenzo Stoakes
2025-03-19 15:15       ` Aleksandr Mikhalitsyn
2025-03-19 15:08     ` Aleksandr Mikhalitsyn

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=278393de-2729-4ed0-822c-87f33c7ce27e@redhat.com \
    --to=david@redhat.com \
    --cc=James.Bottomley@hansenpartnership.com \
    --cc=Liam.Howlett@oracle.com \
    --cc=akpm@linux-foundation.org \
    --cc=aleksandr.mikhalitsyn@canonical.com \
    --cc=arnd@kernel.org \
    --cc=avagin@gmail.com \
    --cc=brauner@kernel.org \
    --cc=chris@zankel.net \
    --cc=criu@lists.linux.dev \
    --cc=deller@gmx.de \
    --cc=hch@infradead.org \
    --cc=jannh@google.com \
    --cc=jcmvbkbc@gmail.com \
    --cc=jeffxu@chromium.org \
    --cc=jhubbard@nvidia.com \
    --cc=linux-api@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=lorenzo.stoakes@oracle.com \
    --cc=mattst88@gmail.com \
    --cc=muchun.song@linux.dev \
    --cc=paulmck@kernel.org \
    --cc=ptikhomirov@virtuozzo.com \
    --cc=richard.henderson@linaro.org \
    --cc=shuah@kernel.org \
    --cc=sidhartha.kumar@oracle.com \
    --cc=surenb@google.com \
    --cc=tsbogend@alpha.franken.de \
    --cc=vbabka@suse.cz \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox