Re: [RFC PATCH 0/7] support for mm-local memory allocations and use it

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: David Hildenbrand <david@redhat.com>
To: Fares Mehanna <faresx@amazon.de>
Cc: akpm@linux-foundation.org, ardb@kernel.org, arnd@arndb.de,
	bhelgaas@google.com, broonie@kernel.org, catalin.marinas@arm.com,
	james.morse@arm.com, javierm@redhat.com,
	jean-philippe@linaro.org, joey.gouly@arm.com,
	kristina.martsenko@arm.com, kvmarm@lists.linux.dev,
	linux-arm-kernel@lists.infradead.org,
	linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	mark.rutland@arm.com, maz@kernel.org, mediou@amazon.de,
	nh-open-source@amazon.com, oliver.upton@linux.dev,
	ptosi@google.com, rdunlap@infradead.org, rkagan@amazon.de,
	rppt@kernel.org, shikemeng@huaweicloud.com,
	suzuki.poulose@arm.com, tabba@google.com, will@kernel.org,
	yuzenghui@huawei.com
Subject: Re: [RFC PATCH 0/7] support for mm-local memory allocations and use it
Date: Fri, 18 Oct 2024 20:52:43 +0200	[thread overview]
Message-ID: <5f9ba14a-909b-4b49-b1de-3dc98b31aee0@redhat.com> (raw)
In-Reply-To: <20241011142547.24447-1-faresx@amazon.de>

On 11.10.24 16:25, Fares Mehanna wrote:
>>>
>>>
>>>> On 11. Oct 2024, at 14:36, Mediouni, Mohamed <mediou@amazon.de> wrote:
>>>>
>>>>
>>>>
>>>>> On 11. Oct 2024, at 14:04, David Hildenbrand <david@redhat.com> wrote:
>>>>>
>>>>> On 10.10.24 17:52, Fares Mehanna wrote:
>>>>>>>> In a series posted a few years ago [1], a proposal was put forward to allow the
>>>>>>>> kernel to allocate memory local to a mm and thus push it out of reach for
>>>>>>>> current and future speculation-based cross-process attacks.  We still believe
>>>>>>>> this is a nice thing to have.
>>>>>>>>
>>>>>>>> However, in the time passed since that post Linux mm has grown quite a few new
>>>>>>>> goodies, so we'd like to explore possibilities to implement this functionality
>>>>>>>> with less effort and churn leveraging the now available facilities.
>>>>>>>>
>>>>>>>> An RFC was posted few months back [2] to show the proof of concept and a simple
>>>>>>>> test driver.
>>>>>>>>
>>>>>>>> In this RFC, we're using the same approach of implementing mm-local allocations
>>>>>>>> piggy-backing on memfd_secret(), using regular user addresses but pinning the
>>>>>>>> pages and flipping the user/supervisor flag on the respective PTEs to make them
>>>>>>>> directly accessible from kernel.
>>>>>>>> In addition to that we are submitting 5 patches to use the secret memory to hide
>>>>>>>> the vCPU gp-regs and fp-regs on arm64 VHE systems.
>>>>>>>
>>>>>>> I'm a bit lost on what exactly we want to achieve. The point where we
>>>>>>> start flipping user/supervisor flags confuses me :)
>>>>>>>
>>>>>>> With secretmem, you'd get memory allocated that
>>>>>>> (a) Is accessible by user space -- mapped into user space.
>>>>>>> (b) Is inaccessible by kernel space -- not mapped into the direct map
>>>>>>> (c) GUP will fail, but copy_from / copy_to user will work.
>>>>>>>
>>>>>>>
>>>>>>> Another way, without secretmem, would be to consider these "secrets"
>>>>>>> kernel allocations that can be mapped into user space using mmap() of a
>>>>>>> special fd. That is, they wouldn't have their origin in secretmem, but
>>>>>>> in KVM as a kernel allocation. It could be achieved by using VM_MIXEDMAP
>>>>>>> with vm_insert_pages(), manually removing them from the directmap.
>>>>>>>
>>>>>>> But, I am not sure who is supposed to access what. Let's explore the
>>>>>>> requirements. I assume we want:
>>>>>>>
>>>>>>> (a) Pages accessible by user space -- mapped into user space.
>>>>>>> (b) Pages inaccessible by kernel space -- not mapped into the direct map
>>>>>>> (c) GUP to fail (no direct map).
>>>>>>> (d) copy_from / copy_to user to fail?
>>>>>>>
>>>>>>> And on top of that, some way to access these pages on demand from kernel
>>>>>>> space? (temporary CPU-local mapping?)
>>>>>>>
>>>>>>> Or how would the kernel make use of these allocations?
>>>>>>>
>>>>>>> -- 
>>>>>>> Cheers,
>>>>>>>
>>>>>>> David / dhildenb
>>>>>> Hi David,
>>>>>
>>>>> Hi Fares!
>>>>>
>>>>>> Thanks for taking a look at the patches!
>>>>>> We're trying to allocate a kernel memory that is accessible to the kernel but
>>>>>> only when the context of the process is loaded.
>>>>>> So this is a kernel memory that is not needed to operate the kernel itself, it
>>>>>> is to store & process data on behalf of a process. The requirement for this
>>>>>> memory is that it would never be touched unless the process is scheduled on this
>>>>>> core. otherwise any other access will crash the kernel.
>>>>>> So this memory should only be directly readable and writable by the kernel, but
>>>>>> only when the process context is loaded. The memory shouldn't be readable or
>>>>>> writable by the owner process at all.
>>>>>> This is basically done by removing those pages from kernel linear address and
>>>>>> attaching them only in the process mm_struct. So during context switching the
>>>>>> kernel loses access to the secret memory scheduled out and gain access to the
>>>>>> new process secret memory.
>>>>>> This generally protects against speculation attacks, and if other process managed
>>>>>> to trick the kernel to leak data from memory. In this case the kernel will crash
>>>>>> if it tries to access other processes secret memory.
>>>>>> Since this memory is special in the sense that it is kernel memory but only make
>>>>>> sense in the term of the owner process, I tried in this patch series to explore
>>>>>> the possibility of reusing memfd_secret() to allocate this memory in user virtual
>>>>>> address space, manage it in a VMA, flipping the permissions while keeping the
>>>>>> control of the mapping exclusively with the kernel.
>>>>>> Right now it is:
>>>>>> (a) Pages not accessible by user space -- even though they are mapped into user
>>>>>>      space, the PTEs are marked for kernel usage.
>>>>>
>>>>> Ah, that is the detail I was missing, now I see what you are trying to achieve, thanks!
>>>>>
>>>>> It is a bit architecture specific, because ... imagine architectures that have separate kernel+user space page table hierarchies, and not a simple PTE flag
>> to change access permissions between kernel/user space.
>>>>>
>>>>> IIRC s390 is one such architecture that uses separate page tables for the user-space + kernel-space portions.
>>>>>
>>>>>> (b) Pages accessible by kernel space -- even though they are not mapped into the
>>>>>>      direct map, the PTEs in uvaddr are marked for kernel usage.
>>>>>> (c) copy_from / copy_to user won't fail -- because it is in the user range, but
>>>>>>      this can be fixed by allocating specific range in user vaddr to this feature
>>>>>>      and check against this range there.
>>>>>> (d) The secret memory vaddr is guessable by the owner process -- that can also
>>>>>>      be fixed by allocating bigger chunk of user vaddr for this feature and
>>>>>>      randomly placing the secret memory there.
>>>>>> (e) Mapping is off-limits to the owner process by marking the VMA as locked,
>>>>>>      sealed and special.
>>>>>
>>>>> Okay, so in this RFC you are jumping through quite some hoops to have a kernel allocation unmapped from the direct map but mapped into a per-process page
>> table only accessible by kernel space. :)
>>>>>
>>>>> So you really don't want this mapped into user space at all (consequently, no GUP, no access, no copy_from_user ...). In this RFC it's mapped but turned
>> inaccessible by flipping the "kernel vs. user" switch.
>>>>>
>>>>>> Other alternative (that was implemented in the first submission) is to track those
>>>>>> allocations in a non-shared kernel PGD per process, then handle creating, forking
>>>>>> and context-switching this PGD.
>>>>>
>>>>> That sounds like a better approach. So we would remove the pages from the shared kernel direct map and map them into a separate kernel-portion in the per-MM
>> page tables?
>>>>>
>>>>> Can you envision that would also work with architectures like s390x? I assume we would not only need the per-MM user space page table hierarchy, but also a
>> per-MM kernel space page table hierarchy, into which we also map the common/shared-among-all-processes kernel space page tables (e.g., directmap).
>>>> Yes, that’s also applicable to arm64. There’s currently no separate per-mm user space page hierarchy there.
>>> typo, read kernel
>>
>>
>> Okay, thanks. So going into that direction makes more sense.
>>
>> I do wonder if we really have to deal with fork() ... if the primary
>> users don't really have meaning in the forked child (e.g., just like
>> fork() with KVM IIRC) we might just get away by "losing" these
>> allocations in the child process.
>>
>> Happy to learn why fork() must be supported.
> 
> It really depends on the use cases of the kernel secret allocation, but in my
> mind a troubling scenario:
> 1. Process A had a resource X.
> 2. Kernel decided to keep some data related to resource X in process A secret
>     memory.
> 3. Process A decided to fork, now process B share the resource X.
> 4. Process B started using resource X. <-- This will crash the kernel as the
>     used kernel page table on process B has no mapping for the secret memory used
>     in resource X.
> 
> I haven't tried to trigger this crash myself though.
> 

Right, and if we can rule out any users that are supposed to work after 
fork(), we can just disregard that in the first version.

I never played with this, but let's assume you make use of these 
mm-local allocations in KVM context.

What would happens if you fork() with a KVM fd and try accessing that fd 
from the other process using ioctls? I recall that KVM will not be 
"duplicated".

What would happen if you send that fd over to a completely different 
process and try accessing that fd from the other process using ioctls?

Of course, question being: if you have MM-local allocations in both 
cases and there is suddenly a different MM ... assuming that both cases 
are even possible (if they are not possible, great! :) ).

I think I am supposed to know if these things are possible or not and 
what would happen, but it's late Friday and my brain is begging for some 
Weekend :D

> I didn't think in depth about this issue yet, but I need to because duplicating
> the secret memory mappings in the new forked process is easy (To give kernel
> access on the secret memory), but tearing them down across all forked processes
> is a bit complicated (To clean stale mappings on parent/child processes). Right
> now tearing down the mapping will only happen on mm_struct which allocated the
> secret memory.

If an allocation is MM-local, I would assume that fork() would 
*duplicate* that allocation (leaving CoW out of the picture :D ), but 
that's where the fun begins (see above regarding my confusion about KVM 
and fork() behavior ... ).

-- 
Cheers,

David / dhildenb

next prev parent reply	other threads:[~2024-10-18 18:52 UTC|newest]

Thread overview: 24+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-09-11 14:33 Fares Mehanna
2024-09-11 14:34 ` [RFC PATCH 1/7] mseal: expose interface to seal / unseal user memory ranges Fares Mehanna
2024-09-12 16:40   ` Liam R. Howlett
2024-09-25 15:25     ` Fares Mehanna
2024-09-11 14:34 ` [RFC PATCH 2/7] mm/secretmem: implement mm-local kernel allocations Fares Mehanna
2024-09-11 14:34 ` [RFC PATCH 3/7] arm64: KVM: Refactor C-code to access vCPU gp-registers through macros Fares Mehanna
2024-09-11 14:34 ` [RFC PATCH 4/7] KVM: Refactor Assembly-code to access vCPU gp-registers through a macro Fares Mehanna
2024-09-11 14:34 ` [RFC PATCH 5/7] arm64: KVM: Allocate vCPU gp-regs dynamically on VHE and KERNEL_SECRETMEM enabled systems Fares Mehanna
2024-09-11 14:34 ` [RFC PATCH 6/7] arm64: KVM: Refactor C-code to access vCPU fp-registers through macros Fares Mehanna
2024-09-11 14:34 ` [RFC PATCH 7/7] arm64: KVM: Allocate vCPU fp-regs dynamically on VHE and KERNEL_SECRETMEM enabled systems Fares Mehanna
2024-09-20 12:34 ` [RFC PATCH 0/7] support for mm-local memory allocations and use it Mike Rapoport
2024-09-25 15:33   ` Fares Mehanna
2024-09-27  7:08     ` Mike Rapoport
2024-10-08 20:06       ` Fares Mehanna
2024-09-20 13:19 ` Alexander Graf
2024-09-27 12:59 ` David Hildenbrand
2024-10-10 15:52   ` Fares Mehanna
2024-10-11 12:04     ` David Hildenbrand
2024-10-11 12:36       ` Mediouni, Mohamed
2024-10-11 12:56         ` Mediouni, Mohamed
2024-10-11 12:58           ` David Hildenbrand
2024-10-11 14:25             ` Fares Mehanna
2024-10-18 18:52               ` David Hildenbrand [this message]
2024-10-18 19:02                 ` David Hildenbrand

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=5f9ba14a-909b-4b49-b1de-3dc98b31aee0@redhat.com \
    --to=david@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=ardb@kernel.org \
    --cc=arnd@arndb.de \
    --cc=bhelgaas@google.com \
    --cc=broonie@kernel.org \
    --cc=catalin.marinas@arm.com \
    --cc=faresx@amazon.de \
    --cc=james.morse@arm.com \
    --cc=javierm@redhat.com \
    --cc=jean-philippe@linaro.org \
    --cc=joey.gouly@arm.com \
    --cc=kristina.martsenko@arm.com \
    --cc=kvmarm@lists.linux.dev \
    --cc=linux-arm-kernel@lists.infradead.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mark.rutland@arm.com \
    --cc=maz@kernel.org \
    --cc=mediou@amazon.de \
    --cc=nh-open-source@amazon.com \
    --cc=oliver.upton@linux.dev \
    --cc=ptosi@google.com \
    --cc=rdunlap@infradead.org \
    --cc=rkagan@amazon.de \
    --cc=rppt@kernel.org \
    --cc=shikemeng@huaweicloud.com \
    --cc=suzuki.poulose@arm.com \
    --cc=tabba@google.com \
    --cc=will@kernel.org \
    --cc=yuzenghui@huawei.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox