Re: [RFC PATCH 0/7] support for mm-local memory allocations and use it

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: David Hildenbrand <david@redhat.com>
To: Fares Mehanna <faresx@amazon.de>
Cc: akpm@linux-foundation.org, ardb@kernel.org, arnd@arndb.de,
	bhelgaas@google.com, broonie@kernel.org, catalin.marinas@arm.com,
	james.morse@arm.com, javierm@redhat.com,
	jean-philippe@linaro.org, joey.gouly@arm.com,
	kristina.martsenko@arm.com, kvmarm@lists.linux.dev,
	linux-arm-kernel@lists.infradead.org,
	linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	mark.rutland@arm.com, maz@kernel.org, nh-open-source@amazon.com,
	oliver.upton@linux.dev, ptosi@google.com, rdunlap@infradead.org,
	rkagan@amazon.de, rppt@kernel.org, shikemeng@huaweicloud.com,
	suzuki.poulose@arm.com, tabba@google.com, will@kernel.org,
	yuzenghui@huawei.com
Subject: Re: [RFC PATCH 0/7] support for mm-local memory allocations and use it
Date: Fri, 11 Oct 2024 14:04:12 +0200	[thread overview]
Message-ID: <465ce78b-d023-40e6-b066-5e4a01e266b6@redhat.com> (raw)
In-Reply-To: <20241010155210.13321-1-faresx@amazon.de>

On 10.10.24 17:52, Fares Mehanna wrote:
>>> In a series posted a few years ago [1], a proposal was put forward to allow the
>>> kernel to allocate memory local to a mm and thus push it out of reach for
>>> current and future speculation-based cross-process attacks.  We still believe
>>> this is a nice thing to have.
>>>
>>> However, in the time passed since that post Linux mm has grown quite a few new
>>> goodies, so we'd like to explore possibilities to implement this functionality
>>> with less effort and churn leveraging the now available facilities.
>>>
>>> An RFC was posted few months back [2] to show the proof of concept and a simple
>>> test driver.
>>>
>>> In this RFC, we're using the same approach of implementing mm-local allocations
>>> piggy-backing on memfd_secret(), using regular user addresses but pinning the
>>> pages and flipping the user/supervisor flag on the respective PTEs to make them
>>> directly accessible from kernel.
>>> In addition to that we are submitting 5 patches to use the secret memory to hide
>>> the vCPU gp-regs and fp-regs on arm64 VHE systems.
>>
>> I'm a bit lost on what exactly we want to achieve. The point where we
>> start flipping user/supervisor flags confuses me :)
>>
>> With secretmem, you'd get memory allocated that
>> (a) Is accessible by user space -- mapped into user space.
>> (b) Is inaccessible by kernel space -- not mapped into the direct map
>> (c) GUP will fail, but copy_from / copy_to user will work.
>>
>>
>> Another way, without secretmem, would be to consider these "secrets"
>> kernel allocations that can be mapped into user space using mmap() of a
>> special fd. That is, they wouldn't have their origin in secretmem, but
>> in KVM as a kernel allocation. It could be achieved by using VM_MIXEDMAP
>> with vm_insert_pages(), manually removing them from the directmap.
>>
>> But, I am not sure who is supposed to access what. Let's explore the
>> requirements. I assume we want:
>>
>> (a) Pages accessible by user space -- mapped into user space.
>> (b) Pages inaccessible by kernel space -- not mapped into the direct map
>> (c) GUP to fail (no direct map).
>> (d) copy_from / copy_to user to fail?
>>
>> And on top of that, some way to access these pages on demand from kernel
>> space? (temporary CPU-local mapping?)
>>
>> Or how would the kernel make use of these allocations?
>>
>> -- 
>> Cheers,
>>
>> David / dhildenb
> 
> Hi David,

Hi Fares!

> 
> Thanks for taking a look at the patches!
> 
> We're trying to allocate a kernel memory that is accessible to the kernel but
> only when the context of the process is loaded.
> 
> So this is a kernel memory that is not needed to operate the kernel itself, it
> is to store & process data on behalf of a process. The requirement for this
> memory is that it would never be touched unless the process is scheduled on this
> core. otherwise any other access will crash the kernel.
> 
> So this memory should only be directly readable and writable by the kernel, but
> only when the process context is loaded. The memory shouldn't be readable or
> writable by the owner process at all.
> 
> This is basically done by removing those pages from kernel linear address and
> attaching them only in the process mm_struct. So during context switching the
> kernel loses access to the secret memory scheduled out and gain access to the
> new process secret memory.
> 
> This generally protects against speculation attacks, and if other process managed
> to trick the kernel to leak data from memory. In this case the kernel will crash
> if it tries to access other processes secret memory.
> 
> Since this memory is special in the sense that it is kernel memory but only make
> sense in the term of the owner process, I tried in this patch series to explore
> the possibility of reusing memfd_secret() to allocate this memory in user virtual
> address space, manage it in a VMA, flipping the permissions while keeping the
> control of the mapping exclusively with the kernel.
> 
> Right now it is:
> (a) Pages not accessible by user space -- even though they are mapped into user
>      space, the PTEs are marked for kernel usage.

Ah, that is the detail I was missing, now I see what you are trying to 
achieve, thanks!

It is a bit architecture specific, because ... imagine architectures 
that have separate kernel+user space page table hierarchies, and not a 
simple PTE flag to change access permissions between kernel/user space.

IIRC s390 is one such architecture that uses separate page tables for 
the user-space + kernel-space portions.

> (b) Pages accessible by kernel space -- even though they are not mapped into the
>      direct map, the PTEs in uvaddr are marked for kernel usage.
> (c) copy_from / copy_to user won't fail -- because it is in the user range, but
>      this can be fixed by allocating specific range in user vaddr to this feature
>      and check against this range there.
> (d) The secret memory vaddr is guessable by the owner process -- that can also
>      be fixed by allocating bigger chunk of user vaddr for this feature and
>      randomly placing the secret memory there.
> (e) Mapping is off-limits to the owner process by marking the VMA as locked,
>      sealed and special.

Okay, so in this RFC you are jumping through quite some hoops to have a 
kernel allocation unmapped from the direct map but mapped into a 
per-process page table only accessible by kernel space. :)

So you really don't want this mapped into user space at all 
(consequently, no GUP, no access, no copy_from_user ...). In this RFC 
it's mapped but turned inaccessible by flipping the "kernel vs. user" 
switch.

> 
> Other alternative (that was implemented in the first submission) is to track those
> allocations in a non-shared kernel PGD per process, then handle creating, forking
> and context-switching this PGD.

That sounds like a better approach. So we would remove the pages from 
the shared kernel direct map and map them into a separate kernel-portion 
in the per-MM page tables?

Can you envision that would also work with architectures like s390x? I 
assume we would not only need the per-MM user space page table 
hierarchy, but also a per-MM kernel space page table hierarchy, into 
which we also map the common/shared-among-all-processes kernel space 
page tables (e.g., directmap).

> 
> What I like about the memfd_secret() approach is the simplicity and being arch
> agnostic, what I don't like is the increased attack surface by using VMAs to
> track those allocations.

Yes, but memfd_secret() was really design for user space to hold 
secrets. But I can see how you came to this solution.

> 
> I'm thinking of working on a PoC to implement the first approach of using a
> non-shared kernel PGD for secret memory allocations on arm64. This includes
> adding kernel page table per process where all PGDs are shared but one which
> will be used for secret allocations mapping. And handle the fork & context
> switching (TTBR1 switching(?)) correctly for the secret memory PGD.
> 
> What do you think? I'd really appreciate opinions and possible ways forward.

Naive question: does arm64 rather resemble the s390x model or the x86-64 
model?

-- 
Cheers,

David / dhildenb

next prev parent reply	other threads:[~2024-10-11 12:04 UTC|newest]

Thread overview: 24+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-09-11 14:33 Fares Mehanna
2024-09-11 14:34 ` [RFC PATCH 1/7] mseal: expose interface to seal / unseal user memory ranges Fares Mehanna
2024-09-12 16:40   ` Liam R. Howlett
2024-09-25 15:25     ` Fares Mehanna
2024-09-11 14:34 ` [RFC PATCH 2/7] mm/secretmem: implement mm-local kernel allocations Fares Mehanna
2024-09-11 14:34 ` [RFC PATCH 3/7] arm64: KVM: Refactor C-code to access vCPU gp-registers through macros Fares Mehanna
2024-09-11 14:34 ` [RFC PATCH 4/7] KVM: Refactor Assembly-code to access vCPU gp-registers through a macro Fares Mehanna
2024-09-11 14:34 ` [RFC PATCH 5/7] arm64: KVM: Allocate vCPU gp-regs dynamically on VHE and KERNEL_SECRETMEM enabled systems Fares Mehanna
2024-09-11 14:34 ` [RFC PATCH 6/7] arm64: KVM: Refactor C-code to access vCPU fp-registers through macros Fares Mehanna
2024-09-11 14:34 ` [RFC PATCH 7/7] arm64: KVM: Allocate vCPU fp-regs dynamically on VHE and KERNEL_SECRETMEM enabled systems Fares Mehanna
2024-09-20 12:34 ` [RFC PATCH 0/7] support for mm-local memory allocations and use it Mike Rapoport
2024-09-25 15:33   ` Fares Mehanna
2024-09-27  7:08     ` Mike Rapoport
2024-10-08 20:06       ` Fares Mehanna
2024-09-20 13:19 ` Alexander Graf
2024-09-27 12:59 ` David Hildenbrand
2024-10-10 15:52   ` Fares Mehanna
2024-10-11 12:04     ` David Hildenbrand [this message]
2024-10-11 12:36       ` Mediouni, Mohamed
2024-10-11 12:56         ` Mediouni, Mohamed
2024-10-11 12:58           ` David Hildenbrand
2024-10-11 14:25             ` Fares Mehanna
2024-10-18 18:52               ` David Hildenbrand
2024-10-18 19:02                 ` David Hildenbrand

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=465ce78b-d023-40e6-b066-5e4a01e266b6@redhat.com \
    --to=david@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=ardb@kernel.org \
    --cc=arnd@arndb.de \
    --cc=bhelgaas@google.com \
    --cc=broonie@kernel.org \
    --cc=catalin.marinas@arm.com \
    --cc=faresx@amazon.de \
    --cc=james.morse@arm.com \
    --cc=javierm@redhat.com \
    --cc=jean-philippe@linaro.org \
    --cc=joey.gouly@arm.com \
    --cc=kristina.martsenko@arm.com \
    --cc=kvmarm@lists.linux.dev \
    --cc=linux-arm-kernel@lists.infradead.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mark.rutland@arm.com \
    --cc=maz@kernel.org \
    --cc=nh-open-source@amazon.com \
    --cc=oliver.upton@linux.dev \
    --cc=ptosi@google.com \
    --cc=rdunlap@infradead.org \
    --cc=rkagan@amazon.de \
    --cc=rppt@kernel.org \
    --cc=shikemeng@huaweicloud.com \
    --cc=suzuki.poulose@arm.com \
    --cc=tabba@google.com \
    --cc=will@kernel.org \
    --cc=yuzenghui@huawei.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox