Re: [RFC PATCH 1/7] mseal: expose interface to seal / unseal user memory ranges

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* Re: [RFC PATCH 1/7] mseal: expose interface to seal / unseal user memory ranges
       [not found] <CABi2SkWrQOdxdai7YLoYKKc6GAwxue=jc+bH1=CgE-bKRO-GhA@mail.gmail.com>
@ 2024-10-08 20:31 ` Fares Mehanna
  0 siblings, 0 replies; 4+ messages in thread
From: Fares Mehanna @ 2024-10-08 20:31 UTC (permalink / raw)
  To: jeffxu
  Cc: akpm, ardb, arnd, bhelgaas, broonie, catalin.marinas, david,
	faresx, james.morse, javierm, jean-philippe, joey.gouly,
	keescook, kristina.martsenko, kvmarm, liam.howlett,
	linux-arm-kernel, linux-kernel, linux-mm, mark.rutland, maz,
	nh-open-source, oliver.upton, ptosi, rdunlap, rkagan, rppt,
	shikemeng, suzuki.poulose, tabba, will, yuzenghui

Hi Jeff,

> Hi Fares,
> 
> Please add me to this series and I'm interested in everything related
> to mseal :-)
> 
> I also added Kees, since mseal is a security feature, and kees is CCed
> on security matters.

Thank you for taking the time to take a look! Sure I will add you both in future
RFCs about this feature.

> On Wed, Sep 25, 2024 at 8:25 AM Fares Mehanna <faresx@amazon.de> wrote:
> >
> > Hi,
> >
> > Thanks for taking a look and apologies for my delayed response.
> >
> > > It is not clear from the change log above or the cover letter as to why
> > > you need to go this route instead of using the mmap lock.
> >
> > In the current form of the patches I use memfd_secret() to allocate the pages
> > and remove them from kernel linear address. [1]
> >
> > This allocate pages, map them in user virtual addresses and track them in a VMA.
> >
> > Before flipping the permissions on those pages to be used by the kernel, I need
> > to make sure that those virtual addresses and this VMA is off-limits to the
> > owning process.
> >
> > memfd_secret() pages are locked by default, so they won't swap out. I need to
> > seal the VMA to make sure the owner process can't unmap/remap/... or change the
> > protection of this VMA.
> >
> > So before changing the permissions on the secret pages, I make sure the pages
> > are faulted in, locked and sealed. So userspace can't influence this mapping.
> >
> > > We can't use the mseal feature for this; it is supposed to be a one way
> > > transition.
> >
> > For this approach, I need the unseal operation when releasing the memory range.
> >
> > The kernel can be done with the secret pages in one of two scenarios:
> > 1. During lifecycle of the process.
> > 2. When the process terminates.
> >
> > For the first case, I need to unmap the VMA so it can be reused by the owning
> > process later, so I need the unseal operation. For the second case however we
> > don't need that since the process mm is already destructed or just about to be
> > destructed anyway, regardless of sealed/unsealed VMAs. [1]
> >
> > I didn't expose the unseal operation to userspace.
> >
> In general, we should avoid having do_unseal, even though the
> operation is restricted to the kernel itself.
> 
> However, from what you have described, without looking at your code,
> the case is closer to mseal, except that you need to unmap it within
> the kernel code.
> 
> For this, there might be two options that I can think of now, post
> here for discussion:
> 
> 1> Add a new flag in vm_flags, to allow unmap while sealed. However,
> this will not prevent user space from unmap the area.
> 
> 2> pass a flag in do_vmi_align_munmap() to skip sealing checks for
> your particular call. The do_vmi_align_munmap() already has a flag
> such as unlock.
> 
> will the above work for your case ? or I  miss-understood the requirement.

Yeah the second approach is exactly what I'm looking for, just to unmap the VMA
while being sealed to free resources. But I'm not sure how complicated it would
be to use.

But I got a negative feedback about the whole approach of using user vaddr and
VMAs to track kernel secret allocations. Even if I can to keep the VMA off-limits
to the owning process and possible improvement to hide the actual location of
the secret memory.

We're still thinking of a better approach, but if we went back to the first
approach of using separate PGD in kernel space I wouldn't be messing with VMAs
or sealing.

Thanks!
Fares.

> Thanks
> -Jeff



Amazon Web Services Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 257764 B
Sitz: Berlin
Ust-ID: DE 365 538 597

^ permalink raw reply	[flat|nested] 4+ messages in thread

* [RFC PATCH 0/7] support for mm-local memory allocations and use it
@ 2024-09-11 14:33 Fares Mehanna
  2024-09-11 14:34 ` [RFC PATCH 1/7] mseal: expose interface to seal / unseal user memory ranges Fares Mehanna
  0 siblings, 1 reply; 4+ messages in thread
From: Fares Mehanna @ 2024-09-11 14:33 UTC (permalink / raw)
  Cc: nh-open-source, Fares Mehanna, Marc Zyngier, Oliver Upton,
	James Morse, Suzuki K Poulose, Zenghui Yu, Catalin Marinas,
	Will Deacon, Andrew Morton, Kemeng Shi, Pierre-Clément Tosi,
	Ard Biesheuvel, Mark Rutland, Javier Martinez Canillas,
	Arnd Bergmann, Fuad Tabba, Mark Brown, Joey Gouly,
	Kristina Martsenko, Randy Dunlap, Bjorn Helgaas,
	Jean-Philippe Brucker, Mike Rapoport (IBM),
	David Hildenbrand, Roman Kagan,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	open list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	open list, open list:MEMORY MANAGEMENT

In a series posted a few years ago [1], a proposal was put forward to allow the
kernel to allocate memory local to a mm and thus push it out of reach for
current and future speculation-based cross-process attacks.  We still believe
this is a nice thing to have.

However, in the time passed since that post Linux mm has grown quite a few new
goodies, so we'd like to explore possibilities to implement this functionality
with less effort and churn leveraging the now available facilities.

An RFC was posted few months back [2] to show the proof of concept and a simple
test driver.

In this RFC, we're using the same approach of implementing mm-local allocations
piggy-backing on memfd_secret(), using regular user addresses but pinning the
pages and flipping the user/supervisor flag on the respective PTEs to make them
directly accessible from kernel.
In addition to that we are submitting 5 patches to use the secret memory to hide
the vCPU gp-regs and fp-regs on arm64 VHE systems.

The generic drawbacks of using user virtual addresses mentioned in the previous
RFC [2] still hold, in addition to a more specific one:

- While the user virtual addresses allocated for kernel secret memory are not
  directly accessible by userspace as the PTEs restrict that, copy_from_user()
  and copy_to_user() can operate on those ranges, so that e.g. the usermode can
  guess the address and pass it as the target buffer for read(), making the
  kernel overwrite it with the user-controlled content. Effectively making the
  secret memory in the current implementation missing confidentiality and
  integrity guarantees.

In the specific case of vCPU registers, this is fine because the owner process
can read and write to them using KVM IOCTLs anyway. But in the general case this
represents a security concern and needs to be addressed.

A possible way forward for the arch-agnostic implementation is to limit the user
virtual addresses used for kernel to specific range that can be checked against
in copy_from_user() and copy_to_user().

For arch specific implementation, using separate PGD is the way to go.

[1] https://lore.kernel.org/lkml/20190612170834.14855-1-mhillenb@amazon.de/
[2] https://lore.kernel.org/lkml/20240621201501.1059948-1-rkagan@amazon.de/

Fares Mehanna / Roman Kagan (2):
  mseal: expose interface to seal / unseal user memory ranges
  mm/secretmem: implement mm-local kernel allocations

Fares Mehanna (5):
  arm64: KVM: Refactor C-code to access vCPU gp-registers through macros
  KVM: Refactor Assembly-code to access vCPU gp-registers through a
    macro
  arm64: KVM: Allocate vCPU gp-regs dynamically on VHE and
    KERNEL_SECRETMEM enabled systems
  arm64: KVM: Refactor C-code to access vCPU fp-registers through macros
  arm64: KVM: Allocate vCPU fp-regs dynamically on VHE and
    KERNEL_SECRETMEM enabled systems

 arch/arm64/include/asm/kvm_asm.h              |  50 ++--
 arch/arm64/include/asm/kvm_emulate.h          |   2 +-
 arch/arm64/include/asm/kvm_host.h             |  41 +++-
 arch/arm64/kernel/asm-offsets.c               |   1 +
 arch/arm64/kernel/image-vars.h                |   2 +
 arch/arm64/kvm/arm.c                          |  90 +++++++-
 arch/arm64/kvm/fpsimd.c                       |   2 +-
 arch/arm64/kvm/guest.c                        |  14 +-
 arch/arm64/kvm/hyp/entry.S                    |  15 ++
 arch/arm64/kvm/hyp/include/hyp/switch.h       |   6 +-
 arch/arm64/kvm/hyp/include/hyp/sysreg-sr.h    |  10 +-
 .../arm64/kvm/hyp/include/nvhe/trap_handler.h |   2 +-
 arch/arm64/kvm/hyp/nvhe/host.S                |  20 +-
 arch/arm64/kvm/hyp/nvhe/hyp-main.c            |   4 +-
 arch/arm64/kvm/reset.c                        |   2 +-
 arch/arm64/kvm/va_layout.c                    |  38 ++++
 include/linux/secretmem.h                     |  29 +++
 mm/Kconfig                                    |  10 +
 mm/gup.c                                      |   4 +-
 mm/internal.h                                 |   7 +
 mm/mseal.c                                    |  81 ++++---
 mm/secretmem.c                                | 213 ++++++++++++++++++
 22 files changed, 559 insertions(+), 84 deletions(-)

-- 
2.40.1

Amazon Web Services Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 257764 B
Sitz: Berlin
Ust-ID: DE 365 538 597

^ permalink raw reply	[flat|nested] 4+ messages in thread

* [RFC PATCH 1/7] mseal: expose interface to seal / unseal user memory ranges
  2024-09-11 14:33 [RFC PATCH 0/7] support for mm-local memory allocations and use it Fares Mehanna
@ 2024-09-11 14:34 ` Fares Mehanna
  2024-09-12 16:40   ` Liam R. Howlett
  0 siblings, 1 reply; 4+ messages in thread
From: Fares Mehanna @ 2024-09-11 14:34 UTC (permalink / raw)
  Cc: nh-open-source, Fares Mehanna, Roman Kagan, Marc Zyngier,
	Oliver Upton, James Morse, Suzuki K Poulose, Zenghui Yu,
	Catalin Marinas, Will Deacon, Andrew Morton, Kemeng Shi,
	Pierre-Clément Tosi, Ard Biesheuvel, Mark Rutland,
	Javier Martinez Canillas, Arnd Bergmann, Fuad Tabba, Mark Brown,
	Joey Gouly, Kristina Martsenko, Randy Dunlap, Bjorn Helgaas,
	Jean-Philippe Brucker, Mike Rapoport (IBM),
	David Hildenbrand,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	open list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	open list, open list:MEMORY MANAGEMENT

To make sure the kernel mm-local mapping is untouched by the user, we will seal
the VMA before changing the protection to be used by the kernel.

This will guarantee that userspace can't unmap or alter this VMA while it is
being used by the kernel.

After the kernel is done with the secret memory, it will unseal the VMA to be
able to unmap and free it.

Unseal operation is not exposed to userspace.

Signed-off-by: Fares Mehanna <faresx@amazon.de>
Signed-off-by: Roman Kagan <rkagan@amazon.de>
---
 mm/internal.h |  7 +++++
 mm/mseal.c    | 81 ++++++++++++++++++++++++++++++++-------------------
 2 files changed, 58 insertions(+), 30 deletions(-)

diff --git a/mm/internal.h b/mm/internal.h
index b4d86436565b..cf7280d101e9 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -1501,6 +1501,8 @@ bool can_modify_mm(struct mm_struct *mm, unsigned long start,
 		unsigned long end);
 bool can_modify_mm_madv(struct mm_struct *mm, unsigned long start,
 		unsigned long end, int behavior);
+/* mm's mmap write lock must be taken before seal/unseal operation */
+int do_mseal(unsigned long start, unsigned long end, bool seal);
 #else
 static inline int can_do_mseal(unsigned long flags)
 {
@@ -1518,6 +1520,11 @@ static inline bool can_modify_mm_madv(struct mm_struct *mm, unsigned long start,
 {
 	return true;
 }
+
+static inline int do_mseal(unsigned long start, unsigned long end, bool seal)
+{
+	return -EINVAL;
+}
 #endif
 
 #ifdef CONFIG_SHRINKER_DEBUG
diff --git a/mm/mseal.c b/mm/mseal.c
index 15bba28acc00..aac9399ffd5d 100644
--- a/mm/mseal.c
+++ b/mm/mseal.c
@@ -26,6 +26,11 @@ static inline void set_vma_sealed(struct vm_area_struct *vma)
 	vm_flags_set(vma, VM_SEALED);
 }
 
+static inline void clear_vma_sealed(struct vm_area_struct *vma)
+{
+	vm_flags_clear(vma, VM_SEALED);
+}
+
 /*
  * check if a vma is sealed for modification.
  * return true, if modification is allowed.
@@ -117,7 +122,7 @@ bool can_modify_mm_madv(struct mm_struct *mm, unsigned long start, unsigned long
 
 static int mseal_fixup(struct vma_iterator *vmi, struct vm_area_struct *vma,
 		struct vm_area_struct **prev, unsigned long start,
-		unsigned long end, vm_flags_t newflags)
+		unsigned long end, vm_flags_t newflags, bool seal)
 {
 	int ret = 0;
 	vm_flags_t oldflags = vma->vm_flags;
@@ -131,7 +136,10 @@ static int mseal_fixup(struct vma_iterator *vmi, struct vm_area_struct *vma,
 		goto out;
 	}
 
-	set_vma_sealed(vma);
+	if (seal)
+		set_vma_sealed(vma);
+	else
+		clear_vma_sealed(vma);
 out:
 	*prev = vma;
 	return ret;
@@ -167,9 +175,9 @@ static int check_mm_seal(unsigned long start, unsigned long end)
 }
 
 /*
- * Apply sealing.
+ * Apply sealing / unsealing.
  */
-static int apply_mm_seal(unsigned long start, unsigned long end)
+static int apply_mm_seal(unsigned long start, unsigned long end, bool seal)
 {
 	unsigned long nstart;
 	struct vm_area_struct *vma, *prev;
@@ -191,11 +199,14 @@ static int apply_mm_seal(unsigned long start, unsigned long end)
 		unsigned long tmp;
 		vm_flags_t newflags;
 
-		newflags = vma->vm_flags | VM_SEALED;
+		if (seal)
+			newflags = vma->vm_flags | VM_SEALED;
+		else
+			newflags = vma->vm_flags & ~(VM_SEALED);
 		tmp = vma->vm_end;
 		if (tmp > end)
 			tmp = end;
-		error = mseal_fixup(&vmi, vma, &prev, nstart, tmp, newflags);
+		error = mseal_fixup(&vmi, vma, &prev, nstart, tmp, newflags, seal);
 		if (error)
 			return error;
 		nstart = vma_iter_end(&vmi);
@@ -204,6 +215,37 @@ static int apply_mm_seal(unsigned long start, unsigned long end)
 	return 0;
 }
 
+int do_mseal(unsigned long start, unsigned long end, bool seal)
+{
+	int ret;
+
+	if (end < start)
+		return -EINVAL;
+
+	if (end == start)
+		return 0;
+
+	/*
+	 * First pass, this helps to avoid
+	 * partial sealing in case of error in input address range,
+	 * e.g. ENOMEM error.
+	 */
+	ret = check_mm_seal(start, end);
+	if (ret)
+		goto out;
+
+	/*
+	 * Second pass, this should success, unless there are errors
+	 * from vma_modify_flags, e.g. merge/split error, or process
+	 * reaching the max supported VMAs, however, those cases shall
+	 * be rare.
+	 */
+	ret = apply_mm_seal(start, end, seal);
+
+out:
+	return ret;
+}
+
 /*
  * mseal(2) seals the VM's meta data from
  * selected syscalls.
@@ -256,7 +298,7 @@ static int apply_mm_seal(unsigned long start, unsigned long end)
  *
  *  unseal() is not supported.
  */
-static int do_mseal(unsigned long start, size_t len_in, unsigned long flags)
+static int __do_mseal(unsigned long start, size_t len_in, unsigned long flags)
 {
 	size_t len;
 	int ret = 0;
@@ -277,33 +319,12 @@ static int do_mseal(unsigned long start, size_t len_in, unsigned long flags)
 		return -EINVAL;
 
 	end = start + len;
-	if (end < start)
-		return -EINVAL;
-
-	if (end == start)
-		return 0;
 
 	if (mmap_write_lock_killable(mm))
 		return -EINTR;
 
-	/*
-	 * First pass, this helps to avoid
-	 * partial sealing in case of error in input address range,
-	 * e.g. ENOMEM error.
-	 */
-	ret = check_mm_seal(start, end);
-	if (ret)
-		goto out;
-
-	/*
-	 * Second pass, this should success, unless there are errors
-	 * from vma_modify_flags, e.g. merge/split error, or process
-	 * reaching the max supported VMAs, however, those cases shall
-	 * be rare.
-	 */
-	ret = apply_mm_seal(start, end);
+	ret = do_mseal(start, end, true);
 
-out:
 	mmap_write_unlock(current->mm);
 	return ret;
 }
@@ -311,5 +332,5 @@ static int do_mseal(unsigned long start, size_t len_in, unsigned long flags)
 SYSCALL_DEFINE3(mseal, unsigned long, start, size_t, len, unsigned long,
 		flags)
 {
-	return do_mseal(start, len, flags);
+	return __do_mseal(start, len, flags);
 }
-- 
2.40.1




Amazon Web Services Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 257764 B
Sitz: Berlin
Ust-ID: DE 365 538 597



^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [RFC PATCH 1/7] mseal: expose interface to seal / unseal user memory ranges
  2024-09-11 14:34 ` [RFC PATCH 1/7] mseal: expose interface to seal / unseal user memory ranges Fares Mehanna
@ 2024-09-12 16:40   ` Liam R. Howlett
  2024-09-25 15:25     ` Fares Mehanna
  0 siblings, 1 reply; 4+ messages in thread
From: Liam R. Howlett @ 2024-09-12 16:40 UTC (permalink / raw)
  To: Fares Mehanna
  Cc: nh-open-source, Roman Kagan, Marc Zyngier, Oliver Upton,
	James Morse, Suzuki K Poulose, Zenghui Yu, Catalin Marinas,
	Will Deacon, Andrew Morton, Kemeng Shi, Pierre-Clément Tosi,
	Ard Biesheuvel, Mark Rutland, Javier Martinez Canillas,
	Arnd Bergmann, Fuad Tabba, Mark Brown, Joey Gouly,
	Kristina Martsenko, Randy Dunlap, Bjorn Helgaas,
	Jean-Philippe Brucker, Mike Rapoport (IBM),
	David Hildenbrand,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	open list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	open list, open list:MEMORY MANAGEMENT

* Fares Mehanna <faresx@amazon.de> [240911 10:36]:
> To make sure the kernel mm-local mapping is untouched by the user, we will seal
> the VMA before changing the protection to be used by the kernel.
> 
> This will guarantee that userspace can't unmap or alter this VMA while it is
> being used by the kernel.
> 
> After the kernel is done with the secret memory, it will unseal the VMA to be
> able to unmap and free it.
> 
> Unseal operation is not exposed to userspace.

We can't use the mseal feature for this; it is supposed to be a one way
transition.

Willy describes the feature best here [1].

It is not clear from the change log above or the cover letter as to why
you need to go this route instead of using the mmap lock.


[1] https://lore.kernel.org/lkml/ZS%2F3GCKvNn5qzhC4@casper.infradead.org/

> 
> Signed-off-by: Fares Mehanna <faresx@amazon.de>
> Signed-off-by: Roman Kagan <rkagan@amazon.de>
> ---
>  mm/internal.h |  7 +++++
>  mm/mseal.c    | 81 ++++++++++++++++++++++++++++++++-------------------
>  2 files changed, 58 insertions(+), 30 deletions(-)
> 
> diff --git a/mm/internal.h b/mm/internal.h
> index b4d86436565b..cf7280d101e9 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -1501,6 +1501,8 @@ bool can_modify_mm(struct mm_struct *mm, unsigned long start,
>  		unsigned long end);
>  bool can_modify_mm_madv(struct mm_struct *mm, unsigned long start,
>  		unsigned long end, int behavior);
> +/* mm's mmap write lock must be taken before seal/unseal operation */
> +int do_mseal(unsigned long start, unsigned long end, bool seal);
>  #else
>  static inline int can_do_mseal(unsigned long flags)
>  {
> @@ -1518,6 +1520,11 @@ static inline bool can_modify_mm_madv(struct mm_struct *mm, unsigned long start,
>  {
>  	return true;
>  }
> +
> +static inline int do_mseal(unsigned long start, unsigned long end, bool seal)
> +{
> +	return -EINVAL;
> +}
>  #endif
>  
>  #ifdef CONFIG_SHRINKER_DEBUG
> diff --git a/mm/mseal.c b/mm/mseal.c
> index 15bba28acc00..aac9399ffd5d 100644
> --- a/mm/mseal.c
> +++ b/mm/mseal.c
> @@ -26,6 +26,11 @@ static inline void set_vma_sealed(struct vm_area_struct *vma)
>  	vm_flags_set(vma, VM_SEALED);
>  }
>  
> +static inline void clear_vma_sealed(struct vm_area_struct *vma)
> +{
> +	vm_flags_clear(vma, VM_SEALED);
> +}
> +
>  /*
>   * check if a vma is sealed for modification.
>   * return true, if modification is allowed.
> @@ -117,7 +122,7 @@ bool can_modify_mm_madv(struct mm_struct *mm, unsigned long start, unsigned long
>  
>  static int mseal_fixup(struct vma_iterator *vmi, struct vm_area_struct *vma,
>  		struct vm_area_struct **prev, unsigned long start,
> -		unsigned long end, vm_flags_t newflags)
> +		unsigned long end, vm_flags_t newflags, bool seal)
>  {
>  	int ret = 0;
>  	vm_flags_t oldflags = vma->vm_flags;
> @@ -131,7 +136,10 @@ static int mseal_fixup(struct vma_iterator *vmi, struct vm_area_struct *vma,
>  		goto out;
>  	}
>  
> -	set_vma_sealed(vma);
> +	if (seal)
> +		set_vma_sealed(vma);
> +	else
> +		clear_vma_sealed(vma);
>  out:
>  	*prev = vma;
>  	return ret;
> @@ -167,9 +175,9 @@ static int check_mm_seal(unsigned long start, unsigned long end)
>  }
>  
>  /*
> - * Apply sealing.
> + * Apply sealing / unsealing.
>   */
> -static int apply_mm_seal(unsigned long start, unsigned long end)
> +static int apply_mm_seal(unsigned long start, unsigned long end, bool seal)
>  {
>  	unsigned long nstart;
>  	struct vm_area_struct *vma, *prev;
> @@ -191,11 +199,14 @@ static int apply_mm_seal(unsigned long start, unsigned long end)
>  		unsigned long tmp;
>  		vm_flags_t newflags;
>  
> -		newflags = vma->vm_flags | VM_SEALED;
> +		if (seal)
> +			newflags = vma->vm_flags | VM_SEALED;
> +		else
> +			newflags = vma->vm_flags & ~(VM_SEALED);
>  		tmp = vma->vm_end;
>  		if (tmp > end)
>  			tmp = end;
> -		error = mseal_fixup(&vmi, vma, &prev, nstart, tmp, newflags);
> +		error = mseal_fixup(&vmi, vma, &prev, nstart, tmp, newflags, seal);
>  		if (error)
>  			return error;
>  		nstart = vma_iter_end(&vmi);
> @@ -204,6 +215,37 @@ static int apply_mm_seal(unsigned long start, unsigned long end)
>  	return 0;
>  }
>  
> +int do_mseal(unsigned long start, unsigned long end, bool seal)
> +{
> +	int ret;
> +
> +	if (end < start)
> +		return -EINVAL;
> +
> +	if (end == start)
> +		return 0;
> +
> +	/*
> +	 * First pass, this helps to avoid
> +	 * partial sealing in case of error in input address range,
> +	 * e.g. ENOMEM error.
> +	 */
> +	ret = check_mm_seal(start, end);
> +	if (ret)
> +		goto out;
> +
> +	/*
> +	 * Second pass, this should success, unless there are errors
> +	 * from vma_modify_flags, e.g. merge/split error, or process
> +	 * reaching the max supported VMAs, however, those cases shall
> +	 * be rare.
> +	 */
> +	ret = apply_mm_seal(start, end, seal);
> +
> +out:
> +	return ret;
> +}
> +
>  /*
>   * mseal(2) seals the VM's meta data from
>   * selected syscalls.
> @@ -256,7 +298,7 @@ static int apply_mm_seal(unsigned long start, unsigned long end)
>   *
>   *  unseal() is not supported.
>   */
> -static int do_mseal(unsigned long start, size_t len_in, unsigned long flags)
> +static int __do_mseal(unsigned long start, size_t len_in, unsigned long flags)
>  {
>  	size_t len;
>  	int ret = 0;
> @@ -277,33 +319,12 @@ static int do_mseal(unsigned long start, size_t len_in, unsigned long flags)
>  		return -EINVAL;
>  
>  	end = start + len;
> -	if (end < start)
> -		return -EINVAL;
> -
> -	if (end == start)
> -		return 0;
>  
>  	if (mmap_write_lock_killable(mm))
>  		return -EINTR;
>  
> -	/*
> -	 * First pass, this helps to avoid
> -	 * partial sealing in case of error in input address range,
> -	 * e.g. ENOMEM error.
> -	 */
> -	ret = check_mm_seal(start, end);
> -	if (ret)
> -		goto out;
> -
> -	/*
> -	 * Second pass, this should success, unless there are errors
> -	 * from vma_modify_flags, e.g. merge/split error, or process
> -	 * reaching the max supported VMAs, however, those cases shall
> -	 * be rare.
> -	 */
> -	ret = apply_mm_seal(start, end);
> +	ret = do_mseal(start, end, true);
>  
> -out:
>  	mmap_write_unlock(current->mm);
>  	return ret;
>  }
> @@ -311,5 +332,5 @@ static int do_mseal(unsigned long start, size_t len_in, unsigned long flags)
>  SYSCALL_DEFINE3(mseal, unsigned long, start, size_t, len, unsigned long,
>  		flags)
>  {
> -	return do_mseal(start, len, flags);
> +	return __do_mseal(start, len, flags);
>  }
> -- 
> 2.40.1
> 
> 
> 
> 
> Amazon Web Services Development Center Germany GmbH
> Krausenstr. 38
> 10117 Berlin
> Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
> Eingetragen am Amtsgericht Charlottenburg unter HRB 257764 B
> Sitz: Berlin
> Ust-ID: DE 365 538 597
> 
> 


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [RFC PATCH 1/7] mseal: expose interface to seal / unseal user memory ranges
  2024-09-12 16:40   ` Liam R. Howlett
@ 2024-09-25 15:25     ` Fares Mehanna
  0 siblings, 0 replies; 4+ messages in thread
From: Fares Mehanna @ 2024-09-25 15:25 UTC (permalink / raw)
  To: liam.howlett
  Cc: akpm, ardb, arnd, bhelgaas, broonie, catalin.marinas, david,
	faresx, james.morse, javierm, jean-philippe, joey.gouly,
	kristina.martsenko, kvmarm, linux-arm-kernel, linux-kernel,
	linux-mm, mark.rutland, maz, nh-open-source, oliver.upton, ptosi,
	rdunlap, rkagan, rppt, shikemeng, suzuki.poulose, tabba, will,
	yuzenghui

Hi,

Thanks for taking a look and apologies for my delayed response.

> It is not clear from the change log above or the cover letter as to why
> you need to go this route instead of using the mmap lock.

In the current form of the patches I use memfd_secret() to allocate the pages
and remove them from kernel linear address. [1]

This allocate pages, map them in user virtual addresses and track them in a VMA.

Before flipping the permissions on those pages to be used by the kernel, I need
to make sure that those virtual addresses and this VMA is off-limits to the
owning process.

memfd_secret() pages are locked by default, so they won't swap out. I need to
seal the VMA to make sure the owner process can't unmap/remap/... or change the
protection of this VMA.

So before changing the permissions on the secret pages, I make sure the pages
are faulted in, locked and sealed. So userspace can't influence this mapping.

> We can't use the mseal feature for this; it is supposed to be a one way
> transition.

For this approach, I need the unseal operation when releasing the memory range.

The kernel can be done with the secret pages in one of two scenarios:
1. During lifecycle of the process.
2. When the process terminates.

For the first case, I need to unmap the VMA so it can be reused by the owning
process later, so I need the unseal operation. For the second case however we
don't need that since the process mm is already destructed or just about to be
destructed anyway, regardless of sealed/unsealed VMAs. [1]

I didn't expose	the unseal operation to userspace.

[1] https://lore.kernel.org/linux-arm-kernel/20240911143421.85612-3-faresx@amazon.de/

Thanks!
Fares.

Amazon Web Services Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 257764 B
Sitz: Berlin
Ust-ID: DE 365 538 597

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2024-10-08 20:32 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <CABi2SkWrQOdxdai7YLoYKKc6GAwxue=jc+bH1=CgE-bKRO-GhA@mail.gmail.com>
2024-10-08 20:31 ` [RFC PATCH 1/7] mseal: expose interface to seal / unseal user memory ranges Fares Mehanna
2024-09-11 14:33 [RFC PATCH 0/7] support for mm-local memory allocations and use it Fares Mehanna
2024-09-11 14:34 ` [RFC PATCH 1/7] mseal: expose interface to seal / unseal user memory ranges Fares Mehanna
2024-09-12 16:40   ` Liam R. Howlett
2024-09-25 15:25     ` Fares Mehanna

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox